Slug: RandomnessTesting
Title: Randomness Testing Tools
Date: 2016-12-15 22:43
Category: TRNG
This page explains some basics on testing for randomness, and the background necessary to understand their outputs.
When testing the randomness of an alleged/assumed random bit/byte stream, there are two fundamentally different categories of tests: There are blackbox tests which are independent of the particular source, and whitebox tests tailored to the particular properties of the given source.
For reasons rather unsurprising, the blackbox and whitebox categories largely correlate with the categories publicly available tests from others and tests developed by the designers of the particular stream source; this obviously leads to other issues concerning the trustworthiness and reliability of the tests---testing your own stuff is always a somewhat risky approach.
All tests basically work by taking a certain part of the data space, declaring them as "not random" or possibly "suspect", and then comparing some test input against this partitioning. Mathematically speaking, all the possible partitionings are equally relevant and as such any stream is equally "random" or "non-random", so choosing any partitioning definition is somewhat arbitrary.
The usual blackbox tests are effectively based on tests used for non-crypto pseudo random number generators; as such, they are generally useful for quick, preliminary testing and for establishing confidence in the source by outsiders. But to test the health of a source, eventually whitebox tests tailored to the failure modes of the given source are needed.
Dieharder is by far the most extensive blackbox test suite. However, it is originally aimed at testing the output of non-crypto pseudo random number generators; aside from the limitations of using it for an entirely different purpose, it is rather notorious with regard to its need for extensive amounts of input. Additionally, it runs all tests sequentially, rather than making use of today's multi-core multi-threaded CPU architectures; this leads to unnecessary I/O while reading an input file that doesn't fit the disk cache, and runtimes of about an hour or so depending on the sort of computer used.
Generally the best approach to use dieharder
is to first generate an output file, e.g. random.out
to run the tests on, so dieharder
can apply all its individual tests to the same data. For a standard test, at least about 14 GB worth of data are needed; more if one of the tests needing large amounts of data returns a suspect result and dieharder
re-tries the same test with more data.
The command line options I (bs) personally use are dieharder -g 201 -f random.out -s 1 -Y 1 -k 2 -a
:
-g 201 -f random.out
: Don't use a compiled-in pseudo RNG but the file random.out
as input.
-s 1
: Rewind the input after every test. Without this, successive tests use successive parts of the input file.
-Y 1
: Keep testing until a definite (in probabilistic terms:-) test result is obtained.
-k 2
: Use some high precision numerics for the KS test; recommended by the man page.
-a
: Run all tests.
Additionally, these may be useful for more targeted testing:
-m <n>
: Multiply the psamples
value by n
; good for getting even more reliable results, at the expense of the additional data needed.
-d <test name/number>
: Perform a specific test.
-l
: List all available tests by name and number.
-p <n>
: Set the psamples
value. See below why you may need this.
The way dieharder
works, it simply returns a clear assessment of the test results; interpretation should be immediately obvious.
There are a number of things to keep in mind when using dieharder
, especially so when running it on a reduced amount of test data.
- If
dieharder
reaches the end of the input file, it rewinds and uses the same test data again. This has rather drastic effects on several tests which assume that some sort of repetition in the input is a sign for a seriously flawed generator.
- The
-Y 1
option works by adding 100 to the value of psamples
until a conclusive result is found. This works reasonably well with tests that start with a value of 100 to psamples
, but there are tests starting with 1000, and others starting with 1.
- The
-m <n>
option also just affects the initial value of psamples
.
- Some of the tests themselves are marked as "suspect" or "do not use" if you run
dieharder -l
. Still, -a
runs them, for whatever reason.
- Expect about one test in a
-a
run to return a "weak" result that needs to be resolved. According to the man page, about 1 in 1000 tests (not -a
runs!) may fail despite the fact that the input is good. In that case I suggest doubling the input size, either by using -m <n>
(which only works in conjunction with -a
) or by adjusting psamples
and/or tsamples
manually.
The Dieharder home page provides the source code as well as documentation. Dieharder is also available via package systems in Linux (apt-get install dieharder) and by brew for OSX.
While somewhat questionable in the way it has been defined in the FIPS140-2 document (and silently been removed in the -3 draft...), these tests are generally considered useful for a quick preliminary test on small amounts of test data.
They generally work on blocks of 20000 bits.
The rngtest
program reads data from its standard input and by default returns a statistics overview when it reaches EOF. This can be changed with these two options (among others):
-c <n>
: Stop running after n
blocks.
-b <n>:
Give intermediate results every n
blocks.
Use at least one of these when running on a pipe or device...
Since rngtest
works on rather small sample sizes it causes a significant number of false alarms:
Test |
Expected failure rate |
Total |
800ppm |
Monobit |
100ppm |
Poker |
100ppm |
Runs |
300ppm |
Long run |
300ppm |
Continuous run |
Extremely rare |
These failure rates were however measured experimentally rather than derived from the algorithms themselves, so caveat utilitor.
Seriously flawed inputs often show excessive failures from very small input; it is generally a good idea to keep testing until at least about 100 failures in total have occurred before seriously comparing the measured results to the expected failure rates from the table.