{{{ #!htmlcomment This page is maintained automatically by a script. Don't modify this page by hand, your changes will just be overwritten the next time the script runs. Talk to your Friendly Neighborhood Repository Maintainer if you need to change something here. }}} {{{ #!html

ModExpNG

Core Description

This "next-generation" IP core implements the modular exponentiation math primitive using the specialized DSP slices present in the Artix-7 FPGA found on the CrypTech Alpha board. It can be used during RSA operations such as encryption/decryption and signing. Given a pair of blinding factors, the core internally blinds the message before signing and then unblinds it afterwards. The blinding factors are automatically mutated during each exponentiation for later reuse. Another prominent feature is the full support for Chinese Remainder Theorem. Given smaller moduli P and Q, the core can do the mixed-radix conversion variant of the algorithm ("Garner's formula") for faster computation. Due to limitations of the underlying hardware (primarily the capacity of the BlockRAM tile in the 7-Series Xilinx devices) the largest supported key length is 4096 bits.

Compile-Time Settings

The core has one synthesis-time parameter:

Combined DSP slice utilization is outlined in the following table:

NUM_MULTS DSP Usage
8 36
16 68
32 132
64 260

Currently the core has been tested in hardware with a balanced setting of NUM_MULTS = 8. Larger values should decreate latency, but proportionally increase DSP slice usage. The core takes advantage of the dedicated high-speed cascade paths between DSP slices, thus all the NUM_MULTS slices must be placed in the same DSP column. Given that the column height for the Artix-7 FPGA is 100 DSP slices, the upper limit on NUM_MULTS is 64. This has not been yet tested in hardware and may require floorplanning or additional operand "broadcast" tree, which is a topic for future research.

API Specification

The interface of the core is similar to other CrypTech cores. FMC memory map is split into four regions (REGISTERS, INPUT_0, INPUT_1 and OUTPUT), each region is 32 kilobits (4 kilobytes). The first one (REGISTERS) contains core registers and looks like the following:

REGISTERS Memory Map
Offset Register
0x0000 NAME0
0x0004 NAME1
0x0008 VERSION
0x0020 CONTROL
0x0024 STATUS
0x0040 MODE
0x0044 MODULUS_BITS
0x0048 EXPONENT_BITS
0x004C BANK_BITS
0x0050 NUM_MULTS

The core has the following registers:

The two following memory regions (INPUT_0 and INPUT_1) contain input quantities, each region is split into eight banks, each bank is 4096 bits (512 bytes). The least significat byte of an input quantity should be written to offset 0 in a bank, i.e. one should start filling banks by writing the least significant word at the lowest offset.

The second region (INPUT_0) contains the following banks:

INPUT_0 Memory Map
Offset Bank
0x1000 M
0x1200 N
0x1400 N_FACTOR
0x1600 N_COEFF
0x1A00 X
0x1C00 Y

The third region (INPUT_1) contains the following banks. Note, that since the third region contains secret components, it is "write-only", any reads from the region will return 0xDEADCODE.

INPUT_1 Memory Map
Offset Bank
0x2000 D
0x2200 P
0x2300 DP
0x2400 P_FACTOR
0x2600 P_COEFF
0x2800 Q
0x2900 DQ
0x2A00 Q_FACTOR
0x2C00 Q_COEFF
0x2E00 QINV

The fourth region (OUTPUT) contains three banks where the core will store the output quantities:

OUTPUT Memory Map
Offset Bank
0x3000 S
0x3200 XM
0x3400 YM

Montgomery Factors and Modulus-dependent Speed-up Coefficients

The core uses the Montgomery modular multiplier, which can only operate on numbers in the so called Montgomery domain. Bringing input numbers into Montgomery domain and converting output numbers out of Montgomery domain requires a special quantity called Montgomery factor: FACTOR = 2 ** (2 * (MODULUS_BITS + WORD_WIDTH)) mod N. This quantity has at most as many bits as the modulus and should be precomputed during key generation. The core's internal data buses are 16-bit wide, so for the formula above WORD_WIDTH = 16. The following pseudocode calculates the Montgomery factor given modulus N:

F = 1
for i from 0 to 2 * (MODULUS_BITS + 16) - 1
    F1 = F << 1
    F2 = F1 - N
    F = (F2 < 0) ? F1 : F2
return F

The final step of Montgomery modular multiplication is Montgomery modular reduction. It is done by adding a multiple of the modulus to the intermediate product. The multiple is selected in such a way, that the lower half of the sum is all zero bits, so it can be safely reduced by just a trivial right shift. This speeds things up, since there's no more need to do computationally difficult division anymore. To determine the multiple of the modulus, another quantity is required, which is the so called modulus-dependent speed-up coefficient: COEFF = -N ** -1 mod 2 ** (MODULUS_BITS + WORD_WIDTH). The number of bits needed to store the coefficient exceeds the width of the modulus by one word. The core internally operates on 16-bit words, so the coefficient is wider than the modulus by two bytes. This non-obvious and somewhat inconvenient feature is required to skip the final conditional subtraction step of the "classic" Montgomery modular reduction. Since a selected multiple of the modulus is added to the intermediate product, the result after shifting it to the right can exceed the modulus, so the traditional solution is to do a conditional subtraction as the very last step. The more precise analysis indicates, that the conditional subtraction can be eliminated, if the main loop of the reduction is repeated for at least three additional times, i.e. MODULUS_BITS + 3 total iterations. The model operates on entire words internally, so it can only do MODULUS_BITS + WORD_WIDTH iterations and thus needs 16 extra bits of the speed-up factor. The following pseudocode calculates the modulus-dependent speed-up coefficient:

R = 1
B = 1
NN = ~N + 1
for i from 1 to (MODULUS_BITS + 16 - 1)
    B = B << 1
    T = R * NN mod 2 ** (MODULUS_BITS + 16)
    if T[i] then
        R = R + B
return R

The stm32/modexpng_util.c file also has reference C code, that computes the Montgomery factor and the speed-up coeficient. Note that each keypair ideally requires three pairs of precomputed numbers: one for the public key N and one for each of the smaller secret moduli P and Q.

References

  1. Hardware Aspects of Montgomery Modular Multiplication
}}} [[RepositoryIndex(format=table,glob=user/shatov/modexpng)]] || Clone `https://git.cryptech.is/user/shatov/modexpng.git` ||