From 6791175e7466ec09bf8464be99078ef33fe8da23 Mon Sep 17 00:00:00 2001 From: "Pavel V. Shatov (Meister)" Date: Tue, 26 Nov 2019 15:26:40 +0300 Subject: One more cosmetic fix. --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 17e19ee..a66c2f8 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ Compile-Time Settings The core has one synthesis-time parameter: - * **NUM\_MULTS** - Sets the number of DSP slices to use per modular multiplier, which must be a power of 2. Each multiplier consumes NUM\_MULTS + 1 slices, since one additional multiplier is required to eliminate the final conditional subtraction step of the Montgomery modular multiplication routine. The core internally consists of a pair of _dual modular multipliers_, thus the total number of occupied DSP slices is 4 * (NUM\_MULTS + 1). When in CRT mode, both _dual multipliers_ operate simultaneously, each with its own modulus. The two multipliers inside of the _dual multiplier_ simultaneously do the squaring and multiplication operations, which constitute one step of the Montgomery powering ladder. + * **NUM\_MULTS** - Sets the number of DSP slices to use per modular multiplier, which must be a power of 2. Each multiplier consumes NUM\_MULTS + 1 slices, since one additional multiplier is required to eliminate the final conditional subtraction step of the Montgomery modular multiplication routine. The core internally consists of a pair of *dual modular multipliers*, thus the total number of occupied DSP slices is 4 * (NUM\_MULTS + 1). When in CRT mode, both *dual multipliers* operate simultaneously, each with its own modulus. The two multipliers inside of the *dual multiplier* simultaneously do the squaring and multiplication operations, which constitute one step of the Montgomery powering ladder. Combined DSP slice utilization is outlined in the following table: @@ -97,7 +97,7 @@ The core has the following registers: Length of operand bank in bits. This read-only parameter returns the length of internal operand bank and allows the largest supported operand width to be determined at run-time. Currently BANK\_BITS is hardwired to always return 4096. - * **NUM_MULTS** + * **NUM\_MULTS** This read-only parameter returns the width of the internal parallel multiplier, that was specified at compile-time. This parameter is currently 8. @@ -187,7 +187,7 @@ The core uses the Montgomery modular multiplier, which can only operate on numbe F = (F2 < 0) ? F1 : F2 return F -The final step of Montgomery modular multiplication is Montgomery modular reduction. It is done by adding a multiple of the modulus to the intermediate product. The multiple is selected in such a way, that the lower half of the sum is all zero bits, so it can be safely reduced by just a trivial right shift. This speeds things up, since there's no more need to do computationally difficult division anymore. To determine the multiple of the modulus, another quantity is required, which is the so called modulus-dependent speed-up coefficient: **COEFF** = -**N** \*\* -1 mod 2 \*\* (**MODULUS\_BITS** + **WORD\_WIDTH**). The number of bits needed to store the coefficient exceeds the width of the modulus by one word. The core internally operates on 16-bit words, so the coefficient is wider than the modulus by two bytes. This non-obvious and somewhat inconvenient feature is required to skip the final conditional subtraction step of the "classic" Montgomery modular reduction. Since a selected multiple of the modulus is _added_ to the intermediate product, the result after shifting it to the right can exceed the modulus, so the traditional solution is to do a conditional subtraction as the very last step. The more precise analysis indicates, that the conditional subtraction can be eliminated, if the main loop of the reduction is repeated for at least three additional times, i.e. **MODULUS\_BITS** + 3 total iterations. The model operates on entire words internally, so it can only do **MODULUS\_BITS** + **WORD\_WIDTH** iterations and thus needs 16 extra bits of the speed-up factor. The following pseudocode calculates the modulus-dependent speed-up coefficient: +The final step of Montgomery modular multiplication is Montgomery modular reduction. It is done by adding a multiple of the modulus to the intermediate product. The multiple is selected in such a way, that the lower half of the sum is all zero bits, so it can be safely reduced by just a trivial right shift. This speeds things up, since there's no more need to do computationally difficult division anymore. To determine the multiple of the modulus, another quantity is required, which is the so called modulus-dependent speed-up coefficient: **COEFF** = -**N** \*\* -1 mod 2 \*\* (**MODULUS\_BITS** + **WORD\_WIDTH**). The number of bits needed to store the coefficient exceeds the width of the modulus by one word. The core internally operates on 16-bit words, so the coefficient is wider than the modulus by two bytes. This non-obvious and somewhat inconvenient feature is required to skip the final conditional subtraction step of the "classic" Montgomery modular reduction. Since a selected multiple of the modulus is *added* to the intermediate product, the result after shifting it to the right can exceed the modulus, so the traditional solution is to do a conditional subtraction as the very last step. The more precise analysis indicates, that the conditional subtraction can be eliminated, if the main loop of the reduction is repeated for at least three additional times, i.e. **MODULUS\_BITS** + 3 total iterations. The model operates on entire words internally, so it can only do **MODULUS\_BITS** + **WORD\_WIDTH** iterations and thus needs 16 extra bits of the speed-up factor. The following pseudocode calculates the modulus-dependent speed-up coefficient: R = 1 B = 1 -- cgit v1.2.3