user/shatov/modexpng - "Next-generation" modular exponentiation using the specialized DSP slices present in the Artix-7 FPGA

Age	Commit message (Collapse)	Author
2020-01-16	This commit modifies the REGULAR_ADD_UNEVEN micro-operation to use DSP slices	Pavel V. Shatov (Meister)
	for addition instead of fabric logic. This opcode is only necessary when in CRT mode and is executed once per entire exponentiation to recombine the two "easier" exponentiations. This was the final change necessary to get rid of using fabric math in the general worker module.
2020-01-16	Reworked modular subtraction micro-operation. Previously it used "two-pass"	Pavel V. Shatov (Meister)
	bank address space sweep, during the first pass (a-b) and (a-b+n) were computed, during the second pass either the former or the latter quantity was written to the output bank (depending on the very last borrow flag value). This is no longer possible, since the FSM now only generates one "interleaved" address space sweep. The solution is to split one complex modular subtraction operation into simpler sub-operations. Currently modular subtraction is achieved by running a sequence of three micro-operations: * MODULAR_SUBTRACT_X computes (a-b) and latches the final borrow flag * MODULAR_SUBTRACT_Y computes (a-b+n) * MODULAR_SUBTRACT_Z writes either (a-b) or (a-b+n) into the output bank depending on the latched value of the borrow flag Unfortunately we can't compute both (a-b) and (a-b+n) during one address space sweep, since fully pipelined adder/subtractor DSP slice has 2-cycle latency.
2020-01-16	Turns out, fabric addition and subtraction in the general worker module are	Pavel V. Shatov (Meister)
	actually in the critical paths of the ModExpNG core and are plaguing the place and route tools. I was barely able to achieve timing closure at 180 MHz even with the highest Map and PaR effort levels. This means that any further clock frequency increase is effectively impossible, moreover any small change in the design may prevent it from meeting timing constants. The obvious solution is to use DSP slices not only for modular multiplication, but also for supporting math operations. When fully pipelined, they can be clocked even faster then the block memory, so there definitely should not be any timing problems with them. The general worker module does three things that currently involve fabric-based math operations: * carry propagation (conversion to non-redundant repsesentation) * modular subtraction * regular addition This commit adds four DSP slice instances and makes the carry propagation opcode use DSP slice products instead of fabric logic.
2020-01-16	Had to rework the general worker module to reach 180 MHz core clock. The module	Pavel V. Shatov (Meister)
	is responsible for doing certain supporting operations (mostly moving operands between banks and doing some simple math operations, such as modular subtraction and regular addition). Depending on the particular operation, one of three bank address space sweep patterns was used: * one-pass (for things like carry propagation) * two-pass (for things like modular subtraction that produce intermediate values in the process) * one-pass interleaved (for copying when only either CRT_?.X or CRT_?.Y is rewritten: we can only write to X and Y simultaneously, so we have to interleave reads from the source bank with reads from the destination bank and overwrite the destination with its just read value, otherwise the second destination operand is lost) I initially coded three FSMs, one for each of the address space sweeps and triggered one of them depending on the opcode, but that turned out too complicated. There's now only one FSM that always does the "one-pass interleaved" pattern, whereas the second read (from the destination bank) is inhibited when not need by the opcode.
2019-10-23	Added missing copyright headers.	Pavel V. Shatov (Meister)

2019-10-21	Further work:	Pavel V. Shatov (Meister)
	- added core wrapper - fixed module resets across entire core (all the resets are now consistently active-low) - continued refactoring
2019-10-21	Added support for non-CRT mode. Further refactoring.	Pavel V. Shatov (Meister)

2019-10-21	Added the regular (not modular) addition operation required during the final	Pavel V. Shatov (Meister)
	step of the Garner's formula algorithm. Note, that the addition is "uneven" in the sense, that the first operand is full-size (as wide as the modulus), while the second one is only half the size. The adder internally banks the second input port during the second half of the addition.
2019-10-21	Added "MERGE_LH" micro-operation. To be able to do Garner's formula we need	Pavel V. Shatov (Meister)
	regular (not modular) multiplication. We're doing this by telling the modular multiplier to stop after the "square" step, which computes A*B. The problem is that the multiplier stores the lower part of the product in the internal bank L and the upper part in the internal bank H, but we need to be able to do operations on the product as a whole. MERGE_LH that combines the two halves of the product into one bank.
2019-10-21	Refactored general worker module	Pavel V. Shatov (Meister)
	Added modular subtraction micro-operation
2019-10-03	Added more micro-operations, entire Montgomery exponentiation ladder works now.	Pavel V. Shatov (Meister)

2019-10-03	Added more micro-operations, also added "general worker" module. The worker ↵	Pavel V. Shatov (Meister)
	is basically a block memory data mover, but it can also do some supporting operations required for the Garner's formula part of the exponentiation.