Age | Commit message (Collapse) | Author |
|
bank address space sweep, during the first pass (a-b) and (a-b+n) were
computed, during the second pass either the former or the latter quantity was
written to the output bank (depending on the very last borrow flag value).
This is no longer possible, since the FSM now only generates one "interleaved"
address space sweep. The solution is to split one complex modular subtraction
operation into simpler sub-operations. Currently modular subtraction is
achieved by running a sequence of three micro-operations:
* MODULAR_SUBTRACT_X computes (a-b) and latches the final borrow flag
* MODULAR_SUBTRACT_Y computes (a-b+n)
* MODULAR_SUBTRACT_Z writes either (a-b) or (a-b+n) into the output bank
depending on the latched value of the borrow flag
Unfortunately we can't compute both (a-b) and (a-b+n) during one address space
sweep, since fully pipelined adder/subtractor DSP slice has 2-cycle latency.
|
|
actually in the critical paths of the ModExpNG core and are plaguing the place
and route tools. I was barely able to achieve timing closure at 180 MHz even
with the highest Map and PaR effort levels. This means that any further clock
frequency increase is effectively impossible, moreover any small change in the
design may prevent it from meeting timing constants. The obvious solution is to
use DSP slices not only for modular multiplication, but also for supporting
math operations. When fully pipelined, they can be clocked even faster then the
block memory, so there definitely should not be any timing problems with them.
The general worker module does three things that currently involve fabric-based
math operations:
* carry propagation (conversion to non-redundant repsesentation)
* modular subtraction
* regular addition
This commit adds four DSP slice instances and makes the carry propagation
opcode use DSP slice products instead of fabric logic.
|
|
is responsible for doing certain supporting operations (mostly moving operands
between banks and doing some simple math operations, such as modular
subtraction and regular addition). Depending on the particular operation, one
of three bank address space sweep patterns was used:
* one-pass (for things like carry propagation)
* two-pass (for things like modular subtraction that produce intermediate
values in the process)
* one-pass interleaved (for copying when only either CRT_?.X or CRT_?.Y is
rewritten: we can only write to X and Y simultaneously, so we have to
interleave reads from the source bank with reads from the destination bank
and overwrite the destination with its just read value, otherwise the second
destination operand is lost)
I initially coded three FSMs, one for each of the address space sweeps and
triggered one of them depending on the opcode, but that turned out too
complicated. There's now only one FSM that always does the "one-pass
interleaved" pattern, whereas the second read (from the destination bank) is
inhibited when not need by the opcode.
|
|
|
|
- added core wrapper
- fixed module resets across entire core (all the resets are now consistently
active-low)
- continued refactoring
|
|
|
|
step of the Garner's formula algorithm. Note, that the addition is "uneven" in
the sense, that the first operand is full-size (as wide as the modulus), while
the second one is only half the size. The adder internally banks the second
input port during the second half of the addition.
|
|
regular (not modular) multiplication. We're doing this by telling the modular
multiplier to stop after the "square" step, which computes A*B. The problem is
that the multiplier stores the lower part of the product in the internal bank L
and the upper part in the internal bank H, but we need to be able to do
operations on the product as a whole. MERGE_LH that combines the two halves of
the product into one bank.
|
|
Added modular subtraction micro-operation
|
|
|
|
is basically
a block memory data mover, but it can also do some supporting operations required for the
Garner's formula part of the exponentiation.
|