Age | Commit message (Collapse) | Author |
|
|
|
|
|
|
|
|
|
* cosmetic rename of Verilog include file
|
|
for addition instead of fabric logic. This opcode is only necessary when in
CRT mode and is executed once per entire exponentiation to recombine the two
"easier" exponentiations. This was the final change necessary to get rid of
using fabric math in the general worker module.
|
|
bank address space sweep, during the first pass (a-b) and (a-b+n) were
computed, during the second pass either the former or the latter quantity was
written to the output bank (depending on the very last borrow flag value).
This is no longer possible, since the FSM now only generates one "interleaved"
address space sweep. The solution is to split one complex modular subtraction
operation into simpler sub-operations. Currently modular subtraction is
achieved by running a sequence of three micro-operations:
* MODULAR_SUBTRACT_X computes (a-b) and latches the final borrow flag
* MODULAR_SUBTRACT_Y computes (a-b+n)
* MODULAR_SUBTRACT_Z writes either (a-b) or (a-b+n) into the output bank
depending on the latched value of the borrow flag
Unfortunately we can't compute both (a-b) and (a-b+n) during one address space
sweep, since fully pipelined adder/subtractor DSP slice has 2-cycle latency.
|
|
actually in the critical paths of the ModExpNG core and are plaguing the place
and route tools. I was barely able to achieve timing closure at 180 MHz even
with the highest Map and PaR effort levels. This means that any further clock
frequency increase is effectively impossible, moreover any small change in the
design may prevent it from meeting timing constants. The obvious solution is to
use DSP slices not only for modular multiplication, but also for supporting
math operations. When fully pipelined, they can be clocked even faster then the
block memory, so there definitely should not be any timing problems with them.
The general worker module does three things that currently involve fabric-based
math operations:
* carry propagation (conversion to non-redundant repsesentation)
* modular subtraction
* regular addition
This commit adds four DSP slice instances and makes the carry propagation
opcode use DSP slice products instead of fabric logic.
|
|
is responsible for doing certain supporting operations (mostly moving operands
between banks and doing some simple math operations, such as modular
subtraction and regular addition). Depending on the particular operation, one
of three bank address space sweep patterns was used:
* one-pass (for things like carry propagation)
* two-pass (for things like modular subtraction that produce intermediate
values in the process)
* one-pass interleaved (for copying when only either CRT_?.X or CRT_?.Y is
rewritten: we can only write to X and Y simultaneously, so we have to
interleave reads from the source bank with reads from the destination bank
and overwrite the destination with its just read value, otherwise the second
destination operand is lost)
I initially coded three FSMs, one for each of the address space sweeps and
triggered one of them depending on the opcode, but that turned out too
complicated. There's now only one FSM that always does the "one-pass
interleaved" pattern, whereas the second read (from the destination bank) is
inhibited when not need by the opcode.
|
|
|
|
|
|
|
|
outputs were going directry into a LUT-based ternary adder which was causing
timing problems. Added a layer of flip-flops, so instead of BRAM -> LUT -> FF
we have BRAM -> FF -> LUT -> FF. This increases core latency by
(number_of_supporting_modular_multiplications + number_of_exponent_bits) ticks.
|
|
The FSM previously had four states encoded using two bits, so the next state
logic didn't have a default case, since all the possible states were used.
Addition of the fifth state required one more state bit, so the FSM now has
five states out eight possible and a default case is thus necessary.
|
|
|
|
and DECODE. Apparently one clock cycle is not enough to entirely decode an
instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2).
This seems to solve the problem. If we run into more timing violations here, we
can add an extra DECODE_3 cycle and register the currently combinatorial
uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32
clock cycles (CRT/non-CRT mode) plus two extra clock cycles per each bit of the
exponent.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
necessarily 1:2.
Fixed compile-time issue where ISE fails to place two DSP slices next to each
other, if A and/or B cascade path(s) between then are partially connected.
Basically, if cascade is used, entire bus must be connected.
|
|
- added core wrapper
- fixed module resets across entire core (all the resets are now consistently
active-low)
- continued refactoring
|
|
|
|
faster than the bus clock now. It can be the same, or say four times faster.
|
|
Moved micro-operations handler into a separate module file, this way we don't
have any synthesized stuff in the top-level module, just instantiations. This
is more consistent from the design partitioning point of view. Btw, Xilinx
claims their tools work better that way too, but who knows...
Added optional simulation-only code to assist debugging. Un-comment the
ENABLE_DEBUG `define in 'rtl/modexpng_parameters.vh' to use, but don't ever
try to synthesize the core with debugging enabled.
|
|
step of the Garner's formula algorithm. Note, that the addition is "uneven" in
the sense, that the first operand is full-size (as wide as the modulus), while
the second one is only half the size. The adder internally banks the second
input port during the second half of the addition.
|
|
regular (not modular) multiplication. We're doing this by telling the modular
multiplier to stop after the "square" step, which computes A*B. The problem is
that the multiplier stores the lower part of the product in the internal bank L
and the upper part in the internal bank H, but we need to be able to do
operations on the product as a whole. MERGE_LH that combines the two halves of
the product into one bank.
|
|
Added modular subtraction micro-operation
|
|
|
|
is basically
a block memory data mover, but it can also do some supporting operations required for the
Garner's formula part of the exponentiation.
|
|
the B input of
the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
|
|
there's
only one instance of input/output values, while storage manager has dual storage space
for P and Q multipliers).
Started working on microcoded layer, added input operation and modular multiplication.
|
|
|
|
have eight 4kbit entries and occupy one 36K BRAM tile.
|
|
|
|
addition of AB and M then reduction by right-shift.
|
|
"rectangular" stage of the multiplication process, i.e. computation of how many
copies of the modulus N to add to the intermediate product AB to zeroize the
lower half: M = Q * N.
|
|
part of multiplication, i.e. compute the "magic" reduction coefficient
Q = LSB(AB) * N_COEFF.
|
|
do the "square" part of the multiplication, i.e. compute the twice larger
intermediate product AB = A * B.
|
|
|
|
* Working microcode for non-CRT exponentiation (i.e. when only d is known)
|
|
* All the data buses are now either 16 or 18 bits wide for consistency
* More consistent naming of micro-operations
* More debugging options (can specify which ladder iteration to dump)
|
|
* Working microcode for CRT exponentiation
* Further refactoring
|
|
* Added initial operand bank structure (working "wide"/"narrow" pairs plus
input & output banks). The core has four pairs of working banks (X.X and X.Y
for Montgomery ladder with modulus P, Y.X and Y.Y for modulus Q)
|
|
- intentionally trigger internal overflow handler
- dump MAC inputs
- dump intermediate numbers during the reduction phase
* Bus widths changes
* Some cosmetic changes
|
|
|
|
|