Age | Commit message (Collapse) | Author |
|
is responsible for doing certain supporting operations (mostly moving operands
between banks and doing some simple math operations, such as modular
subtraction and regular addition). Depending on the particular operation, one
of three bank address space sweep patterns was used:
* one-pass (for things like carry propagation)
* two-pass (for things like modular subtraction that produce intermediate
values in the process)
* one-pass interleaved (for copying when only either CRT_?.X or CRT_?.Y is
rewritten: we can only write to X and Y simultaneously, so we have to
interleave reads from the source bank with reads from the destination bank
and overwrite the destination with its just read value, otherwise the second
destination operand is lost)
I initially coded three FSMs, one for each of the address space sweeps and
triggered one of them depending on the opcode, but that turned out too
complicated. There's now only one FSM that always does the "one-pass
interleaved" pattern, whereas the second read (from the destination bank) is
inhibited when not need by the opcode.
|
|
|
|
|
|
|
|
outputs were going directry into a LUT-based ternary adder which was causing
timing problems. Added a layer of flip-flops, so instead of BRAM -> LUT -> FF
we have BRAM -> FF -> LUT -> FF. This increases core latency by
(number_of_supporting_modular_multiplications + number_of_exponent_bits) ticks.
|
|
The FSM previously had four states encoded using two bits, so the next state
logic didn't have a default case, since all the possible states were used.
Addition of the fifth state required one more state bit, so the FSM now has
five states out eight possible and a default case is thus necessary.
|
|
|
|
and DECODE. Apparently one clock cycle is not enough to entirely decode an
instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2).
This seems to solve the problem. If we run into more timing violations here, we
can add an extra DECODE_3 cycle and register the currently combinatorial
uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32
clock cycles (CRT/non-CRT mode) plus two extra clock cycles per each bit of the
exponent.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
necessarily 1:2.
Fixed compile-time issue where ISE fails to place two DSP slices next to each
other, if A and/or B cascade path(s) between then are partially connected.
Basically, if cascade is used, entire bus must be connected.
|
|
- added core wrapper
- fixed module resets across entire core (all the resets are now consistently
active-low)
- continued refactoring
|
|
|
|
faster than the bus clock now. It can be the same, or say four times faster.
|
|
Moved micro-operations handler into a separate module file, this way we don't
have any synthesized stuff in the top-level module, just instantiations. This
is more consistent from the design partitioning point of view. Btw, Xilinx
claims their tools work better that way too, but who knows...
Added optional simulation-only code to assist debugging. Un-comment the
ENABLE_DEBUG `define in 'rtl/modexpng_parameters.vh' to use, but don't ever
try to synthesize the core with debugging enabled.
|
|
step of the Garner's formula algorithm. Note, that the addition is "uneven" in
the sense, that the first operand is full-size (as wide as the modulus), while
the second one is only half the size. The adder internally banks the second
input port during the second half of the addition.
|
|
regular (not modular) multiplication. We're doing this by telling the modular
multiplier to stop after the "square" step, which computes A*B. The problem is
that the multiplier stores the lower part of the product in the internal bank L
and the upper part in the internal bank H, but we need to be able to do
operations on the product as a whole. MERGE_LH that combines the two halves of
the product into one bank.
|
|
Added modular subtraction micro-operation
|
|
|
|
is basically
a block memory data mover, but it can also do some supporting operations required for the
Garner's formula part of the exponentiation.
|
|
the B input of
the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
|
|
there's
only one instance of input/output values, while storage manager has dual storage space
for P and Q multipliers).
Started working on microcoded layer, added input operation and modular multiplication.
|
|
|
|
have eight 4kbit entries and occupy one 36K BRAM tile.
|
|
|
|
addition of AB and M then reduction by right-shift.
|
|
"rectangular" stage of the multiplication process, i.e. computation of how many
copies of the modulus N to add to the intermediate product AB to zeroize the
lower half: M = Q * N.
|
|
part of multiplication, i.e. compute the "magic" reduction coefficient
Q = LSB(AB) * N_COEFF.
|
|
do the "square" part of the multiplication, i.e. compute the twice larger
intermediate product AB = A * B.
|
|
|
|
* Working microcode for non-CRT exponentiation (i.e. when only d is known)
|
|
* All the data buses are now either 16 or 18 bits wide for consistency
* More consistent naming of micro-operations
* More debugging options (can specify which ladder iteration to dump)
|
|
* Working microcode for CRT exponentiation
* Further refactoring
|
|
* Added initial operand bank structure (working "wide"/"narrow" pairs plus
input & output banks). The core has four pairs of working banks (X.X and X.Y
for Montgomery ladder with modulus P, Y.X and Y.Y for modulus Q)
|
|
- intentionally trigger internal overflow handler
- dump MAC inputs
- dump intermediate numbers during the reduction phase
* Bus widths changes
* Some cosmetic changes
|
|
|
|
|
|
working consistently. Still need to rework recombination routines.
|
|
|
|
(debugging output, simpler MAC clearing and index rotation logic).
|
|
|
|
* more precise modelling of DSP slice
|
|
Better debug printout of accumulators.
|
|
Reworked index rotation code for better readability.
|
|
|