Age | Commit message (Collapse) | Author |
|
|
|
and DECODE. Apparently one clock cycle is not enough to entirely decode an
instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2).
This seems to solve the problem. If we run into more timing violations here, we
can add an extra DECODE_3 cycle and register the currently combinatorial
uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32
clock cycles (CRT/non-CRT mode) plus two extra clock cycles per each bit of the
exponent.
|
|
|
|
|
|
|
|
necessarily 1:2.
Fixed compile-time issue where ISE fails to place two DSP slices next to each
other, if A and/or B cascade path(s) between then are partially connected.
Basically, if cascade is used, entire bus must be connected.
|
|
- added core wrapper
- fixed module resets across entire core (all the resets are now consistently
active-low)
- continued refactoring
|
|
|
|
faster than the bus clock now. It can be the same, or say four times faster.
|
|
Moved micro-operations handler into a separate module file, this way we don't
have any synthesized stuff in the top-level module, just instantiations. This
is more consistent from the design partitioning point of view. Btw, Xilinx
claims their tools work better that way too, but who knows...
Added optional simulation-only code to assist debugging. Un-comment the
ENABLE_DEBUG `define in 'rtl/modexpng_parameters.vh' to use, but don't ever
try to synthesize the core with debugging enabled.
|
|
step of the Garner's formula algorithm. Note, that the addition is "uneven" in
the sense, that the first operand is full-size (as wide as the modulus), while
the second one is only half the size. The adder internally banks the second
input port during the second half of the addition.
|
|
regular (not modular) multiplication. We're doing this by telling the modular
multiplier to stop after the "square" step, which computes A*B. The problem is
that the multiplier stores the lower part of the product in the internal bank L
and the upper part in the internal bank H, but we need to be able to do
operations on the product as a whole. MERGE_LH that combines the two halves of
the product into one bank.
|
|
Added modular subtraction micro-operation
|
|
|
|
is basically
a block memory data mover, but it can also do some supporting operations required for the
Garner's formula part of the exponentiation.
|
|
the B input of
the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
|
|
there's
only one instance of input/output values, while storage manager has dual storage space
for P and Q multipliers).
Started working on microcoded layer, added input operation and modular multiplication.
|
|
|
|
have eight 4kbit entries and occupy one 36K BRAM tile.
|
|
|
|
addition of AB and M then reduction by right-shift.
|
|
"rectangular" stage of the multiplication process, i.e. computation of how many
copies of the modulus N to add to the intermediate product AB to zeroize the
lower half: M = Q * N.
|
|
part of multiplication, i.e. compute the "magic" reduction coefficient
Q = LSB(AB) * N_COEFF.
|
|
do the "square" part of the multiplication, i.e. compute the twice larger
intermediate product AB = A * B.
|