aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-11-26One more cosmetic fix.Pavel V. Shatov (Meister)
2019-11-26Cosmetic fix.Pavel V. Shatov (Meister)
2019-11-26Forgot to push minor cosmetic fix.Pavel V. Shatov (Meister)
2019-11-20Small change to the reductor module to try to get past 180 MHz. Previously BRAMPavel V. Shatov (Meister)
outputs were going directry into a LUT-based ternary adder which was causing timing problems. Added a layer of flip-flops, so instead of BRAM -> LUT -> FF we have BRAM -> FF -> LUT -> FF. This increases core latency by (number_of_supporting_modular_multiplications + number_of_exponent_bits) ticks.
2019-11-19Removed the latch accidentally created while pipelining the uOP engine module.Pavel V. Shatov (Meister)
The FSM previously had four states encoded using two bits, so the next state logic didn't have a default case, since all the possible states were used. Addition of the fifth state required one more state bit, so the FSM now has five states out eight possible and a default case is thus necessary.
2019-11-18Refactored reductor module.Pavel V. Shatov (Meister)
2019-11-16The uOP engine didn't compile at 180 MHz. The pipeline had two stages: FETCHPavel V. Shatov (Meister)
and DECODE. Apparently one clock cycle is not enough to entirely decode an instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2). This seems to solve the problem. If we run into more timing violations here, we can add an extra DECODE_3 cycle and register the currently combinatorial uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32 clock cycles (CRT/non-CRT mode) plus two extra clock cycles per each bit of the exponent.
2019-11-13Beautified the README.md, should look somewhat less nasty now.Pavel V. Shatov (Meister)
2019-10-23Added missing copyright headers.Pavel V. Shatov (Meister)
2019-10-23Added demo driver code for STM32.Pavel V. Shatov (Meister)
2019-10-23Added readme file.Pavel V. Shatov (Meister)
2019-10-23Fixed port width mismatch warning.Pavel V. Shatov (Meister)
2019-10-23Added simulation-only code to measure multiplier load.Pavel V. Shatov (Meister)
2019-10-23Fixed all the testbenches to work with the latest RTL sources.Pavel V. Shatov (Meister)
2019-10-21Reworked testbench, clk_sys and clk_core can now have any ratio, notPavel V. Shatov (Meister)
necessarily 1:2. Fixed compile-time issue where ISE fails to place two DSP slices next to each other, if A and/or B cascade path(s) between then are partially connected. Basically, if cascade is used, entire bus must be connected.
2019-10-21Further work:Pavel V. Shatov (Meister)
- added core wrapper - fixed module resets across entire core (all the resets are now consistently active-low) - continued refactoring
2019-10-21Added support for non-CRT mode. Further refactoring.Pavel V. Shatov (Meister)
2019-10-21Redesigned the testbench. Core clock does not necessarily need to be twicePavel V. Shatov (Meister)
faster than the bus clock now. It can be the same, or say four times faster.
2019-10-21Entire CRT signature algorithm works by now.Pavel V. Shatov (Meister)
Moved micro-operations handler into a separate module file, this way we don't have any synthesized stuff in the top-level module, just instantiations. This is more consistent from the design partitioning point of view. Btw, Xilinx claims their tools work better that way too, but who knows... Added optional simulation-only code to assist debugging. Un-comment the ENABLE_DEBUG `define in 'rtl/modexpng_parameters.vh' to use, but don't ever try to synthesize the core with debugging enabled.
2019-10-21Added the regular (not modular) addition operation required during the finalPavel V. Shatov (Meister)
step of the Garner's formula algorithm. Note, that the addition is "uneven" in the sense, that the first operand is full-size (as wide as the modulus), while the second one is only half the size. The adder internally banks the second input port during the second half of the addition.
2019-10-21Added "MERGE_LH" micro-operation. To be able to do Garner's formula we needPavel V. Shatov (Meister)
regular (not modular) multiplication. We're doing this by telling the modular multiplier to stop after the "square" step, which computes A*B. The problem is that the multiplier stores the lower part of the product in the internal bank L and the upper part in the internal bank H, but we need to be able to do operations on the product as a whole. MERGE_LH that combines the two halves of the product into one bank.
2019-10-21Refactored general worker modulePavel V. Shatov (Meister)
Added modular subtraction micro-operation
2019-10-03Added more micro-operations, entire Montgomery exponentiation ladder works now.Pavel V. Shatov (Meister)
2019-10-03Added more micro-operations, also added "general worker" module. The worker ↵Pavel V. Shatov (Meister)
is basically a block memory data mover, but it can also do some supporting operations required for the Garner's formula part of the exponentiation.
2019-10-03Expanded micro-operation parameters (added dedicated control bit to force ↵Pavel V. Shatov (Meister)
the B input of the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
2019-10-03Reworked storage architecture (moved I/O memory to a separate module, since ↵Pavel V. Shatov (Meister)
there's only one instance of input/output values, while storage manager has dual storage space for P and Q multipliers). Started working on microcoded layer, added input operation and modular multiplication.
2019-10-03Redesigned storage modules, added top-level module, added I/O storage space.Pavel V. Shatov (Meister)
2019-10-01Redesigned core architecture, unified bank structure. All storage blocks nowPavel V. Shatov (Meister)
have eight 4kbit entries and occupy one 36K BRAM tile.
2019-10-01Major rewrite (different core hierarchy, buses, wrappers, etc).Pavel V. Shatov (Meister)
2019-10-01Implemented the final stage of the Montgomery modular multiplication, i.e.Pavel V. Shatov (Meister)
addition of AB and M then reduction by right-shift.
2019-10-01Further work on the Montgomery modular multiplier. Added the thirdPavel V. Shatov (Meister)
"rectangular" stage of the multiplication process, i.e. computation of how many copies of the modulus N to add to the intermediate product AB to zeroize the lower half: M = Q * N.
2019-10-01Further work on the Montgomery modular multiplier. Can now to the "triangular"Pavel V. Shatov (Meister)
part of multiplication, i.e. compute the "magic" reduction coefficient Q = LSB(AB) * N_COEFF.
2019-10-01Started working on the pipelined Montgomery modular multiplier. Currently canPavel V. Shatov (Meister)
do the "square" part of the multiplication, i.e. compute the twice larger intermediate product AB = A * B.
2019-10-01Moved to "modexpng_fpga_model" repo, this one was meant for Verilog.Pavel V. Shatov (Meister)
2019-08-19* More cleanup (got rid of .wide. and .narrow.)Pavel V. Shatov (Meister)
* Working microcode for non-CRT exponentiation (i.e. when only d is known)
2019-08-19* MASSIVE CLEANUPPavel V. Shatov (Meister)
* All the data buses are now either 16 or 18 bits wide for consistency * More consistent naming of micro-operations * More debugging options (can specify which ladder iteration to dump)
2019-08-19* Added more micro-operationsPavel V. Shatov (Meister)
* Working microcode for CRT exponentiation * Further refactoring
2019-08-19* Started conversion of the model to use micro-operationsPavel V. Shatov (Meister)
* Added initial operand bank structure (working "wide"/"narrow" pairs plus input & output banks). The core has four pairs of working banks (X.X and X.Y for Montgomery ladder with modulus P, Y.X and Y.Y for modulus Q)
2019-08-19* Added more debugging options:Pavel V. Shatov (Meister)
- intentionally trigger internal overflow handler - dump MAC inputs - dump intermediate numbers during the reduction phase * Bus widths changes * Some cosmetic changes
2019-04-04Intermediate version to fix recombinaton overflow bug.Pavel V. Shatov (Meister)
2019-04-04Fixed 4096-bit test vector generation.Pavel V. Shatov (Meister)
2019-04-02Removed some boilerplate code, all the three multiplication flavours are nowPavel V. Shatov (Meister)
working consistently. Still need to rework recombination routines.
2019-04-02Cosmetic fixes.Pavel V. Shatov (Meister)
2019-04-02Same changes for "triangle" multiplication phase as for the "square" onePavel V. Shatov (Meister)
(debugging output, simpler MAC clearing and index rotation logic).
2019-04-02Rewrote "square" recombination to match how it works in hardware.Pavel V. Shatov (Meister)
2019-03-30 * more debugging outputPavel V. Shatov (Meister)
* more precise modelling of DSP slice
2019-03-24Simplified index calculation and accumulator clearing logic.Pavel V. Shatov (Meister)
Better debug printout of accumulators.
2019-03-23Added optional output of intermediate quantities for debugging.Pavel V. Shatov (Meister)
Reworked index rotation code for better readability.
2019-03-23Mutate blinding tuple.Pavel V. Shatov (Meister)
2019-03-23Added blinding into math model.Pavel V. Shatov (Meister)