Age | Commit message (Collapse) | Author |
|
|
|
worked because the testbench set both NUM_SYSTOLIC_CYCLES = 4 and
SYSTOLIC_ARRAY_LENGTH = 4. Now should work with any array power, not
only 2.
|
|
Added 512-bit testbench.
|
|
* works in simulator
* passes synthesis without major issues
Started adding pre-multiplication logic...
|
|
|
|
* passes testbench tests again
* this time synthesizes fine (without major issues)
List of things that need polishing in the future:
* Parallelized operand loader can be reduced by a factor of 3
to only store one operand at a time: it currently stores
B, N_COEFF and N. After B is consumed, it can be overwritten
with AB, N_COEFF can be loaded sequentially the same way
A is loaded. After that loader can be filled with Q while
N will be loaded sequentially.
* Turns out QN block memory is not needed at all. After we obtain
the next word of QN, we immediately calculate SN. After that QN
can be discarded, no need to store it.
* Currently there are two wide memories T and PE_C_OUT. XST throws
weird warnings about multi-port RAM before finally deciding
to implement it using flip-flop. Those memories should be turned
into FIFOs to simplify the design and not confuse XST.
|
|
* turned crazy triple multiplier array into one array with input mux
|
|
|
|
|
|
|
|
|
|
|
|
Cleaned up Verilog a bit
|
|
* fixed bug with latency compensation
* cleaned up Verilog source
* added 512-bit testbench
* works in simulator
* synthesizes without warnings
Changes:
* made latency of generic processing element configurable
|
|
* work in progress
|
|
* works in simulator
* passes synthesis w/o warnings
* code needs minor cleanup
|
|
* works in simulator
* may have to change how internal operand buffer is pre-loaded
(shift register instead of wide mux?)
* code needs some cleanup
|
|
|
|
|
|
|