* Added readme filev0.20

* Enabled vendor-specific primitive usage for compilation
author: Pavel V. Shatov (Meister) <meisterpaul1@yandex.ru> 2017-08-07 12:42:50 +0300
committer: Pavel V. Shatov (Meister) <meisterpaul1@yandex.ru> 2017-08-07 12:42:50 +0300
commit: 06dadb7faa692831f7353910269ecbdf0dd6b21c (patch)
tree: 87c3ea6f50526c0c751be5b3ec5f27435bff2ebc
parent: 5c4d3b9b62cd8de2fae6ae49d479ee06173cadc4 (diff)
2 files changed, 101 insertions, 1 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..8abc6bc
--- /dev/null
+++ b/README.md
@@ -0,0 +1,100 @@
+# ModExpA7
+
+## Core Description
+
+This core implements modular exponentiation using the Artix-7 FPGA found on CrypTech Alpha board. It can be used during RSA operations such as encryption/decryption and signing.
+
+## Compile-Time Settings
+
+The core has two synthesis-time parameters:
+
+ * **OPERAND_ADDR_WIDTH** - Sets the _largest supported_ operand width. This affects the amount of block memory that is reserved for operand storage. Largest operand width in bits, that the core can handle is 32 * (2 ** OPERAND_ADDR_WIDTH). If the largest possible modulus is 1024 bits long, set OPERAND_ADDR_WIDTH = 5. For 2048-bit moduli support set OPERAND_ADDR_WIDTH = 6, for 4096-bit capable core set OPERAND_ADDR_WIDTH = 7 and so on.
+
+ * **SYSTOLIC_ARRAY_POWER** - Determines the number of processing elements in the internal systolic array, total number of elements is 2 ** SYSTOLIC_ARRAY_POWER. This affects the number of DSP slices dedicated to parallelized multiplication. Allowed values are 1..OPERAND_ADDR_WIDTH-1, higher values produce higher performance core at the cost of higher device utilization. 
+ 
+---
+TODO: Give device utilization numbers for different values of SYSTOLIC_ARRAY_POWER.
+
+---
+ 
+## API Specification
+
+The interface of the core is similar to other CrypTech cores. FMC memory map is split into two parts, the first part contains registers and looks like the following:
+
+| Offset | Register      |
+|--------|---------------|
+| 0x0000 | NAME0         |
+| 0x0004 | NAME1         |
+| 0x0008 | VERSION       |
+| 0x0020 | CONTROL       |
+| 0x0024 | STATUS        |
+| 0x0040 | MODE          |
+| 0x0044 | MODULUS_BITS  |
+| 0x0048 | EXPONENT_BITS |
+| 0x004C | BUFFER_BITS   |
+| 0x0050 | ARRAY_BITS    |
+
+The core has the following registers:
+
+ * **NAME0**, **NAME1**  
+Read-only core name ("mode", "xp7a").
+
+ * **VERSION**  
+Read-only core version, currently "0.20".
+
+ * **CONTROL**  
+Register bits:  
+[31:2] Don't care, always read as 0  
+[1] "next" control bit  
+[0] "init" control bit  
+The core uses Montgomery modular multiplier, that requires precomputation of modulus-dependent speed-up coefficient. Every time a new modulus is loaded into the core, this coefficient must be precalculated before exponentiation can be started. Changing the "init" bit from 0 to 1 starts precomputation. The core is edge-triggered, this way to start another precomputation the bit must be cleared first and then set to 1 again. The "next" control bit works the same way as the "init" bit, changing the bit from 0 to 1 triggers new exponentiation operation. When repeatedly encrypting/signing using the same modulus, precomputation needs to be done only once before the very first exponentiation.
+
+ * **STATUS**
+Read-only register bits:  
+[31:2] Don't care, always read as 0  
+[1] "valid" control bit  
+[0] "ready" control bit  
+The "valid" status bit is cleared as soon as the core starts exponentiation, and gets set after the operation is complete. The "ready" status bit is cleared when the core starts precomputation and is set after the speed-up coefficient is precalculated.
+
+The second part of the address space contains four operand banks.
+
+Length of each bank (BANK_LENGTH) depends on the largest supported operand width: 0x80 bytes for 1024-bit core (OPERAND_ADDR_WIDTH = 5), 0x100 bytes for 2048-bit core (OPERAND_ADDR_WIDTH = 6), 0x200 bytes for 4096-bit core (OPERAND_ADDR_WIDTH = 7) and so on.
+
+The offset of the second part is 4 * BANK_LENGTH: 0x200 for 1024-bit core, 0x400 for 2048-bit core, 0x800 for 4096-bit core and so on. The core has the following four banks:
+
+| Offset          | Register       |
+|-----------------|----------------|
+| 4 * BANK_LENGTH | MODULUS        |
+| 5 * BANK_LENGTH | MESSAGE (BASE) |
+| 6 * BANK_LENGTH | EXPONENT       |
+| 7 * BANK_LENGTH | RESULT         |
+
+## Implementation Details
+
+The top-level core module contains:
+ * Block memory buffers for input and output operands
+ * Block memory buffers for internal quantities
+ * Precomputation module (Montgomery modulus-dependent speed-up coefficient)
+ * Precomputation module (Montgomery parasitic power compensation factor)
+ * Exponentiation module
+
+The exponentiation module contains:
+ * Buffers for storage of temporary values
+ * Two modular multipliers that do right-to-left binary exponentiation (one multiplier does squaring, the other one does multiplication simultaneously)
+
+The modular multiplier module contains:
+ * Buffers for storage of temporary values
+ * Wide operand loader
+ * Systolic array of processing elements
+ * Adder
+ * Subtractor
+ 
+The systolic array of processing elements contains:
+ * Array of processing elements
+ * Two FIFOs that accomodate carries and products
+
+Note, that the core is supplemented by a reference model written in C, that has extensive comments describing tricky corners of the underlying math.
+
+## Vendor-specific Primitives
+
+CrypTech Alpha platform is based on the Xilinx Artix-7 200T FPGA, this core takes advantage of Xilinx-specific DSP slices to carry out math-intensive operations. All vendor-specific math primitives are placed under /rtl/pe/artix7/. The core also offers generic replacements under /rtl/pe/generic, they can be used for simulation with 3rd party tools, that are not aware of Xilinx-specific stuff. When porting to other architectures, only those three low-level modules need to be ported. Selection of vendor/generic primitives is done in modexpa7_primitive_switch.v. Note that if you change the latency of the processing element, the SYSTOLIC_PE_LATENCY setting in modexpa7_settings.v must be changed accordingly.
diff --git a/src/rtl/pe/modexpa7_primitive_switch.v b/src/rtl/pe/modexpa7_primitive_switch.v
index d38069b..3551d7a 100644
--- a/src/rtl/pe/modexpa7_primitive_switch.v
+++ b/src/rtl/pe/modexpa7_primitive_switch.v
@@ -1,4 +1,4 @@
-//`define USE_VENDOR_PRIMITIVES
+`define USE_VENDOR_PRIMITIVES
 
 `ifdef USE_VENDOR_PRIMITIVES
author	Pavel V. Shatov (Meister) <meisterpaul1@yandex.ru>	2017-08-07 12:42:50 +0300
committer	Pavel V. Shatov (Meister) <meisterpaul1@yandex.ru>	2017-08-07 12:42:50 +0300
commit	06dadb7faa692831f7353910269ecbdf0dd6b21c (patch)
tree	87c3ea6f50526c0c751be5b3ec5f27435bff2ebc
parent	5c4d3b9b62cd8de2fae6ae49d479ee06173cadc4 (diff)