**Contact**

| ||||
---|---|---|---|---|

Day | Events | |||

02/20 Tuesday | 19^{30} - meetup | |||

02/21 Wednesday | ||||

Day | Events |

**Sponsors**

This document describes an efficient A5/1 implementation on SIMD hardware. Then, a fast method for hard-drive lookups on Linux is described.

Read basic operation in a Wikipedia article if you have never seen it before.

You can find some example implementations around the internets. They usually work like this:

- Extract the clocking bits from the registers (bitshift + AND)
- Compare majority clocking bits (some logical operations and a conditional branch)
- Compute next bits (several conditional branches and logical operations)
- Update registers with these bits (bitshift)

Running the aforementioned naive implementation on ATI HD7970 card gives us about 90 Mkey/s. The conditional branches don't really suit the GPU architecture and the registers are only 20 bits long, whereas the card can process 64bit long data types.

crypto.lo.gy article on bitslicing

We will run multiple A5/1 instances in parallel. We will achieve this by loading 64 A5/1 instances somewhat “rotated” into 64 registers that are 64 bits long.

64*64bit variables (vertical) - r[64] array ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ +-+-+-+-+-+-+-+-+ | | | | | | | | | ← A5/1 instance #1 (registers concatenated) | | | | | | | | | ← A5/1 instance #2 | | | | | | | | | ← A5/1 instance #3 | | | | | | | | | ... | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +-+-+-+-+-+-+-+-+

Now for example register shifting is done by circular assignment

for(i=18; i>0; i--) { r[i] = r[i-1]; }

This rotates registers of **all** A5/1 instances at once.

Another popular operation in A5/1 is bitwise XOR. Again, XORing r[i] with r[j] XORs the bits i and j in all A5/1 instances - with a single instruction!

The original Kraken written by SRLabs in 2009 used hand-vectorized assembly-resembling C. We believed that 2015 compilers will unroll the loops and then convert all these commands into vector instructions for us. So for example XORing would be done by the vector unit 256 bits at once and circular assignments would be done by efficient vector rotation instructions. And the whole thing would use only 16 vector registers instead of 64 general purpose registers. However, it turned out the compiler sucks at this and it generates code that is even slower than the naive implementation. So we had to do the vectorization by hand. To produce a better maintainable code, we generate the C code with a shell script.

This is implemented in genkernel* shell scripts.

We found out that using 128-bit vectors (4*32 bit) is about 8 times faster (and hence 4 times faster in total as shorter vectors process less data) than using 256-bit vectors (4*64 bit) on our AMD HD 7970. We hypothesize long vectors won't fit into L1 cache. It may not be the case on other hardware, so we ship “genkernel32” and “genkernel64”. Pick the one that is faster on your hardware.

With this implementation, we get 1.2 Gkey/s on AMD HD 7970.

However, we need to clock two register and the third not - and this is different in each clocking for each cipher instance. So let's create three vectors - called clock1, clock2 and clock3 - each containing “1” on the given position when the given register of the given instance shall be clocked. Now we can clock the register with

r[i] = (r[i] & ~clock1) | (r[i-1] & clock1);

If the register is clocked, clock1 will have “1” and the right part of the expression will apply. Otherwise, the register will be overwritten by itself.

If the tables are sorted, you can find the endpoint in log(n) time, or about 35 seeks in table of size 2^35 (32 GiB). Disk seeks are the bottleneck of the attack. Let's do it better:

At first, we notice that the endpoint values are evenly distributed from 0x0000 to 0xFFFF. Hence we can limit the first part of binary search to a smaller area than the whole disk.

Then, we notice that we have several gigabytes of RAM that is much faster than the drive, and that drive works with “blocks” (usually 4 KiB on current hardware). We will compute the index, which has for each 4KiB block its starting value. Then we will do the binary search in the index in RAM and then load **one** block that contains the solution.

The 32GiB table consists of 8 Mi blocks and you need 8 bytes for a block, so the index will take about 64 MiB RAM - no problem with current computers.

This is implemented in TableConvert from original Kraken.

Naive table would contain tuples “(starting value, endpoint)”. We will notice that the endpoint has 12 bits zero by definition, but it still takes 2*64-12 = 116 bits.

Let's encode the starting point better. We don't need the whole 64 bits, as we have only about 2^34 chains in one table (less in the tables released, probably due to merges). Let's take a 34bit value and a hashing function that hashes to 64 bits (I guess that it's not a good idea to just pad it with zeros as the A5/1 machine might go nuts). This is the magic with kr02_whitening() and kr02_mergebits() you can see in delta.c. The 0x93cbc4077efddc15ULL constant is probably taken from some Linear Congruential Generator, but I haven't found any more info on this.

Now we have 34+64-12 = 86 bits.

The tables are sorted by endpoints and when there are ~2^32 evenly distributed points in 2^52 space (2^64 space minus 12 bits that are always zero), the average gap will be 2^20. So let's encode the endpoint with only 20 bits (+ some margin/padding, as we need variable length encoding). One chain then takes on average only 34+20 = 54 bits.

This is called delta encoding and you can see the parser in delta.c. The format is not documented and we did not want to mess with it, so we edited the three functions (StartEndpointSearch, CompleteEndpointSearch and load_idx) from DeltaLookup from the original Kraken to be compilable with GCC and used them as-is.

Original Kraken is a mix of non-free code and code with unstated license (presumably written by someone else) - DeltaLookup is the unstated case. We tried to contact the author to clarify the license situation and we received no reply. We believe using a small interface to binary undocumented format can be treated as fair use.

So now we know which block we want. Straightforward method would be to open the device, fseek() to a desired position, and fread() the block.

We ran this on a SSD and we got about 5000 IOPS - far less than what was advertised on the SSD product sheet. The problem is the latency - your request goes through the kernel, controller, SATA and SSD controller. The latency of 200 μs seems good, but we would like to get it better.

Current hardware supports NCQ. It can queue several (~32) commands. This is good both for rotation drives, as the commands can be executed in optimal order, and for SSDs, as this eliminates the latency. Linux kernel supports NCQ.

We have 4 SSD drives and the queue for each is 32 commands long. Having the application run in 128 threads sounds like a thread-synchronization nightmare to me. We will rather use the madvise() function and leave the parallelism to the kernel. We flag the blocks we need as MADV_WILLNEED and the kernel will cache the pages in question to memory. We can then just read them. Oh, and use MADV_RANDOM to disable readahead and MADV_DONTNEED once we got it to free the memory.

This is also implemented in delta.c.

Original Kraken uses the following approach:

- mmap the 4096 long block
- madvise it
- wait until it is ready (mmap and madvise other blocks meanwhile)
- read it
- munmap it

We have run this on Linux 3.2 and the mmap and munmap system calls were very inefficient (no idea whether it was misconfigured or if it is just a feature). So we mmapped the whole device. We achieved a 14 % performance gain by it. mmap()ing 1.7 TB is possible on 64-bit system and I don't think someone would run such application on something 32-bit.

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International