Installing deka

Getting the source

Get deka and Kraken source.

git clone <html>git:git.srlabs.de/kraken</html> The original Kraken might not work on recent systems. However, someone published my patched version on GitHub. That version should work on something like Debian Jessie. https://github.com/0x7678/typhon-vx/tree/master/kraken (notice the Czech comments in Fragment.cpp ) ===== Getting tables ===== Get the table files (*.dlt) generated by TMTO Project/SRLabs. It is 40 files of 1.7 TB total. You can get md5sums at http://jenda.hrach.eu/f2/tables.txt. There is a torrent, or you can find someone to mail you a hard drive. Or if you happen to live in Prague, you can get a copy in brmlab. ===== Installing tables ===== It is to be done this way <code> ./TableConvert di /mnt/tables/gsm/100.dlt 100.ins:0 100.idx # table format source file destination:offset index destination</code> if stored in files. However, to avoid filesystem overhead, a direct installation on a block device is advised. The install.py script should help you with this. ===== Configuring tables for deka ===== Edit delta_config.h and write paths to devices and index files and offsets from the generated tables.conf. Protip: do not use /dev/sdX, but path or UUID. /dev/sdX names tend to mix up! ===== Generating kernel ===== Run ./genkernel32.sh > slice.c or 64 to generate kernel with 4x32bit or 4x64bit vectors. One of them would probably be faster. So far it looks like genkernel32 is faster on AMD cards. We have no info about nVidia. Switching to 64bit would also require changing “slices” in vankusconf.py and vankusconf.h. Compiling fails with (older?) nVidia compilers due to unsupported “unsigned long long” type. Replacing it with “ulong” variable seems to help: <code c> ulong one = 1; mask |= one « i; … ulong all = 0xFFFFFFFFFFFFFFFF; if(diff != all) { </code> ===== Setting kernel options ===== In vankusconf.py and .h, number of concurrently launched kernels could be also changed. A good starting value is a small integer multiply of number of computing cores on your card minus 1. For example 4095 if your card has 2048 cores. Additionally, QSIZE can be changed to fit about two times the number of fragments processed in parallel. ===== Running deka ===== Run paplon.py. Run oclvankus.py, once or twice for each OpenCL device. It will ask you which device you want to use and tell you to set PYOPENCL_CTX environment variable to avoid asking again. Run delta_client.py, once or twice. (or use init.sh to run all the above – but running it manually is better for the first time as you can see debug prints) Then, connect to the server (for example with telnet) and test it. <code>~> telnet localhost 1578 Trying ::1… Connected to localhost. Escape character is '^]'. crack 001110001001010111000110000100110100001000011010100001000010000110101100101010100110110100100111110011101110000000 Cracking #0 001110001001010111000110000100110100001000011010100001000010000110101100101010100110110100100111110011101110000000 Found 44D85D82BAF275B4 @ 2 #0 (table:412) crack #0 took 35586 msec</code> Congratulations, you have a working setup! ===== Performance tuning ===== By entering “stats”, you can view size of burst queues. You can see if your bottleneck is the storage (“endpoints” queue) or chain computation. Possible speedups: * tune loop unrolling in kernel * tune number of iterations in kernel (currently 3000) * tune number of kernels executed * use async IO or multiple threads to read blocks