# trans-fat **Repository Path**: haoanqi/trans-fat ## Basic Information - **Project Name**: trans-fat - **Description**: https://github.com/cjg91/trans-fat/tree/main - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2023-08-25 - **Last Updated**: 2023-11-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README https://github.com/cjg91/trans-fat/tree/main # trans-fat An FPGA Accelerator for Transformer Inference We accelerated a BERT layer across two FPGAs, partitioned into four pipeline stages. We conduct three levels of optimization using Vitis HLS and report runtimes. The accelerator implements a transformer layer of standard BERT size, with a sequence length of 128 (which can be modified). ## Instructions This repository is designed to run on a host node with at least two Xilinx u200s. The instructions provided are specific to the the Pitt CRC fpga-n0 node, however, they may be adapted as neded for other nodes. ### Dependancies The required dependancies can be loaded using the following commands. ``` module load xilinx/vitis/2020.2 module load libfaketime source /opt/xilinx/xrt/setup.sh ``` ### Building All building is performed in the `fpga/` directory. Navigate there and enter the following command. ``` faketime 'last year' make all TARGET= VERSION=<0, 1, 2, 3> PART= JOBS=<# of jobs requested> ``` If building for hardware the output artifacts will automatically be coppied into `/builds/v#/fpga#/`. ### Running To run all enter `make test VERSION=<0, 1, 2, 3> PART=all` in the `fpga/` directory. Individual fpga builds can be run directly using the host and executable in the desired `builds/` directory. ## Optimization Versions ### v0 - None ### v1 - Linear layer tiling - Buffering of input and output data - Unrolling of multiplication inner loops ### v2 - Transpose A matmul input - Cache line of A.T - Increase tile size in j dimension - Unrolling of computation in attention heads ### v3 - Stream DDR inputs/outputs in linear layers ## Results
Version Latency (ms)
fpga1 fpga2 all
v0 4723.71 10950.90 15676.30
v1 274.98 120.91 397.45
v2 48.36 95.60 145.27
v3 35.03 71.76 110.99