Barra
Un article de WikiArchi.
Sommaire |
Barra - NVIDIA G80 GPU Functional Simulator
What is Barra
Barra simulates CUDA programs at the assembly language level (Tesla ISA). Its ultimate goal is to provide a 100% bit-accurate simulation, offering bug-for-bug compatibility with NVIDIA G80-based GPUs. It works directly with CUDA executables; neither source modification nor recompilation is required.
Barra is primarily intended as a tool for research on computer architecture, although it can also be used to debug, profile and optimize CUDA programs at the lowest level.
Getting Barra
Source Tarballs (recommended)
Binary packages
Older versions
Development source repository
- Accessible in the Unisim subversion repository.
Quick start
See the Barra Tutorial.
What's new?
See the Changelog.
Installation
Supported Platforms
Barra was tested on GNU/Linux i386 and x86_64. It should be easily portable to Win32 with Cygwin and Mac OS X, but this has not been tested so far.
Binary packages require a CPU with SSE2 support (Intel Pentium 4, AMD Athlon 64 or newer), and a fairly up-to-date Linux distribution.
Tested platforms:
- Ubuntu 8.04 x86_64
- Ubuntu 8.10 i386
- Debian Lenny x86_64
CUDA 2.0, 2.1, 2.2 and 2.3 are supported.
Requirements
Binary package:
- libc6 (>= 2.7)
- libstdc++6 (>= 4.2)
- libxml2 (2.6.30 recommended)
- ncurses (>= 5.6 recommended)
- zlib (1.2.3 recommended)
- libboost-thread1.35
Sources:
- Development versions of the above packages
- automake (1.10 recommended), autoconf
- libtool (2.2.4 recommended)
Installing
From a binary package:
Extract the archive contents to an empty directory (say $HOME/usr or /usr/local). The Barra library is in lib/libcuda.so. More informations in the Barra Tutorial.
From sources:
Refer to Compiling Barra.
If either of these fail, please read Barra Troubleshooting.
Working With Barra
Barra is made of a simulator and a driver. The Barra driver is a dynamic library which exports the same symbols and API as the CUDA Driver library (libcuda.so/cuda.dll).
How Does It Work?
To simulate a CUDA program which uses the Driver API, we temporarily replace the NVIDIA-provided CUDA Driver library with our Barra library by setting the LD_LIBRARY_PATH or PATH variable. This way, cuXxx calls are redirected to the Barra driver, which can then configure and run the simulator as if it were an actual GPU. When cuLaunch or cuLaunchGrid is called, control is transferred to the simulator for execution.
Programs that use the CUDA Runtime API are still linked with the official CUDA Runtime library provided by NVIDIA (libcudart.so/cudart.dll). This Runtime library is only a wrapper over the Driver library, which translates cudaXxx calls into cuYyy calls. We can then trick the Runtime library into using our driver instead of NVIDIA's driver, just as we do with programs using the Driver API directly.
How To Use It?
Let's assume we want to simulate the matrixMul sample of the NVIDIA CUDA SDK under GNU/Linux and Barra was installed in /usr/local/barra-0.4-linux_x86_64.
We temporarily override the default library search path to the Barra lib directory, in addition to specifying where libcudart.so resides (the latter may not be required depending on how the CUDA toolkit was installed):
LD_LIBRARY_PATH="/usr/local/barra-0.4-linux_x86_64/lib/:/usr/local/cuda/lib"
Then, we can launch our executable:
cd NVIDIA_CUDA_SDK/bin/linux/debug ./matrixMul
Barra then outputs lots of debug information (CUDA function calls, cubin data, disassembly, memory allocation, thread scheduling...) during program execution.
Multithreaded simulation
By default, only one Streaming Multiprocessor is simulated by one host thread. Multiple SM can be simulated by independent host threads to accelerate simulation on multi-core and multi-processor machines. To enable this feature, set the environment variable CORE_COUNT to the numbers of threads (and SMs) to use (typically the number of logical processors of the host computer):
export CORE_COUNT=4
Hacking, Debugging
As a computer architecture research tool, Barra is designed to be modified to suit the user's needs (e.g. gathering statistics on instructions, generating traces...)
Statistics gathering can be enabled by setting the environment variable EXPORT_STATS:
export EXPORT_STATS=1
For each kernel run, a file named kernelname.csv will be created in the current directory. (Note that for C++ applications, the kernel name will be the mangled name, such as __globfunc__Z9matrixMulPfS_S_ii.)
This file can be open in any spreadsheet software, and provides the following data for each kernel instruction:
- Address: instruction address
- Name: instruction mnemonic
- Executed: number of times it was executed
- Exec. scalar: number of SIMD channels it was executed on
- Integer: if it is an integer instruction
- FP32: if it is a single-precision floating-point instruction
- Flow: if it is a control-flow instruction
- Memory: if it accesses global or local memory
- Shared: if it accesses shared memory
- Constant: if it accesses constant memory
- Input regs: number of input operands from the register file
- Output regs: number of output operands to the register file
Some support is also present to generate debug traces. It is intended to be used for manual debugging and the trace format may change without notice. Several environment variables control the level of verbosity of traces:
- TRACE_INSN outputs to stderr each instruction executed along with the warp number and program counter. All subsequent trace types are designed to be used with TRACE_INSN.
- TRACE_MASK outputs the current predication mask of the warp after (not during) each instruction.
- TRACE_REG outputs the destination register of each instruction executed, in hex.
- TRACE_REG_FLOAT. Same as TRACE_REG, but as floating-point. Requires TRACE_REG.
- TRACE_LOADSTORE. *Very* verbose. Traces every load and store from/to any memory type.
- TRACE_BRANCH controls the output of the SIMD branching algorithms.
- TRACE_SYNC outputs which warps are waiting at synchronization barriers.
Enabling tracing is done by setting the variable to 1 before running Barra. For example:
export TRACE_INSN=1
Tracing is disabled by setting each variable to 0:
export TRACE_INSN=0
Features
Supported Features
- Simulator
- Integer arithmetic on 32-bit and 16-bit registers, floating-point arithmetic, bitwise instructions.
- Memory scatter/gather instructions from/to global and local memory.
- Shared and constant memory.
- Control flow instructions.
- Reciprocal, reciprocal square root and transcendental instructions (not bit-accurate).
- Synchronization barrier instruction.
- Integer texture sampling over linear memory.
- Driver
- Most of the CUDA Driver API.
- CUDA runtime API, through NVIDIA-provided libcudart.so/cudart.dll.
- Support for cubin files and Fat Executables (through CUDA Runtime).
- Host <-> Device linear memory copy.
- Multithreaded simulation.
UN-implemented Features (aka TODO List)
- Simulator
- Atomic instructions.
- Warp vote instructions.
- Double precision instructions.
- Complete texture sampling.
- Bit-accurate transcendentals.
- Run-time checks. As Barra is primarily designed to run valid CUDA benchmarks, few safety checks are performed on instruction validity, memory addresses... Running an invalid or buggy CUDA program is likely to result in a segmentation fault.
- Driver
- Asynchronous execution.
- Streams.
- Complete texture support.
- Arrays.
- Multiple contexts.
- Multiple devices.
How Fast (Or How Slow) Is It?
Barra performs 4 times faster on average (from 8 times slower to 10 times faster) than source-level emulation in debug mode (nvcc --deviceemu) on a dual core CPU, which is itself several orders of magnitude slower than execution on a high-end GPU. Barra is competitive with the emulator of the Ocelot project, and an order of magnitude faster than the CUDA Debugger.
Test platform: Core 2 Duo E8400, GeForce 9800 GX2, CUDA 2.2, gcc 4.3, Ocelot 0.4.46
What We Plan To Do Next
- More statistics about the CUDA code: coalesced memory accesses, bank conflicts in shared and constant memory...
- Transaction-Level Modeling of the G80 memory architecture to provide a realistic timing model.
- (Close to) cycle-accurate modeling of Streaming Multiprocessors.
- Modeling of power consumption.
- Simulation speed optimization.
About Us
Authors
- Sylvain Collange (sylvain.collange @ dcc.ufmg.br), maintainer
- Marc Daumas (marc.daumas @ univ-perp.fr)
- David Defour (david.defour @ univ-perp.fr)
- David Parello (david.parello @ univ-perp.fr)
Credits
Barra is supported by:
- Université de Perpignan
- The ANR (French National Research Agency) BioWic which provides funding
- Hardware donations from Nvidia
Feedback
Contact: sylvain.collange at ens-lyon.fr
Bug reports and comments are welcome.
Thanks
We wish to thank the following people for their contributions, ideas and/or support to Barra:
- Gregory Diamos
- Fabrice Ferrand
- Hou Yunqing
- Marcin Kościelnicki
- Hendra Sumilo
- Wladimir J. van der Laan
- Guillaume Yziquel
Publications related to Barra
- Yao Zhang, John D. Owens , A Quantitative Performance Analysis Model for GPU Architectures. 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17) 2011.
- Sylvain Collange, Une architecture unifiée pour traiter la divergence de contrôle et la divergence mémoire en SIMT. SYMPosium en Architectures nouvelles de machines (SYMPA) 2011.
- Sylvain Collange. Enjeux de conception des architectures GPGPU : unités arithmétiques spécialisées et exploitation de la régularité. PhD Thesis, UPVD, 2010.
- Sylvain Collange, Marc Daumas, David Defour, David Parello. Barra: A Parallel Functional Simulator for GPGPU. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2010.
- Sylvain Collange, Marc Daumas, David Defour, David Parello, Étude comparée et simulation d’algorithmes de branchements pour le GPGPU. SYMPosium en Architectures nouvelles de machines (SYMPA) 2009.
- Sylvain Collange, David Defour, Yao Zhang. Dynamic detection of uniform and affine vectors in GPGPU computations. Europar 3rd Workshop on Highly Parallel Processing on a Chip (HPPC). 2009.
- Sylvain Collange, David Defour, David Parello. Barra, a Parallel Functional GPGPU Simulator. Technical Report hal-00359342, Université de Perpignan, 2009.
- Sylvain Collange, Barra : un simulateur de GPU pour CUDA., Poster. ARCHI09, Pleumeur-Bodou, 2009.


