As is a practice in all computer languages and APIs, one starts with a hello world program. However, the nvcc compiler is more like a compiler driver, what it does is;
- separate device from host code into two separate files
- compile device code (with nvcc, cudafe, ptxas)
- invoke gcc for host code
Hence, if a C++ hello world program was sent through to nvcc, nothing will happen in the first two steps, but gcc will be invoked to compile the C++ host code. To confirm to CUDA, we will need a more CUDA specific hello world code,
On running helloworld.cu will print hello world, however this is different from a C++ Hello World. Lets look at the 3 CUDA aspects in the program;
#.1. cuda.h
Defines the public host functions and types for the CUDA driver API. In CUDA 5.5 Ubuntu 12.04 installation cuda.h is located at /usr/local/cuda-5.5/include/cuda.h
#.2. __global__
Functions that are going to be executed as kernels on the device should have their header prefaced with, __device__ and/or __global__. In particular keyword __global__ indicates a function that it runs on the device and it is called from host code. In this program, since the body of the kernel is blank, nothing will be computed on the GPU device.
#.3. kernel<<<1,1>>>()
The kernel introduces parallelism into the program, this kernel is configured using one block per grid, and one thread per block. CUDA supports thread abstraction in which code in a single thread is executed by all other threads. This programming model has the advantage of scalability since all threads essentially execute the same piece of code. A single invoked kernel is referred to as a grid. A grid is comprised of blocks of threads. A block is comprised of multiple threads. Grids, blocks, and threads have different properties. A thread is the smallest execution unit in a CUDA program. In CUDA, one writes programs for a single thread and it will be executed by all the other threads.
The kernel function takes no input parameters and has no output. The generic syntax for the kernel function is;
kernel_function <<<BlocksPerGrid, ThreadsPerGrid>>>>( argument1, argument2, ... );
Since the GPU is not participating in this program hence even kernel<<<0,0>>>() will lead to the same answer, so will kernel<<<10,10>>>(). Later parts of the series will discuss better hello world examples.
//helloworld.cu //Compile with => nvcc helloworld.cu -o helloworld //Run with ./helloworld #include <cuda.h> #include <stdio.h> __global__ void kernel(void) { } int main(void) { kernel<<<1,1>>>(); printf("Hello World \n"); }
On running helloworld.cu will print hello world, however this is different from a C++ Hello World. Lets look at the 3 CUDA aspects in the program;
#.1. cuda.h
Defines the public host functions and types for the CUDA driver API. In CUDA 5.5 Ubuntu 12.04 installation cuda.h is located at /usr/local/cuda-5.5/include/cuda.h
#.2. __global__
Functions that are going to be executed as kernels on the device should have their header prefaced with, __device__ and/or __global__. In particular keyword __global__ indicates a function that it runs on the device and it is called from host code. In this program, since the body of the kernel is blank, nothing will be computed on the GPU device.
#.3. kernel<<<1,1>>>()
The kernel introduces parallelism into the program, this kernel is configured using one block per grid, and one thread per block. CUDA supports thread abstraction in which code in a single thread is executed by all other threads. This programming model has the advantage of scalability since all threads essentially execute the same piece of code. A single invoked kernel is referred to as a grid. A grid is comprised of blocks of threads. A block is comprised of multiple threads. Grids, blocks, and threads have different properties. A thread is the smallest execution unit in a CUDA program. In CUDA, one writes programs for a single thread and it will be executed by all the other threads.
The kernel function takes no input parameters and has no output. The generic syntax for the kernel function is;
kernel_function <<<BlocksPerGrid, ThreadsPerGrid>>>>( argument1, argument2, ... );
Since the GPU is not participating in this program hence even kernel<<<0,0>>>() will lead to the same answer, so will kernel<<<10,10>>>(). Later parts of the series will discuss better hello world examples.
No comments:
Post a Comment