Sunday 1 September 2013

Hello CUDA -- Part 2

A hello world program is a often the first program which starts off programming in that language. However, CUDA involves the GPU and checking the GPU status with a CUDA C code is similar to a hello world program. CUDA installations come along with sample programs which check for the GPU. In CUDA 5.5 installation in Ubuntu 12.04, such a program can be found at, /usr/local/cuda-5.5/samples/1_Utilities/deviceQuery.cpp.

Here we look into a simple variant of such a program,

//devicequery.cu
//Compile with => nvcc devicequery.cu -o devicequery
//Run with ./devicequery


#include <stdio.h>
 
// Print device properties
void printDevProp(cudaDeviceProp devProp)
{
    printf("Major revision number:         %d\n",  devProp.major);
    printf("Minor revision number:         %d\n",  devProp.minor);
    printf("Name:                          %s\n",  devProp.name);
    printf("Total global memory:           %u\n",  devProp.totalGlobalMem);
    printf("Total shared memory per block: %u\n",  devProp.sharedMemPerBlock);
    printf("Total registers per block:     %d\n",  devProp.regsPerBlock);
    printf("Warp size:                     %d\n",  devProp.warpSize);
    printf("Maximum memory pitch:          %u\n",  devProp.memPitch);
    printf("Maximum threads per block:     %d\n",  devProp.maxThreadsPerBlock);
    for (int i = 0; i < 3; ++i)
    printf("Maximum dimension %d of block:  %d\n", i, devProp.maxThreadsDim[i]);
    for (int i = 0; i < 3; ++i)
    printf("Maximum dimension %d of grid:   %d\n", i, devProp.maxGridSize[i]);
    printf("Clock rate:                    %d\n",  devProp.clockRate);
    printf("Total constant memory:         %u\n",  devProp.totalConstMem);
    printf("Texture alignment:             %u\n",  devProp.textureAlignment);
    printf("Concurrent copy and execution: %s\n",  (devProp.deviceOverlap ? "Yes" : "No"));
    printf("Number of multiprocessors:     %d\n",  devProp.multiProcessorCount);
    printf("Kernel execution timeout:      %s\n",  (devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
    return;
}
 
int main()
{
    // Number of CUDA devices
    int devCount;
    cudaGetDeviceCount(&devCount);
    printf("CUDA Device Query...\n");
    printf("There are %d CUDA devices.\n", devCount);
 
    // Iterate through devices
    for (int i = 0; i < devCount; ++i)
    {
        // Get device properties
        printf("\nCUDA Device #%d\n", i);
        cudaDeviceProp devProp;
        cudaGetDeviceProperties(&devProp, i);
        printDevProp(devProp);
    }
 
    printf("\nPress any key to exit...");
    char c;
    scanf("%c", &c);

    return 0;
}

On running this program will 'query' the GPU and provide with result as;






























REFERENCES

Tuesday 27 August 2013

Hello CUDA -- Part 1

As is a practice in all computer languages and APIs, one starts with a hello world program. However, the nvcc compiler is more like a compiler driver, what it does is;

  1. separate device from host code into two separate files
  2. compile device code (with nvcc, cudafe, ptxas)
  3. invoke gcc for host code
Hence, if a C++ hello world program was sent through to nvcc, nothing will happen in the first two steps, but gcc will be invoked to compile the C++ host code. To confirm to CUDA, we will need a more CUDA specific hello world code,

//helloworld.cu
//Compile with => nvcc helloworld.cu -o helloworld
//Run with ./helloworld

#include <cuda.h>
#include <stdio.h>

__global__ 
void kernel(void) 
{
}
int main(void) 
{
kernel<<<1,1>>>();
printf("Hello World \n");
}

On running helloworld.cu will print hello world, however this is different from a C++ Hello World. Lets look at the 3 CUDA aspects in the program;

#.1. cuda.h

Defines the public host functions and types for the CUDA driver API. In CUDA 5.5 Ubuntu 12.04 installation cuda.h is located at /usr/local/cuda-5.5/include/cuda.h

#.2. __global__
Functions that are going to be executed as kernels on the device should have their header prefaced with, __device__ and/or __global__. In particular keyword __global__ indicates a function that it runs on the device and it is called from host code. In this program,  since the body of the kernel is blank, nothing will be computed on the GPU device.

#.3. kernel<<<1,1>>>()
The kernel introduces parallelism into the program, this kernel is configured using one block per grid, and one thread per block. CUDA supports thread abstraction in which code in a single thread is executed by all other threads. This programming model has the advantage of scalability since all threads essentially execute the same piece of code. A single invoked kernel is referred to as a grid. A grid is comprised of blocks of threads. A block is comprised of multiple threads. Grids, blocks, and threads have different properties. A thread is the smallest execution unit in a CUDA program. In CUDA, one writes programs for a single thread and it will be executed by all the other threads.


The kernel function takes no input parameters and has no output. The generic syntax for the kernel function is;

kernel_function <<<BlocksPerGrid, ThreadsPerGrid>>>>( argument1, argument2, ... );

Since the GPU is not participating in this program hence even kernel<<<0,0>>>() will lead to the same answer, so will kernel<<<10,10>>>(). Later parts of the series will discuss better hello world examples.