The following code is basically taking an array with 20 million integers and adding all the numbers together to get a final answer. I have heavily commented the code for your convenience. CREDIT: Professional CUDA C Programming by John Chen & Max Grossman | https://amzn.to/2C6tUPg | I have learned tremendously from this book. Please consider buying.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 |
//CUDA Programming Example | 20 Million Array Addition | Very basic implementation that has not been optimized for speed //Company: CUDA Education | cudaeducation.com | cudaeducation@gmail.com | Please donate at cudaeducation.com //YouTube Channel (please subscribe): https://www.youtube.com/channel/UCzpwNg0Ai8zCzbsEtozkfFQ //Twitter: @cudaeducation //Credit: Professional CUDA C Programming by John Chen & Max Grossman | https://amzn.to/2C6tUPg | I have learned tremendously from this book. Please consider buying. //Have questions? Comment on the YouTube channel https://www.youtube.com/channel/UCzpwNg0Ai8zCzbsEtozkfFQ or email cudaeducation@gmail.com //DISCLAIMER: This code is for teaching purposes only! CUDA Education does not guarantee the accuracy of this code in any way. This code should not be used in a production or commercial environment. Any liabilities or loss resulting from the use of this code, in whole or in part, will not be the responsibility of CUDA Education. //All rights reserved. This code is the property of CUDA Education. Please contact CUDA Education at cudaeducation@gmail.com if you would like to use this code in any way, shape or form. //CODING ENVIRONMENT: //CUDA Toolkit 9.0 //Windows environment //Visual Studio 2017 Community Edition //nVidia GeForce 1050 ti Graphics Card //make sure to include all relevant libraries #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <windows.h> //#include <time.h> #include <stdio.h> __global__ void super_fast_addition_on_CUDA_GPU(int *data_received_from_host, int *data_to_send_back_to_host, unsigned int size_of_array) { //set thread ID unsigned int cuda_education_thread_id = threadIdx.x; //linear address of the thread. //this is the thread's index within the grid in the x dimension. //the GPU only sees linear addresses. //we use this index value to know how many threads we should use to do the addition (see "boundary check" below). we don't want to run more threads than is necessary to do the processing, so we check the thread index value (linear_id_of_thread) against size_of_array (my_array_size is defined on the host side and passed to this kernel super_fast_addition_on_CUDA_GPU as size_of_array) //for more information on why this calculation is done, check out https://www.pelagos-consulting.com/wp-content/uploads/2017/08/CUDA_description.pdf unsigned int linear_id_of_thread = blockIdx.x * blockDim.x + threadIdx.x; //convert global data pointer to the local pointer of this block int *local_data_of_block = data_received_from_host + blockIdx.x * blockDim.x; //boundary check if (linear_id_of_thread >= size_of_array) return; //make sure the number of threads that are processing the array is not greater than the size of the array //in-place reduction in global memory for (int stride = 1; stride < blockDim.x; stride = stride * 2) { if ((cuda_education_thread_id % (2 * stride)) == 0) { //calculate the sum of the current value in the array at index cuda_education_thread_id and value in the array at index cuda_education_thread_id + 1 local_data_of_block[cuda_education_thread_id] = local_data_of_block[cuda_education_thread_id] + local_data_of_block[cuda_education_thread_id + stride]; } //sychronize each single iteration for each thread before proceeding to the next iteration. //this is very important as you have to make sure that all the threads within this BLOCK have completed their calculation before going to the next iteration. for example, ALL threads in this BLOCK have to process with stride = 1 before you can proceed to the next iteration which is stride = stride * 2. it's almost like synchronized dancing, where everyone has to reach a specific line before the performers can proceed to the next dance move. even if you are on the last iteration and are planning on exiting the loop, make sure that all threads have concluded their business before exiting. that way, you can record an accurate final sum from the BLOCK. //if you refuse to synchronize the threads, then you will get a random summation on each iteration (because perhaps all threads are not done doing their calculation) and therefore an incorrect final number representing processing from the BLOCK will occur. then your math for the entire array will be off. //syncthreads is your friend in this game, even though it slows processing (but the result is correct, which is most important). //synchronize within BLOCK __syncthreads(); } /*at this point, all the processing (summation) within the BLOCK has been concluded and now you can proceed to recording the final value in the master list */ //write the result of the BLOCK to global memory if (cuda_education_thread_id == 0) data_to_send_back_to_host[blockIdx.x] = local_data_of_block[0]; } int main(int argc, char **argv) { //get the nVidia GPU running CUDA ready int cuda_education_device = 0; //set the device to be used for CUDA execution cudaSetDevice(cuda_education_device); //how many integers are going to be in your array? int my_array_size = 20000000; //print out the number of integers in your array printf("array size %d \n", my_array_size); //how many threads in each BLOCK will be launched in the CUDA kernel? int number_of_threads_in_each_block = 256; //set up the CUDA timer for tracking how long it takes the kernel on the device (GPU) to do its job //for more information on this, check out https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/ cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); //START create variables for the CUDA kernel launch //Please note that basically this is where we plan how the breakup of the array data will occur. //Each BLOCK will get a piece of the data to sum using the threads assigned to it. The conclusion of the BLOCK processing will be one number (not an array), for each BLOCK. This is all occuring on the device (GPU). //The number from each BLOCK will be captured in an array on the device (GPU) side. The array will be sent back to the host side where a for loop will perform the final summation and give us a final answer. //The performance of the code is heavily dependent on the parameters chosen for BLOCK and cuda_education_grid //create BLOCK variable and declare the number of threads that should be in it. //we DO NOT declare the number of blocks here, just the number of threads in each block //variable type is dim3, which basically means there are 3 dimensions (no need to discuss further here) dim3 BLOCK(number_of_threads_in_each_block, 1); //create grid variable and declare the number of BLOCKS in the grid. //the number of BLOCKS in the grid is usually a calculated number (based on the size of the array and the number of threads in each block -> you need enough threads to tackle all the elements in the array). //variable type is dim3, which basically means there are 3 dimensions (no need to discuss further here) dim3 cuda_education_grid((my_array_size + BLOCK.x - 1) / BLOCK.x, 1); //print out the values used for block and grid printf("cuda_education_grid %d | BLOCK %d\n", cuda_education_grid.x, BLOCK.x); //END create variables for the CUDA kernel launch //START memory management on the host, NOT the device (GPU) //take the number of bytes associated with type int and multiply it by the number of ints you have in your array. size_t int_bytes = my_array_size * sizeof(int); //initialize a variable on the host side to hold data that you will send to the device (GPU). //this should be an array int *host_data_send_to_device = (int *)malloc(int_bytes); //initialize a variable on the host side to receive data returned from the device (GPU). this should be an array //notice that it is assuming 1 int value for each BLOCK. remember cuda_education_grid holds the number of blocks int *host_data_received_from_device = (int *)malloc(cuda_education_grid.x * sizeof(int)); //initialize a variable on the host side that will hold our final answer. it should be one number, NOT an array. int host_variable_final_sum_of_block_totals_from_gpu = 0; //END memory management on the host, NOT the device (GPU) //create the array with random integers in it. we are still on the host side for (int i = 0; i < my_array_size; i++) { //this is where random numbers are placed into the array so that you can perform the summation host_data_send_to_device[i] = (int)(rand()); //print out the numbers used in the array if you like. not recommended if you have a large data set like the default of 20 million set in my_array_size. you can change my_array_size to 5 if you would like to see the values used. //printf("%d \n",host_data_send_to_device[i]); } //ACCURACY CHECK: perform the calculation on the host side so we can use the result to check if it matches our calculation on the device side. the result is saved in variable cpu_addition_result int cpu_addition_result = 0; for (int i = 0; i < my_array_size; i++) { cpu_addition_result = cpu_addition_result + host_data_send_to_device[i]; } //START memory management FOR the device (GPU). //please note that we are performing these actions from the host side //initialize a variable on the device (GPU) that will hold data received from the host int *device_data_received_from_host = NULL; //initialize a variable on the device (GPU) that will hold data to send back to the host int *device_data_send_back_to_host = NULL; //allocate memory on the device (GPU) for the variable that will receive data from the host //we already know the necessary size from the int_bytes variable cudaMalloc((void **)&device_data_received_from_host, int_bytes); //allocate memory on the device (GPU) for the variable that will send data back to the host //the memory size is based on the number of BLOCKS that was used to perform the summation. //remember cuda_education_grid holds the number of BLOCKS cudaMalloc((void **)&device_data_send_back_to_host, cuda_education_grid.x * sizeof(int)); //END memory management on the device (GPU) //transfer data from the host to the device cudaMemcpy(device_data_received_from_host, host_data_send_to_device, int_bytes, cudaMemcpyHostToDevice); //make sure all preceding tasks on the device (GPU) has been completed before proceeding with new business. wouldn't want to mix ingredients from two completely different recipes now would you?!?! cudaDeviceSynchronize(); cudaEventRecord(start); //launch the CUDA kernel //cuda_education_grid and BLOCK are used as launch parameters //device_data_received_from_host, device_data_send_back_to_host, and my_array_size are sent as arguments to assist with processing on the device (GPU) super_fast_addition_on_CUDA_GPU << <cuda_education_grid, BLOCK >> > (device_data_received_from_host, device_data_send_back_to_host, my_array_size); cudaEventRecord(stop); //another check to make sure all preceding business on the device (GPU) is complete before proceeding //remember that by default the CPU will launch the kernel and immediately continue processing other instructions. //we use cudaDeviceSynchronize() to tell the CPU to halt further processing until the CUDA kernel has finished doing its business cudaDeviceSynchronize(); cudaEventSynchronize(stop); //calculate and print the amount of time it took the CUDA kernel to process float milliseconds = 0; cudaEventElapsedTime(&milliseconds, start, stop); printf("DEVICE elapsed time: %f milliseconds\n", milliseconds); //at this point, the final BLOCK numbers from summation within each BLOCK has been completed and are waiting to be transferred from the device (GPU) to the host. //please note that this is not our answer yet, as we have to add up the numbers from each BLOCK to get our final answer. we will do the last addition on the host side. //copy BLOCK numbers from the device (GPU) to the host. cudaMemcpy(host_data_received_from_device, device_data_send_back_to_host, cuda_education_grid.x * sizeof(int), cudaMemcpyDeviceToHost); //initialize a variable to hold our final answer. it was already initialized above, but i repeat it here just for clarity host_variable_final_sum_of_block_totals_from_gpu = 0; //add up all the numbers from each BLOCK to get a final answer. //the variable cuda_education_grid.x gives us the number of BLOCKS for (size_t i = 0; i < cuda_education_grid.x; i++) { host_variable_final_sum_of_block_totals_from_gpu = host_variable_final_sum_of_block_totals_from_gpu + host_data_received_from_device[i]; } //print out our final answer, the number of BLOCKS used, and the number of threads in each BLOCK printf("DEVICE FINAL ANSWER: %d | Subscribe to the Cuda Education YouTube channel and check out cudaeducation.com \n", host_variable_final_sum_of_block_totals_from_gpu); //ACCURACY CHECK: check if the result from the device matches the result from the host if (host_variable_final_sum_of_block_totals_from_gpu == cpu_addition_result) { printf("Device result MATCHES host result \n"); } else { printf("Device result DOES NOT MATCH host result \n"); } //practice good housekeeping by resetting the device when you are done cudaDeviceReset(); //CUDA EDUCATION //Website: cudaeducation.com //Twitter: @cudaeducation //Email: cudaeducation@gmail.com //YouTube: Cuda Education //Credit: Professional CUDA C Programming by John Chen & Max Grossman | https://amzn.to/2C6tUPg | I have learned tremendously from this book. Please consider buying. } |
Leave a Reply