CUDA, Supercomputing for the Masses (6) > 소프트웨어

[App 개발] CUDA, Supercomputing for the Masses (6)

hongjuny

128.♡.28.1

2009.08.21 04:09 1,915 0 0 0

LV.1 0%

본문

- 광역 메모리와 CUDA 프로파일러

이번 6회에서는 메모리 성능과 reverseArray_multiblock_fast.cu 에서 공유 메모리를 사용하는 것을 살펴보겠습니다. 이번에는 CUDA 프로파일러를 이용하여 광역 메모리를 점검할 것입니다.

눈치 빠른 독자분들은 이미 우리가 4회와 5회에서 살펴본 배열 뒤집기를 생각하시고 어째서 공유 메모리가 광역 메모리보다 빠른지 궁금해하실 겁니다. 공유 메모리 버젼 reverseArray_multiblock_fast.cu 에서 커널은 광역 메모리에서 공유 메모리로 데이터를 복사하고 결과를 다시 광역 메모리로 되돌려 놓지만, reverseArray_multiblock.cu 에서는 광역 메모리에서 광역 메모리로 데이터를 한 번만 복사하기 때문입니다. 광역 메모리 속도가 공유 메모리에 비해 100 에서 150배 정도 느리므로 광역 메모리 억세스 과정이 두 가지 예제에서 가장 속도를 잡아먹는 과정이 되어야 하는데, 어째서 공유 메모리 버젼이 더 속도가 빠를까요?

이 문제에 답하기 위해서는 광역 메모리에 대한 이해와 CUDA 개발장비에서 제공하는 CUDA 프로파일러 사용법을 익히셔야 합니다. CUDA 소프트웨어를 프로파일링하는 것은 빠르고 쉽습니다. 텍스트와 그래픽 버젼으로 제공되는 소프트웨어는 CUDA 장치의 하드웨어 프로파일 카운터를 읽어들입니다.텍스트 프로파일링은환경 설정에서 간단히 실행시킬 수 있습니다. 그래픽 버젼도 간단합니다. cudaprof 를 실행한 후 클릭만 하면 됩니다. 프로파일링은 중요한 단서들을 제공합니다. 프로파일 이벤트는 CUDA 하드웨어 장치 내에서 처리됩니다. 하지만 프로파일된 커널은 비동기적으로 실행할 수 없습니다. 통신 부하를 줄이기 위해서 커널이 종료된 이후 결과를 호스트로 전송합니다.

- 광역 메모리

숙련된 CUDA 프로그래머가 되려면 광역 메모리를 효과적으로 사용할 줄 알아야 합니다. 이제 소개할 내용은 reverseArray_multiblock.cu 와 reverseArray_multiblock_fast.cu 의 성능 차이를 이해하기에 충분한 광역 메모리에 대한 내용을 다룰 것입니다. 향후에도 필요에 따라 광역 메모리의 효율적인 사용에 대해 언급하겠습니다. 광역 메모리의 자세한 내용은 CUDA 프로그래밍 가이드 5.1.2.1 장을 참조하세요.

광역 메모리는 하드웨어가 데이터를 가장 적은 작업으로 페치(혹은 저장) 할 수 있는 반 워프 내로 합쳐서 억세스할 때 가장 메모리 밴드폭을 크게 사용할 수 있습니다. CUDA 계산 장치 (1.0 과 1.1) 는 데이터를 단일 64비트 혹은 128비트씩 페치할 수 있습니다. 만약 메모리 작업을 합칠 수 없다면, 메모리 작업이 각 쓰레드마다 반 워프씩 나뉘어지므로 비효율적입니다. 합쳐지지 않은 메모리 억세스로 보는 손해는 데이터 종류와 크기에 따라 달라집니다. CUDA 문서에서는 서로 다른 데이터 종류에 따라 예상되는 성능 저하를 대략 정리해 놓고 있습니다.

* 32 비트 데이터형은 10 배 정도 느리다
* 64 비트 데이터형은 4 배 정도 느리다
* 128 비트 데이터형은 2 배 정도 느리다

G80 아키텍쳐 내에서 모든 쓰레드가 반 워프 크기 블럭의 광역 메모리 억세스를 효율적으로 처리하려면:

1. 쓰레드는 32, 64, 128 비트 데이터형을 억세스한다.
2. 처리되는 모든 16워드 데이터는 동일한 메모리 처리 크기의 동일 세그먼트 내에 있다. (혹은 128비트 워드일 때 두 배의 메모리 처리 크기) 따라서 데이터 시작 주소와 배치가 중요하다.
3. 쓰레드는 워드 단위로 순차적으로 억세스한다. 반 워드 내의 k 번째 쓰레드는 k 번째 워드를 억세스한다. 그러나 모든 워프 내의 쓰레드가 합동 쓰레드 억세스로 메모리를 억세스할 필요는 없다. 이 경우 “분산 워프” 라고한다.

GT200 같은 새 아키텍쳐는 위에서 살펴본 합동 방식보다 쉽게 처리할 수 있습니다. 아키텍쳐별 차이점은 이후에 깊이 다루도록 하겠습니다. 지금은 간단히 말해 만약 G80 CUDA 장치에서 최적화된 코드는 GT200 에서도 충분히 동작한다고 정리할 수 있습니다.

- 텍스트 프로파일링 사용법

CUDA 프로파일러 텍스트 버젼을 위한 환경 변수들은

* CUDA_PROFILE: 프로파일러 enable = 1, disable = 0
* CUDA_PROFILE_LOG: 로그파일 이름 지정 (기본값은 ./cuda_profile.log)
* CUDA_PROFILE_CSV: Comma Separated Version 로그파일 enable = 1, disable = 0
* CUDA_PROFILE_CONFIG: 네 가지 신호에 대한 환경 설정을 저장하는 파일 지정

마지막 항목이 중요한데, 한 번에 네 가지 신호만을 프로파일링 할 수 있습니다. 개발자는 CUDA_PROFILE_CONFIG 에 지정된 파일에 기록하고자 하는 이벤트 이름을 한 줄씩 넣어둡니다.

* gld_incoherent: Number of non-coalesced global memory loads
* gld_coherent: Number of coalesced global memory loads
* gst_incoherent: Number of non-coalesced global memory stores
* gst_coherent: Number of coalesced global memory stores
* local_load: Number of local memory loads
* local_store: Number of local memory stores
* branch: Number of branch events taken by threads
* divergent_branch: Number of divergent branches within a warp
* instructions: instruction count
* warp_serialize: Number of threads in a warp that serialize based on address conflicts to shared or constant memory
* cta_launched: executed thread blocks

- 프로파일 카운터에 대하여

성능 카운터 값은 개별 쓰레드 활동에 관련이 없습니다. 그 대신 이 값은 쓰레드 워프 내의 이벤트를 의미합니다. 예를 들어 쓰레드 워프 내의 부조화한 데이터 저장은 gst_incoherent 값을 1 증가시킬 것입니다. 따라서 저장된 최종값은 모든 워프의 부조화한 데이터 저장 정보를 나타냅니다.

더불어, 프로파일러는 GPU 내의 멀티프로세서 중 한 개만을 포착합니다. 따라서 카운터 값은 특정 커널을 위해 실행된 워프의 총 개수와 무관합니다. 따라서 프로파일러 내의 퍼포먼스 카운터를 사용할 때에는 충분히 많은 쓰레드 블럭을 실행해서 멀티프로세서가 전체 작업에서 일관된 분량을 처리하게 하여야 합니다.

그러므로 카운터 값을 커널 코드를 검사하기 위한 값으로 사용할 수 없습니다. 카운터 값은 최적화된 코드와 그렇지 않은 코드의 상대평가로서 사용해야 합니다. 예를 들어, 소프트웨어 시작 단계에서 합동되지 않은 광역 메모리 읽기가 포착되었다면, 코드를 분석하여 합동되지 않은 메모리 억세스를 줄이는 방향으로 수정합니다. 대부분의 경우 목표는 합동되지 않은 글로벌 메모리 억세스를 0 으로 만드는 것이므로, 이 목표를 이루기 위해서 카운터 값을 유용하게 활용할 수 있습니다.

- 프로파일링 결과

프로파일러로 reverseArray_multiblock.cu 와 reverseArray_multiblock_fast.cu 를 살펴봅시다. 먼저 리눅스 bash 쉘 환경에서 환경 변수와 환경설정 파일을 다음과 같이 만듭니다.

export CUDA_PROFILE=1
export CUDA_PROFILE_CONFIG=$HOME/.cuda_profile_config

< Profiler configuration via environnent variables in Linux with bash >

gld_coherent
gld_incoherent
gst_coherent
gst_incoherent

< Contents of the CUDA_PROFILE_CONFIG file >

Running the reverseArray_multiblock.cu executable generates the following profiler report in ./cuda_profile.log:

reverseArray_multiblock.cu 를 실행해보면 프로파일러는 다음의 결과를 ./cuda_profile.log 로그파일에 기록합니다.

method,gputime,cputime,occupancy,gld_incoherent,gld_coherent,gst_incoherent,gst_coherent
method=[ memcopy ] gputime=[ 438.432 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 267.520 ] cputime=[ 297.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 0 ] gld_coherent=[ 1952 ] gst_incoherent=[ 62464 ] gst_coherent=[ 0 ]
method=[ memcopy ] gputime=[ 349.344 ]

< Profile report for reverseArray_multiblock.cu >

동일한 방법으로 reverseArray_multiblock_fast.cu 를 실행하면 다음의 결과를 기존 파일 위에 덮어씁니다.

method,gputime,cputime,occupancy,gld_incoherent,gld_coherent,gst_incoherent,gst_coherent
method=[ memcopy ] gputime=[ 449.600 ]
method=[ _Z17reverseArrayBlockPiS_ ] gputime=[ 50.464 ] cputime=[ 108.000 ] occupancy=[ 1.000 ] gld_incoherent=[ 0 ] gld_coherent=[ 2032 ] gst_incoherent=[ 0 ] gst_coherent=[ 8128 ]
method=[ memcopy ] gputime=[ 509.984 ]

< Profile report for reverseArray_multiblock_fast.cu >

위의 두 프로파일러 결과를 비교해 보면 reverseArray_multiblock_fast.cu 는 비합동적 데이터 기록이 없었지만 reverseArray_multiblock.cu 에는 많이 발견되었습니다. reverseArray_multiblock.cu 소스코드를 살펴보시고 혹시 문제를 고칠 수 있는지 시도해보십시오. 고치고 난 다음 두 프로그램의 속도가 상대적으로 어떻게 바뀌었는지 측정해 보세요.

편의를 위해서 reverseArray_multiblock_cu 를 첫 번째에, reverseArray_multiblock_fast.cu 를 두 번째에 놓았습니다.

// includes, system
#include
#include

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part3: implement the kernel
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
// pointer for host memory and size
int *h_a;
int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory
int *d_b, *d_a;

// define grid and block size
int numThreadsPerBlock = 256;

// Part 1: compute number of blocks needed based on array size and desired block size
int numBlocks = dimA / numThreadsPerBlock;
// allocate host and device memory
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
h_a = (int *) malloc(memSize);
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host
for (int i = 0; i < dimA; ++i)
{
h_a[i] = i;
}

// Copy host array to device array
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

// block until the device has completed
cudaThreadSynchronize();

// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");

// device to host copy
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors
checkCUDAError("memcpy");

// verify the data returned to the host is correct
for (int i = 0; i < dimA; i++)
{
assert(h_a[i] == dimA - 1 - i );
}

// free device memory
cudaFree(d_a);
cudaFree(d_b);

// free host memory
free(h_a);

// If the program makes it this far, then the results are correct and
// there are no run-time errors. Good work!
printf("Correct!
");

return 0;
}
void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.
", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
< reverseArray_multiblock.cu >

// includes, system
#include
#include

// Simple utility function to check for CUDA runtime errors
void checkCUDAError(const char* msg);

// Part 2 of 2: implement the fast kernel using shared memory
__global__ void reverseArrayBlock(int *d_out, int *d_in)
{
extern __shared__ int s_data[];

int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;

// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

// Block until all threads in the block have written their data to shared mem
__syncthreads();

// write the data from shared memory in forward order,
// but to the reversed block offset as before

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main( int argc, char** argv)
{
// pointer for host memory and size
int *h_a;
int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory
int *d_b, *d_a;

// define grid and block size
int numThreadsPerBlock = 256;

// Compute number of blocks needed based on array size and desired block size
int numBlocks = dimA / numThreadsPerBlock;

// Part 1 of 2: Compute the number of bytes of shared memory needed
// This is used in the kernel invocation below
int sharedMemSize = numThreadsPerBlock * sizeof(int);

// allocate host and device memory
size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int);
h_a = (int *) malloc(memSize);
cudaMalloc( (void **) &d_a, memSize );
cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host
for (int i = 0; i < dimA; ++i)
{
h_a[i] = i;
}

// Copy host array to device array
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

// block until the device has completed
cudaThreadSynchronize();

// check if kernel execution generated an error
// Check for any CUDA errors
checkCUDAError("kernel invocation");

// device to host copy
cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors
checkCUDAError("memcpy");

// verify the data returned to the host is correct
for (int i = 0; i < dimA; i++)
{
assert(h_a[i] == dimA - 1 - i );
}

// free device memory
cudaFree(d_a);
cudaFree(d_b);

// free host memory
free(h_a);

// If the program makes it this far, then the results are correct and
// there are no run-time errors. Good work!
printf("Correct!
");

return 0;
}

void checkCUDAError(const char *msg)
{
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
{
fprintf(stderr, "Cuda error: %s: %s.
", msg, cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
}
< reverseArray_multiblock_fast.cu >

0 0

로그인 후 추천 또는 비추천하실 수 있습니다.