[GPU] Cuda Samples Utilities

CRAY KOREA Blog 2024. 5. 2. 15:56

2024. 5. 2. 15:56

- 소스코드 다운

$ wget -O cuda-samples-12.4.1.tar.gz https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz
$ tar xvzf cuda-samples-12.4.1.tar.gz
$ cd cuda-samples-12.4.1
$ cd Samples/1_Utilities/bandwidthTest

- 빌드 방법:

$ CUDA_PATH=/usr/local/cuda-11.8 make TARGET_ARCH=x86_64

1. bandwidthTest

# ./bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: NVIDIA A100-SXM4-40GB
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 23.8

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 24.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 1130.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

# ./bandwidthTest --device=all
[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

Device 0: NVIDIA A100-SXM4-40GB
Device 1: NVIDIA A100-SXM4-40GB
Device 2: NVIDIA A100-SXM4-40GB
Device 3: NVIDIA A100-SXM4-40GB
Quick Mode

Host to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 95.3

Device to Host Bandwidth, 4 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 100.0

Device to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
   Transfer Size (Bytes) Bandwidth(GB/s)
   32000000 4571.9

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

2. deviceQuery

# ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          11.8 / 11.8
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40385 MBytes (42346610688 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 7 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          11.8 / 11.8
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40385 MBytes (42346610688 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 70 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          11.8 / 11.8
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40385 MBytes (42346610688 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 133 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 3: "NVIDIA A100-SXM4-40GB"
  CUDA Driver Version / Runtime Version          11.8 / 11.8
  CUDA Capability Major/Minor version number:    8.0
  Total amount of global memory:                 40385 MBytes (42346610688 bytes)
  (108) Multiprocessors, (064) CUDA Cores/MP:    6912 CUDA Cores
  GPU Max Clock rate:                            1410 MHz (1.41 GHz)
  Memory Clock rate:                             1215 Mhz
  Memory Bus Width:                              5120-bit
  L2 Cache Size:                                 41943040 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        167936 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 199 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes
> Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.8, NumDevs = 4
Result = PASS

3. p2pBandwidthLatencyTest

# ./p2pBandwidthLatencyTest --help
Usage:  p2pBandwidthLatencyTest [OPTION]...
Tests bandwidth/latency of GPU pairs using P2P and without P2P

Options:
--help Display this help menu
--p2p_read Use P2P reads for data transfers between GPU pairs and show corresponding results.
Default used is P2P write operation.
--sm_copy                      Use SM intiated p2p transfers instead of Copy Engine
--numElems=<NUM_OF_INT_ELEMS>  Number of integer elements to be used in p2p copy.

# ./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100-SXM4-40GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A100-SXM4-40GB, pciBusID: 46, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA A100-SXM4-40GB, pciBusID: 85, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA A100-SXM4-40GB, pciBusID: c7, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0      1     1     1     1
     1      1     1     1     1
     2      1     1     1     1
     3      1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 1276.55  17.45  17.51  17.41
     1  17.67 1275.51  17.56  17.49
     2  17.60  17.87 1290.26  17.52
     3  17.79  18.02  18.01 1284.95
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 1253.01  93.27  93.48  93.47
     1  93.25 1305.35  93.46  93.53
     2  93.57  93.50 1307.53  93.27
     3  93.48  93.48  93.23 1306.44
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 1274.99  19.77  20.03  19.96
     1  20.15 1313.03  20.07  19.94
     2  19.95  19.94 1313.03  19.67
     3  20.09  20.11  19.99 1314.13
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 1272.91 185.04 185.28 185.26
     1 184.77 1313.58 185.26 185.41
     2 185.23 185.28 1311.37 184.84
     3 185.26 185.45 184.93 1313.58
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3
     0   2.67  67.54  21.48  21.50
     1  21.39   2.73  21.48  21.56
     2  21.49  21.57   2.16  21.50
     3  21.46  21.49  21.49   2.70

   CPU     0      1      2      3
     0   2.20   6.25   6.27   6.31
     1   6.13   2.07   6.33   6.22
     2   6.20   6.21   2.10   6.28
     3   6.30   6.21   6.29   2.10
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3
     0   2.66   2.63   2.66   2.61
     1   2.72   2.73   2.71   2.61
     2   2.64   2.62   2.15   2.60
     3   2.68   2.71   2.65   2.69

   CPU     0      1      2      3
     0   2.16   1.70   1.73   1.71
     1   1.78   2.15   1.71   1.73
     2   1.79   1.74   2.18   1.76
     3   1.80   1.73   1.75   2.22

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

- 참고#1: https://github.com/NVIDIA/cuda-samples

- 참고#2: https://docs.nvidia.com/cuda/demo-suite/index.html

'SYSTEMS > GPU' 카테고리의 다른 글

[AMD] podman 테스트 (0)	2025.01.22
[ROCM] Unable to open /dev/kfd read-write: Permission denied (0)	2024.06.13
[HBM] TrendForce 자료 (0)	2024.04.24
[AMD] GPU 내용 정리 (0)	2024.02.20
[NVIDIA] GPU 내용 정리 (1)	2024.01.13

HPE CRAY 자료 공유