Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- Docker
- HPFSS
- gpfs
- 1.9
- patch
- Singularity
- conda
- version
- 1.10
- Kernel
- Linux
- HPE
- top500
- Cray
- build
- hpcm패치
- Source
- AMD
- ubuntu
- HPCM
- java
- CPU
- nvidia
- infiniband
- PFSS
- CUDA
- LUSTRE
- SLURM
- rhel
- GPU
Archives
- Today
- Total
HPE CRAY 자료 공유
[GPU] Cuda Samples Utilities 본문
- 소스코드 다운
$ wget -O cuda-samples-12.4.1.tar.gz https://github.com/NVIDIA/cuda-samples/archive/refs/tags/v12.4.1.tar.gz $ tar xvzf cuda-samples-12.4.1.tar.gz $ cd cuda-samples-12.4.1 $ cd Samples/1_Utilities/bandwidthTest |
- 빌드 방법:
$ CUDA_PATH=/usr/local/cuda-11.8 make TARGET_ARCH=x86_64 |
1. bandwidthTest
# ./bandwidthTest [CUDA Bandwidth Test] - Starting... Running on... Device 0: NVIDIA A100-SXM4-40GB Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 23.8 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 24.3 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 1130.1 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. # ./bandwidthTest --device=all [CUDA Bandwidth Test] - Starting... !!!!!Cumulative Bandwidth to be computed from all the devices !!!!!! Running on... Device 0: NVIDIA A100-SXM4-40GB Device 1: NVIDIA A100-SXM4-40GB Device 2: NVIDIA A100-SXM4-40GB Device 3: NVIDIA A100-SXM4-40GB Quick Mode Host to Device Bandwidth, 4 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 95.3 Device to Host Bandwidth, 4 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 100.0 Device to Device Bandwidth, 4 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(GB/s) 32000000 4571.9 Result = PASS NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. |
2. deviceQuery
# ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 4 CUDA Capable device(s) Device 0: "NVIDIA A100-SXM4-40GB" CUDA Driver Version / Runtime Version 11.8 / 11.8 CUDA Capability Major/Minor version number: 8.0 Total amount of global memory: 40385 MBytes (42346610688 bytes) (108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores GPU Max Clock rate: 1410 MHz (1.41 GHz) Memory Clock rate: 1215 Mhz Memory Bus Width: 5120-bit L2 Cache Size: 41943040 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 167936 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 5 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 7 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "NVIDIA A100-SXM4-40GB" CUDA Driver Version / Runtime Version 11.8 / 11.8 CUDA Capability Major/Minor version number: 8.0 Total amount of global memory: 40385 MBytes (42346610688 bytes) (108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores GPU Max Clock rate: 1410 MHz (1.41 GHz) Memory Clock rate: 1215 Mhz Memory Bus Width: 5120-bit L2 Cache Size: 41943040 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 167936 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 5 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 70 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 2: "NVIDIA A100-SXM4-40GB" CUDA Driver Version / Runtime Version 11.8 / 11.8 CUDA Capability Major/Minor version number: 8.0 Total amount of global memory: 40385 MBytes (42346610688 bytes) (108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores GPU Max Clock rate: 1410 MHz (1.41 GHz) Memory Clock rate: 1215 Mhz Memory Bus Width: 5120-bit L2 Cache Size: 41943040 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 167936 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 5 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 133 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 3: "NVIDIA A100-SXM4-40GB" CUDA Driver Version / Runtime Version 11.8 / 11.8 CUDA Capability Major/Minor version number: 8.0 Total amount of global memory: 40385 MBytes (42346610688 bytes) (108) Multiprocessors, (064) CUDA Cores/MP: 6912 CUDA Cores GPU Max Clock rate: 1410 MHz (1.41 GHz) Memory Clock rate: 1215 Mhz Memory Bus Width: 5120-bit L2 Cache Size: 41943040 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 167936 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 5 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 199 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU0) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU1) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU2) -> NVIDIA A100-SXM4-40GB (GPU3) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU0) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU1) : Yes > Peer access from NVIDIA A100-SXM4-40GB (GPU3) -> NVIDIA A100-SXM4-40GB (GPU2) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.8, NumDevs = 4 Result = PASS |
3. p2pBandwidthLatencyTest
# ./p2pBandwidthLatencyTest --help Usage: p2pBandwidthLatencyTest [OPTION]... Tests bandwidth/latency of GPU pairs using P2P and without P2P Options: --help Display this help menu --p2p_read Use P2P reads for data transfers between GPU pairs and show corresponding results. Default used is P2P write operation. --sm_copy Use SM intiated p2p transfers instead of Copy Engine --numElems=<NUM_OF_INT_ELEMS> Number of integer elements to be used in p2p copy. |
# ./p2pBandwidthLatencyTest [P2P (Peer-to-Peer) GPU Bandwidth Latency Test] Device: 0, NVIDIA A100-SXM4-40GB, pciBusID: 7, pciDeviceID: 0, pciDomainID:0 Device: 1, NVIDIA A100-SXM4-40GB, pciBusID: 46, pciDeviceID: 0, pciDomainID:0 Device: 2, NVIDIA A100-SXM4-40GB, pciBusID: 85, pciDeviceID: 0, pciDomainID:0 Device: 3, NVIDIA A100-SXM4-40GB, pciBusID: c7, pciDeviceID: 0, pciDomainID:0 Device=0 CAN Access Peer Device=1 Device=0 CAN Access Peer Device=2 Device=0 CAN Access Peer Device=3 Device=1 CAN Access Peer Device=0 Device=1 CAN Access Peer Device=2 Device=1 CAN Access Peer Device=3 Device=2 CAN Access Peer Device=0 Device=2 CAN Access Peer Device=1 Device=2 CAN Access Peer Device=3 Device=3 CAN Access Peer Device=0 Device=3 CAN Access Peer Device=1 Device=3 CAN Access Peer Device=2 ***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure. So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases. P2P Connectivity Matrix D\D 0 1 2 3 0 1 1 1 1 1 1 1 1 1 2 1 1 1 1 3 1 1 1 1 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1276.55 17.45 17.51 17.41 1 17.67 1275.51 17.56 17.49 2 17.60 17.87 1290.26 17.52 3 17.79 18.02 18.01 1284.95 Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s) D\D 0 1 2 3 0 1253.01 93.27 93.48 93.47 1 93.25 1305.35 93.46 93.53 2 93.57 93.50 1307.53 93.27 3 93.48 93.48 93.23 1306.44 Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1274.99 19.77 20.03 19.96 1 20.15 1313.03 20.07 19.94 2 19.95 19.94 1313.03 19.67 3 20.09 20.11 19.99 1314.13 Bidirectional P2P=Enabled Bandwidth Matrix (GB/s) D\D 0 1 2 3 0 1272.91 185.04 185.28 185.26 1 184.77 1313.58 185.26 185.41 2 185.23 185.28 1311.37 184.84 3 185.26 185.45 184.93 1313.58 P2P=Disabled Latency Matrix (us) GPU 0 1 2 3 0 2.67 67.54 21.48 21.50 1 21.39 2.73 21.48 21.56 2 21.49 21.57 2.16 21.50 3 21.46 21.49 21.49 2.70 CPU 0 1 2 3 0 2.20 6.25 6.27 6.31 1 6.13 2.07 6.33 6.22 2 6.20 6.21 2.10 6.28 3 6.30 6.21 6.29 2.10 P2P=Enabled Latency (P2P Writes) Matrix (us) GPU 0 1 2 3 0 2.66 2.63 2.66 2.61 1 2.72 2.73 2.71 2.61 2 2.64 2.62 2.15 2.60 3 2.68 2.71 2.65 2.69 CPU 0 1 2 3 0 2.16 1.70 1.73 1.71 1 1.78 2.15 1.71 1.73 2 1.79 1.74 2.18 1.76 3 1.80 1.73 1.75 2.22 NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. |
'SYSTEMS > GPU' 카테고리의 다른 글
[ROCM] Unable to open /dev/kfd read-write: Permission denied (0) | 2024.06.13 |
---|---|
[HBM] TrendForce 자료 (0) | 2024.04.24 |
[AMD] GPU 내용 정리 (0) | 2024.02.20 |
[NVIDIA] GPU 내용 정리 (1) | 2024.01.13 |
[NVIDIA] nvidia-smi 항목 설명 (0) | 2021.07.28 |