HPE CRAY 자료 공유

[BMT] STREAM 본문

Applications/BMT관련

[BMT] STREAM

CRAY KOREA Blog 2021. 7. 5. 13:59

STREAM: Sustainable Memory Bandwidth in High Performance Computers

- 참고 : https://www.cs.virginia.edu/stream

 

 

1. Source code download

$ wget https://www.cs.virginia.edu/stream/FTP/Code/Versions/stream_omp.c

 

2. Compile

$ gcc -O3 -fopenmp -D OPENMP stream_omp.c -o stream_c.x

 

3. test

- test script 작성

#!/bin/sh
#SBATCH -J stream
#SBATCH -p short
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -o %x_%j.out
#SBATCH -e %x_%j.err
#SBATCH --time=01:00:00

export OMP_NUM_THREADS=32

/home/hpcd/leesangy/stream/stream_c.x

exit 0

 

- 작업 제출

$ sbatch job.sh

 

- 결과는 아래와 같이 출력 됩니다.

-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
Number of Threads requested = 32
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 87 microseconds.
   (= 87 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:      681308.2640       0.0001       0.0000       0.0001
Scale:     604584.3604       0.0001       0.0001       0.0001
Add:       696631.8062       0.0001       0.0001       0.0001
Triad:     696631.8062       0.0001       0.0001       0.0001
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

 

- 테스트#1 : compiler : gcc 4.8.5, option : -O2 -fopenmp

Threads Copy(MB/s) Scale(MB/s) Add(MB/s) Triad(MB/s)
1 35394.9705 21461.1014 34215.9402 31516.3732
10 200026.4203 110014.5311 202541.8431 129387.2699
24 371794.2604 251815.6248 390167.8140 300039.6304
32 464421.2042 323416.2120 480493.0597 377723.4371
64 464421.2042 438620.0261 479349.0286 475949.3901

 

- 테스트#2 : compiler : gcc 9.2.0, option : -O3 -fopenmp -march=znver1

Threads Copy(MB/s) Scale(MB/s) Add(MB/s) Triad(MB/s)
1 51286.8659 36611.4915 49236.1438 45579.9393
10 271146.9253 162491.1961 177068.2427 175218.9661
24 593883.7522 532610.0317 608237.4381 608237.4381
32 681308.2640 604584.3604 696631.8062 696631.8062
64 654720.6244 651542.3689 748425.9926 751218.6269

# AMD EPYC™ 7xx1 Series Processors 최적화 옵션 "-march=znver1" 사용을 위하여 gcc 9.2 컴파일러 사용

# 참고 : https://www.amd.com/system/files/documents/compiler-options-guide-amd-epyc-7xx1-series-processors.pdf

 

4. 기타(테스트 환경)

구분 상세 정보 비고
Hostname aixl645dn04  
Platform Apollo 6500 Gen10+ XL645d (Bantha)  
Processor 1 x 7543 hyperthreading enabled
OS Ubuntu 20.04  
Memory 8 x 64GB 3200MHz = 512GB  
GPU 4 x A100-SXM4-40GB  

 

'Applications > BMT관련' 카테고리의 다른 글

[BMT] HPC Benchmark list  (0) 2024.04.16
[TOP500] November 2023  (0) 2023.11.15
[nvidia-docker] tensorflow multi gpu test  (0) 2021.09.15
[pytorch] mnist  (0) 2021.08.10
[BMT] conda를 이용한 HPL benchmark  (0) 2021.07.08