Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- HPFSS
- ubuntu
- top500
- CPU
- 1.9
- Linux
- SLURM
- build
- infiniband
- GPU
- Kernel
- CUDA
- PFSS
- Singularity
- version
- gpfs
- LUSTRE
- AMD
- HPE
- conda
- hpcm패치
- nvidia
- java
- Cray
- Source
- Docker
- HPCM
- patch
- rhel
- 1.10
Archives
- Today
- Total
HPE CRAY 자료 공유
[BMT] STREAM 본문
STREAM: Sustainable Memory Bandwidth in High Performance Computers
- 참고 : https://www.cs.virginia.edu/stream
1. Source code download
$ wget https://www.cs.virginia.edu/stream/FTP/Code/Versions/stream_omp.c |
2. Compile
$ gcc -O3 -fopenmp -D OPENMP stream_omp.c -o stream_c.x |
3. test
- test script 작성
#!/bin/sh #SBATCH -J stream #SBATCH -p short #SBATCH -N 1 #SBATCH -n 1 #SBATCH -o %x_%j.out #SBATCH -e %x_%j.err #SBATCH --time=01:00:00 export OMP_NUM_THREADS=32 /home/hpcd/leesangy/stream/stream_c.x exit 0 |
- 작업 제출
$ sbatch job.sh |
- 결과는 아래와 같이 출력 됩니다.
------------------------------------------------------------- This system uses 8 bytes per DOUBLE PRECISION word. ------------------------------------------------------------- Array size = 2000000, Offset = 0 Total memory required = 45.8 MB. Each test is run 10 times, but only the *best* time for each is used. ------------------------------------------------------------- Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 Number of Threads requested = 32 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 87 microseconds. (= 87 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 681308.2640 0.0001 0.0000 0.0001 Scale: 604584.3604 0.0001 0.0001 0.0001 Add: 696631.8062 0.0001 0.0001 0.0001 Triad: 696631.8062 0.0001 0.0001 0.0001 ------------------------------------------------------------- Solution Validates ------------------------------------------------------------- |
- 테스트#1 : compiler : gcc 4.8.5, option : -O2 -fopenmp
Threads | Copy(MB/s) | Scale(MB/s) | Add(MB/s) | Triad(MB/s) |
1 | 35394.9705 | 21461.1014 | 34215.9402 | 31516.3732 |
10 | 200026.4203 | 110014.5311 | 202541.8431 | 129387.2699 |
24 | 371794.2604 | 251815.6248 | 390167.8140 | 300039.6304 |
32 | 464421.2042 | 323416.2120 | 480493.0597 | 377723.4371 |
64 | 464421.2042 | 438620.0261 | 479349.0286 | 475949.3901 |
- 테스트#2 : compiler : gcc 9.2.0, option : -O3 -fopenmp -march=znver1
Threads | Copy(MB/s) | Scale(MB/s) | Add(MB/s) | Triad(MB/s) |
1 | 51286.8659 | 36611.4915 | 49236.1438 | 45579.9393 |
10 | 271146.9253 | 162491.1961 | 177068.2427 | 175218.9661 |
24 | 593883.7522 | 532610.0317 | 608237.4381 | 608237.4381 |
32 | 681308.2640 | 604584.3604 | 696631.8062 | 696631.8062 |
64 | 654720.6244 | 651542.3689 | 748425.9926 | 751218.6269 |
# AMD EPYC™ 7xx1 Series Processors 최적화 옵션 "-march=znver1" 사용을 위하여 gcc 9.2 컴파일러 사용
4. 기타(테스트 환경)
구분 | 상세 정보 | 비고 |
Hostname | aixl645dn04 | |
Platform | Apollo 6500 Gen10+ XL645d (Bantha) | |
Processor | 1 x 7543 | hyperthreading enabled |
OS | Ubuntu 20.04 | |
Memory | 8 x 64GB 3200MHz = 512GB | |
GPU | 4 x A100-SXM4-40GB |
'Applications > BMT관련' 카테고리의 다른 글
[BMT] HPC Benchmark list (0) | 2024.04.16 |
---|---|
[TOP500] November 2023 (0) | 2023.11.15 |
[nvidia-docker] tensorflow multi gpu test (0) | 2021.09.15 |
[pytorch] mnist (0) | 2021.08.10 |
[BMT] conda를 이용한 HPL benchmark (0) | 2021.07.08 |