[nvidia-docker] tensorflow multi gpu test

알 수 없는 사용자 2021. 9. 15. 15:45

2021. 9. 15. 15:45

1. 테스트 환경

HPE HPC Partner Lab

znode44

2. Dockerfile 작성 및 build

Dockerfile 예시

FROM tensorflow/tensorflow:latest-gpu
RUN pip install tensorflow_dataset

후술하겠지만 docker 를 사용자 계정으로 실행하면 docker image에 python 패키지 설치가 용이하지 않음.

먼저 Dockerfile 을 작성하고 빌드

$ docker build -t 이미지:태그

3. slurm interactive 할당

$ srun -p short -N 1 -n 1 -w znode44 --pty bash

4. (nvidia) docker command

(znode44 에서)
$ docker run -u $(id -u):$(id -g) -v 원하는경로:tensorflow_datasets --gpus 원하는갯수 -it --rm 이미지:태그 bash

Host volume을 bind 해서 사용할 때 권한 문제를 일으키지 않기 위해 -u $(id -u):$(id -g) 로 docker 실행

(PLAB에서 권장)

역시 입력파일 저장 공간 권한 문제 해결을 위해 tensorflow_datasets 으로 container dir를 만든다.

--gpus 에서 원하는 gpu 갯수를 지정 가능

5. tensorflow 실행

(tf_docker 에서 /tensorflow_datasets 폴더)
$ python tf_multi_keras.py

tf_multi_keras.py

import tensorflow_datasets as tfds
import tensorflow as tf
tfds.disable_progress_bar()

import os
import time

starttime = time.time()

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']

strategy = tf.distribute.MirroredStrategy()

print('Number of GPU : {}'.format(strategy.num_replicas_in_sync))

num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

def scale(image, label):
  image = tf.cast(image, tf.float32)
  image /= 255

  return image, label

train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

with strategy.scope():
  model = tf.keras.Sequential([
      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
      tf.keras.layers.MaxPooling2D(),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(64, activation='relu'),
      tf.keras.layers.Dense(10, activation='softmax')
  ])

  model.compile(loss='sparse_categorical_crossentropy',
                optimizer=tf.keras.optimizers.Adam(),
                metrics=['accuracy'])

checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

def decay(epoch):
  if epoch < 3:
    return 1e-3
  elif epoch >= 3 and epoch < 7:
    return 1e-4
  else:
    return 1e-5

class PrintLR(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs=None):
    print('\n epoch {} model is {}'.format(epoch + 1,
                                                      model.optimizer.lr.numpy()))

callbacks = [
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
                                       save_weights_only=True),
    tf.keras.callbacks.LearningRateScheduler(decay),
    PrintLR()
]

model.fit(train_dataset, epochs=12, callbacks=callbacks)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

eval_loss, eval_acc = model.evaluate(eval_dataset)

print('Eval loss: {}, Eval acc {}'.format(eval_loss, eval_acc))

print('Elapsed time : ',time.time()-starttime)

'Applications > BMT관련' 카테고리의 다른 글

[BMT] HPC Benchmark list (0)	2024.04.16
[TOP500] November 2023 (0)	2023.11.15
[pytorch] mnist (0)	2021.08.10
[BMT] conda를 이용한 HPL benchmark (0)	2021.07.08
[BMT] STREAM (0)	2021.07.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

HPE CRAY 자료 공유

[nvidia-docker] tensorflow multi gpu test

'Applications > BMT관련' 카테고리의 다른 글

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역