LISA GPU TUTORIALS: PYTORCH AND TENSORFLOW ========================================== * Installing the tools and downloading the dataset and the examples ----------------------------------------------------------------- To download the dataset (CIFAR10) and install the tools it is better to use the main login nodes of Lisa ssh lgpu0XXX@lisa.surfsara.nl wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz Installing PyTorch requires cuDNN (library with CUDA functions for efficient convolution and matrix multiplication) and NCCL (efficient collective communication functions), as well as the use of the correct exports on the Lisa GPU login node. module load pre2019 module load Python/3.6.3-foss-2017b module load cuDNN/7.0.5-CUDA-9.0.176 module load NCCL/2.0.5-CUDA-9.0.176 export LD_LIBRARY_PATH=/hpc/eb/Debian9/cuDNN/7.1-CUDA-8.0.44-GCCcore-5.4.0/lib64:$LD_LIBRARY_PATH pip3 install --user torch torchvision pip3 install --user tensorflow_gpu Download the example repository from PyTorch: git clone https://github.com/pytorch/examples.git pytorch_examples Download the example repository from TensorFlow: git clone https://github.com/tensorflow/models.git ============================================ ============================================ ============================================ * PyTorch example --------------- For this example it is necessary to use the Lisa GPU login node ssh lgpu0XXX@login-gpu.lisa.surfsara.nl We will run a GAN (generative adversarial network) on the CIFAR10 dataset. First, copy the dataset to the dcgan folder of the examples, and use that folder as working directory: cp cifar-10-python.tar.gz pytorch_examples/dcgan/ cd pytorch_examples/dcgan/ Then, place the following pytorch.job in the same folder (edit with "nano pytorch.job") Contents of pytorch.job: ================ #!/bin/bash #SBATCH --job-name=example #SBATCH --ntasks=1 #SBATCH --cpus-per-task=3 #SBATCH --ntasks-per-node=1 #SBATCH --time=1:00:00 #SBATCH --mem=60000M #SBATCH --partition=gpu_shared_course #SBATCH --gres=gpu:1 module purge module load eb module load pre2019 module load Python/3.6.3-foss-2017b module load cuDNN/7.0.5-CUDA-9.0.176 module load NCCL/2.0.5-CUDA-9.0.176 export LD_LIBRARY_PATH=/hpc/eb/Debian9/cuDNN/7.1-CUDA-8.0.44-GCCcore-5.4.0/lib64:$LD_LIBRARY_PATH srun python3 -u main.py --dataset=cifar10 --dataroot=. --cuda ================ Submit the job with: sbatch pytorch.job Then, check SLURM output. It should be something like this: Namespace(batchSize=64, beta1=0.5, cuda=True, dataroot='.', dataset='cifar10', imageSize=64, lr=0.0002, manualSeed=None, ndf=64, netD='', netG='', ngf=64, ngpu=1, niter=25, nz=100, outf='.', workers=2) Random Seed: 8194 Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz Generator( (main): Sequential( (0): ConvTranspose2d(100, 512, kernel_size=(4, 4), stride=(1, 1), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace) (3): ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace) (6): ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (7): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace) (9): ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (10): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace) (12): ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (13): Tanh() ) ) Discriminator( (main): Sequential( (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (1): LeakyReLU(negative_slope=0.2, inplace) (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (4): LeakyReLU(negative_slope=0.2, inplace) (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (7): LeakyReLU(negative_slope=0.2, inplace) (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False) (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (10): LeakyReLU(negative_slope=0.2, inplace) (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), bias=False) (12): Sigmoid() ) ) [0/25][0/782] Loss_D: 1.9226 Loss_G: 4.7332 D(x): 0.3970 D(G(z)): 0.4802 / 0.0112 [0/25][1/782] Loss_D: 1.4282 Loss_G: 5.9478 D(x): 0.7768 D(G(z)): 0.5775 / 0.0048 [0/25][2/782] Loss_D: 1.0728 Loss_G: 5.3339 D(x): 0.6434 D(G(z)): 0.2968 / 0.0070 [0/25][3/782] Loss_D: 1.1183 Loss_G: 6.1291 D(x): 0.7238 D(G(z)): 0.4363 / 0.0031 [0/25][4/782] Loss_D: 0.8597 Loss_G: 6.8755 D(x): 0.7401 D(G(z)): 0.3364 / 0.0016 [0/25][5/782] Loss_D: 0.9826 Loss_G: 6.2175 D(x): 0.6471 D(G(z)): 0.2519 / 0.0028 [0/25][6/782] Loss_D: 1.1336 Loss_G: 8.0856 D(x): 0.7768 D(G(z)): 0.5040 / 0.0005 [0/25][7/782] Loss_D: 0.7810 Loss_G: 7.2080 D(x): 0.7059 D(G(z)): 0.2014 / 0.0012 . . . . . ============================================ ============================================ ============================================ * TensorFlow example ------------------ For this example it is necessary to use the Lisa GPU login node ssh lgpu0XXX@login-gpu.lisa.surfsara.nl We will run a CNN (convolutional neural network) on the CIFAR10 dataset. First, copy the dataset to the cifar10_estimator folder of the examples, and use that folder as working directory. Then unpack the dataset there: cp cifar-10-python.tar.gz models/tutorials/image/cifar10_estimator/ cd models/tutorials/image/cifar10_estimator/ tar xzf cifar-10-python.tar.gz Now the TFRecords (images packed in a TensorFlow database) need to be prepared. IMPORTANT PATCH: you need to comment the download line in the script generate_cifar10_tfrecords.py (line 42) After you have commented that line, run the following routine to pack the data in TFRecords (and first load the necessary modules to run Python with CUDA libraries): module load pre2019 module load Python/3.6.3-foss-2017b module load cuDNN/7.0.5-CUDA-9.0.176 module load NCCL/2.0.5-CUDA-9.0.176 export LD_LIBRARY_PATH=/hpc/eb/Debian9/cuDNN/7.1-CUDA-8.0.44-GCCcore-5.4.0/lib64:$LD_LIBRARY_PATH python generate_cifar10_tfrecords.py --data-dir=${PWD} Then, place the following tensorflow.job in the same folder (edit with "nano tensorflow.job"). Contents of tensorflow.job: ================ #!/bin/bash #SBATCH --job-name=example #SBATCH --ntasks=1 #SBATCH --cpus-per-task=3 #SBATCH --ntasks-per-node=1 #SBATCH --time=1:00:00 #SBATCH --mem=60000M #SBATCH --partition=gpu_shared_course #SBATCH --gres=gpu:1 module purge module load eb module load pre2019 module load Python/3.6.3-foss-2017b module load cuDNN/7.0.5-CUDA-9.0.176 module load NCCL/2.0.5-CUDA-9.0.176 export LD_LIBRARY_PATH=/hpc/eb/Debian9/cuDNN/7.1-CUDA-8.0.44-GCCcore-5.4.0/lib64:$LD_LIBRARY_PATH srun python3 -u cifar10_main.py --data-dir=${PWD} --num-gpus=1 --job-dir=${PWD}/jobs ================ Submit the job with: sbatch tensorflow.job Then, check SLURM output. It should be something like this: 2018-11-30 14:23:37.426372: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2018-11-30 14:23:37.667213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:3b:00.0 totalMemory: 10.92GiB freeMemory: 10.76GiB 2018-11-30 14:23:37.667262: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2018-11-30 14:23:40.870782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-11-30 14:23:40.870842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2018-11-30 14:23:40.870853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2018-11-30 14:23:40.871194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 10409 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:3b:00.0, compute capability: 6.1) WARNING:tensorflow:From cifar10_main.py:380: RunConfig.__init__ (from tensorflow.contrib.learn.python.learn.estimators.run_config) is deprecated and will be removed in a future version. Instructions for updating: When switching to tf.estimator.Estimator, use tf.estimator.RunConfig instead. WARNING:tensorflow:From cifar10_main.py:387: run (from tensorflow.contrib.learn.python.learn.learn_runner) is deprecated and will be removed in a future version. Instructions for updating: Use tf.estimator.train_and_evaluate. INFO:tensorflow:Using config: {'_task_type': None, '_task_id': 0, '_cluster_spec': , '_master': '', '_num_ps_replicas': 0, '_num_worker_replicas': 0, '_environment': 'local', '_is_chief': True, '_evaluation_master': '', '_train_distribute': None, '_eval_distribute': None, '_device_fn': None, '_tf_config': gpu_options { per_process_gpu_memory_fraction: 1.0 } , '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100, '_protocol': None, '_session_config': gpu_options { force_gpu_compatible: true } allow_soft_placement: true , '_save_checkpoints_steps': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_model_dir': '/home/valeriuc/gpu_island/slurm/models/tutorials/image/cifar10_estimator/jobs'} WARNING:tensorflow:From cifar10_main.py:360: Experiment.__init__ (from tensorflow.contrib.learn.python.learn.experiment) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.estimator.train_and_evaluate. You will also have to convert to a tf.estimator.Estimator. WARNING:tensorflow:From /home/valeriuc/.local/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/monitors.py:279: BaseMonitor.__init__ (from tensorflow.contrib.learn.python.learn.monitors) is deprecated and will be removed after 2016-12-05. Instructions for updating: Monitors are deprecated. Please use tf.train.SessionRunHook. INFO:tensorflow:Calling model_fn. INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32) INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32) . . . . .