Google Cloud's Deep Learning VM's are very useful for the average ML use case that needs an easy to setup GPU training instance. They have all the necessary NVIDIA drivers, Tensorflow and other ML goodies. Unfortunately, when wanting to do GIS based ML with dependencies on modern GDAL and Python you will need to do some work.
One choice is to add these dependencies manually to Debian 9 Stretch (at the time writing this, the only distro available) after you get your Deep Learning VM up and running. Unfortunately, the version of GDAL that works with Debian 9 is very old (2.x), and Python is a little old (3.5). Replacing these with newer versions in a stable and secure way is easier said than done.
Instead a simpler task is to create your own custom GCP machine image from a distro that is more recent than Debian 9 and install all the NVIDIA drivers yourself. That is what you will do in this guide.
Prerequisites
The same set of steps in the Google Cloud Machine Image guide.
- Install or update to the latest version of the gcloud command-line tool.
- Set a default region and zone.
STEP 1 of 3: Create the VM that our Machine Image will be based on
Please consider the region where you will be doing the actual training and its GCP dependencies. You will want your resulting machine image to be in the same region. Make sure your default region and zone is set.
Create the VM
You can delete this VM at the end of this guide. We create an image from it and no longer need it running once we have the image.
gcloud compute instances create gis-ml \
--boot-disk-size 100GB \
--maintenance-policy=TERMINATE \
--machine-type=n1-standard-4 \
--accelerator=type=nvidia-tesla-v100,count=1 \
--image=ubuntu-2004-focal-v20200529 \
--image-project=ubuntu-os-cloud \
--metadata="install-nvidia-driver=True"
Update and install CUDA and CuDNN
With your VM now running we need to install the NVIDIA drivers. The below steps work for Ubuntu 18.04 as well. We explicitly use the NVIDIA 18.04 drivers below because otherwise their build dependencies are too far ahead for Ubuntu 20.04 which this VM is based off of. We didn't use 18.04 above because it's python is tool old and fixing that is a bigger pain.
SSH to the VM, using Cloud Console or gcloud compute ssh
sudo apt update
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'
sudo apt update
sudo apt install cuda-10-1 libcudnn7 python3-pip
Set some alias's
Edit both your current user ~/.bashrc and /etc/skel/.bashrc. This will make sure that you as a user of this VM and any future users of a VM based off of it (our machine image) will have a working profile.
# set PATH for cuda 10.1 installation
if [ -d "/usr/local/cuda-10.1/bin/" ]; then
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi
alias python="python3.8"
alias pip="pip3"
Restart
sudo reboot now
Verify Install
SSH back into the VM and run the following, there shouldn't be anything abnormal.
pip --version
python --version
nvcc --version
/sbin/ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep libcudnn
nvidia-smi
Install GDAL
This step is optional if you are not doing GIS based workloads.
sudo apt-get install gdal-bin libgdal-dev
ogrinfo --version
gdalinfo --version
STEP 2 of 3: Create the Machine Image
Shutdown the VM if it's still running
sudo shutdown now
Create the image
Replace [ZONE] with the location of your disk (in this guide it's the default zone you set for this project), us-west1-a for example.
gcloud compute images create ml-with-gis \
--source-disk gis-ml \
--source-disk-zone [ZONE]
Get a list of our images
gcloud compute images list | grep ml-with-gis
STEP 3 of 3: Create an Instance for Training
And now this is where all the hard work pays off. We have this single command below to create a GCE Instance with a GPU along with a modern GDAL and Python on top of a modern Linux distro, Ubuntu 20.04. If you find yourself repeatedly adding more and more dependencies (Tensorflow etc) after you spin these up, simply spin up one of these and repeat Step 2 above.
When this is up and running you can then SSH to it and train your model.
gcloud compute instances create gis-ml-traininer-1 \
--boot-disk-size 100GB \
--maintenance-policy=TERMINATE \
--machine-type=n1-standard-4 \
--accelerator=type=nvidia-tesla-v100,count=1 \
--image=ml-with-gis
Comments
0 comments
Please sign in to leave a comment.