Retrograde Realizations

My AI dev environment

May 27, 2023 - 1766 words - 9 mins

I’ve recently got into learning about AI and thought I’d share my dev environment for doing experiments. My goal was to be able to follow along with various online learning resources that center around Jupyter notebooks using Python and PyTorch. Rather than just use these notebooks directly, I wanted a full IDE (VS Code) with Copilot integrated so that I could ask questions about the code I’m writing and solicit help. I cannot recommend Copilot X enough for that. The economics region of my brain is broken, so I also wanted to develop against two Nvidia 4090 GPUs I bought recently.

TL;DR: my set up

I’ve broken this post down into a hardware section and a software section, but here’s the TL;DR if you don’t care about the details and are just curious:

2 x Nvidia 4090 GPUs because VRAM, 8-bit performance
2 x Arch Linux computers because desktop environments hate running out of RAM
SSH and SSHFS for connecting the two
Connect Jupyter notebooks to VS Code to have access to Copilot X
Docker and Pipenv to isolate projects

Hardware

I’ve got two desktop computers sitting under my desk (well, more, but two that are relevant here). One, which I’ll call the ML computer, has two beefy GPUs in it and the other, which I’ll call the desktop computer, is a bare-bones machine used to render my desktop environment.

GPU setup: 2 x Nvidia 4090s

Cloud is the way to go for a hobbyist if you are rational about money. It takes quite a few hours of machine learning to justify a dedicated GPU and even full-time students are unlikely to justify it. I’m more of an emotional spender and I really can’t bring myself to pay by the hour for anything, so I went ahead and did something absurd: I bought two Nvidia 4090 GPUs. At least I’ll be able to finally try ray tracing in games.

If you’re thinking of doing the same, first read this blog: Which GPU(s) to Get for Deep Learning. It’s very thorough. TL;DR: if you care about VRAM, 4090 is currently the best bang-for buck, especially if you want to do stuff with 8-bit weights.

I went with two because I wanted the extra VRAM for larger models, though the 4090s don’t support SLI and instead need to be manually managed by the software I write. It’s kind of a pain in the ass and I haven’t had much luck in consistently getting newbie-friendly projects working across both, but I’m learning. PiPPy is one promising avenue that I intend to explore further for this. Having two also means I’ll be able to sort out multi-GPU problems locally even if I end up doing the heavy lifting in the cloud. Honestly, though, it would have been better to start with just one GPU.

Other GPU-related tips and notes:

Make sure you plan how you’re going to power them. I got a 1500W power supply. With both computers on the same 15 amp circuit, I’d best not turn on my printer while training.
I haven’t had any heat issues, surprisingly, despite having everything air-cooled.
Make sure they’ll actually fit – they can be nearly 14 inches long and take three PCIe slots worth of space. I had to drill out some rivets and remove the hard drive bay to make room.
Have some way to secure them – they’re so bulky that they can sag and wear out the PCIe slot in your computer. The GPU I bought came with a clever way of securing the tail end of the card but this only worked for one of the two and I had to get a little post for the lower one.
If you have NVMe drive slots under where they’ll go, fill those first – taking these beefy GPUs in and out is a pain.
All the advice on the Internet I’ve seen says to avoid AMD for now; ROCm is a drop-in replacement for CUDA but is apparently poorly-supported at the moment.

Computers: two of them!

I wasn’t planning to go this route at first, but it quickly became apparent that my desktop environment hates running out of VRAM unexpectedly. Running out of VRAM turns out to be a common problem with ML workloads. It crashes or hangs, forcing a restart each time. I’m running Arch Linux with Cinnamon, though I suspect this is true for just about any OS. Ultimately, I decided to set up my primary ML computer to dual-boot to a console-only version of Arch and got a second computer whose only job is to run a desktop environment and connect over SSH to my ML computer. I’d recommend an Intel NUC or similar for the desktop computer, though I already happened to already have an old computer I could use.

Having my desktop environment on the second computer means nothing bad will happen if I run out of VRAM on the main computer – at worst, I’ll just have to start training or inference over again with different parameters. Fortunately, everything else I wanted to do works pretty easily across the two machines too.

Other hardware specs

Some other hardware specs for the ML computer:

CPU: Threadripper 1950x - now an old 16-core 3.4 GHz processor, it still does the job.
RAM: 256GB of ECC unbuffered 2667MHz - this is definitely excessive, but comes in handy when processing lots of large files on the CPU. I’d recommend enabling swap instead of buying this much RAM.
Storage: 4TB of NVMe disk space across two drives - I am effectively doing a RAID0 with these, though via LVM. It took about two weeks of playing around with models and datasets to fill this up, so space is important.
A floppy drive - ok, this isn’t relevant, but I really do have a floppy drive hooked up to it.

Software

Once I got the hardware installed, I set up Arch Linux on both. On the ML computer, it boots to run level 3 (no desktop environment) when I’m doing ML stuff. The desktop computer is running Cinnamon. If you’re not a Linux nerd, I recommend Debian for both; I prefer Arch because I like configuring everything myself.

Connecting the two: SSH and SSHFS

I have an ~/ai directory on both computers, where I put all my code and models etc. On the desktop computer, it’s empty and I mount the ML computer to it over SSHFS:

$ sshfs primary:ai ~/ai

This way, I have convenient access to all of the files from both devices. It essentially treats the ML computer’s directory as being local on the desktop computer’s. I also keep a couple of SSH terminals open as well to run commands and monitor GPU VRAM on the ML computer (with nvidia-smi).

Jupyter with VS Code

Python is what everybody in AI is using for ML. In particular, Jupyter notebooks seem to be popular among AI researchers. Jupyter comes with a web interface that is decent, but it leaves much to be desired if you’re used to a full IDE (and GitHub Copilot…). Fortunately, VS Code support Jupyter notebooks, including remote ones, which means you can run the IDE on the desktop computer and the Jupyter environment on the ML computer. I’m running the Insiders edition of VS Code to gain access to Copilot X, though there is also a Genie GPT plugin as an alternative.

Unfortunately, not everything works right out of the box when using VS Code with Jupyter notebooks. In particular, I’ve discovered some of the fast.ai widgets don’t render correctly without some tweaks (and even then, I’ve found the ImageClassifierCleaner never works).

To get progress bars to render, run this in a cell:

from IPython.display import clear_output, DisplayHandle

# Define a function that updates an existing display object
def update_patch(self, obj):
    # Clear any outputs in the current IPython cell,
    # but wait until new outputs arrive before doing so
    clear_output(wait=True)

    # Update the display with a new object
    self.display(obj)

# Extend the DisplayHandle class by adding the update_patch function
# as a method named 'update'
# This effectively overwrites the existing 'update' method (if any) in DisplayHandle
DisplayHandle.update = update_patch

Docker to isolate projects

Rather than mess with pip, pipenv, conda, etc. to keep my various projects isolated, I run Docker from my project directory. It’s more hermetic than the alternatives and provides a modicum of security should I pull in a third-party package with some kind of vulnerability in it (though this isn’t a perfect solution).

For Jupyter notebooks, this is the command I use:

$ docker run \
    --shm-size=2G \     # Set the size of the shared memory to 2GB
    --gpus all \        # Pass through all GPUs available
    --detach \          # Run the container in the background
    --interactive \     # Keep STDIN open even if not attached
    --tty \             # Allocate a pseudo-TTY
    --publish 8848:8888 \  # Publish 8888 to the host's port 8848
    --publish 8080:8080 \  # Publish 8080 to the host's port 8080
    --volume $(pwd):/home/jovyan \  # Mount the current directory
    --env GRANT_SUDO=yes \  # Set env GRANT_SUDO to yes
    --env JUPYTER_ENABLE_LAB=yes \  # Set env JUPYTER_ENABLE_LAB to yes
    --env JUPYTER_TOKEN="secrettoken" \  # Replace with your own, made up token
    --user root \  # Set the user or UID
    cschranz/gpu-jupyter:v1.5_cuda-11.6_ubuntu-20.04_python-only

Change secrettoken to something only you know; it acts as a password of sorts to connect to it. When you open a notebook file with VS Code, it’ll ask you to pick the Python environment. There’s an option to connect to a remote server; put the IP of the primary computer and add ?token=secrettoken to connect. I often experience a bug where it doesn’t show up after adding it until I restart VS Code, however.

I arrived at this Docker command through trial and error; if you use it, be sure that you understand everything it does first – even Docker cannot guarantee perfect isolation.

On the desktop computer, I do use Pipenv as well so that VS Code knows about the various APIs I pull in and can give me hints about them.