利用LXD容器构建共享的GPU服务器

多人共享的GPU服务器最大的痛点在于，每个人都希望拥有root权限并且过度自信。

笔者接手实验室的服务器管理以来，尝试LXD容器作为虚拟化方案至今已有一年多。机器多为4卡TitanXP或4卡2080Ti，近30个人共享使用。总体上，LXD虚拟化方案运行稳定，使用方便，配合一系列脚本，能够极大的解放管理员，降低工作量。

Google搜索LXD+GPU能够找到大量的中英文资料，因此本文只会简述安装和配置过程，着重分享不同环境下的挑战和解决方案。

笔者也将说明文档共享出来，供大家参考使用：https://deserts.gitbook.io/gpu/manual

安装配置

宿主机驱动配置

笔者习惯使用Ubuntu服务器版系统。首先安装英伟达显卡驱动，CUDA在宿主机上并非必要。

apt install git gcc g++ make cmake build-essential curl -y
apt-get remove --purge nvidia* -y

#把 nouveau 驱动加入黑名单并禁用用 nouveau 内核模块
# 在文件 blacklist-nouveau.conf 中加入如下内容
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf

# 保存退出，执行
update-initramfs -u

#给驱动run文件赋予执行权限：
sudo chmod +x NVIDIA-Linux-x86_64-<版本>.run
#后面的参数非常重要，不可省略：
sudo ./NVIDIA-Linux-x86_64-<版本>.run --no-opengl-files

安装nvidia-container-runtime，这样在容器中可以直接调用宿主机的显卡驱动。

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update

apt install libnvidia-container-dev libnvidia-container-tools nvidia-container-runtime -y

宿主机LXD配置

安装ZFS并配置，使用ZFS作为LXD的存储管理工具。ZFS文件系统开启去重。

apt install zfsutils-linux

zpool create tank /dev/sda
zfs create tank/lxd
zfs set dedup=on tank/lxd

安装LXD

snap install lxd

换源和拉取镜像

lxc remote add tuna-images https://mirrors.tuna.tsinghua.edu.cn/lxc-images/ --protocol=simplestreams --public
lxc image copy tuna-images:ubuntu/18.04 local: --alias ubuntu/18.04 --copy-aliases --public

初始化LXD，注意ZFS pool使用上面创建的，是否使用网桥视情况而定。

lxd init

Would you like to use LXD clustering? (yes/no) [default=no]:
Do you want to configure a new storage pool? (yes/no) [default=yes]:
Name of the new storage pool [default=default]: 
Name of the storage backend to use (btrfs, ceph, dir, lvm, zfs) [default=zfs]:
Create a new ZFS pool? (yes/no) [default=yes]:
Would you like to use an existing block device? (yes/no) [default=no]:
Size in GB of the new loop device (1GB minimum) [default=100GB]: 
Would you like to connect to a MAAS server? (yes/no) [default=no]:
Would you like to create a new local network bridge? (yes/no) [default=yes]: no
Would you like to configure LXD to use an existing bridge or host interface? (yes/no) [default=no]: yes
Name of the existing bridge or host interface: br0
Would you like LXD to be available over the network? (yes/no) [default=no]:
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]
Would you like a YAML "lxd init" preseed to be printed? (yes/no) [default=no]:

lxc profile set default nvidia.runtime true
lxc profile device add default gpu gpu

创建模板容器

lxc init ubuntu/18.04 template -p default
lxc start template

进入模板容器内安装必要的软件，如conda等

lxc exec template bash

完成后，发布将模板容器发布为模板镜像，删除模板容器

sudo lxc stop template
sudo lxc publish template --alias template --public
sudo lxc rm template

自动化脚本

利用笔者提供的shell脚本，即可实现用户创建时自动创建容器。使用add_user.sh新建用户及容器，用户登录宿主机后执行login.sh。

新建用户：

ssh addu@172.26.xxx.xxx
# 密码
addu@172.26.xxx.xxx's password:
=====Welcome!
We need to get sudo permission first. Enter the password for `addu`.
# 输入addu的密码，获取sudo权限
[sudo] password for addu:
=====Let's setup a new account and create a container now.
# 输入用户名，接下来自动创建用户并新建虚拟机
Enter your username: test
Creating user...
Allocating container for test...
Creating test
Allocating ssh port... 10020
Device sshproxy added to test
# 设置用户密码
set password for test now (host only).
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Login this host via `ssh <username>@<host-ip>` to manage your container.
Done!

用户登录：

# 使用新建的用户登陆并管理虚拟机
ssh test@172.26.xxx.xxx
test@172.26.xxx.xxx's password:
Welcome to Ubuntu 18.04.2 LTS (GNU/Linux 4.15.0-54-generic x86_64)
……
 Hi, test
 You're using the GPU Server in Vision Group.

==========About your container:
Your container is not running.
Transfer data to your container using scp or sftp;
File sharing is encouraged, access datasets at shared/datasets, access download files at shared/downloads, etc

See GPU load: nvidia-smi.
    memory usage: free -h.
    disk usage: df -h.

===== main menu  =====
[1] start your container  # 开机
[2] enter your container  # 切换至虚拟机
[3] stop your container   # 关机（也可以直接在虚拟机中执行shutdown now）
[4] change your password  # 更改密码（如果需要改虚拟机密码，进入虚拟机后执行passwd）
[5] allocate ports        # 进行端口映射
[6] release ports         # 释放申请的端口
[0] show info             # 显示虚拟机运行状态
[x] exit                  # 退出管理
# 启动虚拟机
Enter your choice: 1
========== Starting your container...

Press any key to continue...

折腾 Python

Deserts

利用LXD容器构建共享的GPU服务器

安装配置

宿主机驱动配置

宿主机LXD配置

自动化脚本

TOC

Related Posts

Valine Admin 配置手册

炼丹工具之Web端GPU信息整合

拼凑

半监督分割：从数据增强到学习范式

Valine Admin 配置手册

macOS Handoff失效解决方法

炼丹工具之Web端GPU信息整合

利用LXD容器构建共享的GPU服务器