题图摄于北京奥林匹克公园
本文介绍vSphere 7内置的Bitfusion功能如何在Kubernetes中使用,让应用使用远端GPU资源。
(本文作者系VMware云原生实验室工程师,首发于VMware中国研发中心)
背景介绍
以 GPU 为例,通过创新的 Bitfusion GPU 虚拟化技术,能够帮助用户无需任务修改就能透明地共享和使用数据中心内任何服务器之上的 AI 加速器,不但能够帮助用户提高资源利用率,而且可以 极大便利 AI 应用的部署,构建数据中心级的 AI 加速器资源池。
Bitfusion通过提供远程GPU池来帮助解决这些问题。Bitfusion使GPU成为头等公民,可以像传统的计算资源一样抽象、分区、自动化和共享。另一方面,Kubernetes已经成为部署和管理机器学习负载的平台。
本文通过介绍使用最新开发的Bitfusion Device Plugin,如何快捷在Kubernetes上使用Bitfusion提供的GPU资源池进TensorFlow行流行的TensorFlow深度学习开发。
docker build -f bitfusion-device-plugin/Dockerfile -t bitfusion_device_plugin/bitfusion-device:v0.1
FROM bitfusion-base:v0.1
RUN apt install curl -y
RUN \
mkdir -p /goroot && \
curl .9.linux-amd64.tar.gz | tar xvzf - -C /goroot --strip-components=1
# Set environment variables.
ENV GOROOT /goroot
ENV GOPATH /gopath
ENV PATH $GOROOT/bin:$GOPATH/bin:$PATH
# Define working directory.
WORKDIR /gopath/src/bitfusion-device-plugin
COPY . .
RUN go build -o bitfusion-device-plugin
RUN cp bitfusion-device-plugin /usr/bin/bitfusion-device-plugin \
&& cp *.sh /usr/bin
CMD ["./start.sh"]
修改如下,更新 device_plugin.yml 文件中的 image ,Device Plugin 将以 DaemonSet 安装在Kubernetes 节点上。
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: bitfusion-cli-device-plugin
namespace: kube-system
labels:
tier: node
spec:
hostNetwork: true
containers:
- name: device-plugin-ctr
image: bitfusion_device_plugin/bitfusion-device:v0.1
securityContext:
privileged: true
command: ["./start.sh"]
env:
- name: REG_EXP_SFC
valueFrom:
configMapKeyRef:
name: configmap
key: reg-exp
- name: SOCKET_NAME
valueFrom:
configMapKeyRef:
name: configmap
key: socket-name
- name: RESOURCE_NAME
valueFrom:
configMapKeyRef:
name: configmap
key: resource-name
volumeMounts:
- mountPath: "/root/.bitfusion"
name: bitfusion-cli
- mountPath: /gopath/run
name: docker
- mountPath: /gopath/proc
name: proc
- mountPath: "/root/.ssh/id_rsa"
name: ssh-key
- mountPath: "/var/lib/kubelet"
name: kubelet-socket
- mountPath: "/etc/kubernetes/pki"
name: pki
volumes:
- name: bitfusion-cli
hostPath:
path: "/root/.bitfusion"
- name: docker
hostPath:
path: /var/run
- name: proc
hostPath:
path: /proc
- hostPath:
path: "/root/.ssh/id_rsa"
name: ssh-key
- hostPath:
path: "/var/lib/kubelet"
name: kubelet-socket
- hostPath:
path: "/etc/kubernetes/pki"
name: pki
kubeclt apply -f bitfusion-device-plugin/device_plugin.yml
docker build -f bitfusion-device-plugin/docker/bitfusion-tfl-cli/Dockerfile -t bitfusion-tfl-cli:v0.1
FROM bitfusion-base:v0.1
RUN conda install tensorflow-gpu==1.13.1
在 pod.yaml 中添加标签,并且参考如下修改 参数:
resource limit:可以设置应用能使用的 bitfusion.io/gpu 的数目;
---
apiVersion: v1
kind: ConfigMap
metadata:
name: bfs-pod-configmap
---
apiVersion: v1
kind: Pod
metadata:
name: bfs-demo
labels:
purpose: device-demo
spec:
hostNetwork: true
containers:
- name: demo
image: bitfusion-tfl-cli:v0.1
imagePullPolicy: Always
workingDir: /root
securityContext:
privileged: true
command: ["/bin/bash", "-c", "--"]
args: ["python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--local_parameter_device=gpu
--batch_size=32
--model=inception3 "]
volumeMounts:
- mountPath: "/root/.bitfusion"
name: config-volume
resources:
limits:
bitfusion.io/gpu: 1
volumes:
- name: config-volume
hostPath:
path: "/root/.bitfusion"
TensorFlow有自己的官方Benchmarks:tensorflow/benchmarks,里面的tf_cnn_benchmarks包含了resnet50, resnet152, inception3, vgg16, googlenet, alexnet等模型,只需要简单地提供一些参数,便可开始测试。
这里我们选择inception3模型来做基准测试,观察pod内的bitfusion client 是否成功和bitfusion server打通
kubeclt apply -f bitfusion-device-plugin/example/pod/pod.yaml
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/31557890/viewspace-2707077/,如需转载,请注明出处,否则将追究法律责任。