阿里云容器化flannel使用ali-vpc模式的一些坑

Friday, October 26, 2018

记录阿里云部署k8s集群,flannel插件使用ali-vpc模式的一些坑。 部署时参考了https://github.com/coreos/flannel/blob/master/Documentation/alicloud-vpc-backend-cn.md 文档。

系统:CentOS7.4
docker:18.06
k8s:v1.12.2
flannel:v0.10.0-amd64 (容器部署 DaemonSet)
CoreDNS:1.2.2

flannel的阿里云专有网络模式(ali-vpc backend)来替代封装IP规则以取得最佳的表现。因为使用这种模式,不需要额外的 flannel 接口。使用此模式性能比vxlan模式好。

坑一:容器启动时无法解析外部域名

在部署flannel时k8s集群内部是DNS是还没有就绪的,而DNS容器(CoreDNS/kube-dns)需要使用到k8s的内部网络。
使用ali-vpc模式,flannel需要访问阿里云api接口,根据给节点分配的网段添加VPC路由,flannel在启动时报错:

dial tcp: lookup ecs-cn-hangzhou.aliyuncs.com: no such host

flannel容器无法解析ecs-cn-hangzhou.aliyuncs.com域名,连不上阿里云api,配置不了VPC
解决方法:添加指定DNS 223.6.6.6(其他公共DNS也行)
更改如下

spec:
  template:
    spec:
    # 增加DNS服务器223.6.6.6(别的公共DNS也行)
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
        - 223.6.6.6

完整的flannel.yaml文件如下:

--- 
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: flannel
rules:
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - get
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - nodes/status
    verbs:
      - patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: flannel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
subjects:
- kind: ServiceAccount
  name: flannel
  namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: flannel
  namespace: kube-system
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-flannel-cfg
  namespace: kube-system
  labels:
    tier: node
    app: flannel
data:
  cni-conf.json: |
    {
      "name": "cbr0",
      "plugins": [
        {
          "type": "flannel",
          "delegate": {
            "hairpinMode": true,
            "isDefaultGateway": true
          }
        },
        {
          "type": "portmap",
          "capabilities": {
            "portMappings": true
          }
        }
      ]
    }
  net-conf.json: |
    {
      "Network": "10.10.0.0/16",
      "Backend": {
        "Type": "ali-vpc"
      }
    }
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: kube-flannel-ds
  namespace: kube-system
  labels:
    tier: node
    app: flannel
spec:
  template:
    metadata:
      labels:
        tier: node
        app: flannel
    spec:
    # 增加DNS服务器223.6.6.6(别的公共服务器也行)
      dnsPolicy: "None"
      dnsConfig:
        nameservers:
        - 223.6.6.6
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      serviceAccountName: flannel
      initContainers:
      - name: install-cni
        image: quay.mirrors.ustc.edu.cn/coreos/flannel:v0.10.0-amd64
        command:
        - cp
        args:
        - -f
        - /etc/kube-flannel/cni-conf.json
        - /etc/cni/net.d/10-flannel.conflist
        volumeMounts:
        - name: cni
          mountPath: /etc/cni/net.d
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      containers:
      - name: kube-flannel
        image: quay.mirrors.ustc.edu.cn/coreos/flannel:v0.10.0-amd64
        command:
        - /opt/bin/flanneld
        args:
        - --ip-masq
        - --kube-subnet-mgr
        #- --iface=eth1
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
          limits:
            cpu: "100m"
            memory: "50Mi"
        securityContext:
          privileged: true
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: ACCESS_KEY_ID
          value: your_ali_key_id
        - name: ACCESS_KEY_SECRET
          value: your_ali_key_secret
        volumeMounts:
        - name: run
          mountPath: /run
        - name: flannel-cfg
          mountPath: /etc/kube-flannel/
      volumes:
        - name: run
          hostPath:
            path: /run
        - name: cni
          hostPath:
            path: /etc/cni/net.d
        - name: flannel-cfg
          configMap:
            name: kube-flannel-cfg

处理了DNS解析问题flannel启动正常,如下:

main.go:475] Determining IP address of default interface
main.go:488] Using interface with name eth0 and address 172.16.20.10
main.go:505] Defaulting external address to interface address (172.16.20.10)
kube.go:131] Waiting 10m0s for node controller to sync
kube.go:294] Starting kube subnet manager
kube.go:138] Node controller sync successful
main.go:235] Created subnet manager: Kubernetes Subnet Manager - 172.16.20.10
main.go:238] Installing signal handlers
main.go:353] Found network config - Backend type: ali-vpc
alivpc.go:63] Unmarshal Configure : { }
alivpc.go:164] Keep target entry: rtableid=vtb-xxxxxxxxxx, CIDR=10.10.0.0/24, NextHop=i-xxxxxxxxxx 
alivpc.go:187] Keep route entry: rtableid=vtb-xxxxxxxxxx, CIDR=10.10.1.0/24, NextHop=i-xxxxxxxxxx
...

坑二

同一节点的pod可以相互ping通,不通节点的之间的pod不能ping通。
对于着一现象肯定是先排查节点之间发flannel网络了,但是flannel正常到不能再正常,这是才猛然想起阿里云上的安全组在作怪(官方文档并没有提到要更改安全组这一点)。
在阿里云安全组添加白名单10.10.0.0/16(flannel网络的网段,即pod IP的网段)。
添加完白名后单一切正常。