记录阿里云部署k8s集群,flannel插件使用ali-vpc模式的一些坑。 部署时参考了https://github.com/coreos/flannel/blob/master/Documentation/alicloud-vpc-backend-cn.md 文档。
系统:CentOS7.4
docker:18.06
k8s:v1.12.2
flannel:v0.10.0-amd64
(容器部署 DaemonSet)
CoreDNS:1.2.2
flannel的阿里云专有网络模式(ali-vpc backend)来替代封装IP规则以取得最佳的表现。因为使用这种模式,不需要额外的 flannel 接口。使用此模式性能比vxlan模式好。
坑一:容器启动时无法解析外部域名
在部署flannel时k8s集群内部是DNS是还没有就绪的,而DNS容器(CoreDNS/kube-dns)需要使用到k8s的内部网络。
使用ali-vpc模式,flannel需要访问阿里云api接口,根据给节点分配的网段添加VPC路由,flannel在启动时报错:
dial tcp: lookup ecs-cn-hangzhou.aliyuncs.com: no such host
flannel容器无法解析ecs-cn-hangzhou.aliyuncs.com
域名,连不上阿里云api,配置不了VPC
解决方法:添加指定DNS 223.6.6.6(其他公共DNS也行)
更改如下
spec:
template:
spec:
# 增加DNS服务器223.6.6.6(别的公共DNS也行)
dnsPolicy: "None"
dnsConfig:
nameservers:
- 223.6.6.6
完整的flannel.yaml文件如下:
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: flannel
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- apiGroups:
- ""
resources:
- nodes
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes/status
verbs:
- patch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-system
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: flannel
namespace: kube-system
---
kind: ConfigMap
apiVersion: v1
metadata:
name: kube-flannel-cfg
namespace: kube-system
labels:
tier: node
app: flannel
data:
cni-conf.json: |
{
"name": "cbr0",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.10.0.0/16",
"Backend": {
"Type": "ali-vpc"
}
}
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: kube-flannel-ds
namespace: kube-system
labels:
tier: node
app: flannel
spec:
template:
metadata:
labels:
tier: node
app: flannel
spec:
# 增加DNS服务器223.6.6.6(别的公共服务器也行)
dnsPolicy: "None"
dnsConfig:
nameservers:
- 223.6.6.6
hostNetwork: true
nodeSelector:
beta.kubernetes.io/arch: amd64
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
serviceAccountName: flannel
initContainers:
- name: install-cni
image: quay.mirrors.ustc.edu.cn/coreos/flannel:v0.10.0-amd64
command:
- cp
args:
- -f
- /etc/kube-flannel/cni-conf.json
- /etc/cni/net.d/10-flannel.conflist
volumeMounts:
- name: cni
mountPath: /etc/cni/net.d
- name: flannel-cfg
mountPath: /etc/kube-flannel/
containers:
- name: kube-flannel
image: quay.mirrors.ustc.edu.cn/coreos/flannel:v0.10.0-amd64
command:
- /opt/bin/flanneld
args:
- --ip-masq
- --kube-subnet-mgr
#- --iface=eth1
resources:
requests:
cpu: "100m"
memory: "50Mi"
limits:
cpu: "100m"
memory: "50Mi"
securityContext:
privileged: true
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: ACCESS_KEY_ID
value: your_ali_key_id
- name: ACCESS_KEY_SECRET
value: your_ali_key_secret
volumeMounts:
- name: run
mountPath: /run
- name: flannel-cfg
mountPath: /etc/kube-flannel/
volumes:
- name: run
hostPath:
path: /run
- name: cni
hostPath:
path: /etc/cni/net.d
- name: flannel-cfg
configMap:
name: kube-flannel-cfg
处理了DNS解析问题flannel启动正常,如下:
main.go:475] Determining IP address of default interface
main.go:488] Using interface with name eth0 and address 172.16.20.10
main.go:505] Defaulting external address to interface address (172.16.20.10)
kube.go:131] Waiting 10m0s for node controller to sync
kube.go:294] Starting kube subnet manager
kube.go:138] Node controller sync successful
main.go:235] Created subnet manager: Kubernetes Subnet Manager - 172.16.20.10
main.go:238] Installing signal handlers
main.go:353] Found network config - Backend type: ali-vpc
alivpc.go:63] Unmarshal Configure : { }
alivpc.go:164] Keep target entry: rtableid=vtb-xxxxxxxxxx, CIDR=10.10.0.0/24, NextHop=i-xxxxxxxxxx
alivpc.go:187] Keep route entry: rtableid=vtb-xxxxxxxxxx, CIDR=10.10.1.0/24, NextHop=i-xxxxxxxxxx
...
坑二
同一节点的pod可以相互ping通,不通节点的之间的pod不能ping通。
对于着一现象肯定是先排查节点之间发flannel网络了,但是flannel正常到不能再正常,这是才猛然想起阿里云上的安全组在作怪(官方文档并没有提到要更改安全组这一点)。
在阿里云安全组添加白名单10.10.0.0/16
(flannel网络的网段,即pod IP的网段)。
添加完白名后单一切正常。