概要描述
TDH 5.x 版本中使用的是 Docker 1.13 版本,在 Docker 异常退出时可能会出现 container 不在了,但是 container 里面的进程还在,此时重启服务的时候可能会出现端口占用的问题,同时在 pod 日志和服务组件日志中发现以下报错:
Caused by:java.net.BindException:Address already in use
服务对应的 Container stuck, 无法停止、删除以及 exec 执行命令:
[docker] rpc error: code = 2 desc = containerd: container not found
本案例简单介绍一下如何排查类似问题,以及端口占用时的解决方案
本案例环境:TDH 5.2.2
详细说明
操作步骤
- 查看日志找出被占用的端口
- 找出占用端口的进程并 kill 掉
- 重启服务
1、查看日志找出被占用的端口
TDH 集群上有两种 namespace 的 pod,以下示例为查看 default namespace 的 pod 的信息,如果需要查看 TOS(一般是 TOS Master)相关 pod 信息时只需要在以下示例 kubectl 命令后面添加 -n kube-system
指定TOS 所属的 namespace 即可
1.1、执行 kubectl get pods | grep {service_name}
获取运行异常的服务的 pod 状态,如下示例为获取 hyperbase 服务的 pod 状态:
$ kubectl get pods | grep hyperbase
NAME READY STATUS RESTARTS AGE IP NODE
hyperbase-master-hyperbase1-85769621-19nx6 1/1 Running 0 21d 172.22.22.1 tdh-01
hyperbase-master-hyperbase1-85769621-kfwk0 1/1 Running 0 21d 172.22.22.2 tdh-02
hyperbase-master-hyperbase1-85769621-xmvh1 1/1 Running 0 21d 172.22.22.3 tdh-03
hyperbase-regionserver-hyperbase1-1971244736-j2lv0 1/1 CrashLoopBackOff 0 21d 172.22.22.1 tdh-01
hyperbase-regionserver-hyperbase1-1971244736-rv84n 1/1 Running 0 21d 172.22.22.2 tdh-02
hyperbase-regionserver-hyperbase1-1971244736-tfxg4 1/1 Running 0 21d 172.22.22.3 tdh-03
hyperbase-thrift-hyperbase1-288079256-b74mw 1/1 Running 0 21d 172.22.22.1 tdh-01
1.2、执行 kubectl logs -f {pod_name}
查看异常 pod 的运行日志,如下示例为获取 pod_name 为 hyperbase-regionserver-hyperbase1-1971244736-j2lv0 的 pod 日志:
$ kubectl logs -f hyperbase-regionserver-hyperbase1-1971244736-j2lv0
+ export TDH_SCRIPT_DIR=/usr/lib/transwarp/scripts
+ TDH_SCRIPT_DIR=/usr/lib/transwarp/scripts
+ source /usr/lib/transwarp/scripts/repeat_until_ready.sh
+ echo 'umount /etc/hosts'
+ umount /etc/hosts
+ '[' -z '' ']'
+ '[' == false ']'
/usr/bin/boot.sh: line 81: [: ==: unary operator expected
+ sudo -u hbase /usr/lib/transwarp/scripts/hbase.sh HYPERBASE_REGIONSERVER
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/guardian-plugins/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
java.net.BindException: Port in use: 0.0.0.0:60020
at org.apache.hbase.hadoop.http.HttpServer2.openlisteners(HttpServer2.java:954)
at org.apache.hbase.hadoop.http.HttpServer2.start(HttpServer2.java:894)
at org.apache.hbase.hadoop.util.run(Runner.java:70)
Caused by:java.net.BindException: Address already in use
1.3、如果 pod 日志中无最新的日志,可以尝试重启该异常 pod 然后使用同样的命令查看最新的日志:
执行 kubectl delete pods {pod_name}
删除异常 pod ,K8s 会自动重新启动一个pod;
然后执行kubectl get pods | grep {service_name}
获取服务的 pod 状态;
最后执行 kubectl logs -f {pod_name}
查看 pod 的运行日志,如下示例为获取 pod_name 为 hyperbase-regionserver-hyperbase1-1971244736-j2lv0 的 pod 日志:
$ kubectl delete pods hyperbase-regionserver-hyperbase1-1971244736-j2lv0
pod "hyperbase-regionserver-hyperbase1-1971244736-j2lv0" deleted
$ kubectl get pods -owide | grep hyperbase
NAME READY STATUS RESTARTS AGE IP NODE
hyperbase-master-hyperbase1-85769621-19nx6 1/1 Running 0 21d 172.22.22.1 tdh-01
hyperbase-master-hyperbase1-85769621-kfwk0 1/1 Running 0 21d 172.22.22.2 tdh-02
hyperbase-master-hyperbase1-85769621-xmvh1 1/1 Running 0 21d 172.22.22.3 tdh-03
hyperbase-regionserver-hyperbase1-1971244736-q1vdj 1/1 CrashLoopBackOff 0 21d 172.22.22.1 tdh-01
hyperbase-regionserver-hyperbase1-1971244736-rv84n 1/1 Running 0 21d 172.22.22.2 tdh-02
hyperbase-regionserver-hyperbase1-1971244736-tfxg4 1/1 Running 0 21d 172.22.22.3 tdh-03
hyperbase-thrift-hyperbase1-288079256-b74mw 1/1 Running 0 21d 172.22.22.1 tdh-01
$ kubectl logs -f hyperbase-regionserver-hyperbase1-1971244736-q1vdj
+ export TDH_SCRIPT_DIR=/usr/lib/transwarp/scripts
+ TDH_SCRIPT_DIR=/usr/lib/transwarp/scripts
+ source /usr/lib/transwarp/scripts/repeat_until_ready.sh
+ echo 'umount /etc/hosts'
+ umount /etc/hosts
+ '[' -z '' ']'
+ '[' == false ']'
/usr/bin/boot.sh: line 81: [: ==: unary operator expected
+ sudo -u hbase /usr/lib/transwarp/scripts/hbase.sh HYPERBASE_REGIONSERVER
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/guardian-plugins/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
java.net.BindException: Port in use: 0.0.0.0:60020
at org.apache.hbase.hadoop.http.HttpServer2.openlisteners(HttpServer2.java:954)
at org.apache.hbase.hadoop.http.HttpServer2.start(HttpServer2.java:894)
at org.apache.hbase.hadoop.util.run(Runner.java:70)
Caused by:java.net.BindException: Address already in use
at org.apache.hbase.hadoop.util.run.NetBind0(Nasive Methed)
1.4、如果以上均未找到详细的报错信息,可以查看对应服务的日志,日志文件一般在服务所在节点上的 /var/log/{service_name}/
路径
如下示例为 hyperbase1 的日志文件:
$ ll /var/log/hyperbase1/
总用量 236156
-rwxr-xr-x 1 1004 1005 39412805 11月 11 09:13 hbase-hbase-master-tdh-01.log
-rwxr-xr-x 1 1004 1005 63046304 11月 14 12:24 hbase-hbase-regionserver-tdh-01.log
-rwxr-xr-x 1 1004 1005 67111609 9月 5 17:29 hbase-hbase-regionserver-tdh-01.log.1
-rwxr-xr-x 1 1004 1005 86225 11月 11 09:13 hbase-hbase-thriftserver-tdh-01.log
-rwxr-xr-x 1 1004 1005 53250829 11月 14 12:22 SecurityAuth.audit
2、找到占用端口的进程并 kill
在服务运行异常的节点上,执行 netstat -anp |grep {port} |grep LISTEN
命令,查看监听端口的进程,并 kill 掉。这里占用的 ${port} 即为刚才日志 “java.net.BindException: Port in use: 0.0.0.0:60020” 里显示的。如下示例为查看占用 60020 端口的进程,并 kill 掉进程。
$ netstat -anp|grep 60020 |grep LISTEN
tcp 0 0 0.0.0.0:60020 0.0.0.0:* LISTEN 28223/java
$ kill -9 28223
3、重启服务
可以通过 manager 页面重启服务,也可以在后台直接 执行 kubectl delete pods {pod_name}
删除 pod ,K8s 会自动重新启动一个pod。