ITPub博客

首页 > Linux操作系统 > Linux操作系统 > Nagios通过check_http监控一台web应用服务器上多个tomcat服务

Nagios通过check_http监控一台web应用服务器上多个tomcat服务

原创 Linux操作系统 作者:mchdba 时间:2014-06-13 15:38:18 0 删除 编辑

如何在nagios监控tomcat,是一个比较简单又复杂的事情,简单是因为如果只监控web应用服务器的一个tomcat服务是否正常运行,那么比较简单;如果要监控tomcat的其他比如连接数比如jvm内存使用率等就比较复杂,google没有适合的监控脚本;如果要监控web应用上面的多个tomcat服务器,而且很多tomcat服务都是跳转式的,那就需要多做很多事情。

 

一般通常都使用tcp tomcat端口的方式,不过这有一个bug就是tomcat假死的情况下,tcp 端口是OK的,但是tomcat里面部署的web应用其实已经不能正常访问,这个时候需要使用http方式来监控tomcat的状态。

 

所以本文就记录了如何采用http方式来监控一台web服务器上多个tomcat应用服务器。

 

1tomcat web服务器上安装nrpe客户端:

Rpm包下载地址为:http://download.csdn.net/detail/mchdba/7493875

1.1rpm方式安装nrpe客户端


  1. [root@localhost nagios]# ll
  2. 总计 768
  3. -rw-r--r-- 1 root root 713389 12-16 12:08 nagios-plugins-1.4.11-1.x86_64.rpm
  4. -rw-r--r-- 1 root root 32706 12-16 12:09 nrpe-2.12-1.x86_64.rpm
  5. -rw-r--r-- 1 root root 18997 12-16 12:08 nrpe-plugin-2.12-1.x86_64.rpm
  6. [root@localhost nagios]# rpm -ivh *.rpm --nodeps --force
  7. Preparing... ########################################### [100%]
  8.    1:nagios-plugins ########################################### [ 33%]
  9. id: nagios:无此用户
  10.    2:nrpe ########################################### [ 67%]
  11.    3:nrpe-plugin ########################################### [100%]
  12. [root@cache-1 ~]#


1.2 在配置文件最末尾,添加配置信息以及监控主机服务器ip地址

[root@ localhost nagios]# vim /etc/nagios/nrpe.cfg


  1. # add by tim on 2014-06-11
  2. command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
  3. command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
  4. command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
  5. command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
  6. #command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 50 -c 80
  7. command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
  8. command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
  9. allowed_hosts = 127.0.0.1,10.xx.xxx.xx1


check下命令是否生效:


  1. [root@webserver nrpe-2.15]# /usr/local/nagios/libexec/check_users -w 8 -c 15
  2. USERS OK - 2 users currently logged in |users=2;8;15;0
  3. [root@webserver nrpe-2.15]#


看到已经USERS OK -….命令已经生效。

 

1.3 启动nrpe报错如下:


  1. [root@webserver ~]# service nrpe restart
  2. Shutting down nrpe: [失败]
  3. Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
  4.                                                            [失败]
  5. [root@webserver ~]#
  6. [root@db-m2-slave-1 nagios_client]# service nrpe start
  7. Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
  8.                                                            [失败]
  9. [root@db-m2-slave-1 nagios_client]#


建立软连接

[root@db-m2-slave-1 nagios_client]# ln -s /usr/lib64/libssl.so /usr/lib64/libssl.so.6

 (如果没有libssl.so,就采用别的libssl.so.10来做软连接,ln -s /usr/lib64/libssl.so.10 /usr/lib64/libssl.so.6)

[root@db-m2-slave-1 nagios_client]#

再重新启动如下:

[root@webserver nagios_client]# service nrpe start

Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory

                                                           [失败]

[root@web-10 ~]# ll /usr/lib64/libcrypto.so

lrwxrwxrwx. 1 root root 18 10 13 2013 /usr/lib64/libcrypto.so -> libcrypto.so.1.0.0

[root@webserver nagios_client]#

再建软链接:
[root@webserver nagios_client]# ln -s /usr/lib64/libcrypto.so /usr/lib64/libcrypto.so.6

(或者如果没有libcrypto.so,就采用libcrypto.so.10做软连接, ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6)

[root@webserver nagios_client]# service nrpe start

Starting nrpe:                                             [确定]

[root@webserver nagios_client]#

 

1.4 检测下nrpe是否正常运行:

nagios服务器端check

[root@cache-2 ~]#  /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10

NRPE v2.12

[root@cache-2 ~]#

看到返回NRPE v2.15表示已经连接成功。

 

1.5 web应用下添加检测jsp文件

(1) 建立测试文件

vim ./webapps/nagios_test_0611/nagios_test_0611.jsp


  1. <%@ page language=\"java\" contentType=\"text/html; charset=gb2312\"
  2. pageEncoding=\"gb2312\"%>
  3. <!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">
  4. <html>
  5. <head>
  6. <meta http-equiv=\"Content-Type\" content=\"text/html; charset=gb2312\">
  7. <title>nagios test here</title>
  8. </head>
  9. <body>
  10.  <center>Now time is: <%=new java.util.Date()%></center>
  11. </body>
  12. </html>


(2) check下check_http命令

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200

HTTP CRITICAL - Invalid HTTP response received from host on port 8300: HTTP/1.1 404 Not Found

需要重启一下tomcat,使新添加的jsp生效能打开,执行如下stop start命令:

/usr/local/app/apache-tomcat-6.0.37_8300/bin/shutdown.sh

/usr/local/app/apache-tomcat-6.0.37_8300/bin/startup.sh

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200

HTTP OK: Status line output matched "200" - 571 bytes in 0.882 second response time |time=0.882479s;;;0.000000 size=571B;;;0

[root@ webserver ~]#

 

1.6查看NRPE的监控命令


  1. [root@webserver nrpe-2.15]# cat /etc/nagios/nrpe.cfg |grep -v \"^#\"|grep -v \"^$\"
  2. log_facility=daemon
  3. pid_file=/var/run/nrpe.pid
  4. server_port=5666
  5. nrpe_user=nagios
  6. nrpe_group=nagios
  7.  
  8. dont_blame_nrpe=0
  9. debug=0
  10. command_timeout=60
  11. connection_timeout=300
  12. command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
  13. command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
  14. command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
  15. command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
  16. command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
  17. command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
  18. allowed_hosts=127.0.0.1,10.xx.xxx.xx1
  19. [root@webserver nrpe-2.15]#


2,去nagios服务器端添加host等监控信息。

2.1 hosts.cfg里面添加主机信息


  1. define host{
  2.         use linux-server
  3.         host_name webserver
  4.         alias webserver
  5.         address 10.xx.xx.10
  6.         check_command check-host-alive
  7.         max_check_attempts 5
  8.         check_period 24x7
  9.         contact_groups ops
  10.         notification_interval 60
  11.         notification_period 24x7
  12.         notification_options d,u,r
  13.         }


2.2 service.cfg里面添加web机器监控的命令信息


  1. # No.007 webserver
  2. # service definition
  3. define service{
  4.         host_name webserver
  5.         service_description check_load
  6.         check_command check_nrpe!check_load
  7.         max_check_attempts 5
  8.         normal_check_interval 3
  9.         retry_check_interval 2
  10.         check_period 24x7
  11.         notification_interval 10
  12.         notification_period 24x7
  13.         notification_options w,u,c,r
  14.         contact_groups opsweb
  15.         }
  16.  
  17. define service{
  18.         host_name webserver
  19.         service_description check-host-alive
  20.         check_command check-host-alive
  21.         max_check_attempts 5
  22.         normal_check_interval 3
  23.         retry_check_interval 2
  24.         check_period 24x7
  25.         notification_interval 10
  26.         notification_period 24x7
  27.         notification_options w,u,c,r
  28.         contact_groups opsweb
  29.         }
  30.  
  31. define service{
  32.         host_name webserver
  33.         service_description Check Disk sda1
  34.         check_command check_nrpe!check_sda1
  35.         max_check_attempts 5
  36.         normal_check_interval 3
  37.         retry_check_interval 2
  38.         check_period 24x7
  39.         notification_interval 10
  40.         notification_period 24x7
  41.         notification_options w,u,c,r
  42.         contact_groups opsweb
  43.         }
  44.  
  45. define service{
  46.         host_name webserver
  47.         service_description Total Processes
  48.         check_command check_nrpe!check_total_procs
  49.         max_check_attempts 5
  50.         normal_check_interval 3
  51.         retry_check_interval 2
  52.         check_period 24x7
  53.         notification_interval 10
  54.         notification_period 24x7
  55.         notification_options w,u,c,r
  56.         contact_groups opsweb
  57.         }
  58.  
  59. define service{
  60.         host_name webserver
  61.         service_description Current Users
  62.         check_command check_nrpe!check_users
  63.         max_check_attempts 5
  64.         normal_check_interval 3
  65.         retry_check_interval 2
  66.         check_period 24x7
  67.         notification_interval 10
  68.         notification_period 24x7
  69.         notification_options w,u,c,r
  70.         contact_groups opsweb
  71.         }
  72.  
  73. define service{
  74.         host_name webserver
  75.         service_description Check Zombie Procs
  76.         check_command check_nrpe!check_zombie_procs
  77.         max_check_attempts 5
  78.         normal_check_interval 3
  79.         retry_check_interval 2
  80.         check_period 24x7
  81.         notification_interval 10
  82.         notification_period 24x7
  83.         notification_options w,u,c,r
  84.         contact_groups opsweb
  85.         }

  86. define service{
  87.         host_name webserver
  88.         service_description Check Tomcat 9300 Status
  89.         check_command check_nrpe!check_tomcat_9300_status
  90.         max_check_attempts 5
  91.         normal_check_interval 3
  92.         retry_check_interval 2
  93.         check_period 24x7
  94.         notification_interval 10
  95.         notification_period 24x7
  96.         notification_options w,u,c,r
  97.         contact_groups opsweb
  98.         }


2.3 vim contacts.cfg添加新的opsweb邮件组信息


  1. define contactgroup{
  2.         contactgroup_name opsweb
  3.         alias pl ops team
  4.         members tim,mch,nagiosadmin
  5.         }


 

2.4 添加新的监控tomcat的命令,check_tomcat_9300_status

这里不采用check_tcp!8080端口的方式,是因为在实际中tomcat服务假死之后,jsp的网页都是打不开的,但是这个监控端口8080都是正常的,不会报警出来;所以采用check_http的方式,新建立一个通用的/nagios_test_0611/nagios_test_0611.jsp文件,来检测这个jsp的访问情况,如下所示:

vim commands.cfg


  1. # add by tim on 20140611
  2. define command{
  3.         command_name check_tomcat_9300_status
  4.         command_line $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
  5.         }


Jsp文件内容如下:


  1. [root@webserver webapps]# vim . /nagios_test_0611/nagios_test_0611.jsp
  2. <%@ page language=\"java\" contentType=\"text/html; charset=gb2312\"
  3. pageEncoding=\"gb2312\"%>
  4. <!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">
  5. <html>
  6. <head>
  7. <meta http-equiv=\"Content-Type\" content=\"text/html; charset=gb2312\">
  8. <title>nagios test here</title>
  9. </head>
  10. <body>
  11.  <center>Now time is: <%=new java.util.Date()%></center>
  12. </body>
  13. </html>



2.5
在被监控客户端的nrpe.cfg配置文件里面添加tomcat端口配置信息:

command[check_tomcat_9300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 9444 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10

command[check_tomcat_8300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10

 

2.6 测试报错

[root@cache-2 objects]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_load

NRPE: Unable to read output

[root@cache-2 objects]#

已经添加了tomcat930端口,现在再添加一个tomcat8300端口

 

去服务器端shell命令行里面check

/usr/local/nagios/libexec/check_nrpe -H 192.168.15.178 -c check_mysql_myisam_lock

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_load

NRPE: Unable to read output

[root@cache-2 etc]#

同样报错,那么可能就是nagios被监控端的问题。

 

最终检查是nrpe.cfg里面路径有误,源码安装默认路径是:/usr/local/nagios/libexec/check_httprpm安装默认路径是:/usr/lib/nagios/plugins/。这里是rpm安装,所以nrpe.cfg配置文件里面用后面rpm的路径/usr/lib/nagios/plugins/,替换下service nrpe restart之后,问题解决,如下图所示:

3 tomcat多端口监控报警

已经添加了tomcat930端口,现在再添加一个tomcat8300端口


3.1 客户端的nrpe.cfg里面添加配置

[root@webserver root]# vim /etc/nagios/nrpe.cfg

command[check_tomcat_8300_status]=/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8300 -u /xx_xx_xx/index.html -e 200 -w 5 -c 10

 

  1. define service{
  2.         host_name webserver
  3.         service_description Tomcat_8300_Status
  4.         check_command check_nrpe!check_tomcat_8300_status
  5.         max_check_attempts 5
  6.         normal_check_interval 3
  7.         retry_check_interval 2
  8.         check_period 24x7
  9.         notification_interval 10
  10.         notification_period 24x7
  11.         notification_options w,u,c,r
  12.         contact_groups opsweb
  13.         }


3.2 nagios服务器端
添加command命令


  1. [root@cache-2 etc]# vim ./objects/commands.cfg
  2. define command{
  3.         command_name check_tomcat_8300_status
  4.         command_line $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
  5.         }


添加service服务


  1. define service{
  2.         host_name webserver
  3.         service_description Tomcat_8300_Status
  4.         check_command check_nrpe!check_tomcat_8300_status
  5.         max_check_attempts 5
  6.         normal_check_interval 3
  7.         retry_check_interval 2
  8.         check_period 24x7
  9.         notification_interval 10
  10.         notification_period 24x7
  11.         notification_options w,u,c,r
  12.         contact_groups opsweb
  13.         }


3.3 nagios服务器上check下新添加的命令是否生效

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_tomcat_8300_status

HTTP OK HTTP/1.1 200 OK - 611 bytes in 0.003 seconds |time=0.003152s;5.000000;10.000000;0.000000 size=611B;;;0

[root@cache-2 etc]#

看到命令已经生效。

 

3.4 重启nagios服务器,查看结果

[root@cache-2 etc]# service nagios reload

Running configuration check...

Reloading nagios configuration...

done

[root@cache-2 etc]#

重启后,过3分钟,新的tomcat8300已经监控起来了,如下图所示:

 

为了验证tomcat的监控效果,在web服务器客户端,停掉tomcat9300端口,一会就会收到报警email,也会在nagios页面看到红色报警提示,如下所示:

这标示2nagios选项监控的是2个端口,一个9300,一个8300

 

添加新端口8200检测-e 200报错问题解决

[root@webserver OCC_MANAGER_Web]#  /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -e 200 -w 5 -c 10

HTTP CRITICAL - Invalid HTTP response received from host on port 8200

[root@webserver OCC_MANAGER_Web]#

 

4.1 直接访问tomcat服务以及indexhtml

http://10.xx.xx.10:8200/OCC_REPORT_Web/index.html是可以访问的,但是会跳转到

http://www.xxxx.xx/OCC_SSO_Web/login.htm?redirect=http%3A%2F%2F10.xx.xx.10%3A8200%2FOCC_REPORT_Web%2Findex.html的页面,证明web应用都是正常的,只是已经被跳转到别的域名页面而已。

 

4.2 –v详细分析

这个时候tomcat服务器是正常running的,而且web应用也是正常返回的,只是运行 看到这里大概意思是从8200端口获取无效的HTTP响应,因为这条命令最重要的是监控/OCC_REPORT_Web/index.html获取http信息并通过-e 200来判断http正常响应的OK状态,所以去掉报警的-w 5 –c 10参数,去掉-e 200的字符比对信息,看下check的返回信息。

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html

HTTP OK - HTTP/1.1 302 Found - 0.003 second response time |time=0.003367s;;;0.000000 size=317B;;;0

看到返回的是HTTP/1.1 302 Found 查看Tomcat错误代码知道是产生了新的URL信息

……

301  Moved Permanently  客户请求的文档在其他地方,新的URL在Location头中给出,浏览器应该自动地访问新的URL。
302  Found  类似于301,但新的URL应该被视为临时性的替代,而不是永久性的。注意,在HTTP1.0中对应的状态信息是“Moved Temporatily”。

……

 

最后加入-v参数调试看详细的获取信息:

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -H www.xxxx.com -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -v

GET /OCC_REPORT_Web/index.html HTTP/1.0

User-Agent: check_http/v1861 (nagios-plugins 1.4.11)

Connection: close

Host: www.xxxx.com

 

http://10.xx.xx.10:8200/OCC_REPORT_Web/index.html is 323 characters

STATUS: HTTP/1.1 302 Found

**** HEADER ****

Server: Apache-Coyote/1.1

Set-Cookie: ploccSessionId=45CD9C9921A5B89C59FCB2E34FE52734; Path=/

Location: http://www.xxx.com/OCC_SSO_Web/login.htm?redirect=http%3A%2F%2Fwww.xxx.com%2FOCC_REPORT_Web%2Findex.html

Content-Length: 0

Date: Thu, 12 Jun 2014 02:52:45 GMT

Connection: close

**** CONTENT ****

HTTP OK - HTTP/1.1 302 Found - 0.003 second response time |time=0.003268s;;;0.000000 size=323B;;;0

 

看到页面重定向到域名系统,tomcat服务器是正常运行的,所以302 Found也可以表示tomca服务器正常运转无误,因为架构是用的lvs负载均衡,所以如果动用跳转后的公用域名来判断的话,就不能确定是否是这个主机的tomcat,因为公用域名每次只对应其中一个tomcat服务,因为这里是监控具体的一台web服务器的tomcat,所以去监控302端口也是一个不错的办法,这里可以去修改客户端nrpe.cfg里面的8200端口的监控命令,改成监控tomcat302状态值:

Vim /etc/nagios/nrpe.cfg

/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html  -e 302 -w 3 -c 10

 

报错记录(): NRPE: Unable to read output

[1402557345] SERVICE ALERT: webserver;Tomcat_6100_OCC_SSO_Service_Status;UNKNOWN;SOFT;3;NRPE: Unable to read output

 

解决:一般是nrpe路径不对。

 

报错记录()CHECK_NRPE: Error - Could not complete SSL handshake.

[root@cache-2 etc]# /usr/local/nagios/libexec/check_http -I 10.xx.3.xx -p 8100 -u /tradeAdmin/index.html

HTTP OK: HTTP/1.1 302 Found - 319 bytes in 0.064 second response time |time=0.064033s;;;0.000000 size=319B;;;0

[root@cache-2 etc]#

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.3.xx -c check_load

CHECK_NRPE: Error - Could not complete SSL handshake.

[root@cache-2 etc]#

解决:/etc/nagios/nrpe.cfg里面没有添加nagios服务器主机ip地址

Vim /etc/nagios/nrpe.cfg

allowed_hosts=127.0.0.1,10.xx.xxx.xx1

之后重启nrpeservice nrpe restart;再去nagios服务器上验证OK:

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.3.xx -c check_load

OK - load average: 0.43, 0.17, 0.06|load1=0.430;15.000;30.000;0; load5=0.170;10.000;25.000;0; load15=0.060;5.000;20.000;0;

[root@cache-2 etc]#

 

 

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26230597/viewspace-1182345/,如需转载,请注明出处,否则将追究法律责任。

请登录后发表评论 登录
全部评论
Happy is the man who is living by his hobby.

注册时间:2011-09-05

  • 博文量
    147
  • 访问量
    3710492