This post is inspired by a question on theTwisted mailinglist by Michele:
I'm currently working on a PoC with twisted, Python, to prove the<span class="moz-txt-citetags"></span>technology as an alternative to more<span class="moz-txt-citetags"></span>established enterprise choices (java app servers, etc..).<span class="moz-txt-citetags"></span>the question is: if I have N number of processes running in a M number<span class="moz-txt-citetags"></span>of machines, given that there are no network restriction,<span class="moz-txt-citetags"></span>and that at least http and hhtps are always available, how these<span class="moz-txt-citetags"></span>services would be efficiently monitored?<br />
What I do is to have the service listen to a http connection that returns some monitoring data. The service returns a simple xml message containing some datapoints.
Then I use a custom Nagios plugin to request this url and Nagios-PNP to make graphs of the datapoints the monitor produces. It it doesn’t return anything I know the service is down.
The plugin setup is done by providing an url (sayhttp://myservice:893/monitor) to monitor. The plugin will then both monitor the application and graph it.
I have also added an extra field for”errors”that I use to report odd exceptions or other types of failures that are non-fatal but should be investigated. If the error count exceeds a set level the may also go into the reporting.
The twisted page
The page looks like this:
1
2
3from cStringIOimport StringIO
4from nevowimport loaders, rend, static, inevow, guard, url, tags
5from xml.etreeimport cElementTreeas ET
6
7class Monitor(rend.Page):
8 ”””
9 Basic monitoring interface
10 XML format:
11
12
13 min=”0”max=”300”/>
14
15 Exception
16 …
17
18
19 UOM (unit of measurement) is one of:
20 no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
21 s - seconds (also us, ms)
22 % - percentage
23 B - bytes (also KB, MB, TB)
24 c - a continous counter (such as bytes transmitted on an interface)
25
26 ”“”
27 def init(self, config):
28 self.isLeaf =True
29 self.config = config
30
31 def renderHTTP(self, ctx):
32 inevow.IRequest(ctx).setHeader(‘Content-Type’,‘text/xml; charset=UTF-8’)
33
34 #que_length = str(len(getter.get_ids(0, -1)))
35 #num_updated = str(get_nr_updated_last_day(self.config.get_db()))
36 num_errors =str(len(self.config.getErrors()))
37
38 _root = ET.Element(‘status’, {‘service’ :‘PDFIndexer’})
39 if ‘total’ in self.config.stats:
40 _doc = ET.SubElement(_root,‘total’, {‘value’:str(self.config.stats[‘total’]) })
41
42 _doc = ET.SubElement(_root,‘runnerStatus’, {‘value’:str(self.config.checksStatus()) })
43 #doc = ET.SubElement(root,’itemsAddedlast24’, {‘value’: num_updated})
44 _errors = ET.SubElement(_root,‘errors’, {‘value’: num_errors,‘critical’:“3” ,‘warning’:“2”})
45
46 for errorin self.config.getErrors():
47 t = ET.SubElement(_errors,‘error’, text = error)
48
49 _xmlcontainer = StringIO()
50 ET.ElementTree(_root).write(_xmlcontainer, encoding=“UTF-8”)
51 return _xmlcontainer.getvalue()
The Nagios plugin is foundhereand should be fairly self explanatory.
Note: I haven’t managed to get all the bugs out of it yet with regard to graphing differend datapoints. Also, it is probably not the most efficient Nagios plugin around.