Scalable SNMP Polling

One of the major challenges when building a modern network monitoring system is the need to constantly collect vast amounts of SNMP data for statistical and alerting purposes. A scalable SNMP poller is one of the key requirements for a monitoring system to be successful.

SNMP is a fairly simplistic protocol where the monitoring application sends a request to a remote device for a specific piece of data (eg. interface transmit octets) and waits for a response. Due to things like network latency and CPU load of the remote device, it may take some time for the device to respond, therefore getting through the poller workload in a timely fashion can be a massive challenge in large network infrastructures.

Monolithic Poller Architecture

A monolithic poller typically sends its requests synchronously. That is, it sends a request to device A, waits for a response, then sends a request to device B. The major flaw with this architecture is that network latency significantly restricts the number of requests the poller can perform each minute.

For example, a network latency of 100 milliseconds means the poller would only be able to perform 600 requests per minute. At 30 MIB objects per request, that is only 18,000 objects, or data for only 1,800 interfaces (assuming a minimum of 10 MIB objects per interface).

Multi-Process Poller Architecture

A multi-process poller architecture is a crude method of increasing scalability. This is typically done by dividing the workload into N lumps, where N is the number of poller processes fired off. 10 poller processes would typically increase the scale of the above example to 180,000 objects, or 18,000 interfaces.

  • Linear increase in scale
  • Many more processes to manage
  • Complex poller configuration
  • Increase in system memory usage
  • Each poller process is still affected by network latency

Multi-Threaded Poller Architecture

A multi-threaded poller is another crude method of increasing scalability. This is done by firing off many threads, each thread handling a single request / response transaction.

Threads consume less memory than processes because they share the same virtual memory address space, file descriptors and process state information. They are faster to start up and context switch. Refer to this Wikipedia - Threads vs Processes article.

  • Linear increase in scale
  • Consumes less memory than a multi-process poller
  • Complex poller configuration
  • Added complexity of managing many threads
  • Threads have to perform locking so they don't clobber each other
  • Each poller thread is still affected by network latency
  • Significantly more complicated to implement, test and debug
  • One rogue thread can crash all other threads because they all share the same memory and process state information.
  • Has no scale advantage over a multi-process poller

Distributed Poller Architecture

A distributed poller architecture is another way of increasing scale. By deploying multiple pollers closer to the end devices, you can significantly reduce the underlying network latency issues, therefore the pollers are able to perform many more requests each minute.

  • Faster to get through its polling workload due to lower network latency
  • Significantly more expensive to deploy and maintain
  • Requires deploying many remote physical or virtual servers

In Summary ...

None of the above architectures come anywhere close to providing a cost effective scalable solution.
  • Single monolithic synchronous pollers do not scale.
  • Multi-threaded pollers are grossly inefficient due to all the thread management overheads.
  • Distributed pollers are plainly too expensive to deploy and maintain.

The AKIPS Poller Architecture

  • Monolithic architecture - one process, no threads
  • Asynchronous polling algorithm - polls 1000s of devices at the same time.
  • Synchronous device requests algorithm - N outstanding requests to each device at any one time (defaults to 1). This significantly reduces the chances of over running the CPU in the remote devices.
  • In-flight window tuning on a per device basis. This feature completely negates the issue of polling large remote devices with high network latency.
  • Direct database insertion - achieved by tightly coupling the configuration, time-series and event databases into a single process.
  • MIB database integrated directly into the SNMP engine for high speed packet encoding and decoding.
  • 60 second polling interval for every MIB object
  • Scales to over 10 million MIB objects per minute
  • 40 second polling window each minute
  • Integrated Ping Poller (RTT measured in microseconds)
  • 15 second Ping interval for every device
  • Interleaving of Ping and SNMP requests for good packet mix

Example Deployment

Test System:
  • ASUS P8H67-M PRO motherboard
  • Intel ® Core ™ i5-2500K CPU @ 3.30GHz (4 cores, no hyperthreading)
  • 16 gigabytes of 1333Mhz DDR3 RAM (Stream benchmark of 16.4 Gigabytes/sec)
  • 3 x 1 Terrabyte Western Digital WD10EZRX disks (setup in a single ZFS pool).
  • Operating system - FreeBSD ® 10.2 RELEASE
  • Total cost - less than $1000

Note: This is a four year old test server from our test lab. The system uses a stock standard FreeBSD 10.2 installation. No special modifications are performed except for a few sysctl values bumped up.

Network Topology Monitored:
  • Total of 7540 devices
  • 2998 devices using SNMPv2
  • 4542 devices using SNMPv3 (SHA Authentication, AES 256 encryption)
  • Total of 407,160 interfaces (13 objects per interface)
  • 5,360,940 polled MIB objects per minute (average of 134,000 per second), which includes things like CPU, Memory, Temperature, PSU and Fan States, etc.
  • 30,160 pings per minute (502 per sec)
Poller CPU usage (1 second samples)

The poller has a 40 second SNMP polling window, starting at second 5 and ending at second 45. This leaves the system mostly idle at the start and end of each minute for other processes to do work with little CPU and I/O contention.

You'll see from the Poller CPU graph, there is still a lot of available headroom on this system to at least double the number of polled MIB objects.

AES encryption to 4500 SNMPv3 devices has very little impact because the encryption is offloaded to hardware on the Core i5. The poller constantly tracks the Engine ID, BootTime and BootCount for each SNMPv3 device, therefore rarely needs to perform engine discovery.

Poller Context Switches (1 second samples)

Involuntary context switching can significantly impact performance, which is one of the major drawbacks of multi-threaded pollers. Since the AKIPS Ping/SNMP poller is a single monolithic process, it does not experience high levels of involuntary context switching when there are ample idle CPU cores available.

Some Observations ...

  • A single AKIPS poller process easily scales to 10 million MIB objects per minute using commodity grade hardware. There will always be an upper limit but we have the luxury of firing off a second and third poller process.
  • A well engineered asynchronous poller absolutely blows away a poller running hundreds of processes or threads. It's like turning up to drag race: one car with a well tuned supercharged V8 whilst the other with hundreds of weed wacker motors, each with their own fuel tank, spark plug and starter cord.
  • How well does AKIPS scale against other commercial or open source products ? That's very difficult to determine because every vendor uses their own "gobbly goop" terminology, so you are not comparing apples with apples. AKIPS collects 13 MIB objects for every interface every minute, while most other vendors collect a smaller subset at five minute intervals.
  • With network security such an important issue these days, we highly recommend moving entirely to authenticated and encrypted SNMPv3. There is very little measuable impact to data collection.
  • AKIPS runs solely on FreeBSD®, whilst most of other vendors use one of the many Linux® distributions.
  • The AKIPS SNMP implementation has been engineered from scratch, whilst most other commercial and open source software are based on the legacy implementations like Net-SNMP.