Virtual Machine and Hardware Deployment

1. Platform Recommendation

Specifying a VM or bare metal platform is difficult because every network is different (i.e. number of users, devices, polled MIB objects, syslog/trap/netflow rates). AKIPS recommends starting with a VM installation to determine a resource baseline required for monitoring your infrastructure and then increase the CPU/RAM/Storage resources as needed.

As a general rule, we recommend:

a. VM Deployment
  • Commercial grade VM (e.g. VMware)
  • Dedicated CPU cores
  • Ample RAM (50% free)
  • SAN with thick provisioned preallocated storage
OR

b. Physical / Bare Metal
  • Off the shelf server (e.g. Cisco, Dell, HP, IBM, etc)
  • Ample RAM (50% free)
  • RAID 1 or 10

NOTE: Before purchasing physical hardware, contact AKIPS support with your intended vendor/model/spec so we can confirm the operating system has the appropriate disk and ethernet controller driver support.

AKIPS is known to work on the following virtual machine platforms:

  • VMware
  • VirtualBox
  • Hyper-V
  • KVM

2. Minimum Recommended Platform

Network Size Minimum Platform
Smallish
  • 5000 interfaces
  • 10,000 flows / sec
  • Virtual Machine
  • 1 CPU Core
  • 4 GB RAM
  • 100 GB disk space
Medium
  • 50,000 interfaces
  • 100,000 flows / sec
  • Virtual Machine
  • 2+ CPU Cores
  • 8 GB RAM
  • 200 GB disk space
Large
  • 100,000 interfaces
  • 100,000 flows / sec
  • Virtual Machine
  • 4+ CPU Cores
  • 16 GB RAM
  • 500 GB disk space
Huge
  • 500,000 interfaces
  • 500,000 flows / sec
  • Virtual Machine
  • 8+ CPU Cores
  • 32 GB RAM
  • 1 TB disk space
Bigger than the rest
  • 1 million+ interfaces
  • 500,000 flows / sec
Contact AKIPS
Probably works OK in a Virtual machine.
May require dedicated hardware.
  • Virtual Machine
  • 8+ CPU Cores
  • Lots of RAM (64 GB)
  • Several terabytes of disk space

3. System Resources: Ping and SNMP Polling

AKIPS pretty much nails Ping/SNMP polling scalability. The poller consumes less than 50% of a core when monitoring in excess of 15 million MIB objects per minute on commodity hardware, which equates to over 1 million interfaces. Refer to this Scalable SNMP Polling blog.

4. System Resources: Syslog and SNMP Trap

Constantly throwing large amounts (e.g. 200+ per second) of Syslog and SNMP Traps at AKIPS may require the resources of an additional CPU core. Syslog/Trap messages are stored in compressed 10 megabyte chunks. The higher the volume of Syslog/Traps, the more often the data has to be compressed.

5. System Resources: Netflow

The AKIPS flow collector and meters were engineered in the expectation of a large number of flow records (e.g. 1 million flows per second) from a small number of flow exporters (e.g. 50 to 100). The software performs as expected in that environment when ample CPU cores and RAM is available.

What was unexpected was customers wanting to send flows from 1000s of flow exporters. A flow meter process is started for each flow exporter, which means 1000s concurrent of meter processes. This issue is being investigated and will be rectified by allowing a meter process to handle data from many flow exporters, therefore significantly reducing the number of running processes.

6. Increasing the specs of a VM

The procedure of increasing CPU/RAM/Storage sizes in a VM is trivial:

  1. Shutdown the VM using the Admin -> System -> System Shutdown menu.
  2. Wait for the VM to completely shutdown.
  3. Increase the number of CPU cores, memory size or disk space.
  4. Start up the VM.
The AKIPS startup script will automagically detect the expanded disk space and do the appropriate partition and file system commands.

7. System Performance Graphs

AKIPS provides internal system and application performance information under the Admin -> Performance menus. The important things to note are:
  • System Graphs:
    • Memory usage should be fairly static over a day. Assign enough memory so the memory usage graph generally shows less than 50% usage. Lots of free memory is always useful because the operating system will consume as much memory as you give it (e.g. for disk caching).
    • CPU load, System Calls, Context Switches, Interrupts and Disk I/O will spike every 80 minutes when the background data processing occurs. This is normal.
  • Poller Graphs:
    • Ping rate should be constant.
    • SNMP requests should start at second 5 and complete by second 45 (i.e. a 40 second polling window each minute).
    • Poller memory should be constant.
    • Poller CPU usage should be under 50%. In most cases it will be below 10%.
    • Poller Context Switches should be mostly Voluntary. If there are a lot of Involuntary context switches, then additional CPU cores may be required. High levels of involuntary context switches is a sign of process CPU contention.
  • Database Graphs:
    • Compression Runtime should be less than 20 minutes. The database compression works on 30 day data blocks. At the end of 30 days, a new block will be created and the compression runtime will drop. The compression runtime is usually CPU limited. Adding CPU cores to the system should have a linear decrease in the compression runtime, unless the limiting factor is the backend storage speed.
    • Rotation Runtime should be less than Compression Runtime. The limiting factor will be the storage speed. A database file rotation occurs when they become more than 1% fragmented.

8. CPU

General Notes:
  • The number of required CPU cores depends entirely on the size of your configuration (i.e. number of monitored devices, MIB objects, syslog/trap rate, Netflow exporters and flows/sec).
  • Hyperthreading on modern Xeon, Core i3/5/7 CPUs work fine. Leave it turned on.
  • In a VM environment, always assign dedicated CPU cores. Do NOT over provision CPU cores. Over provisioning CPU cores will lead to significant pauses in real-time data polling and processing, and large jumps in time.
Number of CPU cores:
Basically you want enough free CPU cores to handle user requests without any noticeable delays. The Ping/SNMP poller is an extremely efficient single monolithic process, so it will only ever consume a portion of a single CPU core, even when monitoring 1 million+ interfaces. The poller context switch performance graphs (Admin -> Performance -> Poller) is a very good indicator whether there are enough CPU cores. For smooth operation, you want mostly voluntary context switches, not involuntary context switches.

CPU clock speed:

Comparing raw CPU core clock speeds is a fairly meaningless due to differences in core architectures (e.g. number of on die cores, L1/2/3 cache sizes and speeds). AKIPS performs various CPU speed tests for gzip/md5/sha which can be viewed in the Admin -> System -> System Info menu. The following are some examples:

Seconds
CPU Model GZIP MD5 SHA
Xeon E5-2683 v3 2.00GHz 1.7 2.9 3.6
Xeon E5-2660 2.20GHz 1.9 4.0 3.7
Xeon E5-2670 v3 2.30GHz 1.4 2.8 2.9
Xeon E5-2630L v2 2.40GHz 1.5 3.0 3.3
Xeon E5-2670 2.60GHz 1.2 2.7 3.3
Xeon E5-4650 2.70GHz 1.3 2.8 3.4
Xeon X5660 2.80GHz 2.6 3.7 4.8
Core i5-2500K 3.30GHz 1.1 2.4 2.9
Core i7-5820K 3.30GHz 1.1 2.2 2.3

9. Memory

Memory speed is fairly critical for performance. The Admin -> System -> System Info menu will display the memory speed of your system. A value of 8 Gigabytes/sec or greater is recommended. Older/legacy systems appear to have poor memory speed (e.g. 5 Gigabytes/sec or less).

Over provisioning memory in a VMware VM works fine because AKIPS loads the necessary kernel module that performs memory ballooning. Memory ballooning allows the guest VM to gracefully hand back unused free memory to the host machine.

10. Storage

Storage Size

UNIX file systems require plenty of spare space so they can write files out sequentially. In a VM it isn't such an issue because increasing the storage size is trivial. When deploying on physical hardware, it's best to install enough disk space up front for the entire life cycle of the box (e.g. several terabytes). Disks are cheap. Contact AKIPS support if unsure on disk space requirements.

Sequential Read / Write Performance

Databases typically access storage in a random order, but AKIPS databases are arranged in a manner so the majority of read/write I/O is performed sequentially. The large database are repacked if they become more than 1% fragmented. Good sequential I/O performance is important in large installations.

Spindles vs SSD

A modern SATA 2Tbyte disk typically gets over 200Mbytes/sec sequential transfer rates, whereas a SSD typically gets ~400Mbytes/sec read, but somewhat slower write performance because SSD uses a copy-on-write mechanism where every write operation has to be written to a zeroed disk block. That is how SSD works. The painfully slow part of SSD is zeroing disk blocks. If there are no zeroed blocks available for a write operation, write performance falls off a cliff while unused blocks are zeroed.

Having a large pool of pre-zeroed blocks greatly enhances consistent write performance. The SSD trim feature (turned on by default in AKIPS) allows the operating system to inform the SSD when a disk block can be zeroed. Some SSDs also have a hidden pool of pre-zeroed blocks.

DAS vs SAN vs NAS

Storage types: AKIPS preferred order of storage:
  • SAN
  • DAS RAID 10
  • DAS RAID 1
  • DAS RAID 0
  • DAS JBOD
  • NAS (thick provisioned)

DAS and SAN provide efficient "block level" storage to the operating system, whereas a NAS is just a "file store" accessed over 10G Ethernet/IP/NFS. A NAS will have signficiantly higher latency and fragmentation performance issues compared to a SAN/DAS.

Thick vs Thin Provisioning

  • Thick provisioning - storage is preallocated when created (preferred)
  • Thin provisioning - storage is allocated on-the-fly (slow, poor performance)

Do NOT use thin provisioned dynamically allocated storage. It ALWAYS leads to massive database performance problems due to fragmentation. AKIPS reads/writes large sequential database files and expects minimal underlying block level fragmentation and latency.

Using thin provisioned storage is also pointless because AKIPS uses a copy-on-write file system, therefore all disk blocks on the virtual storage will be quickly allocated and consumed, but in a highly fragmented order.