Tuesday, July 1, 2008

The spare 5th tire

No more sitting around doing nothing.

There are all kinds of ways to achieve system reliability, or more precisely system resilience against some kind of failure. They usually involve some kind of redundancy rather like having that 'spare 5th wheel' for your 4-wheel car. When it comes to NVRs the classic approach to resilience is to have say 4 NVRs in constant use, and keep a 5th sitting idle hidden away somewhere, unused but eagerly awaiting the single command to spring into action, to take over the recording responsibility from one of the other 4 that has just been detected to have failed for some reason. This kind of automatic failover is well-known as "n+1 failover" and is in widespread use today. In the car analogy, n=4 and the '+1' is my spare in the trunk.

Upon inspection, n+1, though completely effective, is not very efficient. I have to pay for one NVR to sit around all day doing nothing except aging and worse, slowly going out of warranty without seeing any action - it is my insurance policy where my monthly payments go into a dark hole. But in the real world I do not have a team of 4 employees working hard, and one sitting twiddling his thumbs waiting months for one of the 4 to fall sick and not turn up for work without notice. Instead all 5 work hard, and if one fails then the remaining 4 do their level best to pick up the slack. I do not have 1 power station powered up and ready, waiting for one of the other 4 to fail - all 5 run at below maximum output so that if one fails then the remaining ones again pick up the slack.

What is that called? n+0 failover? Load balancing? And how can that concept be applied to IP video centralized recording like NVRs offer?

IT offers such a technique through virtualized storage - the concept of taking a number of physical storage systems, in this case iSCSI disk arrays, and making them appear as if it were a single pool of shared storage. When this virtualized storage is managed, as is the case with Bosch's Video Recording Manager software, then all the physical iSCSI disk arrays are used equally - all the IP cameras are load-balanced across all the available physical storage. All the physical units are actively being used - you won't find one unit sitting idle waiting to spring into action and save the world. Now when a physical disk array fails, or becomes temporarily inaccessible (since they are network devices and may be located anywhere on the network and all networks need maintenance), the managing software simply sees the total pool of available storage as having shrunk slightly and instantly instructs the remaining physical units to pick up the slack. Recording does not stop - it continues and the remaining units just have to work a little harder to keep up.

Naturally if you do not want a system to degrade in performance (e.g. retention time) upon a component failure, such as a disk array, then you have to build in redundancy. You need to make sure that the remaining physical units are not running maxed out under normal conditions, else they will not be able to pick up the slack. If I have 5 ambulances in continuous 24x7 use, then I cannot survive a breakdown. However if each of the 5 is running at under 80% utilization then I have a chance. So maybe with 5 iSCSI disk arrays I achieve 38 days of video retention, and on one failure that temporarily drops 20% to the 30 days that I consider to be an acceptable minimum. Because the 5th disk array is not sitting idle, you improve your Return on Assets, by utilizing everything you own, to deliver more value (38 days) than you expected.

A final difference is that with n+1, you need a spare for every failure you anticipate. If you need to survive 2 concurrent failures, then you need 2 spare NVRs. In large or highly critical systems that is very realistic (think of the RAID 6 analogy, or NetApp's RAID 4 DP). However with this concept of n+0, as long as redundant capacity exists throughout the remaining system, any number of failures can occur, at any time - the remaining units just have to pedal faster.

Bosch's Video Recording Manager is not just about resilience. It is about squeezing the most out of everything you have and n+1 just doesn't do that. The simple concept of sharing the load is as valuable to fault tolerance as it is to not wasting storage just because you over-allocated space to a cluster of cameras that turned out to consume storage slower than another cluster. Bosch's Video Recording Manager pools all the available and active storage and allocates it appropriately to cameras on-demand.

One day n+1 failover will go out of fashion - not because of a lack of reliability but because of inefficiency. However since almost everyone's software architecture is based on it, it will inevitably take time.

No more sitting around doing nothing - it's time for everyone to pull their own weight.

1 comment:

gwalborn said...

The other nice thing about the Bosch Video Recording Manager solution is that not all iSCSI units in the system need to be of equal capacity or capable of the same number of device connections. It will simply place the load where the remaining system is capable of handling the recording and continue to record all cameras. Small, medium or large in size do their fair share of recording until the Video Recording Manager is assured that the video is retained for the prescribed amount of time. This technology has given me numerous solutions to applications for either large IP camera configurations, long term storage configurations or very diverse recording bit rates across several different camera locations in a common system layout.