 |
High Availability Redundant Processing Architecture Frequently Asked Questions (FAQs)
Is the dual processor board solution fault tolerant?
The dual redundant processor architecture provides fault resiliency. The active CPU board can isolate the backup CPU board if it has failed. Isolation can occur from the bus or the CPU board can be powered down. The ability to power down and up again provides the opportunity to run diagnostics to determine if the takeover occurred because of a hardware or software reason. If software, the CPU board may again be functional and not need to be replaced. If the hardware is faulty, then it will need service.At the system level, the backplane is potentially a single point of failure, however 99.999% availability for the system is still achievable.
How does clustering compare with this architecture?
A cluster is defined here as a group of autonomous desktop servers that work together as a single system. Autonomous implies they have everything they need to be standalone-most particularly, processors, memory, storage, and peripherals as shown in Figure 1. Their advantages are multiple commercial sources, ease of implementation, N+1 redundancy, and scalability. Scalability occurs as you expand a cluster's aggregate processing power-either for increasing job speed or for taking on more difficult tasks and accommodating higher transaction volume-by incrementally adding systems to the cluster.
Their disadvantages are: incremental loss in processing power if a node fails, duplicating peripherals can be expensive if a substantial amount of I/O is required by the application, and the takeover or switchover time can be on the order of 30-90 seconds.In contrast, the high availability N+1 peripheral architecture described here is ideal for OEM building unique servers with custom and typically expensive peripherals. The N+1 architecture's advantages are: reduced cost if many peripherals are needed, smaller system envelope, very fast switch over times, all components are hot swappable. Disadvantages are: system development time is higher because drivers need modification and application code must maintain synchronization.
How does this architecture align with the PICMG* (PCI Industrial Computer Manufacturers Group) High Availability specification?
Directly, it doesn't. The PICMG CompactPCI* Hot Swap Specification definition for High Availability is limited only to peripherals and does not address CPU boards or HA systems. Our CPU boards are fully hot swappable and can be replaced without halting system operation. Indirectly, the PICMG Hot Swap Specification does apply. Of the three levels of hot swapping that are defined (basic, full and high availability) we support the full definition. This means that peripherals wanting to be hot swapped will be automatically detected by the CPU board and notification will be passed to the application.
Will this architecture become a standard?
A PCI Industrial Computer Manufacturer's Group (PICMG) subcommittee is developing a specification for the implementation of redundant System Master CPU boards on a CompactPCI backplane. Intel® is an active member of this subcommittee and many features of our RSS architecture are expected to be included in the specification.
Are drivers available for off-the-shelf peripherals?
Intel is actively working with OTS peripheral vendors to enable device drivers for the HA Redundant Processing environment. If drivers for the particular peripheral device you are using are not currently HA aware, contact Intel. The peripheral vendor can be provided the appropriate documentation to enable the required driver enhancements.
What tools are available to write drivers for proprietary peripherals?
Purchasers of Intel's HA Redundant Processing products receive documentation that outlines the required device driver refinements for operation in this HA environment. A driver shell is also available that provides the framework for an HA driver.
How long does it typically take for software to switch-over?
Takeover time depends on the strategy an HA driver uses during a switch-over and the number of peripheral cards in the system. As an example, in a system utilizing eight intelligent slave Peripheral Master Boards (such as Intel's compactnet product) the total time from fault detection to active system operation is less than 10 milliseconds.
What is the calculated percentage of up-time?How was the calculation made?
System up time, or availability, is based upon a number of factors. These factors include the Mean Time Between Failure (MTBF) of each component, the number of each component employed in the system (e.g., number of system fans), the level of redundancy (i.e., n+1, n+2, etc.), and the amount of time required to service a failed component (= time to reach system + time to replace component). Most of these factors are relatively fixed at the application level, however the time to service a component is largely determined by the environment into which the system is installed. Systems deployed in remote installations requiring hours to reach for service will have a different availability than a system installed in a facility with on-site service personnel and on-site spares.Calculated system availability is over 99.999% with average service times of 2 hours and 99.999% with average service times of 0.5 hours.
Is application modification required for the redundant processing environment?
Yes. Applications need to be tailored for the redundant CPU environment. Applications should control what data is synchronized with the hot standby CPU board and perform internal checkpointing and monitoring watchdog timeouts.
Can a rear I/O board be hot swapped?
Yes, both CPU front cards and rear I/O boards can be hot swapped.
Is this system usable with Microsoft's Wolfpack clustering software?
Intel's high availability (HA) redundant processing architecture is not a cluster, in that the failover to standby hardware is at the board level instead of the system level. However, it is possible to configure the Intel NetStructure® system processor boards in an active/active mode and operate the system as a "cluster-in-a-box" using Wolfpack software. In this mode, an entire bus segment and all cards on that segment will be inoperable in the event of a failover.
Are configured platforms available?
Yes. Platforms are available fully configurable to meet customer needs.
With the backplane being a potential single point of failure for the system, what safeguards are in place to prevent a CompactPCI I/O card from locking the bus when it fails?
If an I/O card is a persistent fault on the bus and that card cannot be disabled, then that CompactPCI bus segment will not operate. The other bus segment can continue to operate.
How aware of the architecture does an application have to be?
Applications operating in a redundant System Master environment have some constraints that are not encountered in a single System Master environment. The application must be structured to gracefully become active when they are called into service from the hot standby mode. The application is also responsible for synchronization.
What tools are available to help application development?
A documented API is available to ease the interfacing of applications with the HA software environment. For maximum performance in the redundant processor HA environment, significant application tailoring should be expected.
What OSs are you planning to support?
Support will be provided for VxWorks*, Linux* and Windows* 2000.
What family of Intel processors are used?
The Pentium® III processors from Intel's Embedded Processor Group. These processors offer low power dissipationwhile providing high performance.
Will the system support Symmetric Multiprocessing (SMP) across the two CPU boards?
The system provides redundant system master CPU boards. SMP is not supported across the two CPU boards.
Do the CPU boards load share?
Either CPU board has access to all peripheral slots. Any bus master on either CompactPCI bus segment can communicate directly with any other peripheral on its bus segment or through the active CPU (transparently) to any other peripheral on the other bus segment. However, the two CPUs do not interleave bus transactions. One or the other is in control of the bus at all times. The CPU board not in control (standby) of the bus is not idle. The Standby CPU board can be receiving program execution information over the Ethernet connection, doing independent on-board I/O, verifying correct program execution on the other CPU board, maintaining data tables or databases, etc., but it is isolated from the backplane.
How does the system support data concurrency between the two processor boards?
It is the responsibility of the application to maintain program concurrency to the desired resolution required by the application via a dedicated 100BASE-T Ethernet connections.
How does the system support accurate Program Counter (PC) transference?
PC resolution between the two CPU boards is not provided. The application provides data and status information for program resumption after a takeover occurs.
What is the switchover mechanism for the processor boards? Who decides?
There are many conditions under which a takeover can occur: a watchdog timer time-out on either CPU, a hardware fault detection on the active processor, and many more, all of which are software configurable in software protected logic within the Host Controller logic.
What portions of the switchover are based in hardware/firmware/OS/application software?
The system hardware can perform a takeover without firmware or software intervention in the event of a catastrophic failure of the active CPU board. System software can also initiate a takeover in response to preprogrammed conditions, such as over temperature or incorrect supply voltages, which constitute a failure mode. Once the takeover has occurred, the Host Controller driver will notify the application that it cannot resume.
What is required to develop a working system?
- Define what points of synchronization are important for the application to maintain between the active and hot standby CPUs.
- Determine the takeover strategy by configuring the host controller configuration registers
- Provide drivers based on the driver model defined for High Availability for any of your custom CompactPCI cards.
- Provide an application that communicates with the hot Standby CPU with provided Ethernet drivers.
|
What is the best case for processor switchover time?
Hardware switchover is on the order of microseconds, assuming the backup processor is in step (i.e., the OS is up and running and all drivers are installed), the worst case, software latency delays would be incurred to get the standby CPU board to the point where the once active CPU was executing. Assuming the standby CPU just performed a takeover of the active CPU because a fault was detected, this might involve power cycling the faulted processor in order to execute the hardware diagnostics, load the OS and device drivers, and then resume application operation.
What type of performance impact is there in providing the second processor board?
On the CompactPCI bus there is no performance impact. On the active CPU board the performance impact is the overhead associated with maintaining contact with the backup processor via Ethernet.
|