So what makes these mainframe thingies so special?
Well, the main characteristics that most contribute to the mainframe’s refusal to die are its RAS, its I/O ability, and the ISA.
RAS (Reliability, Availability, and Serviceability) is a term IBM often uses to describe its mainframes. By the early 70’s IBM had realized that the market for commercial systems was far more lucrative than that for scientific computing. They had learned that one of the most important attributes for their commercial customers was reliability. If their customers were going to use these machines for critical business functions, they were going to have to know they could depend on them being available at all times. So, for the last 30 years or so IBM has focused on making each new family of systems more reliable than the last. This has resulted in today’s systems being so reliable that it is extremely rare to hear of any hardware related system outage. There is such an extremely high level of redundancy and error checking in these systems that there are very few scenarios, short of a Vogon Constructor fleet flying through your datacenter, which can cause a system outage. Each CPU die contains two complete execution pipelines that execute each instruction simultaneously. If the results of the two pipelines are not identical, the CPU state is regressed, and the instruction retried. If the retry again fails, the original CPU state is saved, and a spare CPU is activated and loaded with the saved state date. This CPU now resumes the work that was being performed by the failed chip. Memory chips, memory busses I/O channels, power supplies, etc. all are either redundant in design, or have corresponding spares which can be can be put into use dynamically. Some of these failures may cause some marginal loss in performance, but they will not cause the failure of any unit of work in the system.
Serviceability comes into play in the rare event that there is a failure. Many components can be replaced concurrent with system operation (hot swapped); even microcode updates can often be installed while the system is running. For those components, such as CPUs, that cannot be replaced concurrently, the existence of spares allows the service outage to be scheduled at the customer’s convenience.
In addition to the inherent reliability of the system design, IBM also has created a tightly coupled clustering technology called Parallel Sysplex, which allows up to 32 systems to be operated as a single system image. In a properly deployed Parallel Sysplex, even the complete catastrophic loss of a single system (like those Vogons flying through the datacenter) can be tolerated without any loss of work. Any work that was being performed on the failed system can be automatically restarted on a surviving system. One additional advantage of Parallel Sysplex is that one (or more) system(s) can be removed from the Sysplex for hardware/software maintenance (for instance during off-hours), while the remaining systems continue processing the workload. When the maintenance activity has been completed the system(s) can then be brought back into the Sysplex. One of the ways this can be exploited is the ability to upgrade the software levels on the entire Sysplex (one system at a time), without ever causing any application outage.
With all these capabilities, a true 100% system availability is very practical, and is being achieved at many sites.
These channels are actually I/O processors, they are given “channel programs” to execute, which contain chains of I/O commands including primitive branch capability. This greatly reduces the CPU’s involvement in I/O operations, allowing the CPU to work more efficiently. These channels are capable of handling many concurrent I/O operations and thousands of devices each.
One typical attribute of a mainframe configuration is that there are multiple different paths to each device. A DASD device will normally be accessible through as many as four independent control units each connected to different channels. In fact, it is not unheard of for a given control unit to be connected to as many as 32 separate systems. One of the concepts that may sound very foreign to PC/Unix users is that it is quite normal for a single DASD device to be physically connected, online, and in use by many systems concurrently. There are significant hardware and software protocols in place to ensure data integrity across multiple systems to allow this to happen.
In the 360 and 370 architecture the operating system would create a channel program and attempt to execute it on a channel with a connection to the desired device. If the channel or control unit was busy the SIO (Start I/O) instruction would fail and the OS would attempt to start the channel program on another channel connected to a different control unit. If all paths were busy the OS would have to queue the request to be retried later. One of the significant changes introduced in XA was the idea of the channel subsystem, which coordinated and scheduled the activity on all the channels on the system. Now the OS only has to create the channel program and send it to the channel subsystem, which then handles all channel/control unit and queuing issues. This enabled even greater I/O throughput, while allowing the CPU to be even more efficient because it is only interrupted when the entire I/O operation has been completed.
The total I/O throughput capacity of the current z900 mainframes is no less than 24GB (that’s bytes, not bits) per second. I have not personally had the opportunity to benchmark the performance on these latest systems, but while theoretical numbers can sometimes be misleading, I wouldn’t be at all surprised to see a z900 performing as many as 100,000 I/O operations per second.
While there have been significant changes in the overall instruction set of IBM mainframes over the years, IBM has maintained an amazing amount of backward compatibility for applications. Many of the most significant architectural changes have affected facilities (such as the I/O subsystem) that are only directly invoked by the OS, and are not available to application programs. IBM has gone to great pains to ensure that their customers will not have to re-write or re-compile their programs to run on newer systems. This makes it much easier for customers to adopt new hardware, making it possible for those customers to simply swap out an older system for a newer model, without having to do extensive software testing. It is very common for a company with a single mainframe to simply replace it in a span of a few hours, without any testing on the new system before putting it into production. Typically this is done with the customer remaining on the same OS version, only upgrading to the newer version as needs allow. For instance, a customer can install a new z900 system while still running a 31-bit OS, then install and test a 64-bit version of the OS in a separate LPAR before moving production onto the 64-bit OS.
That’s it, for now…
I hope you found this article to be informative and thought provoking, but due to its brevity I’m sure that there will be questions. I would like to invite everyone to head on over to the message boards and pose any questions you may have. If things get too deep over there for me I may call on a friend or two to help me answer your questions, so go ahead and ask away. Also, feel free to let me know what subjects you might like to see me cover in future articles.
You had better appreciate this Guide entry more than “Mostly Harmless”, it certainly took more research! Anyway, I’m a bit worn out from all this work so I think I’ll nip off to Milliways for a few Pan Galactic Gargle Blasters and watch the Universe explode for my pleasure, anyone know where that frood Zaphod’s gotten off to? I’ll see you at the message boards when I get out of recovery.
Discuss (4 comments)