I always like to get viewer feedback regarding news posts. It just so happens that "Brett" saw our post and decided to respond with a bit of a humerous situation!
Thanks, Brett!
Hi...
I'm a consultant in New York. I have a client who houses their systems in
an IBM data center in New Jersey. In the cage next to my client's is the
current IBM Linux supercomputer called the Shark. I'm not sure of the
specs, but it is the precursor to the one that you discuss in your post. IBM
uses it as a giant multiprocessor time share Linux machine. Basically you
can "buy" as many simultaneous instances of Linux with as much ram and disk
(arranged however you like) as you need. The CPU and ram is in one cage,
and the disks are in another--all connected via some type of fiber
connections. (If you want 500 virtual Linux boxes each with 2gb of ram and
500gb of disk, it's only a few keystrokes and mouse clicks away--that's the
theory anyway.)
Anyway, the reason I'm writing to you is to tell you of a BIG screw-up that
happened on Friday. Although the data center is IBM's, the actual facility
is owned and mostly operated by AT&T. IBM occupies about 85% of the
gigantic facility, but things like HVAC, security and power are handled by
AT&T. At about 1pm here in New York it seems that all of IBM's space in the
data center lost all electrical power. Under no circumstances should this
have happened as there are 5 different levels of power redundancy built into
the facility. (multiple outside power sources, multiple UPS's and multiple
diesel generators).
The IBM Shark, at 11pm that evening was still totally down. There were tons
of IBM technicians and managers bleary-eyed and very frustrated. Although
the power was restored minutes after it was lost, the system was totally
mangled. They just couldn't get the overarching OS that controls each
instance of Linux to load. Also, the disks were in terrible condition with
corruption all over the place. The ramifications of this were (and maybe
still are) that none of the many existing customers who buy capacity on this
system were operational and surely will have problems when things get sorted
out. (Imagine having to manually fix THOUSANDS of individual disks?)
One more thing...The reason for the power outage was that AT&T were trying
to save some money by turning off power distribution units that were not
being used. The technician who was doing it never bothered to determine
which PDU's actually had loads on them. He simply switched them all off.
When I walked into the data center to fix my client's systems one of the IBM
people said to me that he had never seen so many "OK" prompts in one day.
Basically everything in the place just came to a screaching halt. It was a
disaster but it was funny too.
Just thought that you might like to know.
Brett
Thanks, Brett!
Hi...
I'm a consultant in New York. I have a client who houses their systems in
an IBM data center in New Jersey. In the cage next to my client's is the
current IBM Linux supercomputer called the Shark. I'm not sure of the
specs, but it is the precursor to the one that you discuss in your post. IBM
uses it as a giant multiprocessor time share Linux machine. Basically you
can "buy" as many simultaneous instances of Linux with as much ram and disk
(arranged however you like) as you need. The CPU and ram is in one cage,
and the disks are in another--all connected via some type of fiber
connections. (If you want 500 virtual Linux boxes each with 2gb of ram and
500gb of disk, it's only a few keystrokes and mouse clicks away--that's the
theory anyway.)
Anyway, the reason I'm writing to you is to tell you of a BIG screw-up that
happened on Friday. Although the data center is IBM's, the actual facility
is owned and mostly operated by AT&T. IBM occupies about 85% of the
gigantic facility, but things like HVAC, security and power are handled by
AT&T. At about 1pm here in New York it seems that all of IBM's space in the
data center lost all electrical power. Under no circumstances should this
have happened as there are 5 different levels of power redundancy built into
the facility. (multiple outside power sources, multiple UPS's and multiple
diesel generators).
The IBM Shark, at 11pm that evening was still totally down. There were tons
of IBM technicians and managers bleary-eyed and very frustrated. Although
the power was restored minutes after it was lost, the system was totally
mangled. They just couldn't get the overarching OS that controls each
instance of Linux to load. Also, the disks were in terrible condition with
corruption all over the place. The ramifications of this were (and maybe
still are) that none of the many existing customers who buy capacity on this
system were operational and surely will have problems when things get sorted
out. (Imagine having to manually fix THOUSANDS of individual disks?)
One more thing...The reason for the power outage was that AT&T were trying
to save some money by turning off power distribution units that were not
being used. The technician who was doing it never bothered to determine
which PDU's actually had loads on them. He simply switched them all off.
When I walked into the data center to fix my client's systems one of the IBM
people said to me that he had never seen so many "OK" prompts in one day.
Basically everything in the place just came to a screaching halt. It was a
disaster but it was funny too.
Just thought that you might like to know.
Brett
Comment