Lessons Learned at JPL
by Wayne TustinConsiderable newspaper and technical publication coverage was given to an overly-severe March 21, 2000 vibration test in Room 144 of Building 100 at the Jet Propulsion Laboratory, Pasadena, California. The over-test caused significant damage (over $1,000,000) to the High Energy Solar Spectroscopic Imager (HESSI) satellite built by the University of California at Berkeley (UCB). A Mishap Investigation Board (MIB) was convened. The principal source for this article is a prerelease version of the MIB report, which will eventually appear at http://www.gsfc.nasa.gov.
My intent here is to help readers (who are in any way involved in testing) appreciate the possibly disastrous consequences of actions that are taken or not taken in seemingly routine procedures.
Event Sequence
Earlier on March 21, at 13:39 hours, the spacecraft had passed a nominally sinusoidal 0.25g survey test, Run #2. (I say nominally because records later investigated by the MIB showed that there had been significant waveform distortion.) The spacecraft also passed a random vibration test at 17:50 hours. Run #9 utilized force limitations and spectrum notches. (Other records investigated by the MIB showed that this test also had unusual characteristics.) In retrospect, both of these tests had revealed symptoms of trouble, symptoms that unfortunately were ignored on March 21.
The MIB Report mentions aborts on Runs 1, 5, 6, 7 and 8. These are blamed on various overloads, but these overloads are not explained in the Report. I've been told of solar array panel rattling, with acceleration peaks causing limit channels to clip.
At 18:13 hours (Run #10) the 13:39 sine survey test was repeated. The third test of the day, Run #11, was to be a 7.5g open-loop sine burst vibration test.
Plans called for six bursts at -12 dB or 1.88g peak, one burst at -6 dB or 3.75g and (after a review of input and responses) a single burst at full level, 7.5g peak. Unfortunately, the first -12 dB (1.88g) check, at 18:43 hours, was much too severe. The test was aborted manually. Records showed that acceleration reached 21g for 4+ cycles. Solar panel arrays were damaged.
What had happened?
The Board prepared a lengthy list of all possible causes for the mishap. One by one, most of the possible causes were exonerated.
- Endevco 2271A/A20 accelerometers
- Trig-Tek 1273A charge amplifier · m + p VCP9000 vibration control system
- LDS power amplifier driving the Ling A-249 shaker
- Shaker internals (flexures and bearings and other possible mechanical and electrical difficulties).
- Spacecraft response instrumentation is not discussed here ... only the mechanical input to the spacecraft.
Attention soon focused on the oil film slip plate (magnesium alloy) to which the spacecraft was attached via an adapter ring which was connected by 24 Kistler 9251A force sensors (to measure force input to the spacecraft) to an aluminum fixture plate. (After the event, several of the mounting ring-to-slip plate bolts were found to be loose. See log at 18:53 hours.) The slip table, supporting granite block and shaker armature were found to be misaligned, not parallel. The magnesium slip table had evidently been rubbing on the granite block for some time, generating considerable heat. Magnesium had transferred from the underside of the plate to the granite surface. Considerable evidence from the earlier tests that day indicated that the resulting stiction (greater than normal coefficient of static friction) had been present throughout the day's testing and was the root cause of the mishap.
This author asks: were records of tests from March 13-17 and from March 20 reviewed? Yes, and there is some evidence that the stiction problem had existed.(but not been recognized) before March 21.
Concerning the pressurized oil film slip table, the question was raised: had oil pressure been turned on? General agreement: yes, but no documentation and (false economy) no system interlocks.
MIB found that the granite block itself had not shifted, but that the shaker body had moved, misaligning the moving system (consisting of shaker armature, flexures, bearings, bullnose and slip plate) and creating stiction. Two 1 inch diameter bolts that had secured the shaker "saddle" to the shaker base were broken with holes misaligned approximately 0.5 inch. Upon disassembly of the shaker supporting base, one of the two trunnion support needle bearings was found to have a broken outer race. Some rollers were loose; others were missing. Replacement parts will be taken from a surplus A-249 shaker being shipped from Huntsville, Alabama.
When did the shaker base fail?
I've been told that the shaker had been used in the vertical attitude for "a long time" prior to March 21. During rotation of the shaker body into the horizontal attitude, the soon-to-fail (or already failed) trunnion bearing must have emitted loud noises. Why did no one hear that noise? Careful realignment of slip plate to shaker followed that rotation.
How did stiction cause the overtest?
MIB reconstructs the events thus: upon initiating the sine-burst test, the shaker control computer had (at much reduced level) developed and stored a drive signal. Unfortunately, that drive signal was incorrect (much too high) due to slip plate stiction. Excessive force had been required to obtain the required low level of slip plate motion. The computer poorly estimated the drive signal which would be needed for the -12 dB check. Unfortunately, no procedure required the operator to run the sine burst test before mounting the spacecraft.
When the -12 dB check occurred, that excessive force not only overcame stiction but created excessive motion. That's what damaged the spacecraft.
Evidence of stiction
MIB found evidence of stiction in acceleration vs. time plots taken during the earlier 0.25g sine sweep, Run #10. Large amplitude "glitches" occur immediately after the zero velocity points. This author surmises that the accelerometer signals leading to those plots were available to test personnel in Room 104 during the several sine sweeps. Subsequent question brought assurance that an oscilloscope was provided and is always turned on. Those charged with watching the oscilloscope blamed distortion on "shaker-spacecraft interaction and electrical noise". It seems to me they should have been instructed to stop the test if motion was not sinusoidal.
Observers in Room 100 had noticed a "different" low frequency sound during the control computer's equalization process, prior to the earlier random vibration tests (Runs 3 through 9). MIB found indications that stiction had affected equalization, particularly in the low frequency, large displacement spectral region. This was confirmed by reviewing control accelerometer PSD from the self-check at 18:43 hours. This author now asks: 1. Had no one said to all present in Shaker Room 100 and Control Room 104, something like "Listen up, folks. This is an important test upon a very valuable satellite. If you should observe any anomalies, holler so we can investigate."? 2. Why did no one shout "Stop the test."?
Schedule requirements
This author asks: Why was the sine-burst test initiated, in the face of this evidence that all was not right? One of the investigators opines that the misalignment was not sudden but rather was a degradation to which the operators had become accustomed. Perhaps. But might we not ascribe some blame to pressure from above? Tests, coming late in any program, always seem to commence behind schedule. Page 23 of the MIB Report mentions "tight schedule". Was any individual afraid to stop the test?
How long had that test crew been working? 18:13 hours is 6:13 pm. 10 hours? Pressure to finish testing so all could go home to dinner is certainly understandable. Was there another "hurry up" test inflexibly scheduled for that facility next morning? One of the investigators told me privately that testing for 12 hours would not be unusual or unsafe and that project personnel are used to even longer hours. I've learned that commercial testing laboratories sometimes work their people 18 hour shifts. Accident-investigation psychologists tell us that judgment lessens when people are tired, and that quite often the people involved in an accident will later deny having been tired.
Contributing factors
- Misalignment caused the slip table to bind at low force levels. MIB recommends checks for routinely assessing the mechanical "health" of shaker and slip table system.
- Test personnel did not know that quality data was available prior to initiating the sine-burst test. MIB recommends additional procedure steps, to review such data.
- No facility validation test was performed. MIB recommends simulating tests before the test article arrives.
- The shaker base failure. This was also identified as the root cause. MIB recommends refurbishing or replacing the shaker.
- Too-low an amplitude self-check. MIB recommends self-checks at appropriate levels.
To MIB's five, I would like to suggest a sixth and seventh. I'm told there has been considerable test personnel turnover in the JPL test organization. Much experience has been lost. Possibly the year 2000 staff had had little formal precautionary training on this specific shaker system.
Much of the shaker equipment at JPL is 30 or more years old. In generating vibratory force, shaker internals and externals gradually deteriorate with each test. Anticipate failure. Treat your shaker at least as well as you treat your automobile.
Observations
MIB offers nine very thoughtful observations, accompanying each with recommendations.
Lessons learned
The MIB came up with some excellent steps to prevent recurrence of these events. All dynamics labs should copy these six points and post them prominently. Here is Section 10 of the prerelease report. (I have added numbers and slightly changed wordings.)
- Test facilities must be maintained with test equipment in good working order. Metrics that assess the mechanical health of the systems must be developed and tracked.
- "Canned" tests should be developed (and used periodically) to provide a trended database for test systems' responses. Any deviations in any system response should be investigated.
- Critical control system response data such as (a) the transfer function or (b) inverse transfer functions, and (c) calculated drive voltage must be evaluated real-time during testing to ensure that they are reasonable and do not indicate system maladies.
- A facility validation test should be done for each planned test series, before flight or critical hardware is mounted. This should represent actual test conditions.
- Run self-checks that provide a representative response for the forcing range of the planned test. For higher force shock tests, shaker systems and test fixtures often do not respond in a linear fashion. Don't assume that test facilities are always in perfect working order.
- Define all test requirements (for a particular test) in the test plan. Provide test operators with adequate data. Require complete (system) verification testing before testing critical hardware. To those six, I would like to add six more to make a dozen:
- Run self-checks at approximately test intensities.
- Don't operate shaker systems "open loop".
- Supplement your shaker system with an independent "soft shutdown" system protector that derives its signal from an additional redundant (safety) accelerometer.
- Encourage anyone in the lab, whether part of the test team or observing, to call out "Question" if he/she suspects something is not right.
- Display unfiltered accelerometer time histories on at least one old-fashioned analog oscilloscope. During a sine vibration test, fully investigate any departures from a sine waveform.
- Listen to your shaker. It may be trying to tell you something. If, as at JPL, the operators cannot hear the shaker from their control room location, place a microphone near the shaker and a loudspeaker that is always turned "on" in the control room.
Thanks to several people who commented upon drafts of this article: Guido Bossaert, Dan Worth, Bob Newton, Bob Mercado.