Software Error Protection Technic for High Density Memory

 

Mario Eugenio Saturno

INPE, Instituto Nacional de Pesquisas Espaciais

Av. dos Astronautas, 1752 Sao Jose dos Campos, SP, Brazil

Phone:  55 12 39456622, Fax: 55 12 39411890, saturno@dea.inpe.br

ABSTRACT

Present microsatellite experiments generate a significant amount of data. Microsatellite on-board computers shall have large mass memory to store data while the satellite is not in ground visibility. Space radiation affects high-density memories destroying stored data (single event upset - SEU). Error Detecting and Correcting Circuit (EDAC) by adding some circuit and memory may detect and correct these errors. However, this strategy requires a large number of components. So that, microsatellite constraints of mass, power and volume lead to a software approach to protect memory against errors mainly induced by space radiation. A proposed technic uses software that implements Hamming code to protect the data memory. A comparison between the Reed-Solomon method and the proposed technic is also presented. This new approach detects single and double errors, corrects single errors and logs events by dividing the data memory into blocks of 256 words with 32 bits, nine of them used to check. In addition, to avoid that a SEU affects more than one memory cell and then causes double errors in a block, address lines of adjacent memory cells in column direction were exchanged to locate these cells in different blocks.

1. INTRODUCTION

The National Space Research Institute (INPE) developed the first Brazilian scientific microsatellite (SACI-1) based on the vanguard technology and on the experience resulting from the development of the Brazilian Space Program[1].

The SACI-1 is a low Earth polar orbit, with altitude of 750 km, 60 kg of total mass, dimensions of 600x400x400 mm, spin stabilized.

On-board scientific experiments will generate around 4 kbytes/s of data. Due to the time between antenna visibilities (up to 10 hours), high data memory capability, processing and data transfer power are necessary.

The SACI-1 On Board Computer (OBC) is responsible for environment and experiment on board data acquisition, data message encoding and transmission to ground; command reception, decoding and distribution; experiment control, attitude computing and controlling.

The computer comprises three fault-tolerant central processing units (CPU) based on the INMOS T805 transputers and three I/O interfaces. They are connected through 10-Mbps serial links. Each CPU contains a Watch-Dog Timer used to detect CPU malfunction and generate a flag to the other CPU.

Each transputer has a main memory divided into a ROM of 128 kBytes (32 kBytes x 32 bits) and a RAM of 128 kBytes (32k x 44, 32 bits for data plus 12 bits for parity check) protected by an error detection and correction circuit (EDAC) that detects single and double errors and correct single errors and an extended memory to store data, a RAM of 4 Mbytes (1 M x 32). The extended memory has a circuit to detect latch-up.

Each interface is internally redundant and is connected to two processing units. The Telemetry and Telecommand Interface receives telecommand frames from the ground communication system (at 19200 bps) and sends telemetry frames (at 250000 bps). The Serial Interface implements RS-422 communication between the OBC and experiments (at 19200 bps).

The T805 transputer is an INMOS microprocessor with four full duplex communications links, developed to parallel and concurrent applications.

Transputers can be programmed in most high-level languages, but to gain most benefit from the transputer architecture, the software must be programmed in Occam. Basic software manages all facilities and hardware fault tolerance and implements some primitive functions. Applicative programs implement the microsatellite functions and data basis.

Electronic components are sensitive to space radiation. A Single event upset (SEU) occurs when a radiation particle change the content of a memory cell. Also, the total ionizing dose shifts the threshold, increasing power consumption and induced to eventual failure.

The implementation of the EDAC in the main memory RAM requires 37.5% more memory components. It may be inferred that a single cosmic particle might corrupt several adjacent bits. To protect the extended data memory, RAM is necessary to protect several bits in the memory word. Implementations using hardware are prohibitive considering satellite constraints.

An interesting method to protect high-density memory was developed by the team of University of Surrey using Reed-Solomon code and it is discussed bellow.

2. Reed Solomon Method

The Reed Solomon Method is a low cost software implementation. It was used to protect high density memory on the UoSAT2, UoSAT3 and UoSAT5, University of Surrey Satellites [2]. The technic protects data which is written to and read from memory blocks. It is not suitable for protecting byte-by-byte accessed, but only for block accesses.

The Reed Solomon Code was used to protect 8 k x 8 memory components, 96 kBytes in UoSAT2 and 32 k x 8, 4 MBytes in UoSAT3. The memory was divided into 255-byte blocks (actually 256, as explained bellow), 3 bytes of them were used to check parity. The block protection is limited to one SEU affecting one up to 8 bits of a byte. If two SEU occurs in the block it is detectable, but it cannot be located.

Data are handled in blocks of 252 bytes. Whenever a data block is written to memory, an additional 3 parity bytes are derived from the data and appended to the block, resulting a 255-byte wide block. Actually, each block is assigned to 256 bytes (for easy addressing, this extra byte is unused).

Once it is very complex to implement Reed Solomon Code using software, seven pre-calculated look-up tables were used expending for a total table space of 1350 bytes.

The calculation of the three check parity bytes requires 828 look-ups and 828 XOR operations. The calculation of the three check parity bytes requires 765 look-ups and 765 XOR operations.

Observations of the behavior of the RAM on-board UoSAT2 gave an expected error-rate of around 5 E -7 SEU/bit-day. To meet the desired goal, the entire memory should be verified every 4 hours and 33 minutes.

Unfortunately, when UoSAT3 was in use, uncorrectable errors were occurring. The memory verification rate was increased by a factor of 4. After 60 days, 2107 SEU occurred (13 uncorrectable, for 1.7 expected) or 1.047 E -6 SEU/bit-day. It is around the double rate observed on the UoSAT2, but is consistent with 32k x 8 memory components. After 41 days, 1419 SEU occurred (4 uncorrectable, for 0.3 expected) or 1.0 E -6 SEU/bit-day.

It was concluded that some single particles are able to corrupt more than one bit on crossing memory components and turning up the memory verify rate will have no effect. So, single-byte correcting code is not sufficient to maintain an error-free memory.

3. An Error Detection and Correction Block Memory Code

A study [3] resulted in a proposal using software to protect the data memory better than the previous and easier to implement. The implementation is a combination of a very simple hardware modification in the memory address circuit and the use of the Modified Hamming Code for block protection. [4]

The modification suggested in the extended memory address circuit is to change the address bits A1A0 to A9A8. This implementation provides an internal "spread" of memory. Adjacent addressed bytes are made not adjacent and they will be part of different blocks. When a heavy radiation particle cross a component affecting two adjacent bytes, two single errors will occur in two blocks, instead of a double error in a block.

A protected data extended memory block is a block of 256 words of 32 bits, nine of them used to check parity. Dividing the extended data memory into blocks gives 8192 blocks/CPU. The block code method implements detection of single and double errors, correction single errors and log of these events (errors). The 9-word parity check stores the modified Hamming code words implemented for each bit column of the block. These 9 words mean 3.5% of the block.

To implement the modified Hamming code a known algorithm is used. The 32-bit information data is stored into positions addressed from 0 to 246 (D0 to D246), the positions addressed from 247 to 255 are reserved to parity check words from C0 to C8, respectively. Each bit of the parity check verifies a set of data bits. A table that maps such a check bit to data bits must be generated. The first step is to generate all binary combinations (from 0 to 1FFH). After that, all combinations that have an even number of "1" (000000011, 000000101 and 000011110 for instance) are eliminated.

This operation leads to 256 different combinations, 9 of them with one bit. These combinations form the check bit table. To prevent a bit stuck at 0 or 1, the bits C1, C5 and C6 are inverted. The parity check equations are generated using the check bit table.

Data are handled in blocks of 256 32-bit word. Whenever a data block is written to the main memory (hardware protected), an additional 9 parity words are derived from the data and appended to the block, resulting a 255-word wide block. To encode a data block, each 247-word data block has its parity check calculated.

The calculation of the nine check parity bytes requires 2223 look-ups and 1112 XOR operations, or for each 256-byte block, 556 look-ups and  278 XOR operations.

The data blocks must be often scanned, have their parity check words decoded and verified, performed correction if single error were detected and registered all events. This operation is done when the OBC is idle.

Decoding algorithm is similar to encoding. The check parity is included in the verification table. Then, nine syndrome word S0, ..., S8 equations are derived.

After that, if the syndrome words are zero, it means that there is no error in the block. Otherwise, an error occurred. If the number of ´1´ is odd then a single error had occurred and the S word indicates what bit must be corrected, looking for which element matches to S. The index points to the bit that must be inverted. If the number of ones is even then a double error has occurred. These events are logged.

The calculation of the nine check parity bytes requires 2304 look-ups and 1142 XOR operations, or for each 256-byte block, 576 look-ups and 288 XOR operations.

4. Conclusion

The use of high-density memory components in space application is recent and the available literature has no conclusive information about ionizing radiation effects, mainly heavy particles. This requires a very good memory protection observing constraints of mass, power, volume and computing performance.

The solutions adopted for this problem consists of a single way to spread memory positions in a component, protecting it against multiple errors caused by a heavy particle. The error protection method proposed uses 3.5% of the memory, which is a very good rate, mainly considering that main memory spends 37.5%. An additional benefit is that the data stored into the extended memory is ready to be transmitted to ground.

The proposed method protects at least eight times more than Reed-Solomon method spending only the double of check parity words. In additional, it is easier to implement.

5. REFERENCES

1.   J.A.C.F. Neri, M.E. Saturno, et al., M. E, The Brazilian Scientific Microsatellite SACI-1, International Academy of Astronautics Symposium on Small Satellites for Earth Observation, November 4-8, 1996, Berlin, Alemanha

2.   M.S. Hodgart, C.I. Underwood, J.W. Ward, Single Event Upset Error Protection For Solid State Data Memory On Microsatellites, Proceedings of the 5th Annual AIAA/USU Conference on Small Satellites, August 26-29, 1991, Logan, Utah, USA.

4.   S. Lin, D.J. Costelo Jr., Error Control Coding Fundamentals and Applications, 1983, Prentice-Hall, Inc. Englewood Cliffs.

3.   A.R. Paula, M.E. Saturno, F. Vargas, R. Velazco, A Strategy Allowing to use Commercial Circuits in Mass Memory of Micro-Satellites, International Workshop in Computer Aided Design, Test and Evaluation for Dependability, July 2-3, 1996, Pequim, China