Antares-RAID-sparcLinux-HOWTO: Maintenance

Next Previous Contents

When running a RAID 3 or 5 (if you configured one or more drives to be spares) the 5070 will detect when a drive goes offline and automatically select a spare from the spares pool to replace it. The data will be rebuilt on-the-fly. The RAID will continue operating normally during the re-construction process (i.e. it can be read from and written to just is if nothing has happened). When a backend fails you will see messages similar to the following displayed on the 5070 console:

930 secs: Redo:1:1 Retry:1 (DIO_cim_homes_D1.1.0_q1) CDB=28(Read_10)Re-/Selection
 Time-out @682400+16

932 secs: Redo:1:1 Retry:2 (DIO_cim_homes_D1.1.0_q1) CDB=28(Read_10)Re-/Selection
 Time-out @682400+16

933 secs: Redo:1:1 Retry:3 (DIO_cim_homes_D1.1.0_q1) CDB=28(Read_10)Re-/Selection
 Time-out @682400+16

934 secs: CIO_cim_homes_q3 R5_W(3412000, 16): Pre-Read drive 4 (D1.1.0)
 fails with result "Re-/Selection Time-out"

934 secs: CIO_cim_homes_q2 R5: Drained alternate jobs for drive 4 (D1.1.0)

934 secs: CIO_cim_homes_q2 R5: Drained alternate jobs for drive 4 (D1.1.0)
 RPT 1/0

934 secs: CIO_cim_homes_q2 R5_W(524288, 16): Initial Pre-Read drive 4 (D1.1.0)
 fails with result "Re-/Selection Time-out"

935 secs: Redo:1:0 Retry:1 (DIO_cim_homes_D1.0.0_q1) CDB=28(Read_10)SCSI
 Bus ~Reset detected @210544+16

936 secs: Failed:1:1 Retry:0 (rconf) CDB=2A(Write_10)Re-/Selection Time-out
 @4194866+128

Then you will see the spare being pulled from the spares pool, spun up, tested, engaged, and the data reconstructed.

937 secs: autorepair pid=1149 /raid/cim_homes: Spinning up spare device

938 secs: autorepair pid=1149 /raid/cim_homes: Testing spare device/dev/hd/1.5.0/data

939 secs: autorepair pid=1149 /raid/cim_homes: engaging hot spare ...

939 secs: autorepair pid=1149 /raid/cim_homes: reconstructing drive 4 ...

939 secs: 1054

939 secs: Rebuild on /raid/cim_homes/repair: Max buffer 2800 in 7491 reads,
 priority 6 sleep 500

The rebuild script will printout its progress every 10% of the job completed

939 secs: Rebuild on /raid/cim_homes/repair @ 0/7491

1920 secs: Rebuild on /raid/cim_homes/repair @ 1498/7491

2414 secs: Rebuild on /raid/cim_homes/repair @ 2247/7491

2906 secs: Rebuild on /raid/cim_homes/repair @ 2996/7491

9.2 Re-integrating a repaired drive into the RAID (levels 3 and 5)

After you have replaced the bad drive you must re-integrate it into the RAID set using the following procedure.

Start the text GUI
Look the list of backends for the RAID set(s).
Backends that have been marked faulty will have a (-) to the right of their ID ( e.g. D1.1.0- ).
If you set up spares the ID of the faulty backend will be followed by the ID of the spare that has replaced it ( e.g. D1.1.0-D1.5.0 ) .
Write down the ID(s) of the faulty backend(s) (NOT the spares).
Press Q to exit agui
At the husky prompt type:
replace <name> <backend>

Where <name> is whatever you named the raid set and <backend> is the ID of the backend that is being re-integrated into the RAID. If a spare was in use it will be automatically returned to the spares pool. Be patient, reconstruction can take a few minutes minutes to several hours depending on the RAID level and the size. Fortunately, you can use the RAID as you normally would during this process.

Next Previous Contents

9. Maintenance

9.1 Activating a spare

9.2 Re-integrating a repaired drive into the RAID (levels 3 and 5)