This page contains some notes on how to recover from failed disks, etc. on Glued Sun boxes using Solstice DiskSuite package. Since there is currently only a single Sun server in physics, we are addressing a specific situation.
The existing system has 2 hot swappable internal 36GB disk drives, named
c0t0d0
and c0t1d0
, which we refer to as disk 0 and
1, respectively. Please note that the physical label on the machines refers
to them as drives 1 and 2, with 1 being the lower.
Both drives are identically partitioned, and every partition in use
is mirrored (with the exception of partition #7 (c0t0d0s7
and c0t1d0s7
) which stores the DiskSuite state meta-database
replicas).
Useful information can be found at the following sites:
The two disks should have the following partitions defined:
Note that slice 7 is used to hold the DiskSuite state meta-database replicas on Physics servers, and that we generally put 3 replicas on each disk.
Physics uses following naming scheme is used to refer to the various meta-devices, as it allows a simple scheme for figuring out which submirrors are part of a mirror and which partition a submirror refers to. The scheme is:
dXY
where X
is the slice/partition number + 1 and Y is the disk number. So
X runs from 1 to 8, and Y is 0 or 1. Disk
c0t0d0s4
would map to d50
and c0t1d0s2
would map to d31
. (The +1 comes in because d00
,
d01
, etc do not exist.)
dX
where, as above,
X is the slice/partition number +1. So d5
is the
mirror consisting of d50
on disk 0 (c0t0d0s4
to be
precise), and d51
on disk 1 (c0t1d0s4
).
Everything is mirrored, so the system should be able to remain up despite a single drive failure. However, there are issues related to rebooting in such a case. If possible, avoid rebooting until systems administrator is available.
There are two issues related to rebooting after a drive fails. One is that DiskSuite requires more than half of the replicas to be available in order to reboot. Since we carefully put equal numbers of replicas on each disk, we will have exactly half of the replicas available. This is not good enough. The solution is simple, just delete the replicas on the failed disk, and fortunately, this can be done even after the disk failed. It is best if this can be done before the reboot. If the system reboots before this can be done, it can only be brought up in single-user mode (password on box).
The other issue is that the openboot console software does not know anything
about mirroring. It needs the name of a drive to boot from, and it defaults
to the first drive (c0t0d0s0
, drive 0, labelled 1 on the case).
If that drive fails, the system won't boot normally, and you will have to
issue the boot command manually with a drive specification, e.g.
boot altboot
or
boot /pci@1f,4000/scsi@3/disk@1,0
If the device alias altboot
was defined properly, the first
version should work. Otherwise, you will need to give the full specification
for the drive. The above shows what it should be on the physics dept. sun
server.
If the system is still up ...
/usr/opt/SUNWmd/sbin/metadb -i
. The valid replicas
should have flags a, l
and possibly m,p,
and
u
. The problem replicas will likely have one or more capital
letter flags, like M,W,D,F
or R
. The parition
holding the replica appears on the right (note there will be 3 lines per
partition, as 3 replicas per disk); the number following the t
indicates whether is disk 0 or 1.
/usr/opt/SUNWmd/sbin/metastat | less
and look for submirrors with state other than "Okay". The submirror name
should end in 0 or 1 depending on the disk. Generally, for a failed disk,
all submirrors on that disk should have failed, so they should all agree.
If the system is not up, try booting from the open boot ("Ok") prompt
with the command boot
with no arguments. If it quickly
(shortly after the memory test) complains about being unable to open the boot
device or some such, drive 0 is likely the dead one. Try
boot altboot
, which should work, in which case drive 1 is OK.
Note in either case that it will not boot normally, and will complain about
insufficient
metadb replicas being located. You will only be able to boot to single-user
mode (you will need the root password for this, on case of machine). Get
into single-user mode. You can then try the metadb
and/or
metastat
commands under If the system is still up
case to confirm this.
c0t0d0
), the command
/usr/optSUNWmd/sbin/metadb -d /dev/dsk/c0t0d0s7
will delete the replicas on the failed drive. If drive 1 failed
(c0t0d1
), replace the c0t0d0s7
with
c0t1d0s7
. Be sure to delete the correct replicas---
double check the digit following the t
. At this point,
the system can be rebooted fully, although if drive 0 failed, you will need
to issue the command boot altboot
at the "Ok" prompt to specify
the second drive. If the system is in single-user mode, reboot it
shutdown -i6 -y -g0
; it is not necessary otherwise.
metadb -a -c 3 /dev/dsk/c0t0d0s7
metareplace -e d1 /dev/dsk/c0t0d0s0
metadetach -f d1 d10
metaclear d10
metainit d10
metattach d1 d10