There are some preliminary notes and comments about the Disksuite package which is relevant to its use on Glue, as follows.
The single-user password is essential for a two disk mirror, because if a disk fails you will not have a majority of state replicas available, and the system will not reboot in such a case until the replicas on the failed disk are deleted. This can either be done before the system reboots, or from single-user mode. You want to be sure you can access single-user mode if needed.
Alternatively, at least on Solaris 8 and 9, you can add the line
set md:mirrored_root_flag=1
/etc/system
, which will override the default
behavior and allow the system to boot with exactly 1/2 of the database
replicas. This setting is recommended for a two disk mirrored configuration.
Because DiskSuite did not come bundled with the OS for Solaris 2.7 (it
is bundled with Solaris 8 and 9), it had to be installed separately, after
the fact by Glue staff. Due to some quirks in how the package was rdisted
down to the clients,
there are certain symbolic links and/or pseudo-devices in the
/dev
and /devices
trees that appearing to be
missing on Solaris 2.7 clients. This does not appear to be a problem on
Solaris 8 or 9 boxes. At least the
/dev/md/admin
and /dev/md/dsk/dN
and
/dev/md/rdsk/dN
special files are needed. The directory
tree and links under /dev/md/shared
may also be needed. These all
resolve to device special files in /devices/pseudo
which are
not normally present on Glued Solaris 2.7 systems, but I am not sure if they
must exist prior to running things or if they will get created.
There are about a thousand symlinks, and another thousand devices, missing.
The script
/group/pcs/project/PNCE-Unix/disksuite/disksuite-create-metadevices.pl
might be of help in setting these up.
The directory /etc/opt/SUNWmd
must exist on Solaris 2.7
systems, and its contents might have to be there as well. This also appears
to be missing on Solaris 2.7 clients. Solaris 8 and 9 clients use
/dev/lvm
instead of the above, and that does seem to be present on
Glued clients of the above OSes. A sample tarball for /etc/opt/SUNWmd
is available at
/group/pcs/project/PNCE-Unix/disksuite/etc-opt-SUNWmd.tar.
/etc/opt/SUNWmd
. Solaris 8 and 9 systems store similar data in
/etc/lvm
. You will likely be editting the file md.tab
within there, and the metadisk commands seem to alter md.cf
and
md.ctlrmap
as well. I do not know how important any of these
files (or the other files in that directory) are to the functioning of
disksuite; I believe for the most part they are not that critical, with the
critical information being stored in the metadb replicas. The md.tab
file is created manually, and usually has useful comments. Anyway, I recommend
copying all of the files in this directory to the servers config tree, as
it is better to have stuff that is not needed than to need stuff you do not
have in an emergency. I doubt symlinks will work.
After the above, you should be at the "normal" beginning point for using DiskSuite. The procedure from here will vary greatly depending on what is desired. This page covers some common uses in Physics Dept, like mirroring system disks, but other uses should be fairly straight forward and standard resources (see next section) should be useful.
In addition, the purpose of mirroring the system drives is to ensure that the system stays up, so it is useful to be cognizant of the recovery procedure should a disk fail. In particular, there are some small tricks involved when one of a set of mirrored system disks fails. Also, familiarity with the recovery procedure may affect choices made during the initial setup, and recovery can be made easier by proper documentation during the setup. The following references may be of use:
Before proceeding with a discussion of the setup of Disksuite, it is
useful to discuss some conventions and policies in use by either Glue,
the Physics
department, or both. While these are not mandatory (unless you are
configuring on a Physics system:), you should probably have an idea
of how you want things set up, and these may provide useful guides. It is
also a good idea for you to clearly document your conventions; Physics
documents them here as well as in comments in the md.tab
file.
Disksuite requires space on the disk to store its metadb database replicas. Because this database contains the critical information needed for you to access the disks, it must be replicated as widely as possible. You should in general spread the replicas out even over as many disks as are available. On a system with only 2 internal disks, the replicas will most likely be limitted to those 2 disks, and should be divided equally between them (this way the system can stay up if either disk fails). We also have some systems with 4 internal drives, and in these cases we replicate the databases across all 4 drives, even if only two of them are to be mirrored.
The database replicas take up room on the disk which cannot be used for filesystems, etc. The typical partition scheme for a Glued Solaris box is as follows:
/
(root slice)
/usr
/usr/vice/cache
(AFS cache)
/var
/usr/afs
on AFS servers, maybe DB replicas
As can be seen, the standard glue set up uses most of the slices available. Slice 2 might be usable, but I would recommend against it especially on a system disk. That leaves slices 6 and 7 free. Physics generally puts the DB replica on one of these 2 slices. The replicas are not that large, 8192 blocks or about 4MB, and we usually put 2-3 copies on each disk. (NOTE:it is important to spread the copies out over multiple disks, and have the same number of replicas on each disk.) Since I dislike making slices smaller than 50 or so MB, we usually waste a fair amount of space anyway. The other slice may have additional local space available if the disk is big enough that I cannot justify expending the entire disk on system slices.
However, if you hope to make the system an AFS server (thus using slice 6), and possibly put data on slice 7, you have a problem, as there are no more partitions free to put the DB replicas. Fortunately, there is a way around that, at least if you do the mirroring before making the system an AFS server. Disksuite can share a slice between the DB replicas and a filesystem in some cases:
Because it is unwise to have disksuite manage a /vicep
partition
on an AFS server, and since you would want the AFS server software of an
AFS server mirrored also, the best bet is if you can mirror the system before
the AFS server software is installed. Put the DB replicas on slice 6, mirror
root (/
), /usr
, /var
, swap, and
the AFS cache as normal, then create an empty metadevice on slice 6,
newfs
it, and mount it on /usr/afs
.
Some example configurations from Physics:
/usr/afs
on slice 6, and three
DB replicas on slice 7. Slices 0, 1, 3, 4, 5, and 6 are all mirrored.
/usr/afs
, and slice 7 contains the
extra space. Slices 0, 1, 3, 4, 5, and 6 are mirrored. Slice 7 may or may not
be mirrored (definitely not if used as a vice partition).
/usr/afs
. Slices 0, 1, 3, 4, 5, and 6 on the
system disks are mirrored. Slices 0-6 on the other two disks are available,
and may or may not be mirrored (definitely not if using as a vice partition).
In addition to the locations of the DB replicas, you will need to come
up with a naming scheme for the mirrors and submirrors. Each slice to be
mirrored will need a distinctly named disksuite submirror device on each of
the disks being mirrored, and in addition the redundant mirrored device also
needs to be named. The metadevice names should be of the
d
or N
d
where M
N
M
and N
are digits,
and M
cannot be 0 (e.g. d01
is not allowed).
You probably want to come up with a reasonable naming scheme. And ideally your naming scheme should make it obvious what slice the mirror metadevice refers to, and what disk and slice each submirror refers to. Physics uses the following:
dM
N
where M
is the slice/partition number + 1 and N
is the disk number. So M
runs from 1 to 8 (the plus 1 eliminating the problematic 0 value),
and N
is generally 0 or 1. E.g., the submirror for disk
c0t0d0s4
would be called d50
, and for
c0t1d0s2
would be d31
.
dM
where, as above, M
is the slice/partition number +1. E.g., d5
is the
mirror consisting of d50
on disk 0 (c0t0d0s4
to be
precise), and d51
on disk 1 (c0t1d0s4
).
Regardless of whether you want to do logging, mirroring, striping, or RAID, you need to create the metadb DB replicas for Disksuite. Because this step is so universal, it is being covered in its own section.
Before creating the DB replicas, you should have:
command to do this. If you are mirroring the
disks, you want them to have the same partition structure anyway, so once
the first disk is set up, you can use the command
prtvtoc /dev/rdsk/DISK1 | fmthard -s - /dev/rdsk/DISK2
DISK1
to
DISK2
.
We are now ready to create the state meta-databases. First, make sure
no one configured disksuite without your knowledge by checking for the
existence of DB replicas with the command metadb
.
Solaris 2.7 users may have to give a full path to the metadb
command, e.g. /usr/opt/SUNWmd/sbin/metadb
. On Solaris 8 and 9,
it is in /usr/sbin
which should be in your path. This should
return an error complaining that "there are no existing databases". It
might also just return nothing (usually indicating that DB replicas were
set up once and then all were deleted).
If you get a list of replicas, STOP. Someone set up or tried to set up disksuite before you, and figure out what the status is before proceding further. Using the command below to try to create another initial database set will hopefully yield an error, but if not could be disastrous, wiping out the previous DB and making the previously mirrored, striped, etc. disks inaccessible.
For a two disk system, Sun advises a minimum of 2 replicas per disk; physics uses 3. To create the initial replicas, issue the command (as root):
metadb -a -f -c 3 slice
/dev/dsk/c0t0d0s7
to put it on slice 7 of the 1st disk.
The -c 3
in the above command instructs it to put three copies of
the DB there. The -a
says we are adding replicas, and the
-f
forces the addition. NOTE: the -f
option should only be used for the initial DB replica,
when it is REQUIRED to avoid errors due to lack of any
previously existing replicas.
You can check the databases were created successfully by issuing the
metadb
command without arguments, or with just the -i
argument to get a legend for the flags. You should now see 3 (or whatever
value you gave to the -c
argument) DB replicas in successive blocks
on the slice you specified. At this point, only the a
(active)
and u
(up-to-date) flags should be set.
Now add the replicas on the second (and any other) drives. This is done with a command like:
metadb -a -c 3 /dev/dsk/c0t1d0s7
-d
option to
delete all replicas on the named partition, and then re-add the correct number.
Again, you can use the plain metadb
command (or give it the
-i
option for the flags legend) to verify the databases were created
successfully. This command is also useful to use later to verify things
are OK. At this early stage, you should see a line with the a
and
u
flags for each replica on each disk.
Once disksuite is fully functioning and operational, you should again see a line for each replica on each disk. The following flags seem to be set on a functioning system (flags should appear for every replica unless otherwise stated):
a
: the replica is active. This should always be set.
m
: flagging the master replica (only one replica should
have this set, usually the first)
p
: the replica is patched into the kernel. This should
get set after the first reboot (why? what does it mean?)
l
: the replica was read successfully. This should get
set after the first reboot?
u
: the replica is up-to-date. This should always
be set.
o
: the replica was active prior to the last database
change. This should get set after the first reboot.
This section gives instructions on setting up mirroring of the root disk on a new Glued Solaris box with 2 internal disks. Most of the instructions can be easily adapted for systems with more than 2 disks, or if mirroring something other than the root disk.
The instructions assume that we are enabling mirroring on a newly installed, non-production machine. These restrictions are not strictly required, but obviously enabling mirroring of the system disk on a system already in production runs some risks. The system WILL need to be rebooted at least once when mirroring the system disk; if a non-system disk is being mirrored you can probably get away with just umounting and remounting the partitions being mirrored at the appropriate times. NOTE: all filesystems being mirrored must be mounted using the single-ply metadisk mirror before attaching the second submirror to the mirror metadevice; if new data written to the filesystem will only be written to one submirror, and disksuite will get very upset upon attempt to reboot (as the two submirrors are listed as being in sync but are not); if you do this with root will probably not even be able to get to single-user mode.
DISK1
and
DISK2
:
prtvtoc /dev/rdsk/DISK1 | fmthard -s - /dev/rdsk/DISK2
c0t0d0
and c0t1d0
, you would use
metadb -f -c3 -a /dev/dsk/c0t0d0s7
metadb -c3 -a /dev/dsk/c0t1d0s7
md.tab
file (found in /etc/opt/SUNWmd
on Solaris 2.7 systems, and /etc/lvm
on Solaris 8 and 9 systems);
this has the advantage of allowing comments and a more lasting record of what
was done (especially if save a copy in the glue config tree).
/var
) would be defined with:
/dev/md/dsk/d60 1 1 /dev/dsk/c0t0d0s5
/dev/md/dsk/d61 1 1 /dev/dsk/c0t1d0s5
-m
option and the
name of the metadevice corresponding to the disk/slice that currently has
data on it. The only submirror that should be mentioned in the mirror
metadevice definitions in the md.tab file should be the one refering to the
currently mounted physical device. Do not mention the other submirror in the
mirror definition or there can be data loss on the currently mounted slice
. Assuming we installed the system on the first disk from the
previous examples, the mirror metadevice for /var
(slice 5) would be
defined with:
/dev/md/dsk/d6 -m /dev/md/dsk/d60
md.tab
file does not actually define the metadevices,
but stores options to be used by the metainit
command when called
with insufficient arguments, much like /etc/vfstab
can be used to
store options for mount
. Because of the standard partitioning scheme
in use by glue, these md.tab
files are pretty standard, with usually
just the names of the physicsal devices for the two disks needing to be changed.
Review your md.tab file and make sure it is correct. In
particular, make sure the mirror metadevice definitions refer to the submirror
referring to the slice which has the good filesystem on it.
metainit -f SUBMIRROR_NAME
/var
slice would have its
submirror device initialized, retaining the current contents of /var
,
with the command:
metainit -f d60
-f
is required because the physical device referenced in the
md.tab
file for submirror d60
is currently mounted.
Similar commands should be issued for each slice to mirror.
metainit SUBMIRROR_NAME
/var
) with the command:
metainit d61
-f
flag is used in this case.
Similar commands should be issued for each slice to mirror.
metainit MIRROR_NAME
/var
would be
initialized with
metainit d6
d60
(because that is how d6
was defined in the md.tab
file).
Similar commands should be issued for each slice to mirror.
Note: it is important that the single submirror be the
submirror referring to the currently active/mounted disk/slice, otherwise
data loss may occur.
You may want to review your md.tab
file again
before issuing the commands to create the mirrors.
metastat
command to see that all the
submirrors and single-ply mirrors were set up correctly. Everything should be
indicating a state of "Okay". Mirror metadevices should list a single submirror
corresponding to the currently mounted device (if the other submirror is listed,
correct it NOW to avoid data loss), and submirrors should list
the physical device they are associated with and the mirror metadevice they
are attached to. The unattached submirrors (i.e. those referring to slices
not currently mounted) will show up as "Concat/Stripe" at this time.
If any thing looks wrong or a mistake was made, FIX IT NOW
to avoid data loss. You should be able to metadetach
the incorrect
mirror, and possibly metaclear
the problematic mirror and
submirrors, edit the md.tab
file to correct things,
and recreate the affected submirrors and mirror.
umount
, edit
vfstab
, and remount. But most likely a reboot will be required after
making the changes below.
/
) filesystem is being mirrored,
you MUST run the metaroot
command. Usage is
metaroot ROOT_MIRROR_METADEVICE
/etc/vfstab
entry for root, but more importantly it modifies
/etc/system
so that the kernel knows where the root filesystem is.
Failure to run this command might result in an inability to boot the system.
Using the standard Physics naming scheme, the root device on slice 0 will
get mirrored to the metadevice d1
, and so would want to issue the
command
metaroot d1
/etc/vfstab
file should list
/dev/md/dsk/d1
as the device to mount on /
. Also, the
text file /etc/system
should contain a new stanza with metadisk
related stuff in it, something like
* Begin MDD root info (do not edit)
forceload: misc/md_trans
forceload: misc/md_raid
forceload: misc/md_hotspares
forceload: misc/md_sp
forceload: misc/md_stripe
forceload: misc/md_mirror
forceload: drv/pcisch
forceload: drv/mpt
forceload: drv/sd
rootdev:/pseudo/md@0:0,1,blk
* End MDD root info (do not edit)
/etc/vfstab
to have the filesystem mounted using the
mirror metadevice. Yo can use the root partition as modified by metaroot as
a guide. Following our previous examples, the line for /var
would
be so,ething like:
/dev/md/dsk/d6 /dev/md/rdsk/d6 /var ufs 1 no logging
/
), /usr
, or other critical system filesystems, you
must reboot before proceeding further. The filesystems MUST
be remounted using the mirror metadevices before attaching the second
submirror to the mirror device, or serious problems (including inability to
boot even to single user mode) can occur.
devalias altboot /pci@1f,4000/scsi@3/disk@1,0
mount
command and make sure all filesystems to
be mirrored are listed as mounted under the metadisk mirror device, not
the physical device. Do not proceed any further
unless this is the case,
as attaching the second submirror while the physical device for the first
submirror is mounted will result in metadisk syncing the disks, but any
writes to the file system go to one disk only, thereby breaking the
synchronization. What is worse, is that metadisk thinks they are
synchronized, and will fail hard. You might not even be able to get into
single user on next reboot. Make sure all filesystems to be
mirrored are mounted with the metadisk mirror device before attaching
the other submirror. This generally means a reboot after editting
vfstab
as indicated in previous step.
metattach MIRROR SUBMIRROR2
SUBMIRROR2
to be attached to MIRROR
,
and will cause the newly attached submirror to be synchronized with the
previous submirror, copying information from the old submirror to the new one.
For the /var
filesystem in our example,
metattach d6 d61
. Repeat this for all filesystems to mirror.
Because the synchronization can take a while and keep the disks quite busy,
you may want to allow each slice to finish synchronizing before starting the
next on a production system (see the metastat
command below). On
non-production systems I usually get let them all sync in parallel.
metastat
to see how things are going. All the
mirrors you recently attached should be syncing up. This can take some time.
The new submirrors should now show up as "Submirrors" and not as
"Concat/Stripes". States for the newly attached submirrors will be
"Resyncing" but should eventually change to OK when synchronized.
dumpadm -d `
swap -l | tail -1 | awk '{print $1}'`
/dev/md/desk/d2
. You can verify this at a later point
by issuing the command dumpadm
without any arguments.
Failure to do this may cause serious problems if a crash dump ever
gets written, and may make system unbootable
/
),
you need to make the mirror disk bootable. On SPARC systems,
this can be done with the command
installboot /usr/platform/`
uname -i`
/lib/fs/ufs/bootblk /dev/rdsk/ROOT2
ROOT2
is the root slice of the mirrored disk. E.g.
c0t1d0s0
.
vfstab
file.) If you did not do it then, you can do it now, as
follows:
/dev/dsk
and extracting
everything following the /devices
part. E.g., if the mirrored
root disk is c0t1d0s0
, issue the command
ls -l /dev/dsk/c0t1d0s0
. This should return something like
benfranklin:~# ls -l /dev/dsk/c1t1d0s0
lrwxrwxrwx 1 root root 70 May 4 2004 /dev/dsk/c1t1d0s0 -> ../../devices/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a
pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a
.
eeprom nvramrc
. If it returns something like data not available,
you can just go ahead. Otherwise you need to determine if the previous
definitions should be kept or not. If you want to wipe them out, just proceed
as if there were no previous definitions, otherwise you need to cut and paste
the previous definitions into the next command.
eeprom "nvramrc=OLDDEFS devalias mirror PHYSPATH
OLDDEFS
are any old definitions in the NVRAM that you wish
to keep, and PHYSPATH
is the physical path to the secondary boot
disk. For the above example, assuming no previous definitions exist (or want
to keep), you would have:
eeprom "nvramrc=devalias mirror /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a"
nvalias mirror /pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf8a6403,0:a
show-disks
command,
and selecting the device you want, and then typing ^Y
(control-Y)
instead of the long device name in the nvalias
command).
eeprom "use-nvram?=true"
setenv use-nvramrc? true
eeprom "nvramrc"
devalias