Solaris sun cluster & SAN Storage: December 2010

Tuesday, December 21, 2010

CLOSE_WAIT Connections – Tuning Solaris

This article aims to describe a way to tune TCP parameters on Solaris to get a better performance running a WebServer. I will show how the TCP connection is initiated and termintated on a high level and I will focus on how to tune TCP parameters on Solaris.

I have experienced some problems with a bunch of CLOSE_WAIT connections on Solaris affecting the application causing delays on response time and refusing new connections.

I will start this article explaning how a connection is initiated (TCP Three-Way Handshake sequence).

Let’s use the scenario of two servers (A and B) where the server A is going to initiate the connection.

1-) The first segment SYN is sent by the node A to node B. This is a request to server synchronizes the sequence numbers.

Node A —— SYN —–> Node B

2-) Node B sends an acknowledge (ACK) about the request of the Node A. At the same time, Node B is also sending its request (SYN) to the Node A for synchronization of its sequence numbers.

Node A <—– SYN/ACK —- Node B 3-) Node A then send an acknowledge to Node B. Node A —— ACK ——-> Node B

At this time the connection should be established.

Let’s show how the connections are terminated (here is the CLOSE_WAIT issue):

In the termination process, it is important to remember that each application process on each side of connection should close independently its half of connection. This terminating process consists of:

Let’s supose Node A will close its half of connection first.

1-)NodeA transmits a FIN packet to Node B.

(Established) (Established)
Node A —- FIN —-> Node B
(FIN_WAIT1)

2-)NodeB transmits an ACK packet to Node A:

NodeA <—- ACK —- Node B (FIN_WAIT2) (CLOSE_WAIT) //Here is the CLOSE_WAIT Issue. App on Node B should invoke close() method to close the connection on its end. If the App does not invoke close() method, then it will keep the connection stuck on CLOSE_WAIT for the time specified on TCP Stack. If you have much traffic in the server and a lot of connections on CLOSE_WAIT status it will cause some issues such as: - Refusing new connections request. - Slow on response time. - High Processing Resource Utilization. Now, I will describe some tips that helped me to solve some problems in Webservers. It basically consists of changing some TCP parameters in Solaris that will reduce the time that a connection will be on CLOSE_WAIT, releasing this kind of connection quickly. - TCP_TIME_INTERVAL parameter: Description: Notifies TCP/IP on how long to keep the connection control blocks closed. After the applications complete the TCP/IP connection, the control blocks are kept for the specified time. When high connection rates occur, a large backlog of the TCP/IP connections accumulates and can slow server performance. The server can stall during certain peak periods. If the server stalls, the netstat command shows that many of the sockets that are opened to the HTTP server are in the CLOSE_WAIT or FIN_WAIT_2 state. Visible delays can occur for up to four minutes, during which time the server does not send any responses, but CPU utilization stays high, with all of the activities in system processes. 1-) Verify the current value of this: ndd -get /dev/tcp tcp_time_wait_interval 2-) Set the new value ndd -set /dev/tcp tcp_time_wait_interval 60000 (Defaul value is 240000 milliseconds = 4 minutes. Recommended is 60000 milliseconds). - TCP_FIN_WAIT_2_FLUSH_INTERVAL Specifies the timer interval prohibiting a connection in the FIN_WAIT_2 state to remain in that state. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_fin_wait_2_flush_interval 2-) Set the new value: ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500 (Default Value is 675000 milliseconds. Recommended is 67500 milliseconds). - TCP_KEEPALIVE_INTERVAL keepAlive packet ensures that a connection stays in an active and established state. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_keepalive_interval 2-) Set the new value: ndd -set /dev/tcp tcp_keepalive_interval 300000 (Default Value is 7200000 milliseconds. Recommended is 15000 milliseconds). - Connection backlog It means that a high number of incoming connections results in failure. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_conn_req_max_q 2-) Set the new value: ndd -set /dev/tcp tcp_conn_req_max_q 8000 (Default value is 128. Recommended is 8000) This configuration change will help to improve the system performance, and better than this, it will help to reduce major impacts. I have experienced situations that the application was not responding due to a lot of connections on CLOSE_WAIT status. In my case, we identified a bug in the application and we use this tunings as an work-around. It is very useful and it can help you when you are experiencing problems due to many connections on this state. Reference: This article was inspirate on IBM “Tuning Solaris systems” @ WebSphere Application Server Information Center. Additional Info: Local Server closes first: ESTABLISHED -> FIN_WAIT_1-> FIN_WAIT_2 -> TIME_WAIT -> CLOSED.

Remote Server closes first:
ESTABLISHED -> CLOSE_WAIT -> LAST_ACK -> CLOSED.

Local and Remote Server close at the same time:
ESTABLISHED -> FIN_WAIT_1-> CLOSING ->TIME_WAIT -> CLOSED.

Monday, December 20, 2010

Backup commands – usage and examples

Backup commands – ufsdump, tar , cpio
Unix backup and restore can be done using unix commands ufsdump , tar ,
cpio . Though these commands may be sufficient for small setups in
order to take a enterprise backup you have to go in for some custom
backup and restore solutions like Symatic netbackup, EMC networker or
Amanda .
Any backup solution using these commands depends on the type of backup you
are taking and capability of the commands to fulfill the requirement . Following
paragraphs will give you an idea of commands , syntax and examples.

Features of ufsdump , tar , cpio

ufsdump
1. Used for complete file system backup .
2. It copies every thing from regular files in a file system to special character and block device files.
2. It can work on mounted or unmounted file systems.

tar:
1. Used for single or multiple files backup .
2. Can’t backup special character & block device files ( 0 byte files ).
3. Works only on mounted file system.

cpio:
1. Used for single or multiple files backup .
2. Can backup special character & block device files .
3. Works only on mounted file system.
4. Need a list of files to be backed up .
5. Preserve hard links and time stamps of the files .

Identifying the tape device in Solaris

dmesg | grep st

Checking the status of the tape drive

mt -f /dev/rmt/0 status

Backup restore and disk copy with ufsdump :

Backup file system using ufsdump
ufsdump 0cvf /dev/rmt/0 /dev/rdsk/c0t0d0s0
or
ufsdump 0cvf /dev/rmt/0 /usr

To restore a dump with ufsrestore

ufsrestore rvf /dev/rmt/0
ufsrestore in interactive mode allowing selection of individual files and
directories using add , ls , cd , pwd and extract commands .
ufsrestore -i /dev/rmt/0

Making a copy of a disk slice using ufsdump

ufsdump 0f – /dev/rdsk/c0t0d0s7 |(cd /mnt/backup ;ufsrestore xf -)

Backup restore and disk copy with tar :

– Backing up all files in a directory including subdirectories to a tape device (/dev/rmt/0),

tar cvf /dev/rmt/0 *

Viewing a tar backup on a tape

tar tvf /dev/rmt/0

Extracting tar backup from the tape

tar xvf /dev/rmt/0
(Restoration will go to present directory or original backup path depending on
relative or absolute path names used for backup )

Backup restore and disk copy with tar :

Back up all the files in current directory to tape .

find . -depth -print | cpio -ovcB > /dev/rmt/0
cpio expects a list of files and find command provides the list , cpio has
to put these file on some destination and a > sign redirect these files to tape . This can be a file as well .

Viewing cpio files on a tape

cpio -ivtB < /dev/rmt/0

Restoring a cpio backup

cpio -ivcB < /dev/rmt/0

Compress/uncompress files :

You may have to compress the files before or after the backup and it can be done with following commands .
Compressing a file

compress -v file_name
gzip filename

To uncompress a file

uncompress file_name.Z
or
gunzip filename

What is a sticky bit

In Unix sticky bit is permission bit that protects the files within a directory. If the directory has the sticky bit set, a file can be deleted only by the owner of the file, the owner of the directory, or super user. This prevents a user from deleting other users’ files from public directories. A t or T in the access permissions column of a directory listing indicates that the sticky bit has been set, as shown here:

drwxrwxrwt 5 root sys 458 Oct 21 17:04 /public

Sticky bit cab be set by chmod command. You need to assign the octal value 1 as the first number in a series of four octal values.

# chmod 1777 public

Solaris Volume Manager (SVM) – Creating Disk Mirrors

One great thing about Solaris (x86 and Sparc) is that some really cool disk management software is built right in, and it’s called SVM, or Solaris Volume Manager. In previous versions of Solaris it was called Solstice Disksuite, or just Disksuite for short, and it’s still referred to by that name sometimes by people who have been doing this for a long time and therefore worked with that first. The point is that they are the same thing, except SVM is the new version of the tool. Today, we are going to look at what we need to create a mirror out of two disks. Actually, we’ll be creating a mirror between two slices (partitions) of two disks. You can, for example, create a mirror between the root file system slices if you want. Or, if you follow old school rules and break out /var, /usr, etc., you can mirror those as well. You can even mirror your swap slices if you don’t mind the performance hit and need that extra uptime assurance, but we’ll talk about swap in another article. For now, let’s talk about SVM and mirrors.
For the purposes of this article, I am going to assume I have a server with two SCSI hard drives, this is the same process for IDE drives, but the drive device names will be different. The device names I am going to use are /dev/dsk/c0t0d0 and /dev/dsk/c0t1d0, notice that they are the same except for the target (t) number changes, indicating the next disk on the bus. For the slices to use, let’s mirror the root file system on slice 0 and swap on slice 1, sound good? Good.
In order to use SVM, we have to setup what are called “meta databases”. These small databases hold all of the information pertaining to the mirrors that we create, and without them, the machine won’t start. It’s important to note here that it’s not just that the server won’t start without them, the server won’t start (i.e. It goes into single user mode) if you have SVM setup and it can’t find 50% or more of these meta databases. This means that you need to put SVM on your main two drives, or even distribute copies on all local drives if you want, but don’t, for any reason, put any meta databases on removable, external or SAN drives! If you do, and you ever try to start your machine with those drives gone, it won’t start! So keep it on the local drives to make your life easier later.
The disk mirroring is done after the Solaris OS (operating system) has been installed, and therefore we can be sure that the main drive is partitioned correctly since we had to do that as part of the install. However, we need to partition the second disk the same way, the disk label (partition structure) needs to be the same on both disks in the mirror.
We need to pick what partition will hold the meta databases, we already know where / and swap are going to go, and don’t forget that slice 2 is the whole disk or backup partition, so we don’t want to use that for anything. I normally put the meta databases on slice 7. I create a partition of 256MB, which is more than you need, you can use probably 10 if you want, I just like to have some room to grow in the future. It’s important to make sure you get all the slices setup before you do the install! Now that we have determined where all the slices are going to be and what they will hold (slice 0 is / or root, slice 1 is swap, and slice 7 holds the meta information), let’s copy the partition table from disk 0 to disk 1. Luckily, you can accomplish this in one easy step, like this:

#prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

Do you understand what we are doing here? We are using the prtvtoc (print vtoc, or disklabel) command to print the current partition structure, and piping it into the fmthard (format hard) command to essentially push the partition table from one drive to the other. Be sure you get the drive names absolutely correct, or you WILL destroy data! This will NOT ask you if you are sure, and there is NO WAY to undo this if you get it backwards, or wrong! Ok, the two disks now have matching labels, awesome! Next we need to create the meta databases, which will live on slice 7.

The command will look like this:
#metadb –a –c 3 -f c0t0d0s7 c0t1d0s7
See what we are doing here? We are issuing the metadb command, the -a says to add databases, the -c 3 says to add three copies (in case one gets corrupted), and the -f option is used to create the initial state database. It is also used to force the deletion of replicas below the minimum of one. (The -a and -f options should be used together only when no state databases exist). Lastly on the line we have the disks we want to setup the databases on. Note that we didn’t have to give the absolute of full device path (no /dev/dsk), and we added an s7 to indicate slice 7. Sweet, isn’t it?! Now we have our meta databases setup, so next we need to initialize the root slice on the primary disk. Don’t worry, even though we say initialize, it isn’t destructive. Basically, we tell the SVM software to create a meta device using that root partition, which will then be paired up with another meta device that represents the root partition of the other disk to make the mirror. The only thing here that you have to think about, is what you want to call the meta device. It will be a “d” with a number, and you will have a meta device for each partition, that will be mirrored to create another meta device that is the mirror. Got that? I normally name them all close to each other, something along the lines of d11 for the root slice of disk 1, d12 for the root slice of disk 2, and then d10 for the mirror itself that is made up of disks 1 and 2. That make sense? You can name it anything you want, and some folk use complicated naming schemes that involve disk ids and parts of the serial number, but I really don’t see the point in all that. The commands to initialize the root slices for both disks are as follows:

#metainit -f d11 1 1 c0t0d0s0
#metainit -f d12 1 1 c0t1d0s0
See how easy that is? We run the metainit command, using the -f again since we already have an operating system in place, we specify d11 and d12 respectively, and we want 1 physical device in the meta device (the 1 1 tells metainit to create a one to one concatenation of the disk). Again, like before, we specify the target disk, and again with no absolute device name. Take a look though and notice that we did change from s7 to s0, since we are trying to mirror slice 0 which is our root slice. Now that we have initialized the root slices of both disks, and created the two meta devices, we want to create the meta device that will be the mirror. This command will look like this:

#metainit d10 -m d11
Again, we use the metainit command, this time using -m to indicate we are creating a mirror called d10, and attaching d11. Whoah! Wait a minute pardner! Where’s d12 at you are asking? I know you are, admit it, you’re that good! I am glad you noticed. We actually will add that to the mirror (d10) later, after we do a couple other things and reboot the machine. This is a good spot to mention the metastat command. This command will show you the current status of all of your meta devices, like the mirror itself, and all of the disks in the mirror. It’s a good idea to run this once in awhile to make sure that you don’t have a failed disk that you don’t know about. For my systems, I have a script that runs from cron to check at regular intervals and email me when it sees a problem. Before we can reboot and attach d12, we have to issue the metaroot command that will setup d10 as our boot device (essentially it goes and changes the /etc/vfstab for you). Remember that this is only for a boot device. If you were mirroring two other drives (like in a server that has four disks) that you aren’t booting off of, you don’t metaroot those. The command looks like so:

#metaroot d10
How simple. That’s it! Well, that’s it for the root slice anyway. We’ll run through those same command to mirror the swap devices, which I will put down for you here with some notes, but without all the explanation. We’ll be using numbers in the 20′s for our devices, d20, d21 and d22. See if you can follow along:
(*Note: At this point, we already have the label and meta databases in place, so the prtvtoc and metadb steps aren’t needed.)

Initialize the swap slices:

metainit d21 1 1 c0t0d0s1=Notice we changed to
#metainit d22 1 1 c0t1d0s1=slice 1(s1) for swapNow,initialize the mirror
#metainit d20 -m d21
==============================================================
And there you go, at least for the meta device part. One thing to remember though, whether you are doing swap, or a separate set of disks, if you don’t run that metaroot command (like if it’s not the boot disk), you have to change the /etc/vfstab yourself or it won’t work. Here is where we point out a device name difference for meta devices. Instead of /dev/dsk for your mirror, the meta device is now located at /dev/md/dsk/ and then the meta device name. So, our root mirror is /dev/md/dsk/d10 and our swap mirror is /dev/md/dsk/d20. Simple huh? So for your swap mirror, you would edit /etc/vfstab and change the swap device from whatever it is now, to your meta device, which is /dev/md/dsk/d20 in this example. The rest of the entry stays the same, it’s just a different device name. Lastly, in order to make all this magic work, you have to restart the machine. Once it comes back up, you can attach the second drives of the mirror with this command:

For the root mirror
#metattach d10 d12
For the swap mirror
#metattach d20 d22

Once this is done, you should be able to see the mirrors re-syncing when you run the metastat command. Just run metastat, and for each mirror meta device, you should see the re-syncing status for a while. Once the sync is done, it should change to OK.

Example metastat output for d10 after the attachment:

d10: Mirror
Submirror 0: d11
State: Okay
Submirror 1: d12
State: Resyncing
Resync in progress: 0 % done
Pass: 1
Read option: roundrobin (default)
Write option: parallel (default)
Size: 279860352 blocks (133 GB)

d11: Submirror of d10
State: Okay
Size: 279860352 blocks (133 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t0d0s0 0 No Okay Yes

d12: Submirror of d10
State: Resyncing
Size: 279860352 blocks (133 GB)
Stripe 0:
Device Start Block Dbase State Reloc Hot Spare
c0t1d0s0 0 No Okay Yes

There you have it, the output from the metastat command shows the meta device that is the mirror, d10, and the meta devices that make up the mirror. In addition, it shows the status of the mirror and devices which is real handy. For example, in the script that I use to monitor my disks, I use the following command to tell me if any meta devices have any status other than Okay. Check it out:

#metastat | grep State | egrep -v Okay

If I get any information back from that command, I just have the script email it to me so I know what is going on. Cool, huh?

We just had the long version, so here I am going to put the commands together, so you can simply see them all at once, and even use this as a reference. See what you think:

#prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2

#metadb –a –c 3 -f c0t0d0s7 c0t1d0s7
#metainit -f d11 1 1 c0t0d0s0
#metainit -f d12 1 1 c0t1d0s0
#metainit d10 -m d11
#metaroot d10
#metainit d21 1 1 c0t0d0s1
#metainit d22 1 1 c0t1d0s1
#metainit d20 -m d21
>REBOOT<
#metattach d10 d12
#metattach d20 d22

There you have it! That’s how easy it is to create disk mirrors and protect your data with SVM. I hope you enjoyed this article and found it useful!

HOWTO: Mirrored root disk on Solaris

0. Partition the first disk
# format c0t0d0
Use the partition tool (=> "p , p "!) to setup the slices. We assume the following slice setup afterwards:
# Tag Flag Cylinders Size Blocks
- ---------- ---- ------------- -------- --------------------
0 root wm 0 - 812 400.15MB (813/0/0) 819504
1 swap wu 813 - 1333 256.43MB (521/0/0) 525168
2 backup wm 0 - 17659 8.49GB (17660/0/0) 17801280
3 unassigned wm 1334 - 1354 10.34MB (21/0/0) 21168
4 var wm 1355 - 8522 3.45GB (7168/0/0) 7225344
5 usr wm 8523 - 14764 3.00GB (6242/0/0) 6291936
6 unassigned wm 14765 - 16845 1.00GB (2081/0/0) 2097648
7 home wm 16846 - 17659 400.15MB (813/0/0) 819504
1. Copy the partition table of the first disk to its future mirror disk
# prtvtoc /dev/rdsk/c0t0d0s2 fmthard -s - /dev/rdsk/c0t1d0s2
2. Create at least two state database replicas on each disk
# metadb -a -f -c 2 c0t0d0s3 c0t1d0s3
Check the state of all replicas with metadb:
# metadb
Notes:
A state database replica contains configuration and state information about the meta devices. Make sure that always at least 50% of the replicas are active!

3. Create the root slice mirror and its first submirror
# metainit -f d10 1 1 c0t0d0s0
# metainit -f d20 1 1 c0t1d0s0
# metainit d30 -m d10
Run metaroot to prepare /etc/vfstab and /etc/system (do this only for the root slice!):
# metaroot d30
4. Create the swap slice mirror and its first submirror
# metainit -f d11 1 1 c0t0d0s1
# metainit -f d21 1 1 c0t1d0s1
# metainit d31 -m d11
5. Create the var slice mirror and its first submirror
# metainit -f d14 1 1 c0t0d0s4
# metainit -f d24 1 1 c0t1d0s4
# metainit d34 -m d14
6. Create the usr slice mirror and its first submirror
# metainit -f d15 1 1 c0t0d0s5
# metainit -f d25 1 1 c0t1d0s5
# metainit d35 -m d15
7. Create the unassigned slice mirror and its first submirror
# metainit -f d16 1 1 c0t0d0s6
# metainit -f d26 1 1 c0t1d0s6
# metainit d36 -m d16
8. Create the home slice mirror and its first submirror
# metainit -f d17 1 1 c0t0d0s7
# metainit -f d27 1 1 c0t1d0s7
# metainit d37 -m d17
9. Edit /etc/vfstab to mount all mirrors after boot, including mirrored swap

/etc/vfstab before changes:
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/dsk/c0t0d0s1 - - swap - no -
/dev/md/dsk/d30 /dev/md/rdsk/d30 / ufs 1 no logging
/dev/dsk/c0t0d0s5 /dev/rdsk/c0t0d0s5 /usr ufs 1 no ro,logging
/dev/dsk/c0t0d0s4 /dev/rdsk/c0t0d0s4 /var ufs 1 no nosuid,logging
/dev/dsk/c0t0d0s7 /dev/rdsk/c0t0d0s7 /home ufs 2 yes nosuid,logging
/dev/dsk/c0t0d0s6 /dev/rdsk/c0t0d0s6 /opt ufs 2 yes nosuid,logging
swap - /tmp tmpfs - yes -
/etc/vfstab after changes:
fd - /dev/fd fd - no -
/proc - /proc proc - no -
/dev/md/dsk/d31 - - swap - no -
/dev/md/dsk/d30 /dev/md/rdsk/d30 / ufs 1 no logging
/dev/md/dsk/d35 /dev/md/rdsk/d35 /usr ufs 1 no ro,logging
/dev/md/dsk/d34 /dev/md/rdsk/d34 /var ufs 1 no nosuid,logging
/dev/md/dsk/d37 /dev/md/rdsk/d37 /home ufs 2 yes nosuid,logging
/dev/md/dsk/d36 /dev/md/rdsk/d36 /opt ufs 2 yes nosuid,logging
swap - /tmp tmpfs - yes -
Notes:
The entry for the root device (/) has already been altered by the metaroot command we executed before.

10. Reboot the system
# lockfs -fa && init 6
11. Attach the second submirrors to all mirrors
# metattach d30 d20
# metattach d31 d21
# metattach d34 d24
# metattach d35 d25
# metattach d36 d26
# metattach d37 d27
Notes:
This will finally cause the data from the boot disk to be synchronized with the mirror drive.
You can use metastat to track the mirroring progress.

12. Change the crash dump device to the swap metadevice
# dumpadm -d `swap -l tail -1 awk '{print $1}'
13. Make the mirror disk bootable
# installboot /usr/platform/`uname -i`/lib/fs/ufs/bootblk /dev/rdsk/c0t1d0s0
Notes:
This will install a boot block to the second disk.

14. Determine the physical device path of the mirror disk
# ls -l /dev/dsk/c0t1d0s0
... /dev/dsk/c0t1d0s0 -> ../../devices/pci@1f,4000/scsi@3/sd@1,0:a
15. Create a device alias for the mirror disk
# eeprom "nvramrc=devalias mirror /pci@1f,4000/scsi@3/disk@1,0"
# eeprom "use-nvramrc?=true"
Add the mirror device alias to the Open Boot parameter boot-device to prepare the case of a problem with the primary boot device.
# eeprom "boot-device=disk mirror cdrom net"
You can also configure the device alias and boot-device list from the Open Boot Prompt (OBP a.k.a. ok prompt):
ok nvalias mirror /pci@1f,4000/scsi@3/disk@1,0
ok use-nvramrc?=true
ok boot-device=disk mirror cdrom net
Notes:
From the OBP, you can use boot mirror to boot from the mirror disk.
On my test system, I had to replace sd@1,0:a with disk@1,0. Use devalias on the OBP prompt to determine the correct device path.

Solaris sun cluster & SAN Storage