Tuesday, December 21, 2010

CLOSE_WAIT Connections – Tuning Solaris

This article aims to describe a way to tune TCP parameters on Solaris to get a better performance running a WebServer. I will show how the TCP connection is initiated and termintated on a high level and I will focus on how to tune TCP parameters on Solaris.

I have experienced some problems with a bunch of CLOSE_WAIT connections on Solaris affecting the application causing delays on response time and refusing new connections.

I will start this article explaning how a connection is initiated (TCP Three-Way Handshake sequence).

Let’s use the scenario of two servers (A and B) where the server A is going to initiate the connection.

1-) The first segment SYN is sent by the node A to node B. This is a request to server synchronizes the sequence numbers.

Node A —— SYN —–> Node B

2-) Node B sends an acknowledge (ACK) about the request of the Node A. At the same time, Node B is also sending its request (SYN) to the Node A for synchronization of its sequence numbers.

Node A <—– SYN/ACK —- Node B 3-) Node A then send an acknowledge to Node B. Node A —— ACK ——-> Node B

At this time the connection should be established.

Let’s show how the connections are terminated (here is the CLOSE_WAIT issue):

In the termination process, it is important to remember that each application process on each side of connection should close independently its half of connection. This terminating process consists of:

Let’s supose Node A will close its half of connection first.

1-)NodeA transmits a FIN packet to Node B.

(Established) (Established)
Node A —- FIN —-> Node B
(FIN_WAIT1)

2-)NodeB transmits an ACK packet to Node A:

NodeA <—- ACK —- Node B (FIN_WAIT2) (CLOSE_WAIT) //Here is the CLOSE_WAIT Issue. App on Node B should invoke close() method to close the connection on its end. If the App does not invoke close() method, then it will keep the connection stuck on CLOSE_WAIT for the time specified on TCP Stack. If you have much traffic in the server and a lot of connections on CLOSE_WAIT status it will cause some issues such as: - Refusing new connections request. - Slow on response time. - High Processing Resource Utilization. Now, I will describe some tips that helped me to solve some problems in Webservers. It basically consists of changing some TCP parameters in Solaris that will reduce the time that a connection will be on CLOSE_WAIT, releasing this kind of connection quickly. - TCP_TIME_INTERVAL parameter: Description: Notifies TCP/IP on how long to keep the connection control blocks closed. After the applications complete the TCP/IP connection, the control blocks are kept for the specified time. When high connection rates occur, a large backlog of the TCP/IP connections accumulates and can slow server performance. The server can stall during certain peak periods. If the server stalls, the netstat command shows that many of the sockets that are opened to the HTTP server are in the CLOSE_WAIT or FIN_WAIT_2 state. Visible delays can occur for up to four minutes, during which time the server does not send any responses, but CPU utilization stays high, with all of the activities in system processes. 1-) Verify the current value of this: ndd -get /dev/tcp tcp_time_wait_interval 2-) Set the new value ndd -set /dev/tcp tcp_time_wait_interval 60000 (Defaul value is 240000 milliseconds = 4 minutes. Recommended is 60000 milliseconds). - TCP_FIN_WAIT_2_FLUSH_INTERVAL Specifies the timer interval prohibiting a connection in the FIN_WAIT_2 state to remain in that state. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_fin_wait_2_flush_interval 2-) Set the new value: ndd -set /dev/tcp tcp_fin_wait_2_flush_interval 67500 (Default Value is 675000 milliseconds. Recommended is 67500 milliseconds). - TCP_KEEPALIVE_INTERVAL keepAlive packet ensures that a connection stays in an active and established state. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_keepalive_interval 2-) Set the new value: ndd -set /dev/tcp tcp_keepalive_interval 300000 (Default Value is 7200000 milliseconds. Recommended is 15000 milliseconds). - Connection backlog It means that a high number of incoming connections results in failure. 1-)Verify the current value of this: ndd -get /dev/tcp tcp_conn_req_max_q 2-) Set the new value: ndd -set /dev/tcp tcp_conn_req_max_q 8000 (Default value is 128. Recommended is 8000) This configuration change will help to improve the system performance, and better than this, it will help to reduce major impacts. I have experienced situations that the application was not responding due to a lot of connections on CLOSE_WAIT status. In my case, we identified a bug in the application and we use this tunings as an work-around. It is very useful and it can help you when you are experiencing problems due to many connections on this state. Reference: This article was inspirate on IBM “Tuning Solaris systems” @ WebSphere Application Server Information Center. Additional Info: Local Server closes first: ESTABLISHED -> FIN_WAIT_1-> FIN_WAIT_2 -> TIME_WAIT -> CLOSED.

Remote Server closes first:
ESTABLISHED -> CLOSE_WAIT -> LAST_ACK -> CLOSED.

Local and Remote Server close at the same time:
ESTABLISHED -> FIN_WAIT_1-> CLOSING ->TIME_WAIT -> CLOSED.

No comments: