Friday, June 18, 2010

RCA for Siebel problem and Best practices recommendation

Tata Sky : RCA for Siebel problem and Best practices recommendation
Problem Summary : Users were unable to contact the Web server of Siebel application
for apx 10 minutes.
Problem details:
On 4th Jan 07 Webserver was unable to communicate to load balancer for
10 mins. No users were able to access the application. This problem
was resolved automatically. Following message were observed in
Webserver logs
[04/Jan/2007:16:00:39] failure (17568): HTTP3068: Error receiving
request from 10.1.19.44 (Connection refused)
Event Time Lines:
Event Date: 4th January 2007
· Event Time: 15:50 to 16:01
· Problem reported: 8th January 2007
· Domains services restored: Siebel
· Diagnosis/Analysis time:
· Reboot time:
· H/W replacement time:
Diagnosis summary :
1) Corresponding to problem time only relevant message available is “web server was not
able to reach load balancer.”
2) Users also did not reach beyond load balancer.” As per onsite team.
Since no other data is available to pin point the cause of issue with load balancer or
network or web server , approach was taken to analyze full setup and plan all best
practices to prevent from re-occurrence.
Analysis:
Summary:
There were 4 cases logged related to Siebel setup problem.
10965815 Siebel application server restarted as user sessions were hung
10966494 Web server ping packet drop
10968351 Siebel db performance problems
10979783 Webserver unable to communicate with the application server
These are the highlights of the Analysis:
web server error message indicates that client opened the connection and it is
Tata Sky : RCA for Siebel problem and Best practices recommendation
closed from the client side before the webserver managed to read any data from that
connection. For Web server, Cisco Load balancer (Logical IP) is the immediate client. It
is possible that the connections between the load balancer and the client are also
disconnected.
There is no packet drop in the network between web server and Cisco load balancer while
the testing was carried out after the problem was observed. However there is a message
coming from the load balancer. Please check with Cisco for the message.
Workaround if any:
Suggested Fix and recommendations:
1) Implement the best practices
2) Collect most of the logical data at problem time.
3) Enable debug options in application and network level.
4) Implement the NFS option planned for image files store agreed in in phase 1b
architecture layout
Following are the best practice recommendation :
Web server:
1.From the given magnus.conf file of the webserver, KeepAliveTimeout is set to 1200
seconds (2 Hrs). Default value is 30 seconds. In the multi-tier architecture, it is best to set
the KeepAliveTimeout as zero.
2.Please modify following entry from /etc/system
Remove the following entry
set segkmem_lpsize=0x400000
add the following entry
set pcie:pcie_aer_ce_mask=0x1
3.Install the latest level EIS CD Patches.
4.Transition to e1000g
Convert from ipge to e1000g by installing patch 123334-02 and running
the script provided
5. Please refer to the following guide which provides the guidance for
the performance and tuning.
http://docs.sun.com/app/docs/doc/817-6249
Tata Sky : RCA for Siebel problem and Best practices recommendation
Action when case of web server is unresponsive:
Please confirm the webserver hang or unresponsive by accessing the static pages or telnet
to the system for http port. If both of them results in time out, we can confirm that
webserver is hung.
Identity the webserver child process as follows:-
1. ps -ef | grep webservd | grep
Highest number on the pid is the chile pid.
With this PID, we need to collect the following details:-
1. Open the terminal1, run prstat against the pid as given below.
# prstat -L -c -p -o prstat.hung
Run this command for 3 minutes, terminate it using +c keystrokes.
2. Meanwhile, open another terminal2, issue kill -3 command successively for 3 times
with the interval of a minute.
This will create the java thread dump in the errors log file.
3. In the terminal 2, run pstack, pmap, pldd and pfiles against pid.
# pstack pid > pstack.hung
# pmap pid > pmap.hung
# pldd pid > pldd.hung
# pfiles pid > pfiles.hung
4. In the terminal2, run gcore for generating the core file.
# gcore pid-- This will create the core file as core.pid in the present working directory.
5. Run the pkgcore script for collecting the binaries & libraries for root cause analysis.
#pkgcore.sh {case id} {core.pid} {pid}
This will create the packages such as caseid_corefiles.tar.gz & caseid_libraries.tar.gz
6. netstat -na > netstat.hung
Tata Sky : RCA for Siebel problem and Best practices recommendation
Application Server PSBLA001
1.Please modify following entry from /etc/system
Remove the following entry
set segkmem_lpsize=0x400000
set ip:ip_squeue_bind = 0
set ip:ip_squeue_fanout = 1
set ipge:ipge_tx_syncq=1
set ipge:ipge_bcopy_thresh = 512
set ipge:ipge_dvma_thresh = 1
set consistent_coloring=2
Add the following setting
set pcie:pcie_aer_ce_mask=0x1
2.Install the latest level EIS CD Patches.
3. Transition to e1000g
Convert from ipge to e1000g by installing patch 123334-02 and running
the script provided
Action When Application server is hung:
1.PID of the application process
2. truss -o truss.out -ealfd -vall -p "pid of the application"
3. pstack "pid of the application" ==> get it 3 times.
4. snoop -o snoop.out -d
5. ndd /dev/tcp tcp_listen_hash
6. Savecore -L
7. guds output
8. prstat -mvL -n 10 1 600
9. iostat -xnz 1 600
10. mpstat 1 600
11. vmstat 1 600
12.lockstat -C -s 50 sleep 30 lockstat -H -s 50 sleep 30
13.lockstat -kIW -s 50 -i 971 sleep 30
Siebel Database server:
1. Modify the following entries From /etc/system
Remove
set ce_reclaim_pending=1
exclude: lofs
Tata Sky : RCA for Siebel problem and Best practices recommendation
add
set ce:ce_bcopy_threash=97
set ce:ce_dvma_thresh=96
set ce:ce_ring_size=8192
set ce:ce_comp_ring_size=8192
set ce:ce_tx_ring_size=8192
set sq_max_size=100
2.Please note that Dumpdevice : /dev/dsk/c0t0d0s1 is a Submirror of
Swap-Metadevice /dev/md/dsk/d101
Change dumpdevice to Swapmirror : /dev/md/dsk/d101 with: "dumpadm -d
swap"
3. Install the latest level EIS CD Patches which includes cluster
patches.
Action Plan:
Team:
SSE: Vinod SAM: Rajesh
Onsite Team: Prashant Customer engineer/Sysadmin:Shams Khan

No comments: