Sunday, May 4, 2008

Sun Cluster 3.1 cheat sheet

Daemons
clexecd
This is used by cluster kernel threads to executeuserland commands (such as the run_reserve and dofsckcommands). It is also used to run cluster commands remotely (like the cluster shutdown command).This daemon registers with failfastd so that a failfast device driver will panic the kernel if this daemon is killed and not restarted in 30 seconds.
cl_ccrad
This daemon provides access from userland management applications to the CCR.It is automatically restarted if it is stopped.
cl_eventd
The cluster event daemon registers and forwards cluster events (such as nodes entering and leaving the cluster). There is also a protocol whereby user applications can register themselves to receive cluster events.The daemon is automatically respawned if it is killed.
cl_eventlogd
cluster event log daemon logs cluster events into a binary log file. At the time of writing for this course, there is no published interface to this log. It is automatically restarted if it is stopped.
failfastd
This daemon is the failfast proxy server.The failfast daemon allows the kernel to panic if certain essential daemons have failed
rgmd
The resource group management daemon which manages the state of all cluster-unaware applications.A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds.
rpc.fed
This is the fork-and-exec daemon, which handles requests from rgmd to spawn methods for specific data services. A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds.
rpc.pmfd
This is the process monitoring facility. It is used as a general mechanism to initiate restarts and failure action scripts for some cluster framework daemons (in Solaris 9 OS), and for most application daemons and application fault monitors (in Solaris 9 and10 OS). A failfast driver panics the kernel if this daemon is stopped and not restarted in 30 seconds.
pnmd
Public managment network service daemon manages network status information received from the local IPMP daemon running on each node and facilitates application failovers caused by complete public network failures on nodes. It is automatically restarted if it is stopped.
scdpmd
Disk path monitoring daemon monitors the status of disk paths, so that they can be reported in the output of the cldev status command. It is automatically restarted if it is stopped.
File locations
man pages
/usr/cluster/man
log files
/var/cluster/logs/var/adm/messages
sccheck logs
/var/cluster/sccheck/report.
CCR files
/etc/cluster/ccr
Cluster infrastructure file
/etc/cluster/ccr/infrastructure
SCSI Reservations
Display reservation keys
scsi2: /usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2
scsi3:/usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d4s2
determine the device owner
scsi2:/usr/cluster/lib/sc/pgre -c pgre_inresv -d /dev/did/rdsk/d4s2
scsi3:/usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d4s2
Cluster information
Quorum info
scstat –q
Cluster components
scstat -pv
Resource/Resource group status
scstat –g
IP Networking Multipathing
scstat –i
Status of all nodes
scstat –n
Disk device groups
scstat –D
Transport info
scstat –W
Detailed resource/resource group
scrgadm -pv
Cluster configuration info
scconf –p
Installation info (prints packages and version)
scinstall –pv
Cluster Configuration
Integrity check
sccheck
Configure the cluster (add nodes, add data services, etc)
scinstall
Cluster configuration utility (quorum, data sevices, resource groups, etc)
scsetup
Add a node
scconf –a –T node=
Remove a node
scconf –r –T node=
Prevent new nodes from entering
scconf –a –T node=.
Put a node into maintenance state
scconf -c -q node=,maintstate
Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be zero for that node.
Get a node out of maintenance state
scconf -c -q node=,reset
Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be one for that node.
Admin Quorum Device Quorum devices are nodes and disk devices, so the total quorum will be all nodes and devices added together.You can use the scsetup GUI interface to add/remove quorum devices or use the below commands.
Adding a device to the quorum
scconf –a –q globaldev=d11
Note: if you get the error message "uable to scrub device" use scgdevs to add device to the global device namespace.
Removing a device to the quorum
scconf –r –q globaldev=d11
Remove the last quorum device
Evacuate all nodesput cluster into maint mode #scconf –c –q installmoderemove the quorum device#scconf –r –q globaldev=d11check the quorum devices#scstat –q
Resetting quorum info
scconf –c –q reset
Note: this will bring all offline quorum devices online
Bring a quorum device into maintenance mode
obtain the device number#scdidadm –L #scconf –c –q globaldev=,maintstate
Bring a quorum device out of maintenance mode
scconf –c –q globaldev=,reset
Device Configuration
Lists all the configured devices including paths across all nodes.
scdidadm –L
List all the configured devices including paths on node only.
scdidadm –l
Reconfigure the device database, creating new instances numbers if required.
scdidadm –r
Perform the repair procedure for a particular path (use then when a disk gets replaced)
scdidadm –R - device scdidadm –R 2 - device id

Configure the global device namespace
scgdevs
Status of all disk paths
scdpm –p all:all Note: (:)
Monitor device path
scdpm –m
Unmonitor device path
scdpm –u
Disks group
Adding/Registering
scconf -a -D type=vxvm,name=appdg,nodelist=:,preferenced=true
Removing
scconf –r –D name=
adding single node
scconf -a -D type=vxvm,name=appdg,nodelist=
Removing single node
scconf –r –D name=,nodelist=
Switch
scswitch –z –D -h
Put into maintenance mode
scswitch –m –D
take out of maintenance mode
scswitch -z -D -h
onlining a disk group
scswitch -z -D -h
offlining a disk group
scswitch -F -D
Resync a disk group
scconf -c -D name=appdg,sync
Transport cable
Enable
scconf –c –m endpoint=:qfe1,state=enabled
Disable
scconf –c –m endpoint=:qfe1,state=disabled Note: it gets deleted
Resource Groups
Adding
scrgadm -a -g -h ,
Removing
scrgadm –r –g
changing properties
scrgadm -c -g -y
Listing
scstat –g
Detailed List
scrgadm –pv –g
Display mode type (failover or scalable)
scrgadm -pv -g grep 'Res Group mode'
Offlining
scswitch –F –g
Onlining
scswitch -Z -g
Unmanaging
scswitch –u –g
Note: (all resources in group must be disabled)
Managing
scswitch –o –g
Switching
scswitch –z –g –h
Resources
Adding failover network resource
scrgadm –a –L –g -l
Adding shared network resource
scrgadm –a –S –g -l
adding a failover apache application and attaching the network resource
scrgadm –a –j apache_res -g \ -t SUNW.apache -y Network_resources_used = -y Scalable=False –y Port_list = 80/tcp \ -x Bin_dir = /usr/apache/bin
adding a shared apache application and attaching the network resource
scrgadm –a –j apache_res -g \ -t SUNW.apache -y Network_resources_used = -y Scalable=True –y Port_list = 80/tcp \ -x Bin_dir = /usr/apache/bin
Create a HAStoragePlus failover resource
scrgadm -a -g rg_oracle -j hasp_data01 -t SUNW.HAStoragePlus \> -x FileSystemMountPoints=/oracle/data01 \> -x Affinityon=true
Removing
scrgadm –r –j res-ip
Note: must disable the resource first
changing properties
scrgadm -c -j -y
List
scstat -g
Detailed List
scrgadm –pv –j res-ipscrgadm –pvv –j res-ip
Disable resoure monitor
scrgadm –n –M –j res-ip
Enable resource monitor
scrgadm –e –M –j res-ip
Disabling
scswitch –n –j res-ip
Enabling
scswitch –e –j res-ip
Clearing a failed resource
scswitch –c –h, -j -f STOP_FAILED
Find the network of a resource
# scrgadm –pvv –j grep –I network
Removing a resource and resource group
offline the group# scswitch –F –g rgroup-1remove the resource # scrgadm –r –j res-ipremove the resource group # scrgadm –r –g rgroup-1
Resource Types
Adding
scrgadm –a –t i.e SUNW.HAStoragePlus
Deleting
scrgadm –r –t
Listing
scrgadm –pv grep ‘Res Type name’

Friday, May 2, 2008

Procedure to Add the disks in Sun Cluster

Procedure to add the disks in Cluster

1. Create a metaset and all the hosts to metaset.
# metaset -s metaset name -a -h primary node secondary node

2. Find out the did device for shared hitachi disks
# scdidadm -L

3.Create a metaset and all the hosts to metaset.

#metaset -s -a /dev/did/rdsk/dx /dev/did/rdsk/dy

4.Format and allocate all the size to Slice 0, leaveing the 7th partition as it is.

5. Create a meta device as per requirement for 33 gb luns.

# metainit -s metaset name disk name e.g. d100 1 1 did device for e.g. /dev/did/rdsk/dxs0

6. Create stripe metadevice for 66 gb luns

# metainit -s metaset name disk name e.g dz 1 2 did device /dev/did/rdsk/das0 did device /dev/did/rdsk/das0> -i 512k


7. newfs the device

8. Create mount points

9. Edit vfstab and update the file.

20K domain shutdown procedure

Following is the high level procedure for powering off & on the domains.
Here in the example i have taken c domain. Replace this with domain you
are getting outage for.

1. Login to tsky-20k-01-sc & also to second tsky-20k-02-sc.
2. su - sms-svc
3. login to to respective console for both nodes in the cluster
eg. console -d c in 20k-1-sc &
console -d c in 20k-2-sc
4. one of the node give
#scshutdown -y -g0
5. Both the consoles will come to OK prompt.
Do the following in each of the domains :
a. ok> ~. (Press both key simulataneosly)
you will get sms prompt.
b. setkeyswitch -d c off
c. setkeyswitch -d c on
d. console -d c
e. ok> boot
Do the same procedure in other node.

PROCEDURE FOR REPLACING MIRRORED DISKS

Given all of the above, the following set of commands should work in all cases (though depending on the system configuration, some commands may not be necessary):
To replace a Solaris VM-controlled disk which is part of a mirror, the following steps must be followed:

1. Run 'metadetach' to detach all submirrors on the failing disk from their
respective mirrors:
metadetach -f
Note: If the "-f" option is not used, the following message will be returned:
"Attempt an operation on a submirror that has erred component".
Then run 'metaclear' (**) on those submirror devices:
metaclear
Verify there are no existing metadevices left on the disk by running:
metastat -p grep c#t#d#

2. If there are any replicas on this disk, remove them using:
metadb -d c#t#d#s#
Verify there are no existing replicas left on the disk, by running:
metadb grep c#t#d#

3. If there are any open filesystems on this disk (not under Solaris VM
control), unmount them.

4. Run the 'cfgadm' command to remove the failed disk.
cfgadm -c unconfigure c#::dsk/c#t#d#

NOTE: Use the "cfgadm -al" command to obtain the variable "c#::dsk/c#t#d#".
The variable will be listed under the 'Ap_Id' column from the "cfgadm
-al" command's output.
NOTE: if the message "Hardware specific failure: failed to unconfigure SCSI
device: I/O error" appears, check to make sure that you cleared all
replicas and metadevices from the disk, and that the disk is not being
accessed.

5. Insert and configure the new disk.
cfgadm -c configure c#::dsk/c#t#d#
cfgadm -al (to confirm that disk is configured properly)

6. Run 'format' or 'prtvtoc' to put the desired partition table on the new disk

7. If necessary, recreate any replicas on the new disk:
metadb -a c#t#d#s#

8. Recreate each metadevice to be used as a submirror, then use 'metattach' to
attach those submirrors to the mirrors and start the resync.
NOTE: If the submirror was something other than a simple one-slice concat device, the metainit command will be different than shown here.
metainit 1 1
metattach

9. Run 'metadevadm' on the disk, which will update the New DevID.
metadevadm -u c#t#d#

Thursday, May 1, 2008

E20K Adding System Boards

1. Verify that the selected board slot can accept a board.
# cfgadm -a -s "select=class(sbd)"
2. Add the board to the slot, then connect and configure the board.
# cfgadm -v -c configure SBx ------Where x represents the number of the board

E20K Adding System Boards

To Add a system board to the domain, the board must already be assigned to the domain, or must be in the ACL,

Application Monitoring

Introduction:
Application monitoring is a very important aspect of a project but unfortunately not much attention is paid to develop the effective monitoring while the project is not live. Once project is live lack of proper monitoring costs in terms of downtime when support persons are not aware if application is having some problems or application not working at all.
Discussion on application monitoring should start early at least from the time when deployment details are being worked out. Some application may require some specific scripts or tools or authorizations, an early discussion on monitoring will make it in a better position to address the delays in its implementation.
This document gives a basic introduction to the challenges , type of monitoring and best practices which can be followed to ensure high availability of the live systems .
Challenges in Application monitoring :
Following are some of the challenges faced today for application monitoring :
1. Proactive Monitoring : Proactive monitoring means monitor the system and application health and take corrective action when it reaches a certain threshold level .The threshold level is defined as the level where application is not showing deterioration but can deteriorate if corrective actions are not taken . The biggest challenges is to gather the statistics to workout a threshold and number of parameters and process that needs to be monitor . Applications which interact directly with the customer for example ecommerce, banking & online applications needs to be monitored proactively so that problems are detected even before it impacts the end user customer.

2. Complexity & number of applications: An application may become more complex if it has a global user base. The application has to support multiple languages, culture and currencies. Application may have multiple instances located in different regions of the world and may be using different time or logging format. To effectively monitor global applications one has to under stand the application instances their interconnectivity, flow coordinate with regional teams and in most cases depend on regional teams for monitoring the application.

3. Shared Systems: Applications are often shared in a system in order to utilize the full capacity of the hardware and this implementation brings in its own set of challenges. For a single application system it is easier to track the resources like memory, CPU, disk, network bandwidth but in shared application environment some application may take the resources and others may get impacted for apparently no fault of their own. Sometime application owners may not be contactable to take corrective actions.

4. Clustered Systems: To avoid a single point of failure applications are hosted in a clustered environment with number of machines in different network and locations. From monitoring perspective it poses another challenge of keeping track of the request & failure logs , memory , CPU network , disk resources as one has to look at all the cluster component machines logs and resource just to isolate which one is giving bad performance .

5. Limited Logging in production environment : Since volume of transactions are very high and application code has already been run through performance , reliability and quality assurance cycles the application code in the production environment is generally enabled for minimum logging information . This may lead to situation when actually indicator of a problem may not show up in the logs . The logs may not show the error message until the logging level increased .

6. Custom logging in production : Logging in online production environments at the most can be changed to higher level as provided by the code. In case of particular problem when logging and other debugging methods does not provide a clue to the problem special instrumented code has to be developed and deployed to capture error condition events . The instrumented code has to be deployed in production environment only since the problem could not be replicated under test conditions .Deploying a custom code in production calls for application downtime which may not be acceptable to the application owners and business groups involved and also require considerable efforts on part of supporting team to maintain it . This custom code may get overwritten by the next release cycle code .

Types of Monitoring for applications :
Applications are simultaneously monitored at various points to ensure its availability and monitoring as a whole falls under following categories :

1. Health Monitoring : As a proactive step application health has to be monitored constantly in order to address any issue before it becomes a serious issue . Health monitoring in a simple arrangement will consists of taking a snapshot of system & application parameters and comparing it to the standard benchmarks . For example in a system if a transaction is known to take around one second to complete and we can monitor this response time and setup alerts if the response time increases . Automated monitoring of health parameters is the best way of ensuring high availability of an application environment.

2. Error Monitoring : Errors in any application can impact the user experience adversely. An error condition in an application can cause user experience to fail out rightly or can cause unexpected errors such as time outs or failure to submit or display the requested data . Errors can arise either due to software problem , relating to application code , web server , application server or database server or due to an hardware issues relating to memory , CPU processing , disk space or network issues . These type of errors are monitored differently , application errors are mostly monitored by analyzing the application , web server , application server logs , understanding the error message and using that error message to find the nature of problem . For example an application may stop processing new requests and from log files we may find the possible reason for this behavior if the application is not able to process the requests due to resource shortage like cpu , memory ,network bandwidth , database performance etc. Application monitoring requirement and tools to monitor can be designed by studying the application documentation , architecture , platform , error messages etc.
Hardware monitoring is done using the standard tools and commands available for the particular hardware. Every operating system has tools and commands to monitor memory usage , CPU usage and disk usage but to monitor & report these resources on a regular basis custom scripts can be written which is independent of application code .

3. Performance Monitoring : Performance of an application is critical to create good user experience. An application which responds to user requests in reasonable amount of time will have a good impact on user whereas an application which takes seconds or minutes to respond will cause users to abandon the application . Application Performance is derived from application code and supporting hardware . The code ensures that the program routines incorporated in the program are capable of handling at least desired number of actual user requests and hardware provides the necessary memory and processing capabilities..
Application performance can be monitored from the application access time , request processing time & time reported for various transactions in the application logs . While the application logs may provide some data about the processing time actual user experience can be simulated by sending requests to applications from different locations and measuring the resulting application response time in real-time.

4.Configuration Monitoring : Applications releases and operating system changes can impact the hardware and software configuration of a machine .It is very important to monitor configuration to avoid any undocumented and untested configuration element .Each of the configuration change needs to be documented and monitored for any unauthorized change . The best way to monitor configuration is through a change control process where a change is submitted approved and them implemented . The change control process keeps record for all the changes and allows to monitor the changes by the persons responsible for the applications.

5. Security Monitoring : In today's global scenario it is very important to monitor applications for security . Security monitoring involves ensuring latest security patches are implemented in application servers , web servers & database servers . Software companies frequently issues security warning in their software products and these security warnings should be carefully studied & implemented to ensure compliance and protection against hackers . At any given point of time the software versions should be monitored to understand if they poses any security threat and update them with newer & secure safe version .
Some companies have security teams who constantly monitor hardware and software for possible security breach and send their recommendations but generally support team should subscribe to the newsletters from software companies which informs about the later security threats .

Best Practices for application monitoring :
Systems can fail due to various reasons related to hardware , operating system , network or applications itself . Sometimes despite good efforts systems and applications fail . Although one can not assure always available status of these components there are some best practices which can be followed to ensure high availability of applications :

1. Plan Early : If there is a new application or software component is becoming live and needs monitoring it is better to involve in early discussions of architecture and design to get an overview of things to come . This give time to think and implement the monitoring solution when required. In many cases it will help as monitoring solution may not be a straight forward and may require additional resources and efforts.

2. Monitoring proactively : Don't let system/applications go down and its failure be used as a point to start corrective action . Monitor systems and applications proactively for the symptoms of problem so that corrective action can be initiated before system/application fails . Proactive monitoring can achieved by monitoring some threshold values for resources utilization like CPU memory , network bandwidth and application health parameters . If the system crosses the threshold values a system health check has to be performed which include finding the running processes , memory utilization by various process , monitoring application logs etc . The health check and corrective action proactively can avoid system and application crash.

3. Balance the Load : Load balancers are used to distribute the load on to the servers which can handle the load . In the event of one server being heavily loaded or down the load balancers can automatically direct the traffic to the healthy server . This operation by load balancers is transparent to the users and they will not notice the difference. Load balancers can be hardware or software based and if not present has to be used for a high transaction application.

4. Cluster the servers : Clustering removes the single point of failure by providing multiple points for request processing . In the event of one server being down due to hardware failure , network failure or heavily load on resources , requests are sent and processed by other members of the cluster .

5. Create a Recovery Plan : To avoid delay online applications should have a well documented & tested recovery plan . The plan should cover the steps and checklists to be followed in the event of a application failure. A simple example would be to test the fail over feature of a server and observe the total requests failure and time taken to failover etc. which can give a estimated time when a alternate server will be up . Having a plan at the time of failure avoid time wastage to look for alternatives.

6. Deploy application code from a trusted & tested source : Application code should be released from the trusted & tested source such as version control system , staging or quality assurance environments . No code should be released which has external changes other then trusted source where only authorized persons have access . Using code in this way presents a opportunity to simulate any code problems and examine the code base itself by the development teams.

7. Create a Service Level Agreement : A service level agreement in writing emphasize the need and scope of monitoring . It provides monitoring requirements for the support team and a standard to measure the application availability by the business groups. This document will give a estimated time to respond and fix the issues and teams can work in advance to create a recovery plan which meets the service level agreement .

8. Use Good hardware : Hardware which is proven to be reliable in the industry should be used for production environment . All the additional component cabling etc should be of high standard to avoid problems due to hardware failures . Replacement components should be of exact specifications as original. The hardware should have support mechanism with manufacturing company or other company which can supply the components and troubleshooting expertise in case of a failure.

9. Seek Professional Help : If your application is mission critical ,involves impact to customers and revenue then it is not sufficient to relay on home grown solutions for monitoring but you should seek professional advice from the companies which have been doing monitoring for other companies. These companies besides monitoring applications can provide you with different type of reports like response time , downtime , uptime etc. which may be helpful in marinating and planning for the application resources.

Implementation :
To implement effective application monitoring one has to under stand the nature of application , what exactly it is trying to do . For this one doesn't have to have the full application code knowledge but the basic flow of information should be clear .

1. Uptime Monitoring
For setting this type of monitoring applications are monitored if they are up and running . A simple monitor can be setup by monitoring the server urls or server processes . The problem with this type is that it can tell if a application is up it does not tell if application can process the transactions .

2. Transaction Monitoring
Transaction based applications are best monitored using transaction monitor . If the application involves some form submissions and displaying a success message ,the same behavior can be simulated using some scripts and status can be captured to find success status. The script can do the transactions at repeated intervals and send alerts if something fails.
This can be used effectively in proactive monitoring if the application can return back the transaction processing time or some other status which can be quantified . The transaction completion time/status can be monitored and compared with expected times . If a transaction takes much time one can look at the application logs to figure out the problem & take corrective action to avoid a crash.

3. Data files monitoring
In some application environments transactions happen offline where the data travels in an offline manner from one point to another like businesses sending their daily sales data to their head office every night in the form of a data file. This type of flow can be monitored by constantly monitoring the various drop and pick points of the data files . At frequent intervals counts can be taken at drop and pickup points to ensure the files are moving properly .
This also provides a means to proactively monitor the flow as the problem will be known when files starts to accumulate at a drop point on its first occurrence and system can be prevented from clogging by looking into the cause which resulted in accumulation of files.

4. Database Monitoring
Applications uses databases and databases should be monitored for its uptime state as transaction state . Uptime state is easy to monitor by monitoring some key processes we can determine if data base is up or not. To monitor the transactional health of a database some monitoring transactions like creating records , updating the records etc can be done and the time taken for each transaction and their final status is noted.
When the transactions starts to fail we can know that database is having some issues but as proactive monitoring we can monitor the time taken to complete each transaction . In most of the cases if system becomes overloaded the transaction time will be higher and that can give a vital clue to look the problem area in database and correct it before it goes down .

5. Resource Monitoring
CPU , Memory , network disk ,monitoring is equally important as the above ones . constantly monitoring the system resources can prevent application and operating system slow down and crash. If the CPU and memory is reaching its peak the application can go into a hung state . If disk space is full applications can crash right away as they may not be able to write logs etc on the disk. Network bandwidth over utilization can also causes application crash where by the request queues starts building up due to slow network. All the resources offer quantitative measurements and can be mentored using the scripts using existing system utilities . For proactive monitoring threshold values can be set for each resource and on reaching the threshold one can investigate the cause of over utilization of resources .

Metaset problem in a cluster node

To correct metaset problem. please follow the action plan below:
NodeA and NodeB. I assume that NodeB is not having metaset information
Action plan to correct the metaset issue this will require downtime on the node (NodeB)which is not having metaset information.
On node NodeB:
# init 0
on node NodeA:
# metaset -s setname -f -d -h NodeB
(this may take a 3/4 minute or so to return)
on node NodeB:
ok> boot
(wait for not to fully join the cluster)
on node NodeA:
# metaset -s setname-a -h NodeB
on node NodeB:
# metaset (should now list the metaset information)
test Cluster switchover with a: This requires an outage.
# scswitch -z -g RGname -h NodeB