Tuesday, December 9, 2008

A Holistic Approach to Siebel CRM Monitoring

What should we monitor on Siebel CRM?

It turns out to be a rather common question, even for some of our long time customers. In fact, I was on a call with a customer this morning and heard a rather lively discussion amongst its staff on this topic. I probably should write a white paper about this. However, knowing how much work I have to finish before taking some time off for Christmas, it could be a while before I can publish a formal white paper, so let me try to share some of my thoughts in real time. Consider this a first installment of a best practice white paper.

Before I talk about what needs to be monitored, let me define what I mean by monitoring. Monitoring, as defined by the Webster Dictionary, is to watch, to keep track of, or check usually for a specific purpose. In technical sense, it is the set of activities to gather telemetry data from a piece of hardware or software, analyze the data, and provide some sort of notification if some sort of exceptions are found. Monitoring is closely related to diagnostic. In fact, the same piece of telemetry can be used for both purposes. One might want to monitor CPU usage using data gathered in real time, and examine a time series of CPU trend in diagnosing performance data. Personally, I tend to classify monitoring as the set of tasks that lead to the realization of an exception, and diagnostic as the set of tasks that follow to determine problem root causes. In ITIL terms, monitoring may lead to creation of an incident, while diagnostic is carried out in incident and problem management.

Now that I have defined what I mean by monitoring, let's talk about what needs to be monitored.

The obvious things to monitor are CPU, memory, disk space, and I/O (disk, network, etc...). These are the most basic computing resources that Siebel and its underlying database depend on, and they are finite resources, so it makes sense to monitor them. However, these are not the only things, nor are they necessary the most important things.

One thing that makes monitoring Siebel different from monitoring other technologies is that Siebel is an application. As an application, it interacts with users directly, whereas most users do not deal directly with the database, or the load balancer, or the storage devices, and so on. Consequently, the primary purpose of application monitoring is to make sure that the application is providing the service level that users expect in order to do their jobs.

Many things can impact application service level. In fact, every component in a Siebel environment, including but not limited to the Siebel application server, web server, gateway server, report server, CTI, database, storage device, server, network switch, router, load balancer, WAN, etc... can all impact service level. Therefore, it is important to monitor everything, right? Yes and no.

Traditionally, application monitoring means monitoring all the components, and the health of the application is the aggregate health of all the components. However, this kind of bottom up approach is increasingly ineffective because of the increasing amount of redundancy built into production application environments, and because many applications are becoming more and more service oriented. For example, with RAID, it is no big deal to lose a disk. With Oracle RAC, you can lose a database server node and the database will keep on running. With Siebel app server clustering, you can lose an app server altogether but the application would continue to function (yes, users logged onto that server would need to log on again). The point that I want to make is that while it is bad to have component failures, they are not the big catastrophes that they used to be in their service level impacts.

The starting point of Siebel monitoring should be from the top – monitor from the end user perspective by focusing on interactive user sessions and batch jobs, and then move downward to the components. If users have problems accessing application functionalities and getting good response times, or if batch jobs are not getting run within targeted batch window, you clearly have a problem with the application, and those problems may be caused by component level outages. On the other hand, if a server goes down but interactive user sessions and batch jobs are working just fine, you have less to worry about. You'll still want to find out and fix this problem, because the service level of your Siebel environment may drop below your target if another server goes down. Still, the server outage is less urgent than it used to be. In traditional component based monitoring approach, a server outage would be a fatal problem that demanded immediate action. In this top-down end user focused approach, a server outage would most likely be a warning unless there is no redundancy for the component.

Both active and passive approaches should be used for monitoring interactive user workload, and critical alerts should be generated if exceptions occur. I wrote about these two monitoring approaches in two previous postings (1, 2), so you can refer to those articles for more details. For batch workload, the key thing to focused on is whether the job finishes on time and whether errors or warnings are generated in processing the entries. Most of the data that you need to watch are in Siebel log files.

The next set of things to monitor are resources. They are important to monitor because resources tend to be finite. If they run out, processing either stops or is delayed. Keep in mind about the relative importance of these resource at the component level though – resource outage may not be a critical event in the grand scheme of things. Traditional resources to monitor include CPU, memory, disk space and I/O, but don't forget about Siebel-specific artifacts such as task count, and when monitoring traditional resource, you need to do it in the context of Siebel. In other words, you should monitor not only server level CPU, but also CPU consumption specific to the Siebel processes.

Lastly, monitor for exceptions, which can be errors showing up on log files, or summarized Siebel server and component statistics for number of level 0 and 1 errors, number of component crashes, restarts, or even number of database connection retries. These are important to monitor in the sense that while a single exception may not be a critical problem, a swamp of these errors happening within a relatively small time window is usually a bad sign, and may point to problems that could cause service level target to be missed.

What about the other Siebel server and component statistics? For the most part, the other statistics are useful for diagnostic and performance tuning purpose. They are not very useful for generating alerts. For example, it is not really practical to set an absolute threshold on a metric such as Average Reply Size, which shows the amount of data Siebel returns. What is a good value to set a threshold anyway? On the other hand, it would be useful to capture the information, and see how the value changes before and after a major application change in order to understand performance impacts. Statistics such as this one should be collected and saved into a database so that trend analysis can be performed.

I just touched on the surface of what should be monitored. There's more, as some of the more critical components require specific approaches. I guess I better add the white paper to my to-do list.