Monday, March 21, 2011

Rethinking the Role of SARM in 2011

It is almost 1 a.m., and I am still up trying to finish up a bit of research via the Internet. By chance I came across an article that talks about Siebel Application Response Measurement (SARM). What is the chance for this to happen in the middle of the night? I thought. What's even more amazing is that the author referred to one of my previous posts (thanks, @lex)! Even though I am getting sleepy, I feel compelled to provide a response.

SARM is indeed not used by every Siebel customer even though it should be. In a survey that we conducted at Oracle OpenWorld several years ago, we learned that there were two reasons for administrators to not turn on SARM - 1. it was too hard to understand the data; 2. SARM was perceived to consume too much overhead. There was some truth in #1, but the Siebel Transaction Diagnostic Tool in Application Management Suite for Siebel should have largely solved this problem. Instead of you having to worry about fetching the right set of SARM log files and running SARMquery manually, the tool does it for you, and generates nice graphical reports that help you quickly visualize and understand the performance diagnostic data that SARM captures.


#2 is a bit of a myth though. SARM does consume capacity, but the amount that it consumes is quite reasonable for the critical insights that it provides in order to manage a Siebel application properly. The alternative of not turning on SARM is to have Siebel as a blackbox, which doesn't make it very manageable.

While SARM is useful, there are also other tools that one should use for managing Siebel application performance. I covered this topic in my previous article "A Holistic Approach to Siebel CRM Monitoring" a while back. The reason why SARM should be used in conjunction with other tools is that we have made available several newer complementary technologies that are more suitable for carrying out some of the application performance management tasks since we introduced SARM.

SARM was created in-house at Siebel. At the time, we thought we would use it as an all encompassing framework for both monitoring and diagnostic. However, as in any 1.0 software development project, there was not enough resource to building everything that we wanted, so we had to phase in the capabilities. SARM was first made available in 7.5, and we made subsequent enhancements to the framework in 7.7 and 8.0. In addition to resource limitation, we also had to live with technology limitation. The original intent of supporting the ARM 2.0 interface was to provide an in-memory feed to monitoring tools so that alerts could be sent if application response time fall below service level target. However, because ARM 2.0 API data fields were not wide enough for SARM to pass contextual data such as screen and view names, the usefulness of this interface for real-time monitoring was limited, and it is totally useless for performance diagnostic as the contextual data is critical to troubleshooting the application.

Another shortcoming of SARM is that SARM instrumentation is not available in the Siebel UI client frameworks. Consequently, SARM can only tell you server time, and not the end-to-end transaction request time that end users see. This means that any network related problems are totally invisible to SARM. By the time we tried to address this shortcoming, Siebel was already part of Oracle, and we had a new option available.

This new option was a new technology called Real User Experience Insight (RUEI). It turned out that Siebel was not the only application in which we had to solve the application performance management problem at Oracle. In fact, administrators of Oracle E-Business Suite, PeopleSoft, and JD Edwards EnterpriseOne all need to monitor application performance. Instead of building something one off for Siebel, we needed something that worked across all those applications, and can be used in the future for Fusion Applications. RUEI fits the bill perfectly.

RUEI, which also is part of Application Management Suite for Siebel, goes beyond what SARM can do in several aspects and is the perfect complement to SARM. First, RUEI does not consume any processing capacity on any of the Siebel web, application and database servers. RUEI uses a network protocol analysis approach of gathering monitoring data, which does not require any software to be installed on the Siebel server boxes, hence it does not interfere with the Siebel application. The original approach that we thought about implementing would require running SARM in the client, and it would only work for the Siebel HI framework. Other approaches that require agents to be statically or dynamically installed on Siebel clients or servers to intercept Siebel end user traffic may also interfere with Siebel operations.

Second, because RUEI uses network protocol analysis, it can measure the end-to-end response time and the volume of network traffic that the Siebel application generates. The information can be used in the initial performance problem triage to decide whether response time problem is caused by the network or the server. Also, because RUEI captures network information, you can often determine the physical location of the user via network address mapping that is built into the tool.

Third, RUEI can measure not only end user response time, but also capture errors that end users see on the user interface. This insight is very important for carrying out tech support as end users may or may not report the errors that they see on their help requests properly. Error statistics may also be used to improve the usability of the application or user training, as repeating occurance of errors may indicate that the user interface is too hard to use, or users simply are not trained properly.

Fourth, RUEI provides much finer grain real-time alerting of Siebel performance issues than is possible via the ARM 2.0 API approach that SARM implements. With RUEI, one could set KPI target on specific Siebel screens, views or applets, and have alerts go off when certain percentage of activities on those objects go above the acceptable service level target.

Finally, RUEI comes with a built-in OLAP database and a very nice set of tools for generating both ad-hoc and pre-defined performance reports. You can even use it to carry out click stream analysis that is typically done with web analytics software to answer questions that business analysts care about. Think of it as a business intelligence tool for understanding end user experience.

If RUEI is so nice, does it mean that SARM is no longer needed? Of course not. RUEI can tell you from a business perspective and end user perspective who the end users are, where they come from, what they tried to do on Siebel and the kind of response time and errors that they received. However, except for network problems, it won't tell you why the application is running slowly. You need SARM for this.

In addition to RUEI, which provides real user monitoring within Application Management Suite for Siebel, the suite also includes tools for synthetic user monitoring, workflow monitoring, Siebel component monitoring, log file monitoring, and configuration change monitoring. More information about the product can be found on this website.

3 comments:

Marc Rix said...

Excellent article, Chung. RUEI is a very powerful and cool product. (My company, Bridgescape, has implemented RUEI for several Oracle customers, some specifically for Siebel performance monitoring. Folks are always amazed at how much information it provides, and with no agents!) I also thought your readers may be interested in these online RUEI demos as well:

RUEI Dashboard

Full Session Replay

Root Cause Analysis / Drill-Downs

Marc Rix
Twitter: @marcrix

Unknown said...

Great Work!
Very useful and informative blog.
Thanks! for sharing such quality information over application performance management.

Stefan said...

Hi Chung, thanks for a great post! We are running SARM right now, and I am really excited about what it might be able to do for us. We find it hard to actually understand what the metrics we are capturing mean. For instance, we can see some very large self response times for SWSE_SENDMSG, and are trying to figure out what this actually is. I assume that its the time it takes the SWSE to send a REQ to the AOM and the time it takes to get an ACK, but I can't seem to find any documentation about this anywhere. Do you know where I might be able to find such information?