I had an extensive conversation today with Erik Loftstrand at Microsoft. We discussed in detail the benefits (or lack thereof) of clustering the root management server in a System Center Operations Manager 2007 deployment. Erik was of the opinion that there is no benefit to clustering the RMS.
My arguments have to do with the monthly reboot of the RMS for patch management as well as the fundamental need for high availability but if you think about it, the system is already highly available as long as you break it up into server roles
An agent will report to its assigned management server regardless of the health of the RMS. The impact of losing the RMS is an impact to alert notification and console access. For a server reboot, this would be less than a five minute impact to the environment and is done after hours when the load on the system is minimal. Most administrators will be in bed sleeping peacefully during this outage.
In an extended outage where the RMS is going to be out of service for more than just a reboot, an alternate server can be promoted to the RMS role. This would fix any issues with subscriptions. As for the Engyro connector, it is reasonable to load the connector on multiple servers. If there is a problem with the RMS, Engyro will remain un-impacted.
One major drawback to SCOM is that Microsoft does not support a geo-located cluster. In short, if the Bellevue data center is taken out in an earthquake, we cannot fail over (in a cluster configuration) to a clustered node in Southern California. You can, however, transfer the RMS role to that management server in Southern California. As long as the management servers hosting the agents is up and available, the agents will report regardless of the health of the RMS. If the management server goes down, the agents will fail over to another management server (or the RMS) in the management group.
Today I've opened my eyes and seen the future for what it is. In the really real world, there are a number of bugs and idiosyncrasies between SCOM and clustering, it is more trouble than it is worth when there is already the fundamental high availability built into the SCOM architecture. A good disaster recovery process, in writing, is important. If a blockpoint or OS patch on the RMS is going to take the server down for any extended length of time, it is very easy to promote the RMS role to another management server.