Many of my customers ask me questions about what differentiates a successful MOSS deployment from a difficult, troublesome, or problematic environment. My answer is simple to say but difficult to manage well, and can be summarized in two phrases: "Load Testing and Configuration/Patch Management".
These two phrases are fantastic because they describe an end-goal... and in order to do them properly, imply that a lot of prerequisites are achieved. For example, proper load testing requires that you have an appropriate load testing environment, and that you do testing at all (this is common sense that is surprisingly rarely performed). Configuration Management implies that you have a process for moving changes and updates from one environment to another and have clear documentation on those processes (even less rare than testing). Ultimately, having these capabilities also implies having a certain amount of infrastructure to support them.
In a perfect world, here's what your MOSS infrastructure might look like:
Each environment provides support for a specific purpose in the infrastructure. The descriptions are as follows:
Developer Environment (virtual): Provides an independent environment for developers to create functionality without impacting the productivity of any other developer. Performance and fault tolerance are not considerations for this environment.
Developers can create functionality as desired, rebooting or restarting services as necessary to support their development activities.
When a significant issue presents itself, Undo disks enable safe rollback to a known-good configuration. (Note: code must be stored outside of the VM to prevent the loss of work!)
Provides a complete MOSS install that is relatively close (but does not necessarily exactly match) the production configuration.
If laptops are used, allows developers to take their development environment when not directly connected to the corporate network, increasing productivity.
Development workstations must be have sufficient disk and memory capacity to support a small virtual OS and MOSS configuration running locally on their machines.
Integration Environment (optionally virtual): Provides an environment where all developers can deploy their new functionality and perform initial UAT business owners. Performance and fault tolerance are not considerations for this environment.
Provides an environment where developers can verify their code in parallel with code deployed/created by other developers, and 3rd party solutions.
Contains the same functional code as the production environment, plus any updates that are scheduled to be deployed to production.
Allows developers to directly manipulate the system and perform investigations into any functional errors or impacts of their code on the environment.
If virtualized, allows for relatively easy recovery from unintentional or otherwise irreversible actions using Undo disks (Virtual Server) or snapshots (Hyper-V) (assumes SQL exists in the same VM).
Provides the initial testing and UAT environment for 3rd party solutions, and for investigation of potential 3rd party solutions.
Communication should be maintained between developers to ensure that the activities of one developer (restarting services) do not impact the activities of other developers.
QA/Test (Non-virtual): Provides an environment to test functional code under load. Performance characteristics comparable (or identical to if possible) to the production environment is a significant requirement for this environment. Fault tolerance is not.
Many errors in SharePoint are only visible when the system is under considerable user load. Memory constraints, storage performance, locking issues, and race conditions are impossible to detect via any other method.
Verifies successful and proper creation of SharePoint solution packages, and a clear abstraction from the developer and farm administrator roles. Developers have no administrative or console access to this environment.
Provides a near-perfect match to the production environment, including general infrastructure/network configuration and all deployed code.
If hardware that matches the production environment is used, performance characteristics are easily aligned. This environment can also serve as an emergency repository of replacement servers if a hardware-level error is encountered or if emergency capacity is required.
Contains considerable debugging and investigation tools supporting investigation of issues identified through load testing.
This environment is never accessed or used by business owners or end-users.
Test “agent” machines will be required to create the necessary request characteristics to simulate production load on this environment.
Functional specifications and test plans are required from the primary developer/team to ensure that the functionality being deployed is properly tested and the results are valid.
PRODUCTION (Non-virtual): Provides the primary (frequently only) environment used by the general company or employees. Performance and fault tolerance are critical components for this environment.
Isolated, stable environment that only has “known-good” code running on it that has successfully passed all other functional and performance testing. Developer access is never provided in this environment.
Maintains isolation between various roles, including System Administrator (for OS level administration), Farm Administrator (for SharePoint farm-wide administration and maintenance), Site Collection and Web administration for content access, and other general user access.
Is critically managed to ensure that all changes and configuration elements are carefully tracked and managed ensuring recoverability in a disaster scenario.
Can provide support for numerous “vanity URLs” giving the indication of dedicated servers while ensuring consistency of management.
Can provide support for non-functional content staging using functional code that has passed testing as above.
Contains specific, low-impact troubleshooting tools that are installed but generally inactive but can be enabled with little disruption or end-user impact if issues present themselves that cannot be otherwise identified.
Directly accessed by end-users. Random or unscheduled restarting of services or servers is generally unacceptable.
Any need that requires new functionality (web parts, features) must be deployed to this environment before content utilizing those features may be introduced.
As you can see, each environment in this infrastructure provides significant capabilities that are both critical and necessarily isolated. For example, you would never want to perform load testing in your production environment… and yet, should you not have the ability to perform load testing, this is effectively what you’re doing, and making your users suffer through the experience.
Does it seem like a lot of infrastructure? Probably… but the benefits far outweigh the costs. Not following this minimal guideline frequently results in increases in expenses in other areas, such as additional support staff due to the need to support user requests when unforeseen problems arise, culminating in a “reactive” support model. If you find yourself never being able to get ahead, constantly “putting out fires” rather than planning and providing for new or enhanced capabilities, you’re operating in reactive mode.
No… this won’t solve all of your problems, and it doesn’t take into account other needs that could be best served by having yet another completely separate and independent infrastructure (recovery farm?), but following this guideline will move you significantly into the “Proactive” area… and maybe win you a few awards from your customers. :)