When running large scale upgrades (think thousands of computers) a critical component to success is having an effective process for treating issue reports.
Every Windows and Office rollout generates issues, some real and some imagined.
In this post, I describe the process we often use with customers.
As context we use the following guiding principles for issue resolution:
- A rollout should take reasonable steps to minimise issues, however, issue free rollouts are not realistic for large deployments.
- Issue resolution should be focussed on the most time-efficient methods for minimising user disruption.
- Rollout duration should be minimised as an environment in transition introduces more unforeseen challenges than a post rollout environment.
- Not all issues prevent rollout from proceeding.
Given the above principles, we champion the below process.
Criteria for accepting an issue
All the information in this section is best gathered by the team directly communicating with end-users; this may be the project team, but often it includes local Site Support and Service Desk teams.
- Confirmation the device was impacted by a rollout.
- It may sound obvious, but make sure the impacted device was actually touched by the rollout (Users and Support teams alike can make mistakes and incorrectly assume a rollout impacted a machine).
- Specific information must be provided.
- Time and date of an issue.
- A screenshot of the issue if applicable.
- Provision of local logs if pre-defined as required.
- Timing of a reported issue with rollout to a specific device must be validated.
- The rollout can be excluded from consideration if the timing of an issue was actually before the rollout to an impacted device – clearly this is obvious, but in our experience User and Support teams can get confused and incorrectly attribute an issue to a rollout that has not yet occurred.
- In some cases, if the issue was actually much later (for example weeks later) than a rollout action then rollout impact can also be excluded from consideration.
- More than 1 computer must be impacted.
- An issue on just 1 computer is treated as noise (remember, this approach is for large scale rollouts after initial pilot periods are complete).
- Confirm if computers that are outside of the rollout scope are impacted – this is often neglected, and the perils of confirmation bias mean support teams only focus machines within the scope of a rollout when attempting to attribute a cause.
- The issue must be re-occurring.
- An issue that occurs during a single window of time, and then never appears again, is no longer an issue. Irrespective if multiple computers are impacted.
- Reasonable troubleshooting must be attempted by operational support teams
- This is in the interests of getting a user back up and running.
- A process checklist on what “reasonable steps are” is central for this step to work efficiently. This may require any combination of rollback, re-install, re-image or provision of a spare device. The process checklist is also key for communicating known issues and workarounds.
Approach for categorising an issue
The steps in this section should be completed by the project team and can be completed by a relatively junior resource.
- If the criteria for accepting an issue (per the previous section) have not been met the ticket should be returned with a request to identify missing information. This validation process has to be quick so that the process helps rather than hinders efficient resolution.
- If the criteria for accepting an issue has been met, the issue should be logged in whatever system has been agreed for issue management. Each issue should be given a priority for resolution, a technical owner, and in particular, for the avoidance of doubt, if the issue impacts the rollout continuing. This categorisation should be visible to any team providing end-user support.
Approach for addressing an issue
The steps that follow require relatively skilled technical resources where “first call resolution” is highly unlikely (i.e. there is a real issue that does not yet have a treatment identified). This is the reason for the information gathering and pre-validation that occurs in advance.
- Review logs
- Logs are the easiest place to view first, thus automate the collection of logs as much as possible in advance (a topic for a separate post).
- Reviewing logs can be an art in itself, and at a minimum, understanding what “normal” or “healthy” logs look like is a pre-requisite. Microsoft Premier support has custom tools and dedicated experts at completing log review so they should be leveraged as part of the log review process.
- The more logs you have the more potential for not only identifying the cause of a reported issue but the better the chance of identifying occurrences of the issue that have not yet been reported.
- Identifying an issue just from logs is a quick win, but the reality is often the logs just don’t help. Either the right logs are not enabled by default, or there is too much noise in the logs to identify anything meaningful.
- Reproducing an issue becomes the next step if logs do not help. If the issue can be reproduced, logs can be enabled as necessary to allow step 1 to be retried.
- For some issues, you may find that logs do not help and the issue cannot be reproduced on demand. So while the issue is re-occurring, catching the issue in the act makes analysis impractical. The approach we use here is to find more users with the issue via a sample user survey; this achieves two objectives; firstly it validates if the issue is widespread of limited, secondly it provides more target samples. The result is then either the issue is deprioritised and put on a “watching” status due to insignificant impact, or log collection is targetted at multiple machines to increase the probability of catching the issue.
- Trial by elimination
- This step only applies where log data provides no clues to an issue, but an issue can either be reproduced, or there are enough occurrences of an issue to allow trial by elimination experiments.
- Trial by elimination provides an effective method to eliminate, and thus focus on remaining components of technology. For example, is a wireless disconnection issue due to a local Wifi Access Point issue, or due to Wireless Driver instability? By using the same computer model/ drivers at a different site re-occurrence of an issue will point investigation to focus on a driver issue, while no-occurrence will point investigation to point to Wifi Access Point issue.
- The above process is key to getting us back to identifying the cause of an issue via logs.
- Identify a treatment
- Steps 1 to 4 are focussed on identifying the cause of an issue via logs. This is the only method with confidence to identify a root cause and it prevents issue resolution reverting to conjecture and guesswork.
- Attempting to identify a treatment without evidence in logs is highly undesirable, thus “try harder” on the above steps if log evidence is not identified.
- Somewhere in the above steps, there will be a Eureka moment and treatment options can then be identified. This may include short terms options to step around the cause of the issue to allow rollout to continue, or it may require workarounds, or it may require the deployment of a fix.
The above process, or variations of it, is what we have found to work successfully. We find there are two key takeaways to the above:
- Have an issue resolution process in the first place; agree on some guiding principles for what the process should achieve; i.e. zero issues, and maximum pace rollout are competing objectives, and thus the approach to balancing those objectives should be consciously acknowledged.
- By having a process you can set expectations on the work impact to end-user support teams. There will be an impact, and a process is the best way to communicate and agree on what that will be.