Certificate Services Disaster Recovery
Hey Everyone, welcome to my blog and my very first blog post :)
Public Key Infrastructure is my speciality and today I'd like to share some thoughts on disaster recovery. This blog post is an update to the disaster recovery article posted in 2010 linked here. When working with customers I like to look at disaster recovery by breaking it down into scenarios that could go wrong and how each scenario can be recovered from. Also, there are some mitigations that can be performed in an emergency situation if you absolutely cannot get a CA back online for a period of time, which I would like to touch on. I usually put these into a Backup Matrix so the different scenarios and their specific problems can be addressed easily, and because anything with matrix in the title sounds cool!
This post will use many acronyms throughout and assumes at least an intermediate level of knowledge about PKI. So, without further ado, fasten your seat-belts as this post will be a whopper!
The table below is the backup matrix I was referring to. This is not an absolute authoritative list of everything that can go wrong with a PKI but provides some high level common scenarios. The recovery steps listed are broken down further below with some detailed advice and guidance on what needs to be backed so these recovery processes can be followed.
Complete failure on Root CA
CA can't issue any new certificates until it has been recovered
CA is unable to publish new CRL until it has been recovered which will lead to certificate validation problems
Loss of access to private key
CA can't issue any new certificates until it has been recovered
Database corruption or other database issue.
CA can't issue any new certificates until it has been recovered
CA is unable to publish new CRL until it has been recovered which will lead to certificate validation problems
CDP is unavailable
If clients cannot access the CDP then certificate validation errors will occur.
Custom templates removed from directory or lost due to other AD related failure.
CA will not be able to issue certificates based on certificate templates. The custom templates would need to be re-created or recovered.
Backing up a Certificate Authority
So, to the fundamentals first. Before we can restore a certificate authority we need to have a good valid backup. The best way to backup a certificate authority is to use the certutil command line tool, the Swiss army knife of anything PKI! There are three parameters you can use to backup the CA depending on the information you want to capture, they are as follows:
Certutil –backup è This will backup both the database and private key of the CA
Certutil –backupDB è As the switch suggests this will backup the database only
Certutil –backupKey è Again, as the name suggests this will backup the CA private key only
Which one do I use? I hear you ask, well that depends on a couple of things really. Firstly, are you using a Hardware Security Module (which is best practice for a CA)? If so, you will never use the full backup or backup key options as the private key backup and recovery will be specific to the type of HSM you are using. If you aren’t using a HSM, then you can use the other backup options to backup the key, however, once installed the CA private key may only change at very lengthy intervals. This may be when the CA certificate is renewed or otherwise re-keyed due to other factors. Based on this we only need to have a valid backup of the current private key which may be valid on a CA for many years, such as for a root CA. In my opinion it is only needed to backup the private key at these intervals and documentation should exist to either do a “certutil –backupkey” or to export the CA certificate manually using MMC console at said intervals. When the backup of the private key is taken it should have a very secure password configured and safely recorded, such as in a fire safe with limited access. As the private key is the crown jewels of the CA the backup of the key should also be secured and not left lying round on the server or otherwise easily accessible location. This is where the value of HSM’s come in as ordinarily anyone who has administrative access to the CA server can export or backup the private key to a PKCS#12 file and then import onto their own server and impersonate your CA, which is very bad! With a HSM the private key is protected in a hardware device which has special controls in place such that the keys cannot just be exported.
Now we have a good backup of the CA certificate and private key we can look to backup the CA database. As I mentioned already, the easiest and most elegant way to do this is to use the “certutil –backupDB” command line, indeed wrapping this into a batch file and running on a schedule makes the most sense. In addition to the backup of the CA database allowing for CA recovery, backing up also has the added benefit of truncating the CA log files. Some good information on huge CA databases, compaction and truncating log files is detailed in this blog post. Here is a sample script for backing up the CA database that I use, it is very rudimentary but gets the job done:
REM Set variables
echo %time% %date% > %myCABackup%\Backuplog.log
REM Create backup folders
if not exist %myCABackup% md %myCABackup%
if exist %myCABackup%\7 rd /s /q %myCABackup%\7
if exist %myCABackup%\6 rename %myCABackup%\6 7
if exist %myCABackup%\5 rename %myCABackup%\5 6
if exist %myCABackup%\4 rename %myCABackup%\4 5
if exist %myCABackup%\3 rename %myCABackup%\3 4
if exist %myCABackup%\2 rename %myCABackup%\2 3
if exist %myCABackup%\1 rename %myCABackup%\1 2
REM Publish new CRL
REM Backup CA database
certutil -BackupDB %myCABackup%\1 >> %myCABackup%\Backuplog.log
REM Backup CA registry key
certutil -getreg ca > %myCABackup%\1\ca-registry.txt
REM Backup CSP registry key
certutil -getreg ca\CSP > %myCABackup%\1\ca-registry-CSP.txt
REM Backup all templates published at CA
Certutil –catemplates > %myCABackup%\1\CATemplates.txt
REM Backup events logs
Wevtutil.exe epl Application %myCABackup%\1\Application-%myBackupDate%.evtx >> %myCABackup%\Backuplog.log
Wevtutil.exe epl System %myCABackup%\1\System-%myBackupDate%.evtx >> %myCABackup%\Backuplog.log
Wevtutil.exe epl Security %myCABackup%\1\Security-%myBackupDate%.evtx >> %myCABackup%\Backuplog.log
REM Backup CAPolicy File
IF exist %systemroot%\CAPolicy.inf goto CAPOL
copy %systemroot%\CAPolicy.inf %myCABackup%\1\
One thing you will notice about the script is that it creates folders for the last seven days so there are seven backups. Also, the script backs up the CA registry settings, CAPolicy file and the event logs for server. I think the event logs are useful, especially the security log, so that you have an audit trail of the CA that can be archived. I must mention that I always turn the auditing on the CA up to max as well, you never know when you’ll need to have that audit information! Now, I usually configure the script to run every 24 hours so there is a fresh backup every day for the last seven days stored on the data drive of the server. From there other tools can be used to backup the file system of the server for full server recovery, longer term archive and data retention. There are also some new PowerShell cmdlets that can be leveraged to perform the CA backup, so if you’re on a 2012 or later server you can leverage PowerShell instead of the old batch files.
In addition to the backup of the CA database it is a good idea to have copies of the original installation scripts and CAPolicy file. You do have detailed installation documents and configuration management, right? I think the best thing to do in this situation is to have all of the commissioning scripts in a folder on the data drive where the CA database backup is, that way when it comes to restoring the database using the long term file backup solution the original scripts can also be restored as well. I also think it’s a good idea to publish a CRL every 24 hours regardless of the CRL validity period settings and to do this within the backup script. The main reason for this is because in my experience certain certificate policies require you to publish every 24 hours. This also ensures CRL information is up to date, see detail later about how the CDP can be kept up to date with this data as well.
The last thing on backing up the CA I want to mention is around having a record of all certificates that are issued. If you are only backing up the CA every 24 hours then the worst case scenario is that you will have nearly 24 hours’ worth of issued certificates that aren’t included in the backup. For a high volume CA that is pumping out lots of certificates this can be an issue. The certificates themselves will continue to work and the users, devices or services which are using them will function as before, however the big risk is that you can’t revoke those certificates. From a security point of view this is bad as potentially there are many certificates out in the wild that you have lost control of. There is a trick we’ll talk about later that can be used to enable the CA to have knowledge of these lost certificates or if all else fails we can revoke the certificates if we have their serial number. The Microsoft CA has a built in feature we can use to help us record the serial number and base64 encoded value of every certificate that is issued, this is done using the SMTP exit module. The SMTP exit module is not configured by default so this will need configuring during the initial building of the CA.
Detailed information on the SMTP exit module and how to configure it are detailed in the following articles:
Note in the script that you will need to configure the correct parameters for your SMTP server and emails addresses you wish to use.
Complete CA Failure
Now we have a full backup of the CA we can look at the different scenarios and what methods we can use to restore the CA to service or at least ensure clients and certificate validations are working. So in our first scenario lets imagine the CA has had a complete failure, this could be anything from a physical server crashing and burning or if virtual a corruption of the virtual hard disk. In this scenario the CA will be unable to issue any new certificates or publish a new CRL. Now, it may appear that being unable to issue new certificates is the most important issue to fix so service can be restored but being unable to publish a CRL is a far more important issue to address. Not having a current valid CRL will have a huge impact on a business where certificates are used heavily. Let’s take an example of an organisation that uses smartcards for Windows logon, if the CA CRL expires a situation will occur where all of their users are unable to logon, I have indeed seen this happen in my career. From a business continuity point of view users not being able to log on to their Windows desktops on a Monday morning is just about the worst thing that could happen. Now, the amount of time you have before this world ending event occurs depends on what your CRL publication interval and its overlap period are. This is essentially your SLA. By way of example, if you have a CRL which has a base validity period of 24 hours and an overlap period of 24 hours you have at a minimum 1 day and at a maximum 2 days to recover your CA. One point to make on this is that if you are using Delta CRLs with a shorter validity period, say 1 hour, then that will be become your minimum time for recovery as Delta CRLs are critical for a PKI authentication.
Ok, now that whole complicated explanation is out of the way how do we fix this and not allow this apocalyptic event to occur. Well there is only one way to fully restore the CA, which is to recover the CA from backup onto a new or rebuilt server. This may take some time depending on your own infrastructure, your level of monitoring and alerting for the CA (check back later for another post on this!) and ability to action the server rebuild. If you cannot fully restore the CA within the SLA period (the CRL validity period) then there is a process to re-sign the CRL on another server and extend its validity so that clients can validate the CRL and carry on with their business. Obviously new certificate issuance and revocations will still be on hold until the CA is recovered but at least the CEO will be able to logon and check his emails.
Full CA Recovery
The following steps can be used to fully recover a CA:
Build a new server, either new hardware or virtual, as your CA. It is best practice to use the same server name here, this will save some headache later.
Configure server OS to your standard, set IP address and join to the domain.
Restore the directory previously backed up to the data drive on the CA, this should include 7 days’ worth of CA database backups, the various certutil outputs, event logs and commissioning scripts.
Copy the CAPolicy.inf file back into the %WINDIR% directory.
Obtain a copy of the private key form secure storage and make available on the CA, this may be restoring the PFX file if not using a HSM or by configuring the HSM and restoring the private key using the HSM manufacturer’s instructions.
When using a HSM it is usually necessary to re-associate the hardware protected private key with the CA certificate. To do this, the CA certificate will need to be installed into the machine store of the CA (it can be retrieved from the AIA location where it should be published) and the CA private key available in the HSM device.
Once the certificate and private key are available the following command can be run to join the key to the certificate:
Certutil –csp “<hardware CSP/KSP>” –repairstore MY “<CA Certificate Serial Number>”
You will receive feedback similar to the following:
certutil -csp "ncipher security world key storage provider" -repairstore My "12 36 2b a2 00 00 00 00 00 03"
================ Certificate 0 ==
Serial Number: 12362ba2000000000003
Issuer: CN=Contoso Issuing CA,OU=Contoso,O=Lab,C=GB
NotBefore: 01/06/2010 13:51
NotAfter: 01/06/2030 13:59
Subject: CN=Contoso Root CA,OU=Contoso,O=Lab,C=GB
CA Version: V0.0
Signature matches Public Key
Cert Hash(sha1): 45 f5 f9 e5 db 75 87 d2 ba e1 67 05 16 bb 00 d9 62 92 ed a2
Key Container = Contoso Issuing CA
Provider = ncipher security world key storage provider
Private key is NOT exportable
Signature test passed
CertUtil: -repairstore command completed successfully.
Note you will need to replace the CSP section with the CSP of the HSM manufacturer and the CA certificate serial number with the serial number of the CA certificate you are restoring.
- Once the certificate and private key have been restored by either importing the PFX file locally on the server or by using the HSM you should be able to see the certificate in the local machine store with the private key symbol as shown in the following screenshot:
If you are running on a 2012 upwards CA you can now install the Active Directory Certificate Services role before we perform the configuration. If you are on a lower level operating system then just install the role as normal and configure the installation as per the instructions in the next step.
Now we can perform a recovery installation of the CA following these steps, note the screenshots are from a 2012 R2 server but the options are the same across any CA version:
Select Enterprise CA or Standalone CA depending on which CA you are restoring.
Select Root CA or Subordinate CA depending on which CA you are restoring:
Select Use existing private key and then the Select a certificate and use its associated private key option:
The certificate you imported earlier should be listed in the Certificates dialog box, ensure this is selected and click Next
Note: If you are using a HSM there may be a need to check the “Allow administrator interaction…” check box. Ensure this setting is configured if required by the HSM vendor.
Configure you database and log location, this can be the same as before:
Observe the confirmation of the settings and click Configure (or Install on a non-2012 server):
Click Close to finish the configuration:
Once the CA role has been installed and configured the CA database can be restored, this can be done through the CA GUI or via the command line by running the following commands:
- net stop certsvc
- certutil -restoredb -f e:\CA-Backup\1
- net start certsvc
Note that the restore path above will be correct based on the sample script that is detailed above. Also note the –f parameter to force overwrite the existing database created during the install.
Now the CA server, private key and database have all been restored the original install script can be re-run to configure the CA. This will ensure the correct CDP and AIA paths are configured and CRL lifetimes as per the original install. If you are in doubt of the original config then the CA registry export in the backup folder can be used to rebuild the configuration of the CA. Any other configuration should now be applied to the server such as permissions on the CA, certificate managers etc.
All certificate templates which were previously published to the CA should be re-published. The list of these templates will be in the ca-templates.txt file located in the backup folder.
Any scheduled tasks should also be configured for the monitoring of the CA and copying og the CRL as per the original install.
The CA will now be fully operational and should be validated to ensure it is functioning correctly. You can test a certificate enrolment and revocation and then publication of the CRL to ensure this is working.
The last thing to do is to replay any certificates into the database that have been issued since the last backup. As we touched on earlier we can get the details of the certificates by filtering the inbox where the SMTP exit module sends the emails to. This should be filtered to only show emails which have been received between the last backup date and the current time. We can then look at all emails which contain details of the issued certificates. The following screenshot shows a sample of this from my lab:
As you can see, the email contains certificate details but pay particular attention to the base64 encoded certificate, which can be taken and pasted into notepad to re-construct the whole certificate. Follow these steps to replay the certificates to the database:
Save the base64 encoded certificates from each of the emails to a *cer file in a folder on the CA. You should have a .cer file for every certificate you wish to restore.
Before we can replay any certificates into the database we need to configure the CA to allow this by running the following command:
certutil -setreg ca\KRAFlags +KRAF_ENABLEFOREIGN
Run the following command to add the certificate into the database:
certutil -f –importcert <certificate file>
Where <certificate file> is one of the saved files. You could script this so it loops through all of the certificates in the folder one by one and adds them to the database if you have many.
Once all of the above steps have been completed your CA should be fully recovered and up to date with all issued certificates. The above procedure can also be adapted to migrate a Certificate Authority to another server, say during an upgrade to a newer platform, the concept is essentially the same. There are a few differences in the exact steps the full details of which can be found here. The process can also be used to recover the CA to a test environment for validating your disaster recovery plans and backups.
Manual CRL Resign
As you can see from the steps above a full CA recovery can be a lengthy process, particularly if you have not been alerted to its failure in a reasonable time frame. As we mentioned earlier this may mean that the CA cannot be recovered before the CRL is due to expire which would have dire consequences to any applications or services relying on that CA’s certificates. To mitigate this a manual re-sign of the CRL can be done to extend the validity of the CRL for a short time period while the full recovery above is performed. The following steps go through a manual CRL resign:
Log on to a server in your environment, preferably another CA, or you could use an up to date client operating system such as Windows 8.
Obtain a copy of the private key and make available on the machine, as detailed above this may be restoring the PFX file if not using a HSM or by configuring the HSM (which is why another CA is preferred) and restoring the private key using the HSM manufacturer’s instructions. I won’t go into the details of this as they are above in the Full CA Recovery section, but you should have the CA certificate and private key installed with the certificate showing the private key symbol as seen in the screenshot further above.
Obtain the latest CRL from the CDP and copy to a temporary directory on the machine, rename the CRL by adding a –old after the file name.
Extend the CRL lifetime by running the following command:
certutil –f –sign <CRLFileName-old.crl> <CRLFileName.crl> 0:XX
Where 0:XX is the amount of hours you want to add to the CRLs validity, such as 0:48 for adding 48 hours. You can specify which ever time period you desire here, but don’t go crazy and only specify a time period long enough to be able to perform a full recovery. It is tempting to add many days here, but remember that the CRL may then potentially get cached at the client side for that period of time and the client may not then retrieve an up to date CRL for that extended period.
Also note that the old file is specified first and then the newly signed file. The newly signed file should have the name that is referenced in the CDP.
The newly signed CRL should then be published to the CDP, this may include a web CDP and LDAP CDP. For the web CDP, copy the file over to the virtual directory and overwrite the existing file, for the LDAP CDP run the following command:
certutil –f –dspublish <CRLFileName.crl>
- Once the CRL has been published then the CDPs should be checked by using the Enterprise PKI console, you should see the status as OK for all CDPs, similar to the below screenshot:
Another point to make here, although slightly off topic, is that the above certutil command can be used to add entries to the CRL. So if you haven’t restored the CA fully or you are having some issues and you need to revoke a certificate in a hurry (maybe an employee has gone rogue?) you can manually add a serial number to the CRL by running the following command:
certutil –f –sign <CRLFileName-old.crl> <CRLFileName.crl> +”<serial number>
Where <serial number> represents the serial number of the certificate you wish to add to the revocation list such as “42 00 00 00 05 06 0a 1a 29 6d 40 a1 3e 00 00 00 00 00 05”
Loss of Access to Private Key
Let’s now imagine a scenario where the CA loses access to its private key only. This could be due to a HSM failure or even accidental deletion of a software based key from the local Crypto folder by an administrator. In this instance the private key should be restored following step 5 in the Full Recovery section above. Once restored the service should be restarted and the CA should then function correctly.
Database corruption or other database issue.
Another scenario that could happen is if there is database corruption for some reason, maybe due to a disk error or something similar. The process for recovery or mitigation is similar to the steps involved for a full recovery. Ultimately a recovery of the CA database will need to be done following steps 8 and 11 above in the Full Recovery section. Once the database has been restored then the service should be started and in theory will be good to go.
If it will take some time to recover the database then the procedure for resigning the CRL can be performed to extend the CRL lifetime.
CDP is Unavailable
When designing CRL distribution points you should ensure that redundancy is built into the design to cope with any failure at any point in the infrastructure. You should typically have more than one CDP listed and each CDP should be accessible from the clients you wish to serve, e.g. if you are providing services to remote clients then the CDP should be externally accessible. I normally advocate having a web CDP as the first one in the list due to its ubiquity amongst client types, including network devices and other non-Windows clients. This ensure that clients of all flavours will be able to retrieve a CRL from the first CDP in the list. Remember that with PKI the first CDP in the list is tried by the relying party, and if this is unavailable the second CDP is tried using the retry algorithm, this is explained in great detail here, although each device type may be different. So, having a web CDP first should minimise timeouts while connecting to the CDP. The downside to using a web CDP is that the server on which it is hosted needs to be redundant using something such as Network Load Balancing, or at a minimum DNS round robin. This adds complexity and potential cost to the web service when it may only be serving files of a few kilobytes. It would be good practice to configure the CDP on another highly available web property that you own which is already configured for load balancing and monitored for up time.
Another area of concern for a web CDP, probably the biggest area for failure, is in the process used for copying the CRL from the CA over to the web server. Generally the CA will be configured to publish the CRL to the local file system and then a scheduled task of some sort used to copy the file over to the server. You can use any mechanism you like here for copying the file, maybe a PowerShell script, a VBScript or a good old batch file using robocopy. The way I like to do this is to create a task attached to Event ID 4872 in the security log. This audit event is generated in the security log when you have CA auditing switched on within the CA console and the “Audit Object Access” local security setting configured. I highly recommend you configure these settings for your CA. More information on the CA audit events can be found here. The steps for creating this task are listed below:
Open the security log in event viewer and filter for event ID 4872 as per the following screenshot:
In the results select one of the audit events for the CRL publication, right click this and select Attach Task To This Event…
Give the task a name and description how you like and click Next.
Click Next again.
Select to Start a program and click Next.
Select the script you wish to run, again this can be any copy script you desire, and click Next.
Click Finish to create the task.
Now any time the CA publishes a CRL, either automatically or manually, the copy script will run and push that CRL to the web CDP. This mechanism isn’t infallible but generally works quite well. You may wish to add your own monitoring to make sure the script is running, maybe by sending a mail on CRL copy or by writing a log file. This mechanism also has the added bonus that the CRL file on the CDP will always be the latest one instead of waiting for some other arbitrary schedule.
The other type of CDP you can configure is LDAP based, where a Microsoft CA will store CRLs in Active Directory. LDAP CDPs may not be supported by all devices, hence my reasoning for adding them second in the list, but they provide redundancy “out of the box”. By being stored in AD they are replicated to all domain controllers in the forest as they are stored in the configuration partition. AD provides inherent redundancy via its multi-master model but also allows clients to use AD sites to connect to their closet domain controller. A Microsoft CA can also publish its CRL into AD automatically on CRL publication, so no extra scripts or configuration are required.
To summarise CDP failure, it is vitally important to ensure you have multiple CDPs configured and that with web CDPs some form of load balancing is implemented. Also, a robust copy mechanism should be used to ensure the web CDP always has the latest CRL available. LDAP CDPs have redundancy and availability by virtue of being stored in AD.
Custom templates removed from directory or lost due to other AD related failure
The last disaster recovery point I want to cover is around certificate templates. Imagine that an administrator has accidently deleted all certificate templates from AD or something has gone wrong with AD whereby the templates are unusable. There isn’t a built in way to backup and restore those templates apart from using the AD tools. This is a bit heavy handed to just restore the certificate templates. To help with managing this I’ve created a PowerShell script which will back up the certificate templates to an XML file which can later be used to re-create all of the certificate templates. The script can be found here. This script uses the Certificate Enrolment Web Services and Certificate Enrolment Policy Web Services roles, as the API needed to dump the templates out this way was added with this service. As far as I’m aware there isn’t any other supported mechanism to dump the certificate template objects out of AD. It should be noted that the CEP and CES services should be installed and you the script needs to be run on the box where this service is. More detail on the script can be found on the script centre posting.
Once you have a backup of your certificate templates the script can be used to re-create the exported templates within any Active Directory environment. All elements of the exported templates will be re-created apart from security permissions, these will need to be manually re-added.
Using the backup and restore information above along with the backup matrix you should be well equipped to create your own disaster recovery documentation and approach to recovering you CA infrastructure. However you may still be in a predicament whereby you can’t execute a recovery for some reason, maybe all your backups have been lost, stolen or otherwise corrupt. Well there are still options to allowing users, computers and services to carry on functioning when a new CRL cannot be issued. You can switch off certificate revocation checking.
NOTE: I do not advocate using this approach unless you absolutely have no choice in the matter, as disabling revocation checking is a huge security risk. Follow the below advice at your own risk!
Revocation checking is down to each individual application and will need to be configured on a per-app basis depending on what application you need to restore functionality too. As an example, if you are using smartcards for logon then you can disable revocation checking at the domain controller to allow user certificates to carry on being validated even when the CRL has expired. As noted, this is a big risk but maybe your only choice. Other applications and services will have their own method for disabling CRL checking and you should look this information up or if need be contact the application vendor if it is a non-Microsoft application, service or device.
To summarise this monster post and close out I’d like to leave you with a few key points:
Having a robust backup strategy for your PKI is essential and this should be documented. You can use the approach detailed above to help build this out.
There is great value in understanding the different disaster scenarios and ensuring you have a restoration strategy for them. Using something like the backup matrix and referencing the particular restore process would be beneficial.
There are some techniques you can use to get yourself up and running in an emergency. Re-singing Re-signing the CRL is preferable but as an absolute last resort you can disable revocation checking.