question

Chris-7032 avatar image
3 Votes"
Chris-7032 asked JanKraus-1911 commented

Windows Server 2019 dedup compatibility issues after 2021-06 Cumulative Update (KB5003646)

I have discovered a compatibility issue with the Windows Deduplication feature on Windows Server 2019 after the 2021-06 Cumulative Update (KB5003646) is applied.

If the Windows Server 2019 machine has a volume attached which was originally deduped on a Windows Server 2016 machine then it will BSOD relating to dedup.sys.

The dedup.sys driver for Windows Server 2019 was updated between 2021-05 CU (file version 10.0.17763.1554) and 2021-06 CU (file version 10.0.17763.1971).

I have reproduced this issue with a clean install of Windows Server 2019 from the RTM ISO with no additional software or configuration changes. Up to and including OS build 17763.1935 (2021-05 CU) there are no issues accessing data on a volume originally deduped on Windows Server 2016. Once the system is updated to OS builds 17763.1999 (2021-06 CU) then accessing data on that same volume will cause a BSOD relating to dedup.sys.


The issue is not present for volumes which were first deduped on Windows Server 2019 only when the volume was first deduped on Windows Server 2016.

This was originally discovered when performing a rolling OS upgrade of a failover file server cluster running Windows Server 2016 nodes which use deduped volumes for file shares. Once the roles with the deduped volumes were moved to the first Windows Server 2019 node it caused that node to BSOD.

windows-server
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

TeemoTang-MSFT avatar image
0 Votes"
TeemoTang-MSFT answered TeemoTang-MSFT edited

Your discovery and reproduce are valuable, thanks for your effort, I will submit this situation to Microsoft.
However, In-place upgrades are never recommended. In fact, since Windows 10/Server 2016 released, upgrade processes are essentially a clean-install, and then migrate data. It's actually quite a sweet technique they're moving towards where it's almost like partition-based installs but with the same registry/data/programs folders.
Therefore, if you clean install Windows Server 2019 then update it to the build 10.0.17763.1971, I think Windows Deduplication feature will not causes BSOD issue, look at this similar case:
https://docs.microsoft.com/en-us/answers/questions/435392/post-server-2016-gt-2019-upgrade-dedupsys-bsod.html?childToView=435473#answer-435473


If the Answer is helpful, please click "Accept Answer" and upvote it.
Information posted in the given link is hosted by a third party. Microsoft does not guarantee the accuracy and effectiveness of information.
Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.


· 3
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Hi TeemoTang,

Thank you for your reply.

However, I have not done an in-place upgrade in any of the scenarios described. The above was done with all fresh installs for the system drive.

The process where this is particularly an issue is with a Failover Cluster rolling OS upgrade which is supposed to be a supported scenario. An overview of the process is:

  1. Drain roles for one node in the failover cluster and evict from the cluster.

  2. Create a new VM, install Windows Server 2019 and configure as needed.

  3. Attach the shared storage to the new VM (in this case, multiple VHDSets).

  4. Add the new VM to the failover cluster.

  5. Repeat for the second node.

As you can see, this is fresh installs with only the existing data only volumes remaining.




0 Votes 0 ·

When I did testing on the fresh VM as described in my post, I created a new VM and installed Windows Server 2019. Then I attached a VHDX containing only data (not a system drive) which had already been deduped on a Windows Server 2016 system. When 2021-06 CU was installed on the Windows Server 2019 VM (which contains an updated dedup.sys file) it would BSOD when accessing data on that deduped volume.

It is worth noting too, on this test VM, if I replace the newer dedup.sys with the prior version then accessing the data on that deduped volume does not BSOD.

The steps to replace the driver were: (again this is a test VM only, not production)

  1. Shutdown the test VM.

  2. Mount the system drive for the VM

  3. Take ownership of C:\Windows\System32\Drivers\dedup.sys and grant Administrators write permission.

  4. Replace C:\Windows\System32\Drivers\dedup.sys with the older version from Windows Server 2019 with CU 2021-05 installed.

  5. Unmount the volume and boot the VM.



0 Votes 0 ·

Ok, I understand, your test doesn’t involve upgrade, only previous VHD attached, and when you change the newer version of dedup.sys(comes from KB5003646) with the older version from Windows Server 2019 with CU 2021-05, BSOD doesn’t appear. That’s enough, it can prove your idea.
I have submitted this case via our channel, let’s wait for update.
If you need further assistance such as remote session or deep research, you could open a request ticket with Microsoft support.
https://support.serviceshub.microsoft.com/supportforbusiness

0 Votes 0 ·
Chris-7032 avatar image
0 Votes"
Chris-7032 answered TeemoTang-MSFT commented

Hi TeemoTang-MSFT,

Thank you for your assistance.

I would also add the following:

  • the BSOD doesn't occur as soon as you access the volume, rather if you try to read a large chunk of data from the volume. The most reliable way I found to reproduce the crash without fail is to start an antivirus scan targeting the deduped volume. My tests have been done with Windows Defender, not third party AV. But I have also confirmed the same behavior occurs if Windows Defender is removed and a 3rd party AV is installed.

  • I have tested with two different volumes which were created and deduped on Windows Server 2016 originally and then moved to a Windows Server 2019 CU 2021-06 system. So it is not a specific volume that has an issue.



· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Ok, get it. But now what we do is waiting for the following update

0 Votes 0 ·
Stefan-9361 avatar image
0 Votes"
Stefan-9361 answered

Hi,

can we have an update to this problem, please?
Also this should be commuicated immediately on a higher level, so other customers have a chance to react.

This just turned out to be huge issue for our production environment. Literally lost money due to an outtage of more than hour.

All the details described above apply to us. Volumes originally deduped on server 2016, now 2019 server.
We had one failing node 4 days ago, second node failing 3 days ago. So far no big deal, fail-over handled it. Deactivated AV to reduce amount of read data.

Today both nodes failed simultaneously. A restart did not help, BSOD as soon as 30 secs after reboot.
We aimed for uninstalling the cumulative update. Which is impossible with a BSOD immediately after startup. Booting in save mode was no solution since uninstall did not work. Guess some required services missing, but who knows which ones to start...
After some failed reset attempts I was lucky enough to prevent the Fileserver role from starting up. At this point the server did not crash and I was able to unistall the cumulative update. But even then I had to wait at 15 - 30 mins at a "100% applying updates" boot screen, not sure if the machine was stuck or doing anything...
Back in Windows without the latest update all seems working fine atm.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Chris-7032 avatar image
0 Votes"
Chris-7032 answered

I have a support case open with Microsoft for this bug but I have also had no updates since providing them with the crash dump and event logs 5 days ago, despite being told it would take 1-2 days to analyse.

In my situation I have been fortunate enough to be able to go back to Server 2016 and am considering disabling dedup and unoptimising the volumes on Server 2016 then creating new Sever 2019 nodes and deduping the volumes again (I’ve tested this and it does appear to work around the issue). My dataset is not so big that this is out of the question, and even if Microsoft do eventually fix their bug - who’s to say the same thing won’t happen with an update again down the track as there is clearly a difference on dedup volumes created in Server 2016 vs Server 2019.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Stefan-9361 avatar image
0 Votes"
Stefan-9361 answered Chris-7032 commented

I agree, unoptimizing the volumes and starting over with 2019 June driver would be a secure workaround. Unfortunately this might be almost impossible for us. We have multiple volumes, which would all hit their boundries unoptimized. This would imply a lot of time-consuming copy jobs and disk space in dimensions we just don't have lying around. For folders that mostly have to be available 24/7.

Also the changed driver had another serious impact for us:
Our backup software Veeam relies on it. And I guess Veeam is not the only backup solution with similiar mechanisms. Currently I cannot restore any single file originally stored on a deduped volume of the file server. I hope this will work again as soon as I uninstall KB5003646 from the backup machine, too.

Nevertheless this is mere a slight change in the driver. The above point shows it made it literally incompatible within the same server version! You expect that when you step up from 2016 to 2019 and it's well-known and described for Veeam. But this broke so many things at once, we feel at risk right now.

Oh and btw. shadow copies are broken right now, too. They are disabled after the rollback to the may update and cannot be activated again. Guess something has changed with the way of scheduled tasks implementation in the June CU.

· 1
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

I have had a response from Microsoft support that this is a known issue at it is because the reparse points used by dedup were changed. I have no yet been given a solution.

0 Votes 0 ·
MasterIT avatar image
0 Votes"
MasterIT answered

Have The exact issue ourselves.

2016 with an In-place upgrade a few years ago to 2019.

everything was working perfectly until a month ago, which would coincide with the last patch tuesday.

if we disable dedup on our storage volumes then it drops our reboots down to 1 or two after hours, when backups to this file server commence.

we can uninstall the update via DISM from Windows RE/PE

dism.exe /Image:D:\ /get-packages /format:list

  1. At the "Package Identity" column, find out the Package Name of the update that you want to remove.

  • e.g. "Package_for_KB4058702~31bf3856ad364e35~amd64~~16299.188.1.0"

Then give the following command to remove the problematic update package:

dism /image:D:\ /Remove-Package /PackageName:PackageName
e.g. To remove the "Package_for_KB4025376~31bf3856ad364e35~amd64~~10.0.1.0 ", give the following command:

dism /image:D:\ /Remove-Package /PackageName:Package_for_KB4025376~31bf3856ad364e35~amd64~~10.0.1.0

excerpt from here: [ https://www.repairwin.com/how-to-remove-updates-from-windows-recovery-environment-winre/][1]


[1]: https://www.repairwin.com/how-to-remove-updates-from-windows-recovery-environment-winre/



5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Chris-7032 avatar image
1 Vote"
Chris-7032 answered Chris-7032 edited

I have had a further update from Microsoft support that they have acknowledged the issue and the development team is looking at it but “it may take a long time to get a released fix”.

I would say that this will go in the large basket of bugs that never get fixed so the best way to deal with it if at all possible would be to re-attach the volume to a Server 2016 machine, disable dedup and unoptimise the volume then re-attach the volume to Server 2019 enable dedup and optimise the volume.

Obviously that’s not going to be possible if you have a large amount of data and/or it is a physical server, but I really don’t see Microsoft fixing this given that it is still not even a documented acknowledged issue.

5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

JanKraus-1911 avatar image
0 Votes"
JanKraus-1911 answered JanKraus-1911 commented

We had the same problem on several Servers. It seems that the July CU fixed the issue. At least we had no more BSODs with July CU installed.
Can anyone confirm this? Is there an official statement from microsoft?

We are still scared because we had several crashes a day on more than five file-servers affecting several hundred users.

· 4
5 |1600 characters needed characters left characters exceeded

Up to 10 attachments (including images) can be used with a maximum of 3.0 MiB each and 30.0 MiB total.

Well, If I take a look at the current version of the dedup.sys file it says 10.0.17763.2061, which is different from the one Chris-7032 mentioned in the first post for june (10.0.17763.1971). So SOMETHING was defintely changed.

I have a currently a state on one volume which let's me easily test this. As soon as I start an unoptimization job on that volume the active node crashes with BSOD. And that happens although I rolled back to the may update. I guess the 2 weeks time period with the june version changed something permanently and now the may version of the dedup driver fails with some operations on some data. Even copying with robocopy caused a BSOD. Not with a copyjob in windows explorer though.

So I will update one node to the july CU today and start an unoptimzation job. If it crashes it might still not mean it wasn't fixed. I could just be the volume permantely broken.

1 Vote 1 ·

I would be happy about any confirmation.
But if you put your Volume back to a 2016 Server OS you should still be able to copy your data. So most likely the volume is not permanetly broken.

0 Votes 0 ·

I can’t say if it is fixed or not - I took the route of disabling de-dedup, unoptimising then upgrading the cluster nodes to Server 2019 and re-enabling dedup.

I did email Microsoft support to get an update a few days ago and they said they are still working on the fix.

0 Votes 0 ·

We still had no more crashes with July CU, seems the fix is included there.

0 Votes 0 ·