Working on one of our S2D clusters, we've had quite a strange issue. During a regular maintenance, a colleague reported that the system had issues enabling the maintenance mode. Basically disks entered maintenance mode, but the process didn't continue.
The system is a test-environment running only two nodes of Windows Server 2022 with a bunch of HDD, SSD and NVMe on each node. (First installation had been Windows Server 2016, upgraded to 2019, then 2022). Technically the solution might also work on Azuer Stack HCI which is using S2D under the hood.
Risk of data loss
This guide describes some steps that involve direct manipulation of storage related objects. Do *NOT* perform this steps unless your certain about what you're doing!
TL;DR - Just tell me how to fix it
In case you just want to skip the whole story, you might just try the following step, for every disk that is stuck.
What happens here is: We're invoking the Maintenance method on the CIM Class that represents a physical disk (which internally translates/maps calls to WMI methods, if i'm right here). Calling this, does not verify higher level objects like virtual disks, it's more or less "just" changing a flag on the disk (which might be applied using regular metadata modification of the storagepool).
By running this on every disk in maintenance state we're enabling access to the disks again, which will in turn allow storagejobs to resume which will then fix our virtualdisks which will then make us happy.
The long way
So, our system has been stuck, just like shown above and on the text version right below.
That's it. No step forward, no step backward. Actually every operation trying to modify this state either returned an error or has been stuck forever (despite we didn't wait actually "forever", but a few hours :-)).
Things that did NOT work
Finding a solution for this hasn't been easy (talking about it now the solution seems rather simple). We tried to use common commands like:
Modify S2D properties like disabling caching
Invoke Maintenance method via CIM on MSFT_StorageHealth directly
None of this was able to change the state of the disks and returned a message similar to this:
So, we can assume that the endpoint has an option to omit VD check (our VDs are in degraded state, which is expected - but maybe this has an issue here).
In our setup, we are very confident that we don't have an issue about our VDs, they are running on a single copy very fine, we just cannot disable maintenancemode and let the storagejobs resume.
Well, so - passing a ValidationFlags argument with a value of 0 (= uint16, no flags set) returns a successful status code on invocation, but doesn't actually change the disks state. As I've not been able to find any documentation about valid flags, this path ended here. Nearly.
One thing that did work
Digging around quite a while lead me to one method: Maintenance via MSFT_PhysicalDisk.
So, there's a method available on these objects. Let's check it's arguments:
Quite interesting, there's also an option to disable maintenance mode, so let's invoke it.
And - yes - finally. Maintenance mode is disabled on this disk. Repeat this step on every disk and storagejobs will resume.
While having jobs running, let's check the fault domains:
After job completion, VDs are healthy again.
Watching the solution this sounds rather simple: just fetch the CIMClass definition for the type of affected device, check methods and invoke one of them. Doesn't seem like searching for a solution for hours :-)
Altering statefulsets on kubernetes can be tricky - as statefulsets are very common used for persistent applications like databases recreation is no option. This guide shows a path around some of these limitations.