Recovering the VCSA on a vSAN cluster

Disclaimer: The credit for the answer goes to John Nicholson (http://thenicholson.com/) a.k.a. lost_signal from the VMware SABU and I added some points.

As I am going through my physical design decisions, I came across a simple question for which I couldn’t find an immediate answer:

How can I restore my vCenter instance (VCSA) if I put in on the very same cluster it is supposed to manage? Can I restore directly on vSAN via an ESXi host?

As my google-Fu let me down, it was time to start a discussion on reddit:

vSAN question: Restore VCSA on vSAN from vmware

 

TL,DR: The good news is: Yes, you can recovery it directly and with 6.6. vSAN clusters this is straightforward with no prerequisites. Look into the vSAN Multicast Removal-guide for the post-processing steps.

As there are other aspects you generally need to consider (not only for vSAN),  I decided to summarize some basic points  (for 6.6 and onward clusters):

  • First things first, make a backup of your VCSA on a regular schedule along with your recovery objectives.
    • If you are on vSAN you should look for SPBM support in your selected product: the good if you have support, the bad if you don’t have it
  • Create ephemeral port groups as recovery options for the VCSA and vSAN portgroups
    • This is not vSAN specific but should be generally considered when you have the vCenter on the same vDS it manages
  • Make a backup of your vDS on a regular basis (or at least after changes)
  • Export your storage policies
    • Either for fallback in case you make accidental changes or for reference/auditing purposes
    • You might need them in case you are ever forced to rebuild the vCenter from scratch
  • John pointed out that a backup product with “boot from backup” capability (e.g. Veeam Instant restore) doesn’t need raise the initial question at all as an additional (NFS) datastore is mounted.
    • A point from myself: Verify the impact of NIOC settings if you followed the recommended shares in the vSAN guide for the vDS. The NFS mount uses the management network-VMK interface which is quite restricted (note: that this would only apply if you have bandwidth congestion anyway).

I would be more than happy if anyone is willing to contribute to this.

When you are using SPBM but the rest of the world is not (vSAN)

Today I came across an issue I did not immediately think about selecting a data protection or replication solution for a vSAN deployment:

Let us say we have a vSAN datastore as target for a replication (failover target) or a data restore from backup. But what if your data protection or disaster recovery/replication product does not support storage policies?

You might find yourself facing some unexpected problems.

The restore or failover might succeed but your VM files (including VMDKs) are subsequently protected with the vSAN default policy. If you did not modify it, this will result in FTT=1 and FTM=RAID1 (If you are not familiar with FTT and FTM, search for in conjunction with vSAN).

At first glance, this does not look too bad, does it?

But…
Now what if the source VM was protected with FTT=2 and FTM=RAID6?
The restored VM has now less protection with more space consumption and the VM might not even fit on the datastore, even if the clusters are setup identically or even it is the same cluster (in case of a restore).

Example:

A VM with a 100GB disk is consuming 150GB  at the source vSAN datastore (with FTT=2 and FTM=RAID6 ) and is able to withstand two host failures. However, it would consume 200GB at the destination datastore (with FTT=1 and FTM=RAID1) as the latter would create two full copies and only one host failure can be mitigated.

Sure you could modify the default policy for this, but what if you have different settings? The beauty of SPBM lies in the fact that you can apply it per disk and re-applying the policy settings for a more complex setup will become messy and error prone.

Now if you ask me for a good example on how to do it:

Veeam shows how to integrate this here.

VMware offers a storage policy mapping in SRM

Fun with vSAN, Fujitsu and (LSI) Broadcom SAS3008

Update 2015-08-07:

Found a newer version of the file with the name P15  containing version 16.00.00 (right…) of the sas3flash utility here

This one works fine:

 

——-

Today was “one of those days”:

A simple vSAN installation turned into a nightmare of reboots, downloads, google searches and so on.

The problem at hand was a cluster of Fujitsu RX2540M2 vSAN nodes with CP 400i controllers (based on LSI/Broadcom SAS3008 chipset) and vSAN 6.5U1.

For vSAN 6.5 U1 the HCL requires the controller  firmware version 13.00.00, but Fujitsu delivered it with version 11.00.00.

Easy enough, the plan looked like this:

  1. Download the sas3flash tool for VMware from Broadcom here
  2. Copy the vip and install it via esxcli
  3. Download and extract the firmware (with 7zip you can open the .ima file) from Fujitsu here
  4. Flash the Controller online with /opt/lsi/bin/sas3flash

Unfortunately , the controller did not show up with sas3flash and I was not able to find the controller with the tool, not matter what I did.

Then the fun began, I tried:

  • Online patching via lifecycle management: Did not find any updates
  • The Fujitsu ServerView Update DVD: Wouldn’t find any CP400i or any required update
  • The USB stick from Fujitsu with the firmware (see link above): Wasn’t able to find the controller when booting into DOS mode
  • Get into the UEFI shell: Isn’t that easy with Fujitsu anymore:  read here

In the end, I gave up:

I pulled the controllers from the server, fitted them into a UEFI capable workstation and flash them with the Fujitsu USB stick from the UEFI shell. Worked like a charm and took my like 5 minutes. If your USB stick is mounted on fs0, the command is like this:

sas3flash -f fs0:lx4hbad0.fw -b b83100.rom -b e150000.rom

or you could simply use the force.nsh script 🙂

After a refitting the controllers in the server and rebooting, sas3flash on esxi still reports no controllers.

I am at a total loss why this happens. A few weeks ago this worked totally fine with a similar setup at another customers.

I am curious to know why this didn’t work and how to fix it.

Any updates will be posted here.