Recovering the VCSA on a vSAN cluster

Disclaimer: The credit for the answer goes to John Nicholson (http://thenicholson.com/) a.k.a. lost_signal from the VMware SABU and I added some points.

As I am going through my physical design decisions, I came across a simple question for which I couldn’t find an immediate answer:

How can I restore my vCenter instance (VCSA) if I put in on the very same cluster it is supposed to manage? Can I restore directly on vSAN via an ESXi host?

As my google-Fu let me down, it was time to start a discussion on reddit:

vSAN question: Restore VCSA on vSAN from vmware

 

TL,DR: The good news is: Yes, you can recovery it directly and with 6.6. vSAN clusters this is straightforward with no prerequisites. Look into the vSAN Multicast Removal-guide for the post-processing steps.

As there are other aspects you generally need to consider (not only for vSAN),  I decided to summarize some basic points  (for 6.6 and onward clusters):

  • First things first, make a backup of your VCSA on a regular schedule along with your recovery objectives.
    • If you are on vSAN you should look for SPBM support in your selected product: the good if you have support, the bad if you don’t have it
  • Create ephemeral port groups as recovery options for the VCSA and vSAN portgroups
    • This is not vSAN specific but should be generally considered when you have the vCenter on the same vDS it manages
  • Make a backup of your vDS on a regular basis (or at least after changes)
  • Export your storage policies
    • Either for fallback in case you make accidental changes or for reference/auditing purposes
    • You might need them in case you are ever forced to rebuild the vCenter from scratch
  • John pointed out that a backup product with “boot from backup” capability (e.g. Veeam Instant restore) doesn’t need raise the initial question at all as an additional (NFS) datastore is mounted.
    • A point from myself: Verify the impact of NIOC settings if you followed the recommended shares in the vSAN guide for the vDS. The NFS mount uses the management network-VMK interface which is quite restricted (note: that this would only apply if you have bandwidth congestion anyway).

I would be more than happy if anyone is willing to contribute to this.

When you are using SPBM but the rest of the world is not (vSAN)

Today I came across an issue I did not immediately think about selecting a data protection or replication solution for a vSAN deployment:

Let us say we have a vSAN datastore as target for a replication (failover target) or a data restore from backup. But what if your data protection or disaster recovery/replication product does not support storage policies?

You might find yourself facing some unexpected problems.

The restore or failover might succeed but your VM files (including VMDKs) are subsequently protected with the vSAN default policy. If you did not modify it, this will result in FTT=1 and FTM=RAID1 (If you are not familiar with FTT and FTM, search for in conjunction with vSAN).

At first glance, this does not look too bad, does it?

But…
Now what if the source VM was protected with FTT=2 and FTM=RAID6?
The restored VM has now less protection with more space consumption and the VM might not even fit on the datastore, even if the clusters are setup identically or even it is the same cluster (in case of a restore).

Example:

A VM with a 100GB disk is consuming 150GB  at the source vSAN datastore (with FTT=2 and FTM=RAID6 ) and is able to withstand two host failures. However, it would consume 200GB at the destination datastore (with FTT=1 and FTM=RAID1) as the latter would create two full copies and only one host failure can be mitigated.

Sure you could modify the default policy for this, but what if you have different settings? The beauty of SPBM lies in the fact that you can apply it per disk and re-applying the policy settings for a more complex setup will become messy and error prone.

Now if you ask me for a good example on how to do it:

Veeam shows how to integrate this here.

VMware offers a storage policy mapping in SRM

I/O acceleration at host level – Part II: PrimaryIO appliance deployment

In part 1 I already talked about the basics of I/O acceleration and PrimaryIO as a possible alternative to PernixData FVP. In this (short) post we’ll look at the deployment of the APA appliance.

I recently had the time to download the newest version, GA 2.0, in order to set up a customer Proof of Concept (PoC) .

And I failed at the initial deployment.

Out of the box the OVA would throw an error about an unsupported chunk size.

PrimaryIO – chunk size error with vCenter 6.5

Now I was already sitting in front of a vCenter version 6.5 (with ESXi hosts on 6.0) and as with FVP this is currently not supported for APA (I got the info from support that PrimaryIO targets April/May 2017 for the support).

But since this is a PoC/Lab I didn’t give up easily:

A nice VMware KB article describes  the problem at hand and offers a solution.

Since OVAs are essentially a compressed archive, I used 7zip to extract the files and decided to have look at the appliance definition file (.OVF).

Line 5 and 6 contained the virtual disk definitions and the parameter:

 ovf:chunkSize="7516192768"

Removing it, the .OVF looked like this after editing:

Gotacha which is also mentioned in the KB article:  You have to delete the .mf file afterswards or at least update the checksums since the content was modified and they no longer match.

I skipped the step of re-creating an .OVA-file since we can use the .OVF and .VMDK-files directly in the FlexClient-deployment wizard. The only remaining adjustment was to relieve the .VMDK-files of their training zeros.

This left me with there three files:

PrimaryIO GA 2.0 files are some adjustments

 

After that the deployment worked like a charm and my next task was to setup networking since I opted for “fixed IP” within the OVA deployment wizard. Unfortunately the OVF-script does not include a script to set the IP information, however this step is well documented in the manual.

Essentially the APA-appliance is an ubuntu  with enabled “root” login (default password: admin@123) and setting an IP is straightforward.

PrimaryIO – Linux Appliance screen after login

You might adjust additional linux stuff, like syslog and ntp according to your needs.

However, from a security standpoint I am a bit worried.

The appliance is based on Ubuntu 12.04 LTS, which is nearing end of life/support in few weeks– after that there are no more updates.
As you can see, there are initially many updates missing after deployment. I am not sure how the update policy is on the appliance (i.e. can I just use apt-get).

Regarding these questions I will raise this issue with PrimaryIO support and update this article.

Updated info from support:

We do not recommend an apt-get upgrade of the appliance. If you are facing any specific issue – we can help address that. […]
I have a confirmation that the APA 2.5 release scheduled for May 2017 GA will have the latest ubuntu LTS based PIO Appliance.

 

All right, for a few week I am OK with an “old version”.

Part 3 will go into the APA configuration