VCSA 6.5 U1: vAPI status “yellow” and content library not started (possible fix)

As this is an error that affected me now at multiple customer installations, it is time for a blog ūüôā

After upgrading a site to 6.5 U1 I noticed several issues:

  • The vAPI Endpoint status changed to “yellow”
  • The Content Library service would not start

The only resolve I found within the VMware KB was “restart the services.

As I didn’t help, I searched along and found VMware KB¬†2151085¬†with the cause of the error


The file is deployed with the noreplace option. With this option, the will no longer be overwritten, instead it is saved with the extension .rpmnew.
and a nice hint
This is a known issue seen with several upgrade paths to vSphere 6.5 Update 1. Not all upgrade paths are affected, VMware is investigating affected paths this article will be updated once confirmed.
I hope this helps someone else as the KB entry isn’t obvious.

Fun with vSAN, Fujitsu and (LSI) Broadcom SAS3008

Update 2015-08-07:

Found a newer version of the file with the name P15 ¬†containing version 16.00.00 (right…) of the sas3flash utility here

This one works fine:



Today was “one of those days”:

A simple vSAN installation turned into a nightmare of reboots, downloads, google searches and so on.

The problem at hand was a cluster of Fujitsu RX2540M2 vSAN nodes with CP 400i controllers (based on LSI/Broadcom SAS3008 chipset) and vSAN 6.5U1.

For vSAN 6.5 U1 the HCL requires the controller  firmware version 13.00.00, but Fujitsu delivered it with version 11.00.00.

Easy enough, the plan looked like this:

  1. Download the sas3flash tool for VMware from Broadcom here
  2. Copy the vip and install it via esxcli
  3. Download and extract the firmware (with 7zip you can open the .ima file) from Fujitsu here
  4. Flash the Controller online with /opt/lsi/bin/sas3flash

Unfortunately , the controller did not show up with sas3flash and I was not able to find the controller with the tool, not matter what I did.

Then the fun began, I tried:

  • Online patching via lifecycle management: Did not find any updates
  • The Fujitsu ServerView Update DVD: Wouldn’t find any CP400i or any required update
  • The USB stick from Fujitsu with the firmware (see link above): Wasn’t able to find the controller when booting into DOS mode
  • Get into the UEFI shell: Isn’t that easy with Fujitsu anymore: ¬†read here

In the end, I gave up:

I pulled the controllers from the server, fitted them into a UEFI capable workstation and flash them with the Fujitsu USB stick from the UEFI shell. Worked like a charm and took my like 5 minutes. If your USB stick is mounted on fs0, the command is like this:

sas3flash -f fs0:lx4hbad0.fw -b b83100.rom -b e150000.rom

or you could simply use the force.nsh script ūüôā

After a refitting the controllers in the server and rebooting, sas3flash on esxi still reports no controllers.

I am at a total loss why this happens. A few weeks ago this worked totally fine with a similar setup at another customers.

I am curious to know why this didn’t work and how to fix it.

Any updates will be posted here.

Some notes about Oracle Backup with Veeam

Veeam introduced the oracle backup feature with version 9 and a¬†lot of people are very excited about this, because oracle backups were one of the last bastions for the¬† “old guard” of enterprise backup products.

I was facing a bit of a problem here, since Oracle DB are definitively not my turf but now it is expected that I should be able to handle those in a backup/recovery case.

So, the goal of this post is to establish some basic concepts about oracles DBMS and share some notes how Veeam backup behaves when backing up oracle databases (perhaps this might interest folks who can compare this to RMAN).  Feel free to correct me, as stated above this is not my usual business.



Taken directly from the Veeam documentation center, these are the requirements for enabling log shipping.

  • Veeam Backup & Replication supports archived logs backup and restore for Oracle database version 11 and later. The Oracle database may run on a Microsoft Windows VM or Linux VM.
  • Automatic Storage Management (ASM) is supported for Oracle 11 and later.
  • Oracle Express Databases are supported if running on Microsoft Windows machines only.
  • The database must run in the ARCHIVELOG mode.

So, that is that for version requirements – OK, but what is an archive log?

Oracle Logging

Redo Logs

Knowing the usual pieces ¬†MSSQL I am familiar with transaction logs, but within oracle these are called “redo logs”. Unlike¬†the t-log, you need to have at least two redo logs (quotes are from the oracle DB manual):

The redo log of a database consists of two or more redo log files. The database requires a minimum of two files to guarantee that one is always available for writing while the other is being archived.

Of these (at least two, more likely three) only one is written into:

LGWR (log writer) writes to redo log files in a circular fashion. When the current redo log file fills, LGWR begins writing to the next available redo log file. When the last available redo log file is filled, LGWR returns to the first redo log file and writes to it, starting the cycle again.

Notice that all logs will be overwritten at some point in time, i.e. when the log files are filled up. This would be more like the SIMPLE log model in MSSQL.

And finally some nomenclature to distinguish those logs who are being written to and those that are currently not in use:

Oracle Database uses only one redo log files at a time to store redo records written from the redo log buffer. The redo log file that LGWR is actively writing to is called the current redo log file.

Redo log files that are required for instance recovery are called active redo log files. Redo log files that are no longer required for instance recovery are called inactive redo log files.

Archived Redo Logs

I stated that we will lose the content of the logs at some point in time, the solution for keeping the logs is the ARCHIVELOG mode.

An archive log is an “archived redo log” and it will lead to us the equivalent of a FULL recovery model in MSSQL.

I am borrowing the following explanation from Oracle Terminology for the SQL Server DBA

These are redo log files that have been backed up. There are a number of ways to have Oracle automatically manage creating backups of redo log files that vary from manual to completely automated.

If the disks storing these files fills up, Oracle will not be able to write to the data files ‚Äď active redo log files can‚Äôt be archived any more. To ensure safety, writes are stopped.

The last paragraph is an important one, we will come back to this later on.



Veeam with Oracle backup


At this point I will trust that you know how to setup a Veeam job, perhaps I find time to add this more basic stuff later on.

Most people will use a 24 hour image backup schedule for their VMs. ¬†For reduction of RPO you would implement the “log shipping” within the application aware backup tab.

Note 1: Veeam does not use RMAN (Recovery Manager) but rather the Oracle Call Interface (OCI).

Note 2: Veeam will only delete log files will the next image backup (full or incremental), not with the log shipping.

Note 3: Veeam doesn’t delete empty log folders after clearing *.arc files

Note 4: Veeam will issue a (global) log switch as part of the backup. This archives all online redo log files, meaning that you need space on your archive log partition.

So, with note 4 in mind, image a situation where ¬†your archive partition is quite full and you want a backup to clear it up. This might actually lead to more used space before anything can be cleared up. Also, you cannot use Veeam to clear up an already full partition. Right now, there doesn’t seem to be an option to change this.

Note 5: Behavior of Delete logs older than <N> hours or Delete logs over <N> GB may not be so obvious.

First, this will only take effect when you run a full/incremental backup (see note 2).¬†If you let the default “delete after 24 hours” and make an image backup once a¬†day, then you will always have nearly two days worth of logs on your disk: The last day and the current logs.

Deploying the IBM StorWize / SVC IP quorum

*Update: Thanks to Daniel Huber for making some corrections. He found an error in the systemd script as well as my Java/JDK stuff

A while ago, late¬†2015, IBM announced “IP quorum” support with version 7.6.

I didn’t have the need or opportunity to play around with this, but after a¬†colleague just implemented an installation with an IP quorum I wanted to try it for myself. The storage system part is fairly easy but they leave it up to you how to implement the Linux part.

A bit of technical background

A quorum device is needed in IBM SVC “streched” or StorWize “HyperSwap” installation and acts as a “tie breaker” placed at an independent location outside your two main data centers.

The traditional quorum device for StorWize/SVC is an “extended quorum” qualified fibre channel array. Usually either the FC connectivity or the storage array (cost, rackspace, heat dispersion) at the third site become a problem. So, here comes the¬†IP quorums to save the day.

The IP quorum consists of a simple java application that needs to be executed on a Linux server. However, IBM has narrowed the choice for your deployment considerably:

I don’t want to go into possible reasons why they do this, but if you are in a production environment make sure you fulfill the requirements.

Note that there is a gotcha with using the IP quorum (taken from the IBM documentation center):

Unlike quorum disks, all IP quorum applications must be reconfigured and redeployed to hosts when certain aspects of the system configuration change.

Now this is serious stuff, the “good old” FC quorum is like a fire and forget solution which you don’t need to touch after the setup.

Reading further on you’ll find some more information about the network quality required,¬†in a typical metro/campus installation this should be of no concern but make sure you check this:

  • The maximum round-trip delay must not exceed 80 milliseconds (ms), which means 40 ms each direction.

  • A minimum bandwidth of 2 megabytes per second is guaranteed for node-to-quorum traffic.

If you are worried about the availability:

The maximum number of IP quorum applications that can be deployed is five.

Be grateful for this. You want to patch your Linux¬†on a regular base, don’t you? And an 1U rack server or workstation offers not the same level of redundancy a storage system does. On the downside, if you have more than one installation you’ll have to redeploy the quorum to multiple locations¬†after a major configuration change (see above).

Personal comment time:

Using an IP quorum looks like a nice thing, there is no useless storage array sitting in a third location, less costs and so on. However, make sure you know the implications for your storage environment. This is a kind of paradigm shift since you give a key part of your availability solution out of your hands and rely on network and servers. Organisation wise this might include working with different teams, so change management becomes crucial.

Make sure you know how the network behaves in a site disaster so that the quorum stays available, the only time your ip quorum is really needed it should stay available. Test this on a regular base. Make sure the server team doesn’t patch all your quorum servers at once and test them after updating.

Time for some deployment:

For my tests I am using a IBM StorWize V7000 generation 1.

Since all models (including SVC) share the same code base this howto shoud be valid for all platforms.

As you can see I am using the latest and greatest code level, version 7.8.

Creating the quorum apllication

The latest version gives me some advantages, including the fact that you can deploy and monitor the IP quorum from the UI:

Since my installation uses IPv4 only (shame on me), the option for the IPv6 download is greyed out. And yes, I am using the “superuser” account for this demonstration (I guess that’s the second mark).

As you can see in the screenshot you can do this in the command line by issuing

  • mkquorumapp 

The application jar file is created in the /dumps directory, since you are already using the CLI I assume you know how to retrieve it from there with scp.

On the UI your browser will offer you a download, save it and transfer it to your quorum server.

I am using a VM based on RHEL 7.3, for a real life deployment this might be an option if you have an ESXi host in your third location. Otherwise a simple 1U rack server or even some kind of workstation might be a good option.

Deploying the quorum application

At this point you should have created the ip quorum application and installed a server with a supported Linux OS. At the end you’ll find a short summary of all commands, but here is a walk-through to understand the process.

I am not going to run an application as root if I do not have to, so creating a service user might be a good idea. Mine is named “ip-quorum”

My quorum application will be deployed into /usr/local/bin/ip_quorum and therefore I need to create the directory and change permissions accordingly:

Time to install Java, but not any Java – you’ll need IBM Java.

As Daniel pointed out I mentioned IBM Java but drifted off and used OpenJDK. Sorry for the error. As you can see it works but it will not be officially supported by IBM. Get your IBM Java here.

As I haven’t time to update the post with new instructions please beware that the next two steps describe OpenJDK instead of IBM Java:

Since I do not have a working satellite installation I’ll get mine straight from the only repositories. Have a look at the RedHat KB for instructions regarding satellite or RHEL6.

For SLES and RHEL you’ll need a working subscription if you are going to do it this way.

After the installation you might want to check the status

and start the applications once by running

java -jar /usr/local/bin/ip_quorum/ip-quorum.jar

As you can see the quorum application initiates a connections to your storage array, not the other way around as a normal service would. The target port on the StorWize is 1260/tcp if you need to implement a firewall rule.

Go back to your storage system to check the status:

Looks good, we made sure our quorum can reach the storage system on the network and application layer. Now terminate the application with CTRL-C .

Setting up a service for the quorum application

Since we want our quorum to start and stop with the server we need some kind of auto start implementation. There are multiple ways to achieve this, I choose a systemd service definition which I call “ip-quorum”.

Create the file with in the system folder of systemd

and add your content:

Description=IBM Storwize IBM Quorum Service

# Typo corrected by Daniel, it is Type with a capital letter. 
ExecStart=/usr/bin/java -jar /usr/local/bin/ip_quorum/ip_quorum.jar


Make sure you tell systemd that unit files have changed

systemctl daemon-reload

Enable and start the service, do a quick check with “netstat” and on the storage system UI to see if the stuff is running


Here are the lines required, please make sure to check these, fill in the needed values and do not paste them blindly into your system

# creating the user and files

useradd -r ip-quorum
mkdir -p /usr/local/bin/ip_quorum
cp (SOURCE)/ip_quorum.jar /usr/local/bin/ip_quorum
chown -R ip-quorum:ip-quorum /usr/local/bin/ip_quorum
cd /usr/local/bin/ip_quorum
chmod 774 ip_quorum.jar

# install java
subscription-manager repos --enable rhel-7-server-supplementary-rpms
yum install -y java-1.8.0-openjdk-headless

# create service definition
cat <<'EOF' >> /lib/systemd/system/ip-quorum.service
Description=IBM Storwize IBM Quorum Service

ExecStart=/usr/bin/java -jar /usr/local/bin/ip_quorum/ip_quorum.jar


# start and enable the service 
systemctl daemon-reload
systemctl enable ip-quorum
systemctl start ip-quorum

I/O acceleration at host level – Part II: PrimaryIO appliance deployment

In part 1¬†I already talked about the basics of¬†I/O acceleration and PrimaryIO as a possible alternative to PernixData FVP. In this (short) post we’ll look at the deployment of the APA appliance.

I recently had the time to download the newest version, GA 2.0, in order to set up a customer Proof of Concept (PoC) .

And I failed at the initial deployment.

Out of the box the OVA would throw an error about an unsupported chunk size.

PrimaryIO – chunk size error with vCenter 6.5

Now I was already sitting in front of a vCenter version 6.5 (with ESXi hosts on 6.0) and as with FVP this is currently not supported for APA (I got the info from support that PrimaryIO targets April/May 2017 for the support).

But since this is a PoC/Lab I didn’t give up easily:

A nice VMware KB article describes  the problem at hand and offers a solution.

Since OVAs are essentially a compressed archive, I used 7zip to extract the files and decided to have look at the appliance definition file (.OVF).

Line 5 and 6 contained the virtual disk definitions and the parameter:


Removing it, the .OVF looked like this after editing:

Gotacha which is also mentioned in the KB article:  You have to delete the .mf file afterswards or at least update the checksums since the content was modified and they no longer match.

I skipped the step of re-creating an .OVA-file since we can use the .OVF and .VMDK-files directly in the FlexClient-deployment wizard. The only remaining adjustment was to relieve the .VMDK-files of their training zeros.

This left me with there three files:

PrimaryIO GA 2.0 files are some adjustments


After that the deployment worked like a charm and my next task was to setup networking since I opted for “fixed IP” within the OVA deployment wizard. Unfortunately the OVF-script does not include a script to set the IP information, however this step is well documented in the manual.

Essentially the APA-appliance is an ubuntu ¬†with enabled “root” login (default password: admin@123) and setting an IP is straightforward.

PrimaryIO – Linux Appliance screen after login

You might adjust additional linux stuff, like syslog and ntp according to your needs.

However, from a security standpoint I am a bit worried.

The appliance is based on Ubuntu 12.04 LTS, which is nearing end of life/support in few weeks‚Äď after that there are no more updates.
As you can see, there are initially many updates missing after deployment. I am not sure how the update policy is on the appliance (i.e. can I just use apt-get).

Regarding these questions I will raise this issue with PrimaryIO support and update this article.

Updated info from support:

We do not recommend an apt-get upgrade of the appliance. If you are facing any specific issue – we can help address that. […]
I have a confirmation that the APA 2.5 release scheduled for May 2017 GA will have the latest ubuntu LTS based PIO Appliance.


All right, for a few week I am OK with an “old version”.

Part 3 will go into the APA configuration

Adding a second syslog server to a VCSA 6.5 (Appliance MUI)

Beware: This is probably not supported

I was asked if I could add another syslog server to an existing VCSA deployment. With the nice blog post from William Lam in mind, adding the second server should be easy. Just edit the configuration and there you are.

The UI won’t allow this.

I guess it is CLI time then, luckily the blog post mentions this:

A minor change, but syslog-ng is no longer being used within the VCSA and has been replaced by rsyslog.

So we are just looking at a matter of finding the right config file.

In the main file

  • /etc/rsyslog.conf

you can an ‚Äúinclude‚ÄĚ statement point towards the file

  • /etc/vmware-syslog/syslog.conf

The only content is our first syslog server, configured as a remote syslog target.

At this point adding the second server is not a big deal, the file now looks like this:

Remember to reload the syslog service afterwards.


Another gotacha (besides the lack of support):

Changing the settings via VAMI/UI will overwrite your modifications

I/O acceleration at host level – Part I: Overview & PrimaryIO

In my last to posts I have been rambling about certifications, but it is time to put something useful on this page.

When PernixData released 2.x of FVP they really got the attention of the vCommunity. Lots of posts and happy people all around about two years ago.

For those of you who are not familiar with the topic:
The idea of FVP and similar products is to take the I/O handling as close to the source as possible, which is in case of VMs the ESXi host. You can use memory or flash storage devices as a read or (mirrored-)write cache for your most demanding VMs. (Cached) Requests are instantly answered and load is taken from your storage system. This is especially useful if you have an older model with very little or no flash and/or experience performance problems and you 6want to protect your investment (read: no chance to upgrade hardware any time soon). There are of course other use cases but I’ll try to keep this simple.

But after Nutanix acquired PernixData they essentially buried FVP as a product.
However Nutanix fulfills the support for customers with an existing contract but by now you should have a ‚Äúplan B‚ÄĚ if you need I/O acceleration at host level.
FVP lacks vSphere 6.5 support and I find it hard to believe that great efforts are being made to deliver it anytime soon.¬†(Side note: More than FVP I’ll be missing the¬†PernixData Architect which delivers quite a good view into the storage layer and is very easy to handle.) *UPDATE: I am very sorry, I didn’t give the folks at Pernix/Nutanix the credit they deserve, update is targeted for 04/17 according to support.

So, let’s have a look at a possible alternative:

Last year PrimaryIO offered PernixData FVP customers their Application Performance Accelerator (APA) for VMware at no charge except support costs.

APA uses vSphere APIs for I/O Filtering (VAIO) for their implementation, which  is quite nice in my opinion since this is standardized within the vSphere environment. Here I take the liberty to quote directly from VMware:

VAIO is a Framework that enables their parties (Partners) to develop filters that run in ESXi and can intercept any IO requests from a guest operating system to a virtual disk. An IO will not be issued or committed to disk without being processed by IO Filters created by 3rd parties (source)

Right now there are two supported use cases (caching and replication) and according to VMware

caching will significantly increase the IOPS available, reduce latency, and increase hardware utilization rates (quote from the source from above)

If someone is interested in an overview of currently supported VAIO solutions, you may find it here.

What does APA offer?
According to their technical brief they do not cache randomly or only the most frequented blocks but do this acutally application aware (hence the name, I guess):

Only the most important application components such as frequently accessed tables or indexes that speed up queries are optimized, while less critical elements such as log records, replicas, audit entries, or ad hoc user activity are de-prioritized.

I’m not sure how to track this in the future and verify the claim, so comments are welcome and if I find a way I’ll let you know ūüôā

Like FVP they APA offers the possibility to mirror the write cache across hosts which is a must if you take this into production. I’ll need to check if they support fault domains (i.e. two data center).

For a more details on how it works have a look at the VMware blogs where Murali Nagaraj, CTO of PrimaryIO posted about this.


Thank you for your attention, in part 2 I will continue on with the APA appliance deployment


Note: My posts on this topic are in no way sponsored by PrimaryIO.

Personal blog, remember? ūüôā

Adding some thoughts on the value of VMware certifications

This morning I posted about how I feel about the VCDX price increase. TL, DR: I can understand the reasons behind this, but VMware has to deliver value for the money.

Having said that, there is a bigger issue in the room in my opinion.

Essentially this tweet from Jason Nash triggered this post:

For what it is worth, I think that with VMware, the partner tier says not much about the technical skills and qualification.

As you can read here, the requirements for the highest level, Premier Partner, is essentially revenue driven. Sure, you need four VCPs but when you are big enough for a million of sales within 12 months, sending out four people on an ICM course is peanuts.

Do not get me wrong, the VCP has is place but from the higher partner level I would expect more to verify the expertise. VCP is a multiple choice exam, VCAP deployment is hand-on (you cannot braindump that) and design requires you to draw and place something (again, no braindumps here).

So why would or should a partner spend any money to certify his employees toward the ‚ÄúVMware Certified Advanced Professional‚ÄĚ-level or even above?

The answer is: I do not know and I cannot see a business case for this at the moment.

Back on twitter Joe Silvagi  from VMware pointed out that there are business benefits for a partner:


Nevertheless, here I try to see it from a potential customer point of view.

You cannot pick an enterprise/premier partner and know that they have at least a number of n VCIX or even a VCDX in a certain field (solution competency) to guarantee a certain amount of knowledge.

This would really count for something BUT…

… VMware needs to promote their advanced certifications so these get the attention and value they deserve.

I had to explain to many people what my VCAP or VCIX actually means, customers and coworkers alike, and even what the next level with VCDX would be. For a VCDX attendee this is the worst case, you put effort into your certification in order to get benefits (from a pay increase to a new job) but if no one knows what this title is, you have a problem because this lowers your ROI

Compare this to Cisco, if you say ‚ÄúCCIE‚ÄĚ everyone has an idea what you are talking about and goes like ‚Äúahhh‚ÄĚ and ‚Äúohhh‚ÄĚ.

This might be different in the US, but this is my unfortunate experience here in Germany.