OMFG that was painful. And it’s not complete yet.

I am replacing the hard drive in the web server with an SSD drive, and I had it all figured out on how I was going to take snapshots of the virtual machines, copy the bulk of the virtual drives to another machine and only have the web server down a few minutes while I shut the VMs down, copied the post-snapshot data and brought them up on another PC while I replaced the drive and then repeated the process to return the VMs to the original server. So two downtimes of a few minutes each.

But the first problem was that the web server VM image had bad block(s) on the physical drive and therefore wouldn’t copy over. Ok, so now the plan was to have the server down an hour or two while I repair the drive and then just bring it back up on the same machine after copying from the HDD to the SSD. Well, the repair ran several hours. D’oh! The copy went well, though….

The next problem was that since my old physical server was also my domain controller I had fits trying to manage the new core server hosts with the controller offline. There are ways of doing it via command line, but I am not up to speed enough yet with doing that and wasted a few hours. I needed to create a virtual switch on the new host, import the VMs, and I had planned to convert the old physical domain controller into a virtual machine server, but doing that would also require some command-line-fu I’m not up to speed on yet.

I finally gave up and decided to bring the physical domain controller back online so I could get the web server back up while doing some other gyrations this weekend to make it work the way I want and finish the upgrade. But the old physical machine wouldn’t boot. It was due to my reassigning drive letters, and I managed to fix that with some profanity mixed in.

But during the process I had wanted to boot into safe mode and disable Hyper-V on the old physical host, but I couldn’t get into safe mode because apparently I used a different password than I usually do on the local admin account on that PC, and being in safe mode and being the domain controller, domain logins weren’t available. :P So I just booted up normally after the fix and quickly disabled Hyper-V to prevent the old DC/host from starting its VMs.

So now I’ve got my old server setup splattered across three different physical PCs, and at this point I guess I’ll create a new virtual domain controller, and then I’ll be able to take the physical one down and manage the new Hyper-V host on the SSD.

Lessons learned:

  • WinRM is a royal pain in the ass when the domain controller isn’t available
  • Best practice will be either to have the server tools on the server or know all the command-line-fu before relying on WinRM
  • Having the IPv6 autoconfigurator go offline makes things challenging when I’m relying on IPv6. Either assign static IPv6 IPs to all host servers or have multiple prefix advertisements.

What’s really ironic is that some of my upgrade plans will prevent these kind of problems in the future. The new 2012 Hyper-V Core server would have logged the bad clusters and would have fixed them quickly offline instead forcing a whole-drive bad cluster search. And I was intending to move the domain controller to a VM so it could move around and stay online when the physical host is down.

Items of concern: Both my linux servers crashed after moving them between hosts. They went offline, dumped error messages to the console and used up a bunch of CPU. They are usually quite robust, so I don’t know if the move triggered a problem or if maybe there is undetected corruption on the old drive. No, based on when and how they crashed it seems like it’s the machine or host that’s the problem. Although while typing this I remembered that the old host was Server 2008 R2 and the new one is Hyper-V Server 2012 Core; perhaps there is a driver problem between the VM and the host. Yeah, that makes sense, because my media services VM hasn’t had any problem on the new host, so it’s probably a VM driver update in Linux that’s needed.

I think overall downtime wound up being 12-18 hours. However the next downtime should be short since the bad block problem in solved, and the web server can stay online until the new host is fully ready.