Technology


We ran into this bit of fun while setting up a NIS domain for testing in the lab today:
rob@rob-kubuntu3:~$ ypcat -d nisdom -h rhel5-64-2 passwd.byname
No such map passwd.byname. Reason: No such map in server's domain

It turns out this was a problem with the /var/yp/securenets file, but I’m still not sure what is wrong. The man page for ypserv shows:

A sample securenets file might look like this:

# allow connections from local host — necessary
host 127.0.0.1
# same as 255.255.255.255 127.0.0.1
#
# allow connections from any host
# on the 131.234.223.0 network
255.255.255.0 131.234.223.0

So we set up our securenets to look like this:

host 127.0.0.1
255.255.255.0 10.10.10.0

And tried to connect to the server:
rob@rob-kubuntu3:~$ ip addr show dev wlan0 |grep "inet "
inet 10.10.10.210/24 brd 10.10.10.255 scope global wlan0
rob@rob-kubuntu3:~$ ypcat -d nisdom -h rhel5-64-2 passwd.byname
No such map passwd.byname. Reason: No such map in server's domain
rob@rob-kubuntu3:~$ ping -c1 rhel5-64-2
PING rhel5-64-2 (10.10.10.213) 56(84) bytes of data.
64 bytes from rhel5-64-2 (10.10.10.213): icmp_req=1 ttl=64 time=0.823 ms

--- rhel5-64-2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.823/0.823/0.823/0.000 ms

Removing the /var/yp/securenets file allowed us access, so it wasn’t firewall or rpc or portmap issues, to the best I can determine. Adding “host 10.10.10.210” also worked and allowed the client access. So what’s wrong with the format / man page?

I upgraded the TNS lab this past week from Windows 2008 to Windows 2008 R2, including replacing the 4 Domain Controllers (rather than upgrading). It gave me a chance to review the procedure for moving a Certificate Server to a new system, which I hadn’t done since 2005. For those who haven’t tried, the procedure for moving a Certificate Server is reasonably well documented at the Microsoft Support site here: http://support.microsoft.com/kb/555012. The part of this that’s especially tricky, especially in our lab, is the renaming of the DC.

In our lab we have an empty forest root, as per the old (Windows 2000-era) Microsoft recommendations, to match several large customer environments. Because it’s a lab, and no clients connect to it, we only have a single DC. I snapshotted it as a backup, and went through the procedure to rename a domain controller, also well documented by Microsoft, this time at TechNet.

For review, the procedure we planned to run was:
netdom computername dc04 /add:dc01.lwtest.corp
netdom computername dc04 /makeprimary:dc01.lwtest.corp
shutdown -r -t 0
netdom computername dc01 /enum
netdom computername dc01 /verify
netdom computername dc01 /rem:dc04.lwtest.corp

I’m still not sure what caused it, but in this case, this command failed:
netdom computername dc04 /makeprimary:dc01.tns.lab
At this point, I couldn’t make the old name primary again (I would get an “Access Denied” error), so I rebooted to see which name had taken. And that’s where things went bad.

When the DC came up, we were getting this error: Netlogon EventID 5602. Source: NETLOGON, EventID: 5602, Data: “An internal error occurred while accessing the computer’s local or network security database.”

Because the DC rename hadn’t completed successfully, the computer couldn’t actually log into itself to load AD. Very bad for the root of the forest. I wasn’t able to find anything helpful in my searches, so thought I’d let you know the fix:

Name it back to the old name and try again:
Reboot into Safe Mode.
netdom computername localhost /makeprimary:dc04.lwtest.corp
shutdown -r -t 0

Boot normally
netdom computername localhost /makeprimary:dc04.lwtest.corp
netdom computername dc01 /enum
netdom computername dc01 /verify
shutdown -r -t 0

After *that* reboot, make sure, with the verify command, that the old name took, and that you can log in, and just try the rename again.

I couldn’t get the “rename back” to take untill after the attempt in safe mode. Strange, but it’s working great now! Hopefully this will help someone.

I had a Bourne Shell (sh) script I needed to capture the exit status of, but it was being run through “tee” to capture a log file, so “$?” always returned the exit status of “tee”, not the script. In a nutshell, it went something like this:
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
sh $0 0 | tee $LOGNAME
exit $?
fi
#... Do lots of things in the script
exit $ERRORCODE

Now, the important thing here is that the script sets very specific error codes (we have 16 defined) based on different error states, so that a tool like HP Opsware can give us different reports based on the exit status. When run with “0″ for no logging, this works great, but it requires the controlling tool to capture logs, and not all do (especially cheap “for” loops in a shell script.)

But when run with logging enabled, all of the fancy error code handling (45 lines of subroutines’ worth) gets lost, because “$!” is equal to the status code of the “tee” command. Bash scripters out there will say “but what about $PIPESTATUS ?” If we could use bash, the code would be:
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
sh $0 0 | tee $LOGNAME
exit ${PIPESTATUS[0]}
fi
#... Do lots of things in the script
exit $ERRORCODE

(Note the single line change in the conditional exit.)

But, I don’t have the luxury of bash (thanks AIX and FreeBSD and Solaris 8), so we needed to get fancy…
#!/bin/sh
DO_LOG=$1
LOGNAME="`hostname`.out"
if [ "$DO_LOG" -eq "1" ]; then
# Logging is turned on, so relaunch ourself with logging disabled, and tee the output to the logfile
cp /dev/null $LOGNAME
tail -f $LOGNAME &
TAILPID=$!
sh $0 0 >> $LOGNAME 2>&1
RETURNCODE=$?
kill TAILPID
exit $RETURNCODE
fi
#... Do lots of things in the script
exit $ERRORCODE

In this last example, we’re creating the empty logfile by copying /dev/null to the logname, then starting a backgrounded “tail” command on the empty file. Because we haven’t disconnected STDOUT in the backgrounding, we will still get the screen output we desire from “tail”. The script now only writes *its* output, with redirected STDOUT and STDERR, to the log file, which is already being tailed to the actual screen. At the end of the script, we capture the true exit code, clean up the tail ugliness, and exit with the desired status code.

This does have a serious downside that if the script encounters and error and exits, the “tail” is left running indefinitely on Linux and Solaris, since the kernel there will simply scavenge the process to be owned by init. So, if you take this method, be very careful to capture all errors you may possibly encounter. Or, just use a better scripting tool. 🙂

I have recently pushed the main ESX host for TNS to 70% overcommit on RAM, since upgrading to 4.1. Interestingly (expectedly), the performance now is the same as it was on 3.5 with 2 fewer VMs and only 50% overcommit. But, it’s still pretty poor in the “Lab” performance pool, even after changing that pool from “low” to “normal” shares. So we finally ordered new memory, doubling the server to 16gb. It goes in Sunday night, so we’ll see how things perform next week when Rob’s on site with customers.

I recently upgraded the totalnetsolutions.net internal network from ESX 3.5 to ESXi 4.1. The ESX Host upgrade itself is simple, and not worth mentioning. When complete, however, you have an option to upgrade the Guest OS Virtual Hardware from v4 to v7. Support for USB devices, thin-provisioned disks, and supposed speed improvements come with the upgrade.

The process should always be:

1. Upgrade VMware Tools to the latest available version. This pre-stages the drivers for the newest hardware, even though it’s not “installed” yet.
2. Reboot the guest and make sure it boots and runs properly after all upgrades (host and guest) have been completed.
3. Back up the entire guest VM, including the VMX and VMDK files.
4. Upgrade the virtual hardware through vSphere
5. Boot the VM and verify all settings are working properly.

I started the upgrades in the Unix lab. The Red Hat Enterprise Linux (4 and 5) and Ubuntu (10) systems went without a hitch. VMware Tools automatic upgrade went properly, systems rebooted fine, and after upgrading the virtual hardware, I didn’t have to change a thing in the guests. The Solaris 10 x86 guest, had some issues, however. I believe a rescan was all that was required to fix it, but we were planning on rebuilding the box anyways, so used the issues as the final “nail in the coffin” to the old hardware.

On the Windows side, we have 2 pools in our ESX environment: one for test machines, and one running our production environment. We have Domain Controllers (and separate forests) in both environments, but all file and Exchange operations only live in production.

The Windows 2003 DC / Exchange 2003 server came up fine, although it lost its network configuration (adapter MAC changed), so that had to be reset, but is a simple fix.

All Windows 2008 DCs in the test lab, including the RODC, came up fine, but with the same “lost network configuration” hiccup. These systems all have the NTDS data and logs on the C: drive.

The Windows 2008 Server Core DC / File server, however, was a different story. Upon reboot, the server kept giving a BSOD and rebooting, so I couldn’t read the error. As this system is the primary (200GB) file server, primary DNS server (including conditional forwarding to the test lab), and the DC that handles the most load (DNS weight on the Windows 2003 is slightly lower), fixing the Blue Screen was of major importance. This is how it’s been fixed:

1. Safe Mode and “Last known Config” didn’t work, so hit F8 on the boot process to choose “Do not restart on system failure”. This allows you to read the BSOD message. In our case, it was simply “File Not Found”. Which means, no minidump, and you might be sunk.
2. On a whim, since it is a DC, I tried to boot into Directory Services Restore Mode, hoping the “not found” file was AD related… and was right.
3. This leads us down the path of this support article.
4. Immediately upon booting, I ran: ntdsutil files integrity which gave this error:
Could not initialize the Jet engine: Jet Error -566.
Failed to open DIT for AD DS/LDS instance NTDS. Error -2147418113
5. Searching shows there’s not much useful here, but we know it’s a failure to read the DIT. This could be security, or horrid corruption.
6. I quit ntdsutil to try to check the files on the E: drive, where they lived, only to find there was no E: drive. With no MMC, it’s diskpart to the rescue.
7. diskpart
DISKPART> list disk
Disk ### Status Size Free Dyn Gpt
-------- ---------- ------- ------- --- ---
Disk 0 Online 24 GB 0 B
Disk 1 Offline 100 GB 0 B
Disk 2 Offline 100 GB 0 B

8. I ran:
select disk 1
online
select disk 2
online
exit

9. Now I can read the E: drive, so try ntdsutil files integrity again… and get the same error message. Checking the disk, everything looked fine. In Linux, I’d check permissions with a quick “touch filename”, but notepad needed to be used here, only to discover the entire disk was marked read-only. Back to diskpart!
diskpart
select disk 1
attributes disk clear readonly
select disk 2
attributes disk clear readonly

10. Now ntdsutil runs properly, reboot into normal mode, and the system is fixed!

I haven’t seen posts of other people having disks get marked offline and unreadable on their VMs after an upgrade, but this only happened on the Windows 2008 system, and it’s non-system disks.

« Previous PageNext Page »