Profile cover photo
Profile photo
Neil Flake
63 followers
63 followers
About
Posts

Post is pinned.Post has attachment
Ok, after the ordeal I just went through trying to rescue a locked up/corrupted Active Directory box, I figured I would do a write up about what happened, what I did or should have done, and the final resolution.  Hopefully through this someone can gain some knowledge and not have to go through the nightmare I did.   Likely someone with a little more advanced AD knowledge would have just killed the main box since there was a backup, my concern at the time was the amount of corruption in the main box, would that have "spread" to the backup.   Ok, i'm getting ahead of myself.  I apologize in advance, I type like I talk, so it may be a bit hard to read, try to follow if you can :)
 
History:
It was a cold dreary morning on Tuesday ... just trying to set the mood here.  Actually it wasn't so dreary, things were going good.  School was out at for snow/ice so I had settled in to doing a few chores I set back for those days.  One of those was to replace an admins PC to a newer one.  I was across campus doing that.  I had copied their my documents over to their H: drive(home profile drive on the server) and had just plugged in the new PC and started the copy from H: back over.  As the files were copying I saw an old backup in that users drive so I went ahead and deleted that one too.  In the middle of both processes all of a sudden they just stopped.  So i tried again but this time H: drive was no longer working.  Internet was sort of working so I thought maybe it was a network card or cable that had went bad.(Funny thing is the previous PC had some network issues, and I had spent a lot of time getting it to connect properly, so I knew for sure that it was a network issue or something in wires at that building)  I spent 2-3hrs replacing network cards, changing out network cables, doing everything I could think of.   I finally got so exasperated, I decided to head back over to my office(next to the server room) and try to re-image the pc.  In my office I was trying to get the image to do, but DHCP wasn't handing the PXE boot to FOG.  Odd since I had just done that a few days ago, so i got up and walked into the server room.  THERE STARING ME IN THE FACE was a server screen that was LOCKED UP!   No amount of key presses or anything would get a response from it.  This is what had caused the PC before not to access the H: drive, why Internet would work a little bit(some of the services were running the background appearantly(DNS))   So my only course of action was to RE-POWER the server.

Before I go to far into the issue and final resolution, There will likely be some purists or those that deal with AD daily that likely will cringe at what I did or didn’t do.  Keep in mind that I’m not a specialist, I can create AD, manipulate it, just recently I had gotten the backup AD going(I know there is no primary/secondary anymore, just helps me keep it all straight)  So the thought of demoting the main one off, trying to clean up metadata, then create a new one, promote it and hope that the 2003 backup AD propagates to that after the failure of the first.  Along with that is all the file shares, printer shares, folder permissions,etc that would possibly have to re-do was also on my mind.  Since this is now resolved, I am going to run tests on how easy/hard that would have been using a 2012 hyperV VM’s.   Never had time before but I’m going to make time.  So next time there is an AD crash, I’ll better be prepared on triage/recovery. 
 
Start of Woe:
I have never had good luck when a server is rebooted, any time i've seen errors in the past it has been during a re-boot.  So I do everything I can to keep the boxes running at all times.  Had no choice with this one.  Well upon reboot the dreaded BLUE SCREEN OF DEATH was staring me in the face, and I started sweating bullets.  I wrote down the error message: (C00002e2, Directory could not start because: "A device attached is not functioning ..0xc0000001)  Upon research(Google) it was a corrupted ntdis.dit (AD Database) or a log file.  The process was to boot into directory restore mode and do the following(gotten from this website: http://social.technet.microsoft.com/Forums/windowsserver/en-US/771b97ad-4e1c-4e7c-8617-91601224dd7f/server-core-2008-r2-blue-screen?forum=winserverManagement  )
_____________________________________________________________________
1.  Restart the server and press F8 key, select Directory Services restore mode.
2.  Log in with the local administrator username and password (hope you remember what you set it to!).
3.  Type cd \windows\system32
4.  type NTDSUTIL
5.  type activate instance NTDS
6.  type files
7.  If you encounter an error stating that the Jet engine could not be initialized exit out of ntdsutil.
8.  type cd\
9.  type md backupad
10. type cd \windows\ntds
11. type copy ntds.dit c:\backupad
12. type cd \windows\system32
13. type esentutl /g c:\windows\ntds\ntds.dit
14. This will perform an integrity check, (the results indicate that the jet database is corrupt)
15. Type esentutl /p   c:\windows\ntds\ntds.dit
16. Agree with the prompt
17. type cd \windows\ntds
18. type move *.log c:\backupad   (or just delete the log files)
This should complete the repair.  To verify that the repair has worked successfully:
1.  type cd \windows\system32
2.  type ntdsutil
3.  type activate instance ntds
3.  type files        (you should no longer get an error when you do this)
4.  type info       (file info should now appear correctly) 
One final step, now sure if it's required:
From the NTDSUTIL command prompt:
1.  type Semantic Database Analysis 
2.  type Go 
_______________________________________________________________
Ok, not bad I thought, all the errors on the jet engine database went away and it is displaying what it should according to the above.  So I rebooted the server, BLUE SCREEN.  Went back into recovery mode, tested the above again and no errors.  Tried to think of what “device” it could be talking about and spent another hour or two doing this back and forth. 
System State Recovery, surely this will save my bacon.  Keep in mind I have current backups as of the night before. All of main drive and all of the data drive is backed up to a networked NAS daily.  All user information is there.  I had used windows server backup to do a full backup with system state.  The only problem was that it was from November 18th(and tis March 4th now)  I had not set up the schedule on it where I thought I had.   I was still withing the tombstone timeline of 180 days, so I thought I was good.  Did a system state recovery, took about 30 min and then the server rebooted NORMALLY… I thought I was saved, before the login screen it has booted to the “windows update issues, reverting changes” screen.  Ok, I can wait.  Alright, 3 hours later it has finally reverted all the changes and I’m at a login screen.  Started the process at 3pm and its now going on 8pm.  I log in and I’m staring at the wonderful desktop. 
(All the while in the background, I’ve checked on a pc to see what state the login is at.  Log ins are taking a little longer(DNS) but they do all log in, only issue is none of the mapped drives work or Home Profile Drive(H: in our case) is working.)
So  I do the dcdiag and it spits back all sorts of errors at me.  I check AD sites and service and users and computer, etc an none will start, it gives a “target name is invalid”   I go to the backup AD 2003 server and they all do show up there.  When looking at the AD sites and services it shows the W2K8 box as “unavailable” I try to ping by IP and it works, I try Nslook up by servername(I’ll call it DC1 for example) and it showed the IP but said “server: unknown”.  I then try to browse to DC1 from DC2 using the server name \\DC1 and it would not go there, but going to \\10.x.x.x would show up.  So something on the server is not letting it “serve”   Upon research of the target name issue and such, I get lots of other items to try in dos, from kerbos password resetting, to stopping services(Network logon, kerbos, File Replication, etc)  Did a variety of these from 8pm to about 10:30pm.  Out of sheer frustration, I rebooted the server again thinking that maybe just stopping and starting the services wasn’t working right.   Upon reboot, it now had be COMPLETELY locked out of the server, I could not log in any of the domain users, nor would it let me switch to any of the local accounts, I was locked out.  Its now 11pm and school was called off another day, so I figured with some sleep, that something would come to me.
Day 2, Woes Continue
In at 7:30am to try to tackle this beast, after several attempts at password recovery that continued to fail, I decided to do the system state restore again.  It worked but I then had to sit and wait for all the updates to revert themselves again, so at about 11:30am I was able to start working on it again.   This time I had done a lot more reading on the internet.  I had actually once gone to the DC2(03AD backup) and tried to see what its dcdiag stated, but it wouldn’t run, stated I need the server tools kit, but I looked at the programs on the server and it stated I had them installed(resolution for this comes later as well)  So I turn my focus back on the main DC1.  Same problems exists, further info is that SYSVOL is not replicating(ie \\10.x.x.x\netlogon is dead)  This is in the File replication portion somewhere and turns out was one of the main issues why it wasn’t serving home drives/etc.  I tried copying sysvol items from DC2 over to it, I did the burflags D4 settings in regedit(but I didn’t shut the correct services off at the appropriate times it turns out)  I couldn’t get this to work, so I turned my focus back on the problem of DC1 and DC2 not able to replicate back and forth.  I found more and more on the kerbos thing and even the ticket purging between the two.  So on DC1 I decided to try to reset kerbos password.  Again thinking since it was the one that was not working, that I needed to reset it(again later on it turns out it needed to be reset on both boxes)  After a lot of finagling I did get kerbos to reset the password.  At that time I also purged the kerbos tickets(klist purge, while kerbos service is off(same with password reset))   Rebooted DC1 hopeing that it would churn back to life… NOTHING still not replicating, still not working across the network as \\DC1 .Its now about 2pm and I’m really starting to stress out.  I had an idea I was on the right track but I was missing something that was causing \\DC1 not to share its resources with server name.   I start thinking now its becoming a lost cause. I have one plan to contact a friend of mine, that has 100’s of certs(that is what I joke with him about all the time)  and is a Virtual machine master and has setup servers all the time.  He is at work so he is able to give me a little bit of his time, after looking around for 30 min he lets me know its likely hosed and I need to just drop AD on this one and start up a different one.   I know that’s probably right but the thoughts of all those shares and stuff having to change pop back in my mind.  I actually start thinking more about the process, I changed all the batch files to reference \\10.x.x.x(DC1’sIP) instead of \\DC1.  My mind is racing trying to think of how do all of this.  I have one more Ace up my sleeve, another friend who owns a Multi-state Network services company.  He states from my problem description that he isn’t sure he can save it either, BUT has a server management solution that can involve some server specialists based in India.(After what they were able to accomplish and what all it involves, I’ll likely be recommending this service to everyone, but that’s a later email/post).   At this point I’m open to anything so I start that process.  Its now about 4:00 pm, his company has to get me setup on management platform, which has its initial hiccups from my firewall likely, so finally about 6:30pm we have the platform installed and the main techs from India in process.  I was assigned to an Active Directory specialist and about 7:30pm I get the call from him to try to explain what happened and what I needed fixed.  It was extremely difficult to communicate, while his English was not bad, he spoke super soft and super-fast.  I got across that I needed to save DC1 as the AD and get it and DC2 replicating properly, also no users were getting their H: profile drives.   With that platform software they were able to use logmein to directly manage both DC1 and DC2.  I hung up the phone and he went to work.  I watched the whole process to see where in the world I had missed it and to see what magical command was going to make it all start working.   First hour he focused on DC1 mainly scouring the logs, looking at the errors, as well as getting a grasp of how it was all setup.  DNS, Name server, IP’s etc.   About 8pm he started working through some of the issues using dcdiag to diagnose them.  He went through lots of steps using the repladmin, but wasn’t getting anywhere with it, a few errors went away, but there were several that would not.  This is the point at about 8:45 now that he started really nailing the solutions.  On DC1 he was trying to reset kerbos password but kept running into errors, the went to DC2 and shut kerbos off but wasn’t able to perform the dcdiag and netmon functions.  He went and downloaded the 2003 server tools(R2) which has all these tools included, when I had looked it, it was tools that were older and I just needed to download new ones(smacking forehead)  After these were downloaded, he was able to stop kerbos service on both machines.  Reset the password on both(netdom resetpwd /s:server /ud:domain\User /pd:*)  Then he also purged kerbos tickets on both(this was another step I missed) (klist purge) he actually had to do it multiple times.  The next step was to reset both DNS, netlogon etc all in command line.  (stop dns & stop netlogon & ipconfig /flushdns & start DNS & ipconfig /registerdns …etc)  the flushdns and registerdns are another two things I didn’t run across.
After doing this to BOTH DC1 and DC2 they both started replicating back and forth properly(dcdiag only had one minor error instead of a screen full)  He called and thought the issue was resolved at about 9:15 now.   I informed him that there was still something not functioning right because no H: drive shares are working still and \\DC1 still would not work.  He went back and looked at the sysvol/netlogon issue.  I got a bit lost here, but to generalize it, he copied a good copy of the sysvol folder form DC2, shut off File replication service(I think that was the one) and did the D4 burflag in the registery(non-autoritative replication)  After about 15 more min and a reboot of DC2 EVERYTHING STARTED WORKING AGAIN!!!!
So without that extra pro support I would still today be triaging a user folder move, shares, folder permissions etc.  The irony of this all is I had already ordered a new server and its on the build phase right now in preparation for a major server change this summer.  But NOW I have the tools to know if there is an issue with Ad replication exactly where to look and how to fix it (for the most part)  
To sum up.  Right now check and make sure your system state backup is operating and backing up normally.  I think that was the crux of most of these issues because my backup was 90 days old OR Windows server backup sucks, not sure which one caused it.    So now going forward, I’ve updated my backup plan and I’m in the search for a better real-time snapshot of server backup so if something like this happens again, a few hours and an image restore should take care of the problem.
If you are still reading this, then I’m sorry, again I’ll apologize if I seemed to ramble a bit.  It all seemed so crazy at the time.  I will also likely start recommending that monitoring service which includes a TON more than just high end help in case of emergency.
Add a comment...

Post has attachment
Lake of the Ozarks
Photo
Add a comment...

Post has attachment
Lake of the Ozarks late June
#fishing #lakeoftheozarks #boating  
Photo
Add a comment...

Post has attachment
#bassfishing on #lakeoftheozarks   biggest fish of the week.   One of the best fishing vacations we've had.
Photo
Add a comment...

Post has attachment
#Walleye on #LakeoftheOzarks    First one ever.
Photo
Add a comment...

Post has attachment
Grandmas Garden + Google Plus filter.
Photo
Add a comment...

Post has attachment
Who is ready for "Burgers and Kabobs"
#bbq #backyardbbq #cookout  
Photo
Add a comment...

Post has attachment
Refresh share, since its been at the bottom for a while.

Don't know why but the hen and chicks plant just fascinates me!
Photo
Add a comment...

Post has attachment
Couple cool looking products, Nightlight and usb charger, just change the faceplate and thats it.  Gonna snag several of each before long!!
https://www.youtube.com/watch?v=5uCL79X24kA
Add a comment...

Post has attachment

Amazing, almost makes me want to take up beekeeping!
Add a comment...
Wait while more posts are being loaded