Working in technology usually means a fair amount of fixing stuff that’s broken. The advice of Roy and Moss from The IT Crowd is often handy: “Have you tried turning it off and on again?” Strong Google/DuckDuckGo skills and some free time are usually all you need to fix almost anything these days. I’ve really only had one situation in my tech career that had me stumped for a long time: bad networking at HSRA. That issue eventually got resolved with new hardware, but it’s a terrible feeling when technology doesn’t follow a logical troubleshooting process.
A week ago, another issue surfaced that may join the troubleshooting hall of shame. It started with a team member who couldn’t login to the file server on our office network over the weekend. With most of our staff working remotely, the on-premise file server is usually accessed via VPN, from both Mac and Windows clients. I connected from home and was able to see the file server (and thankfully all of the files) from my office iMac, but could not connect from my MacBook Pro. I find it useful to keep troubleshooting notes for future me, so please continue on if you’d like some nerd-tastic reading.
Like most everything in our building, things are starting to get old. The file server is part of the core infrastructure that was installed back in 2016. Our EMC VNXe3200 SAN is the foundation of a virtual environment with VMware hosts (Dell acquired EMC in 2016). There are a pair of Windows virtual servers, along with a Linux VM. The Windows servers act as our primary and secondary directory servers (Active Directory, still on Windows Server 2012), while the Linux box runs our intranet and some other IPTV services. It’s a fairly complex setup, but has been rock solid up to this point.
The VNXe3200 can serve CIFS shares directly, using AD for file permissions and access management. My AD servers are set to automatically install Windows updates, which I suspect was the root cause of this problem. The SAN hardware all seemed to be fine – no disk, power or network issues. People who were connected had no problems; it was looking like an issue with the authentication from AD. The web-based Unisphere management interface for the SAN was still running the Flash version of the Operating Environment (OE), so I needed to figure out a way around that issue to get more info from the logs (since Flash is now dead).
One of my original project engineers was able to set me up with a very old VM that had a copy of Firefox with the Flash plugin. Being careful to restrict network access for both security and auto-update reasons, I managed to get Unisphere updated to the latest OE version with HTML5 (220.127.116.1186894). Looking at the logs, the SAN had lost connectivity to directory services:
All Domain Controller servers configured for the CIFS server are not reachable. Please check this is not a network connectivity issue. Ensure at least one Domain Controller is up and running and is reachable by VNXe storage array.
It did not appear to be a network issue and I could login to the AD servers with no problem (with multiple user accounts). The support contract I had with Dell/EMC on the VNXe3200 expired and my experience trying to contact them did not go well. They took forever to respond and when they finally did, they wanted to charge me for all of the expired time, plus another year (for an amount that was about equal to just buying a new solution).
My immediate concern was making sure we had reliable backups of all files and folders. My ultra-low cost cloud backup strategy is to connect to the file server on my office iMac and use an app called qBackup that connects to a Backblaze B2 Cloud Storage account. This has worked great for years – the script runs nightly with an incremental backup, qBackup was a one-time cost of $30 and the monthly Backblaze charge is usually under $10 a month. The flaw in this cloud backup strategy is that it takes a really long time to restore 2.5 terabytes of data online (you can pay $189 and wait for Backblaze to ship a hard drive copy, but I don’t know how long that takes).
Since I still had access to the file server from the iMac, I stopped at Best Buy to get a 4TB external hard drive (and some thumb drives for people that needed files right away via SneakerNet). Now I had the cloud backup *and* locally attached copies of everything in my office. I tracked down the original engineer that helped install and configure this setup and he helped me create a new share that is served directly from the domain controller. I copied over all of the backup files from the 4TB drive and tested connections. That worked, so I reviewed security settings with our GM and applied permissions to folders via AD security groups. Cloud backup was re-pointed to the new share and ran successfully from my iMac. I created cheat sheets on how to connect to the new share from both Mac and Windows clients and sent them out to our staff. Done, right? Well…
All of my Mac users had no problem connecting and seeing what they were permitted to see. Some Windows users were also completely fine, but others connected and could not view all of the folders they were allowed to see. I initially had access based enumeration turned on, so some hidden folders were expected (but not ones they should see). One colleague saw all folders in the office, but not over VPN from home (on a brand-new laptop). This TechNote pointed towards a local cache issue (which this article also talks about). We’ve been experimenting with various offline settings and most people are now connected successfully. Here’s how we are troubleshooting Windows connections now:
- Restart the workstation first, Roy
- Re-map the network drive using a different letter
- Use the full AD name (i.e. – domain.local\username)
- Delete local offline cache files
Dell/EMC sent me a notice this weekend that there is another update to the Unisphere OE (18.104.22.16899487), but I didn’t see anything relevant in the release notes. I did download the huge .gpg file anyways, but so far the health check is timing out and I can’t get it installed. Also forgot to mention that we rolled back two of the automatic Windows Server updates that installed in mid-February (and turned off automatic updates). That obviously isn’t a long-term strategy, so I’d like to get updated to Windows Server 2019 soon (VMware updates too). OneDrive, Teams, SharePoint and even Box/Dropbox for Business are all options that may come into play as well.
Serving files shouldn’t be rocket science. At least I felt a little better when the engineer said, “I’ve installed hundreds, if not a thousand, file servers like this and I’ve never seen one do what yours is doing.”
Trailblazing!Originally published by DK on March 1, 2021 at 5:12 pm