Comcastic (The Tale of the Good Tech)

[follow-up] So in the end, resolving the issue we had with Comcast took someone who knew what they were doing, with good debugging skills and a good attitude, and about 15 minutes. It helps to be methodical. It’s fine to twist knobs and keep trying stuff, but you need to be scientific about things. Write stuff down if you need to. Be organized. Change one variable at a time.

1. Examine the data carefully

Clueless tech: “I don’t even know where to start.”

Good tech: “Hmmm, only some channels work when they ALL should. Let’s look up some of these channel frequencies on the channel map reference and see if they’re correct.”

2. See if you can state the problem simply (possibly by recognizing that there is a problem)

Clueless tech: “Lots of channels aren’t working. Huh. Did you give us money for them?”

Good tech: “Channel 647 isn’t working, and it should.”

3. Ask yourself if there’s any more data you can gather

Clueless tech: [Argues about subscription packages with the customer]

Good tech: “Hmmm, I have a signal strength meter with me, maybe I should measure this.”

4. For a complex system, try tracing a fault back to its source.

Clueless tech: [still arguing about the channels we’re paying to not get. Shows me a page on his phone that he says supports his argument that we shouldn’t get those channels, I do research in front of him to show that he’s flat wrong and looking at stuff from years ago]

Good tech: “Hmm, no signal here on that frequency, maybe sampling the signal closer to the street will help localize the fault?”

5. While it might make sense to twist a knob a few times to see if the problem just goes away, twisting the same knob thirty or forty times only makes you look like a frustrated skinner box rat when the researcher has gone on vacation and fogotten to fill the pellet jar.

Clueless tech: “Let’s do that thing that didn’t work before, again, a few more times, because it’s the only tool I’ve got.”

Good tech: [Already out the door and finding the cable closet]

Root cause was one of the two things I thought might be the issue. The first possibility was a bad channel assignment from the head-end (and believe me, the software that runs the head-ends is crap), the second was a cable trap that had been left on the line. Turned out to be the trap; a filter had been installed maybe ten years ago in a locked utility closet and never removed, and it was happily quashing the signals between 270 and 670 Mhz, and since Comcast went entirely digital in our area a few years ago, these traps should just have been removed.

The first tech just flailed and couldn’t actually think about the problem, but could have solved it if he had paid attention.  The second tech knew a few failure modes, but more importantly he knew how to think about a problem.

6. Don’t design systems that lack end-to-end diagnostics. They will be expensive for you, and your customers will have little pity as they stand back and watch you flail. I’m happy to pay NASA to publically fail because they do stuff that is exciting, dangerous and hard; cable provisioning is none of those.

And we’re back

Nothing like a good random savaging from your $hostingProvider. Sorry about that. It’s been a week of downtime because the tools that $hostingProvider has for diagnosing problems are about as useful as a stack of moldy Ouija boards. Logs? Oh you can ask for logs, but they will not come. Forum quality on $hostingProvider ranges from “Cargo Cult” nonsense to “plz hlp” because it’s a cheap-ass neighborhood.

Me to support: “Site down. Apache looks utterly dead. Logs? WTF.”

Support: “Huh. Lots of stuff wrong, too much to describe, really, and we’re throttling traffic to that server because there are too many people on it. Why are you on that old server anyway?”

Me: “What?”

Support: “No problem, we’ll just move your site. Oh, wait. You have to delete all your old databases first.”

Me: “What?”

Support: “Bye now. Merry Christmas.”

Took another call to support to get Apache high enough to crash. Actually the support folks are pretty good, it’s the policies and configurations they’re called on to support that are, shall we say, legacy. It says so, right there.

Another call or two to have them repair permissions. On stuff. No word on what actually changed or what permissions they had to fix, just, um, you know . . . stuff that needed permissions changing. Oh yes, all this time, the entire time the site was down, the little status widget in the corner of my site’s administration page was cheerfully reassuring me that the site was 100% up and functional (while right next to that, a site preview page was declaring HORRIBLE ERROR and APACHE CATASTROPHE). Yay for modern technology.

So, I mucked with crap for a week, twisting knobs and pushing buttons and waiting for random pieces of web UI to agree with each other. Honestly, half of this time was looking at how to run the hell away from $hostingProvider by moving a few domains experimentally and discovering the roadblocks that $hostingProvider decided were necessary to prevent customer flight, go figure. Finally got things in order, restored the database, restored it again because $hostingProvider, and things seem to be working now. Hi there.

$hostingProvider will soon be an ex hosting provider. I’ve already moved a couple of domains, this one is next. It’s fun.