Profile cover photo
Profile photo
David Anderson
About
Communities and Collections
View all
Posts

Neat remote linux surgery story, if you're into that sort of thing.

I'm visiting my in-laws, and ssh'd into my home desktop to do a bit of hacking. Whoa, it's extremely sluggish, what the hell. Load average is 265, wtf?

Turns out my nvidia GPU seems to have crashed due to some sort of hardware fault, and that caused the nvidia kernel module to lock a kernel thread into a tight 100% CPU loop. Xorg, Cinnamon and Chrome were also all pegged at 100%. This has happened before while I was at my desk, and it manifests as the machine seemingly locking up. TIL it doesn't completely lock up, merely get veeeeeery slow and freeze the display.

Due to a quirk of how the drives in my desktop are organized, I can't just reboot the machine remotely, because it'll reboot into Windows and I'll lose remote access. I also can't unload the nvidia kernel module that's causing havoc, because it's all borked and not cooperating.

But wait, there's kexec! I can tell this borked kernel to jump ship and pass control to a fresh copy of itself! And indeed, a few heart-stopping seconds after the `kexec -e`, my desktop pings again. Triumphantly, I ssh in... and the load average is 150 and climbing. Turns out a soft kexec doesn't reset the GPU, so when the nvidia driver loaded, it went straight back to spinning in the borked codepath.

Okay fine, blacklist the nvidia module and kexec again. Wrong! Blacklisting doesn't prevent the module from being loaded as a dependency of another module. Instead, you need to specify `install nvidia /bin/false`, to tell userspace to try running /bin/false instead of the normal insmod/modprobe.

Okay, that's better, after one more reboot, load average is now a mere 25. With the nvidia module blacklisted, X still started using some other generic driver and tried talking to the GPU... getting itself stuck in a similar, but less fatal (i.e. userspace-only) loop. `systemctl disable lightdm && systemctl kill lightdm` took care of that, and the machine's pretty happy now, despite the brained GPU.

Now, anyone know if there's a way to get linux to powercycle a PCIe slot without rebooting the entire machine? I don't even know if PCIe has that capability, or if the power rail is always-on.

(Yes yes, year of linux on the desktop, and all that. I'm still reasonably impressed at the amount of invasive surgery that was possible to work around a hardware failure from 500 miles away)

Post has shared content
No. Burnout is caused when you repeatedly make large amounts of sacrifice and or effort into high-risk problems that fail. It's the result of a negative prediction error in the nucleus accumbens. You effectively condition your brain to associate work with failure.

This is a really interesting insight, and worth keeping in mind.

Post has attachment

Ignition (currently) doesn't have a way to provide the ignition config on bare metal installs other than on the kernel commandline. This is fine when you're PXE-booting, but if you're installing to local disk, then it gets a bit icky.

Fortunately, OEM configs to the rescue! The OEM partition can provide a grub.cfg containing configuration snippets. Just drop a 'set linux_append="coreos.config.url=oem://ignition.json"' into that grub.cfg, and Ignition will load ignition.json from the OEM partition. Lovely!

Post has shared content
My random thanksgiving hacking project: another programmable flight computer for KSP. Unlike kOS which exposes a high level scripting interface, my still-nameless project exposes a raw machine with a well defined programmer's model, which you program in assembler (or in anything you feel like writing which can output to assembler).

The machine itself is also kinda cool. It's a multicore SIMD-capable stack machine. Broken down: multi-core meaning your program can spin up additional CPUs running different code. You might have one CPU dedicated to attitude control, one dedicated to thrust and staging control, and so on. The CPUs all run their code simultaneously, and can communicate in a few well-defined ways (e.g. to send a "Thrust for 200m/s" command to the thrust control CPU).

SIMD-capable means that the CPUs have vector arithmetic instructions, so you can e.g. add a 3-value vector to another 3-value vector in a single "add.v" instruction, rather than 3 "add" instructions. This should hopefully make it easier to do the vector math necessary for astronavigation.

Stack machine means that all operations reference a per-CPU stack of values. This is a fairly easy programming model for hand-written assembler, since you don't have to keep track of which registers are in use, which have magical meanings, and so forth. It's an unusual architecture to see in hardware, but Forth is a well known programming language which uses a stack machine. As a result, the assembler feels a lot like "low-level Forth" in a few ways.

I promise, it's not that complicated to use. With a few examples anyone can pick up the principle of how the machine works, and start writing useful code. In the screenshot below, my "end to end" test program running within KSP: two CPUs counting to 100 in different increments before doing an orderly shutdown.

It's far from finished. Notably, there is currently no I/O support, meaning that the computer can't actually get data from, or send commands to the spacecraft. That's the next part that needs implementing, the ability to actually control stuff.
Photo

Post has attachment
Armed with the KSPX parts pack, an Apollo-style mun mission: LEM packed under the service module, and two-stage LEM.
PhotoPhotoPhotoPhotoPhoto
Derpollo 1&2
44 Photos - View album

Post has attachment
I've been playing with the kethane mod, in preparation for resources in .19. After doing recon with a probe in a polar orbit, I landed that probe in a near-equatorial patch at the edge of the large sea. The probe serves as a marker for larger ships which do not carry a kethane detector.

Following that, a mining rig made its way to minmus, landed, and proved the kethane field. I'm now setting up a space station network, +Scott Manley style, to provide orbital support for passing transport. It's currently just a service module around minmus (rocket and RCS fuel, solar panels and a docking port with a probe brain), but habitat modules and other comforts are on the launchpad ready for the off.

The eventual plan is to turn Luna Control into an interplanetary launch station, where ships fill up on the way to the outer worlds. And, of course, the same thing on the other end. Wouldn't want to be short on fuel for the long trek back from Jool.
PhotoPhotoPhoto
KSP kethane playthrough
3 Photos - View album

Post has attachment
Arrived at Jool, but overdid the aerobraking a little... Jeb hasn't given up though, he'll be thrusting to the bitter end!

Oh well, I'm sure the next three kerbals will be fine...
Photo

Post has attachment
Reddit's thoughts on the Spanner paper: "No need to take out patents on it, Google already employs everyone capable of understanding it and creating an implementation."

It is a very information dense paper, and the technology stack required to make Spanner work the way it does is fearsome (for example, do your machines run a time daemon that provides an drift SLA? NTP doesn't count, since it makes no guarantees about the rate of drift). But once you put it all together, magic happens.

Today's random sysadmin post: how to forward all the mail from your machine to a single address.

MTAs like Exim and Postfix are extremely powerful and flexible. However, none of them seem to document something very simple that I'd like them to do: "I don't care who the recipient is, send everything you receive to this one email."

The use case here is virtual machines whose daemons may send email (e.g. cron, logwatch). Typically, these daemons will send to some_user@localhost, or just some_user. This means notifications I care about end up sitting on the disk of the VM, where I can't see them. Most of the daemons can be reconfigured to send to a specific non-local address, but a few can't. And I don't want to sprinkle .forwards for every system user, especially given that many daemons will define random new ones for their use.

So, I want my MTA to do something trivial: all email it receives should get its sender rewritten to logs+HOSTNAME-USER@natulte.net, and forwarded to gmail's servers for processing (my domain is managed by Google Apps). This seems fairly simple, but involved quite a bit of fiddling to eventually get right.

Here's the config that works, for the machine "aquaman.natulte.net" forwarding email to logs@natulte.net. First, /etc/postfix/main.cfhttp://pastebin.com/UvfgnScd

Two influencial settings that are not set in that config, because their defaults are just fine, are append_at_myorigin and append_dot_domain, which canonicalize stuff like 'root' or 'root@aquaman' to the fully qualified address. After that simple canonicalization, /etc/postfix/canonical_maps.cf is consulted to rewrite the recipient.

Here's that file: http://pastebin.com/qNUqSfCT

Basically, that file says to only rewrite if the recipient is not already in the form logs+SOMETHING@natulte.net. Then, a series of increasingly general patterns rewrite to grab as much information out of the original sender as possible. In order:

USER+PROGRAM@anything becomes logs+aquaman-USER-PROGRAM@natulte.net
USER@anything and USER become logs+aquaman-USER@natulte.net
Anything else (which should be impossible, but just in case) becomes logs+aquaman-root@natulte.net(i.e. assume only root can convince postfix to mail to broken addresses)

That's it. With those settings, any email sent from the machine, by anything, will end up in my logs mailbox with appropriate +metadata to describe where it's from.

Ideally, I'd actually want to rewrite the Subject to prepend originating hostname and user, but I haven't dug into how to make that happen yet, and how to make it jive with the rewrite above.

There may be a simpler way to implement this, but this is the first one I've found that captures really-all email generated by a machine.
Wait while more posts are being loaded