Matthew Garrett ([info]mjg59) wrote,
@ 2009-11-13 09:32:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Entry tags:advogato, fedora

Legacy PC design misery
I've spent chunks of the last couple of days fighting a problem that's existed for about 25 years. The 8086 was a 16-bit processor with a 20-bit address space, limiting the maximum physical address that could be accessed to 1MB. However, quirks of the segmented memory system meant that addresses greater than 1MB could be constructed - these would wrap around to the bottom of memory. Because loading the segment registers was a time consuming operation, some programmers used this behaviour as a performance optimisation.

The 80286 introduced 24 bit address space. Unfortunately, this meant that the addresses that previously wrapped to the bottom of memory now pointed at real addresses - not ideal if you were expecting the old behaviour. IBM fixed this by tying the 21st address line (A20 - they're zero indexed) through an and gate, with the default behaviour being to keep it tied at 0 and thus maintaining the old wraparound behaviour. Applications that wanted to access the full address space needed to enable the A20 logic gate. IBM didn't want to add any extra hardware to their system if they could avoid it, so tied the other side of the and gate to a spare pin on the keyboard controller. By writing a couple of bytes to the keyboard controller, your PC-AT stopped pretending to be an XT and gave you access to all of the insanely expensive RAM it had stuffed in it. Hurray!

PCs have been emulating this behaviour since the AT was first cloned. Of course, this being the PC industry, many have got it wrong. There's a set of approaches for controlling the A20 gate that may work, varying in terms of performance and desirability. Most hardware will give the desired result (ie, I have no desire to run DOS executables from 1982, make my A20 work damnit) using any of the various methods of A20 enabling. Some hardware doesn't. The most common method used in bootloaders (where we still have access to system BIOS services) is to call int 15h with an ax of 0x2401, which asks the BIOS to enable A20 for us. This isn't implemented on all hardware, but we should get a failure back that lets us go and bang on the keyboard controller in an attempt to get it to pay attention[1].

Enter the Kohjinsha SC3.

I picked this up second hand in Japan. It's a ridiculously cute little tablet, only slightly larger than hardware that's comfortably in the MID range. It booted a Fedora liveCD perfectly, though having GMA500 graphics meant that what appeared wasn't terribly attractive. Installation proceeded happily enough, followed by a reboot and... nothing. Grub loaded the kernel and initrd, jumped to the kernel and everything hung.

So, for the past couple of days, I was stepping through the kernel setup code, trying to work out where and why it was hanging. I'd got it narrowed down to the region where the kernel tried to free the memory used by the initramfs, but the failure hopped around depending on my kernel build. Something was clearly very wrong. The strangest thing about this was that if I booted the liveCD boot menu and selected "Boot from local drive", everything worked perfectly. isolinux was clearly doing something that grub wasn't, but there's rather a lot of code to step through there.

Things became a lot easier once I found that the OpenSuse version of grub worked. Their grub has a rather smaller set of patches than ours, and only a few looked even plausibly relevant. It only took ten minutes or so to figure out that it was one that altered the A20 code. Things became much clearer then.

The main functional difference between the Suse A20 implementation and the upstream one[2] is that the Suse one explicitly tests whether the A20 enabling worked by putting values at two different addresses that would be the same if A20 is disabled. By comparing them, we know whether A20 is working properly or not. If not it can then fall back to other mechanisms. The Fedora code trusted the BIOS's claim that the int 15 call had worked. The Kohjinsha's BIOS lied, A20 remained disabled, grub copied the kernel and initramfs to chunks of address space that contained lies rather than RAM and everything fell over horribly.

Thankfully, not a difficult fix once the problem was identified. But seriously, people. How hard can it be not to screw this up?

(For an excrutiatingly detailed analysis of how hard it can be not to screw this up, see here)

[1] the Intel Macs don't implement the int 15 approach, but return a failure. They also don't have a legacy keyboard controller, so attempting to hit that resulted in grub falling over. The magic IO port approach works. Another example of how the Intel Macs aren't really PCs...

[2] grub2 implements the more paranoid check




(20 comments) - (Post a new comment)


[info]simont
2009-11-13 03:03 pm UTC (link)
But seriously, people. How hard can it be not to screw this up?

Given the constant use of the phrase "A20gate" in this post, I wonder that you managed to avoid sneaking in the word "scandalous" at this point...

(Reply to this)


[info]martling
2009-11-13 05:30 pm UTC (link)
Jeez. The sheer number of ways that one quick hack has bitten people in the ass in the last 27 years is amazing.

Not least the fact it gave a backdoor into the Xbox.

(Reply to this) (Thread)


[info]lionsphil
2009-11-13 06:10 pm UTC (link)
That is somewhere between hilarious and absolutely tragic.

(You know, when you have to reach for AltGr, captchas have gone too far:
wedges 430(half)(copyright or at symbol)
)

(Reply to this) (Parent)(Thread)


[info]martling
2009-11-13 06:50 pm UTC (link)
That is somewhere between hilarious and absolutely tragic.

I was in there when that hack was first publicly revealed, and the consensus was certainly with the former option. They didn't even have to explain, they just stuck "A20#" in huge letters on a slide and everyone immediately burst out laughing. Good audience, that.

(Reply to this) (Parent)


[info]lionsphil
2009-11-13 06:13 pm UTC (link)
The Intel Macs don't implement the int 15 approach, but return a failure. They also don't have a legacy keyboard controller, so attempting to hit that resulted in grub falling over. The magic IO port approach works.


Oh Gods...Macs picked up this brain-damage in the switch to x86, even though they have absolutely no legacy code that expects wraparound?

(Reply to this) (Thread)


[info]mjg59
2009-11-13 06:16 pm UTC (link)
I suspect that they don't have this behaviour - the problem is that grub didn't check whether or not A20 behaved itself, and so went off to hit a keyboard controller that wasn't there anyway.

(Reply to this) (Parent)

x86 change-over
(Anonymous)
2009-11-13 07:11 pm UTC (link)
The legacy here isn't software, it's the x86 architecture itself. In fact, the entire reason grub fails is since the Macs don't have legacy code they didn't have to have the hacky work-around.

The core issue here is that the x86 family is by design 100% (in theory) backwards compatible. Every buggy behavior, every obsolete command gets carried forward in case there's code that calls it.

As a result, working that close to the processor especially at boot time gets rather sticky. Actually, this is a good explanation of the magic and hackery that goes in to it: http://duartes.org/gustavo/blog/post/how-computers-boot-up/

(Reply to this) (Parent)(Thread)

Re: x86 change-over
[info]lionsphil
2009-11-13 08:57 pm UTC (link)
You appear to be aggressively agreeing with me. That Macs may have adopted hardware containing default-active workarounds for legacy software that doesn't exist in the Mac world was my point.

(Reply to this) (Parent)


[info]thaytan
2009-11-14 03:40 am UTC (link)
When you have to dive that deep into the computer's brain, you better be wearing a Tron suit. Regardless of reality, that's the way *I'm* picturing it.

(Reply to this)

GRUB *Legacy*
(Anonymous)
2009-11-14 11:09 am UTC (link)
Alright, but doesn't all this only apply to *Legacy* GRUB code?

(i.e. something that was released 4 years ago which we consider completely obsolete now)

--
Robert Millan

(Reply to this) (Thread)

Re: GRUB *Legacy*
[info]lionsphil
2009-11-14 12:26 pm UTC (link)
I love how open source can declare 1.0 series (oh, wait, 0.9x) unsupported "legacy" before 2.0 (no, hang on, 1.9x) is ready.

(Reply to this) (Parent)

Re: GRUB *Legacy*
[info]mjg59
2009-11-14 03:09 pm UTC (link)
See footnote 2.

(Reply to this) (Parent)

Tried MS-DOS 1.0?
(Anonymous)
2009-11-14 11:49 am UTC (link)
I do slightly wonder if 1982-era software really runs on modern PCs. Has anyone ever tried booting MS-DOS 1.0 on one?

(Reply to this)


[info]rasmatan
2009-11-14 09:47 pm UTC (link)
This article terrifies me. Standard PC architecture terrifies me. What the HECK.

And tragically, any attempt to replace it with a sane new architecture will never take off...

(Reply to this) (Thread)


[info]rasmatan
2009-11-14 10:03 pm UTC (link)
Also, in this era of virtual machines, we should be able to ditch all this scary stuff and run legacy apps in their own happy little imaginary world. My university eventually realized that the amount of electricity our Vax was sucking vastly outweighed the costs of buying the emulator and running on a $300 Windows box. (And we're still using a Vax because a completely custom database was developed for us in the 80s and replacing it would entail either spending a bazillion dollars on programmers or retraining all our employees and destroying our workflow they've been comfortable with for 20+ years. You can tell it's a nice place to work, because many secretaries here really have been working here that long...)

Why do I keep triggering the captcha? No URLs or anything!

(Reply to this) (Parent)(Thread)


[info]mjg59
2009-11-14 10:05 pm UTC (link)
Because otherwise I get ridiculous quantities of spam from bots that are registered users.

(Reply to this) (Parent)

Love it
(Anonymous)
2009-11-18 01:48 pm UTC (link)
Whether it's "Old New Thing"-style entries, like this one, or just random esoterica, your blog is in my top five favorites of Planet Gnome.

If you ever speak at a conference in Paris, I'm there in a heartbeat.

- John

(Reply to this)


[info]iabervon
2009-11-18 11:46 pm UTC (link)
I understand why hardware continues to support methods of turning A20 on, but I don't understand why it doesn't boot with it on and not support any way of turning it off (that is, simply treat any attempt the software makes to use any of the common methods for turning it on or off as a no-op). It's not like you could drive any of your new hardware without an OS new enough to support virtual memory, and if your OS supports virtual memory, you can emulate having A20 disabled for your ancient software.

(Reply to this) (Thread)


[info]teferi
2009-11-20 03:12 am UTC (link)
Someone, somewhere, has a crucial line-of-business application that depends on real mode, warts and all, working as it did on the 8086, and they won't buy new hardware if it breaks that.

(Reply to this) (Parent)(Thread)


[info]iabervon
2009-11-20 03:44 am UTC (link)
I'm fairly certain that said application wouldn't be able to run directly on the hardware for some other random reason by now, if for no other reason than that it's got to be on a 5.25" floppy or a pre-IDE hard drive, and the new hardware doesn't have controllers for any such devices (and, for that matter, its timing will be incredible wrong, due to not being on a fixed-instruction-speed processor). And if it doesn't actually have to be bare metal, it wouldn't be able to tell it's not in real mode because the 8086 didn't have anything else to check for.

Now, I don't doubt that there are programs out there that depend on the system being indistinguishable from an 8086, including A20 not existing, but that's easy enough to provide when you have the cycles to take a page fault on every memory access without the program noticing, and when you have to be doing some sort of virtualization anyway.

(Reply to this) (Parent)


(20 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…