Unusual circuits in the Intel 386's standard cell logic

198 points by Stratoscope 18 hours ago

junto 5 hours ago

This reminds me of Adrian Thompson’s (University of Sussex) 1996 paper, “An evolved circuit, intrinsic in silicon, entwined with physics,” ICES 1996 / LNCS 1259 (published 1997), which was extended in his later thesis, “Hardware Evolution: Automatic Design of Electronic Circuits in Reconfigurable Hardware by Artificial Evolution, Springer, 1998”.

Before Thompson’s experiment, many researchers tried to evolve circuit behaviors on simulators. The problem was that simulated components are idealized, i.e. they ignore noise, parasitics, temperature drift, leakage paths, cross-talk, etc. Evolved circuits would therefore fail in the real world because the simulation behaved too cleanly.

Thompson instead let evolution operate on a real FPGA device itself, so evolution could take advantage of real-world physics. This was called “intrinsic evolution” (i.e., evolution in the real substrate).

The task was to evolve a circuit that can distinguish between a 1 kHz and 10 kHz square-wave input and output high for one, low for the other.

The final evolved solution:

- Used fewer than 40 logic cells

- Had no recognisable structure, no pattern resembling filters or counters

- Worked only on that exact FPGA and that exact silicon patch.

Most astonishingly:

The circuit depended critically on five logic elements that were not logically connected to the main path.

Removing them should not affect a digital design

- they were not wired to the output

- but in practice the circuit stopped functioning when they were removed.

Thompson determined via experiments that evolution had exploited:

- Parasitic capacitive coupling

- Propagation delay differences

- Analogue behaviours of the silicon substrate

- Electromagnetic interference from neighbouring cells

In short: the evolved solution used the FPGA as an analog medium, even though engineers normally treat it as a clean digital one.

Evolution had tuned the circuit to the physical quirks of the specific chip. It demonstrated that hardware evolution could produce solutions that humans would never invent.

karolinepauls 2 hours ago

I wonder what would happen if someone evolved a circuit on a large number of FPGAs from different batches. Each of the FPGAs would receive the same input in each iteration but the output function would be biased to expose the worst-behaving units (maybe the bias should be raised biased in later iterations when most units behave well).
- mmastrac 11 minutes ago
  
  Either it would generate a more robust (and likely more recognizable) solution, or it would fail to converge, really.
  You may need to train on a smaller number of FPGAs and gradually increase the set. Genetic algorithms have been finicky to get right, and you might find that more devices would massively increase the iteration count
rcxdude 5 hours ago

Though the unreplicable nature of it certainly limited its usefulness. I'd also suspect it would be quite sensitive to temperature.
- junto 4 hours ago
  
  I’d argue that this was a limitation of the GA fitness function, not of the concept.
  Now that we have vastly faster compute, open FPGA bitstream access, on-chip monitoring, plus cheap and dense temperature/voltage sensing, reinforcement learning + evolution hybrids, it becomes possible to select explicitly for robustness and generality, not just for functional correctness.
  The fact that human engineers could not understand how this worked in 1996 made researchers incredibly uncomfortable, and the same remains true today, but now we have vastly better tooling than back then.
  
  tremon 4 hours ago
  
  I don't think that's true, for me it is the concept that's wrong. The second-order effects you mention:
  - Parasitic capacitive coupling - Propagation delay differences - Analogue behaviours of the silicon substrate
  ...are not just influenced by the chip design, they're influenced by substrate purity and doping uniformity -- exactly the parts of the production process that we don't control. Or rather: we shrink the technology node to right at the edge where these uncontrolled factors become too big to ignore. You can't design a circuit based on the uncontrolled properties of your production process and still expect to produce large volumes of working circuits.
  Yes, we have better tooling today. If you use today's 14A machinery to produce a 1µ chip like the 80386, you will get amazingly high yields, and it will probably be accurate enough that even these analog circuits are reproducible. But the analog effects become more unpredictable as the node size decreases, and so will the variance in your analog circuits.
  Also, contrary to what you said: the GA fitness process does not design for robustness and generality. It designs for the specific chip you're measuring, and you're measuring post-production. The fact that it works for reprogrammable FPGAs does not mean it translates well to mass production of integrated circuits. The reason we use digital circuitry instead of analog is not because we don't understand analog: it's because digital designs are much less sensitive to production variance.
  
  junto 3 hours ago
  
  Possibly, but maybe the real difference is the subtlety between a planned deterministic (logical) result versus deterministic (black box) outcome?
  We’re seeing this shift already in software testing around GenAI. Trying to write a test around non-deterministic outcomes comes with its own set of challenges, so we need to plan can deterministic variances, which seems like an oxymoron but is not in this context.
- paulgerhardt 4 hours ago
  
  That unreplicability between chips is actually a very, very desirable property when fingerprinting chips (sometimes known as ChipDNA) to implement unique keys for each chip. You use precisely this property (plus a lot of magic to control for temperature as you point out) to give each chip its own physically unclonable key. This has wonderfully interesting properties.
  
  rowanG077 an hour ago
  
  The technical term is usually "Physical unclonable function".

hyperman1 14 hours ago

There are 2 interesting articles here. Not only does Ken treat us with a great text, but hidden in footnote 1 is a second gem. Thanks for the early christmas gift!

tremon 3 hours ago

> we found that the engineers were automating things by writing their own scripts where in earlier days you might have to go to ask a CAD person to come and do something for you -- and that’s difficult to do. Much easier if the engineers can do it themselves and I think that all came about because we instituted Unix for the 386 design. Again if management knew what we were doing they wouldn’t have let us do it.
> He walked across the street from Santa Clara 4 to Amdahl and they had a Unix that ran on 370 computers. So he went over there and got a tape and brought it back, sent it over to Phoenix where the mainframes were and told 'em to load it. They did, not knowing what was on that tape because they never would have done it if they had known
It's wild to read that Intel's flagship product, the part that basically defined the next 40 years of computing, might have turned out very differently if management and/or IT knew what the engineers were doing.
Everything old is new again, I guess.
- kens 2 hours ago
  
  Another interesting thing is that the Unix guru on the 386 project was Pat Gelsinger, who later became Intel's CEO. Gelsinger also converted at least one member of the 386 team to Christianity.

dcassett 9 hours ago

> However, the 386 uses a different approach—CMOS switches—that avoids a large AND/OR gate.

Standard cell libraries often implement multiplexers using transmission gates (CMOS switches) with inverters to buffer the input and restore the signal drive. This implementation has the advantage of eliminating static hazards (glitches) in the output that can occur with conventional gates.

zozbot234 8 hours ago

Static hazards are most often dealt with by just adding some redundant logic (consensus terms) to the circuit. This can even be done automatically.

skissane 16 hours ago

> Regenerating the cell layout was very costly, taking many hours on an IBM mainframe computer.

I would love to know more about this – how much info is publicly available on how Intel used mainframes to design the 386? Did they develop their own software, or use something off-the-shelf? And I'm somewhat surprised they used IBM mainframes, instead of something like a VAX.

kens 16 hours ago

Various papers describe the software, although they are hard to find. My earlier blog post goes into some detail: https://www.righto.com/2024/01/intel-386-standard-cells.html
The 386 used a placement program called Timberwolf, developed by a Berkeley grad student and a proprietary routing tool.
Also see "Intel 386 Microprocessor Design and Development Oral History Panel" page 13. https://archive.computerhistory.org/resources/text/Oral_Hist...
"80386 Tapeout: Giving Birth to an Elephant" by Pat Gelsinger, Intel Technology Journal, Fall 1985, discusses how they used an Applicon system for layout and an IBM 3081 running UTS unix for chip assembly, faster than the VAX they used earlier. Timberwolf also ran on the 3081.
"Design And Test of the 80386" (https://doi.org/10.1109/MDT.1987.295165) describes some of the custom software they used, including a proprietary RTL simulator called Microsim, the Mossim switch-level simulator, and the Espresso PLA minimizer.
- dcassett 8 hours ago
  
  > Espresso PLA minimizer
  You can still find the software for Espresso (I ran it a few years ago):
  https://en.wikipedia.org/wiki/Espresso_heuristic_logic_minim...
retrac 15 hours ago

VAX were relatively small computers for the time. They grew upward in the late 80s eventually rivalling the mainframes for speed (and cost). But in the early 80s IBM's high end machines were an entire order of magnitude larger.
Top of the line VAX in 1984 was the 8600 with a 12.5 MHz internal clock, doing about 2 million instructions per second.
IBM 3084 from 1984 - quad SMP (four processors) at 38 MHz internal clock, about 7 million instructions per second, per processor.
Though the VAX was about $50K and the mainframe about $3 million.
themafia 16 hours ago

There's not a lot of "off the shelf" in terms of mainframes. You're usually buying some type of contract. In that case I would expect a lot of direct support for customer created modules that took an existing software library and turned into the specific application they required.
f1shy 15 hours ago

> Did they develop their own software
Knowing intel SW and based on it was succesful, I really doubt it

userbinator 16 hours ago

But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.

Does that schedule include all the revisions they did too? The first few were almost uselessly buggy:

https://www.pcjs.org/documents/manuals/intel/80386/

kens 15 hours ago

According to "Design and Test of the 80386", the processor was completed ahead of its 50-man-year schedule from architecture to first production units, and set an Intel record for tapeout to mask fabricator.
adrian_b 12 hours ago

Except for the first stepping A0, whose list of bugs is unknown, and it also implemented a few extra instructions that were dropped in the next revisions, instead of having their bugs fixed, the other steppings have errata lists that are not significantly worse than those of most recent Intel or AMD CPUs, which also have long lists of bugs, for which there are workarounds in most cases, at the hardware level or operating system level.

wolfi1 16 hours ago

if I remember correctly the 386 didn't have branch prediction so as a thought experiment how would a 386 with design sizes from today (~9nm) fare with the other chips?

Earw0rm 13 hours ago

It would lose by a country mile, a 386 can handle about one instruction every three or four clocks, a modern desktop core can do as many as four or five ops PER clock.
It's not just the lack of branch prediction, but the primitive pipeline, no register renaming, and of course it's integer only.
A Pentium Pro with modern design size would at least be on the same playing field as today's cores. Slower by far, but recognisably doing the same job - you could see traces of the P6 design in modern Intel CPUs until quite recently, in the same way as the Super Hornet has traces of predecessors going back to the 1950s F-5. The CPUs in most battery chargers and earbuds would run rings around a 386.
- anthk 12 hours ago
  
  A 386 was a beast against a 286, a 16 bit CPU. It was the minimum to run Linux with 4MB of RAM, but a 486 with and FPU destroyed it and not just in FP performance.
  Bear in mind that with an 386 you can barely decode an MP2 file, while with a 486 DX you can play most MP3 files at least in mono audio and maybe run Quake at the lowest settings if you own a 100 MHZ one. A 166MHZ Pentium can at least multitask a little while playing your favourite songs.
  Also, under Linux, a 386 would manage itself relativelly well with just terminal and SVGAlib tools (now framebuffer) and 8MB of RAM. With a 486 and 16MB of RAM, you can run X at sane speeds, even FVWM in wireframe mode to avoid window repaintings upon moving/resizing them.
  Next, TLS/SSL. WIth a 486 DX you can use dropbear/bearssl and even Dillo happily with just a light lag upong handhaking, good enough for TLS 1.2. Under a 486, a 30-35? year old CPU. IRC over TLS, SSH with RSA256 and the like methods, web browsing/Gemini under Dillo with TLS. Doable, I did it under VM, it worked, even email and NNTP over TLS with a LibreSSL fork against BearSSL.
  With a 386 in order to keep your sanity you can have plain HTTP, IRC and Gopher and plain email/Usenet. No MP3 audio, where with a 486 you could at least read news over Gopher (even today) will multitasking if you forced yourself to a terminal environment (not as hard as it sounds).
  If you emulate some old i440FX based PC under Qemu, switching between the 386 and 486 with -cpu flag gives the user clear results. Just set one with the Cirrus VGA and 16MB and you'll understand upong firing X.
  This is a great old distro to test how well 386's and 486's behaved:
  https://delicate-linux.net/
  
  Earw0rm 8 hours ago
  
  Yep, we had a few later-generation 486s in college. They would run Windows NT4 with full GUI - not especially well, but they'd run it. And they'd do SSL stuff adequately for the time.
  ISTR the cheap "Pentium clones" at the time - Cyrix, early AMDs before the K5/K6 and Athlon - were basically souped-up 486 designs.
  (As an aside - it's very noticeable how much innovation happened between a single generation of CPU architectures at that time, compared to today. Even if some of them were buggy or had performance regressions. 5x86 to K5 was a complete redesign, and the same again between K6 and K7).
  
  accrual 2 hours ago
  
  I did some multitasking recently on my iDX4-100 + 64MB FPM. I used NT4 with SP2 because the full SP6 was much slower. I could have a browser open, PuTTY, and some tracker music playing no problem. :)
  
  rwmj 9 hours ago
  
  I ran X and emacs and gcc on a 386DX with 5MB of RAM circa 1993, and while not pleasant it was workable. The upgrade to 16MB (that cost me £600!) made a big difference.
  
  masfuerte 8 hours ago
  
  Ten years before that I saved up for ages and spent £25 on 16KB of RAM. I could have bought a house for the cost of 16MB. It's amazing how quickly it changed.
  
  Earw0rm 5 hours ago
  
  Both the RAM (for the better) and the house (for the worse).
  
  rwmj 7 hours ago
  
  ZX81 rampack, right?
  
  masfuerte 5 hours ago
  
  Nearly, it was actually for a BBC Micro.
  
  rwmj 21 minutes ago
  
  We can't be friends!
  
  iberator 11 hours ago
  
  You could run Linux with 2MB of ram with kernels before 1994 AFIK and with a.out format of binaries instead of ELF.
  Nowadays I think it's still doable in theory but Linux kernel have some kind of hard coded limit of 4MB (something to do with memory paging size).
  
  ptspts 2 hours ago
  
  Why is ELF so much slower and/or more memory hungry than a.out on Linux?
  
  anthk 9 hours ago
  
  Yep but badly. Read the 4MB laptop Howto. Nowadays if I had a Pentium/k5 laptop I'd just fit a 64 MB SIMM on these and keep everything TTY/framebuffer with NetBSD and most of the unheard daemons disabled. For a 486, Delicate Linux plus a custom build queue for bearssl, libressl on top (there's a fork out there), plus brssl linked lynx, mutt, slrn, mpg123, libtls and hurl.
  
  rasz 3 hours ago
  
  >A 386 was a beast against a 286
  386, both SX and DX, run 16bit code at ~same clock for clock speed as 286. 286 topped out at 25MHz, Intel 386 at 33MHz. Now add the fact early Intel chips had broken 32bit and its not so beastly after all :)
  In one of Computer History Museum videos someone from Intel mentioned they managed to cost reduce 386SX version so hard it cost Intel $5 out the door, the rest of initial 1988 $219 price was pure money printer. Only in 1992 Intel finally calmed down with i386SX-25 going from Q1 1990 $184 to Q4 1992 $59 due to losing AMD Am386 lawsuit, and only to screw with AMD relegating its Am386DX-40 Q2 1991 $231 flagship to the title of Q1 1993 $51 bottom feeder.
- immibis 7 hours ago
  
  Presumably it's much smaller. A similar but different thought experiment would also fill the 14gen-sized die with 386es running in parallel.
  
  atq2119 6 hours ago
  
  If you continue that thought experiment, you'd very quickly run into the issue that the way the 386 interfaces memory is hopelessly primitive and not a good match for running 1000s of cores in parallel.
  A large reason why out of order speculative execution is needed for performance is to deal with the memory latencies that appear in such a system.
  
  toast0 6 hours ago
  
  That gets you close to Larabee/Xeon Phi, although that was pentium based although amd64 and a vector engine were added and later products were Atom derived.
tliltocatl 8 hours ago

Modern CPUs are more or less built around the memory hierarchy, so it would be really hard to compare those two - a 386 in a modern process might be able to run at the same clock speed or even faster, but with only a few kb of memory available. As soon as you connect a large memory it will spend most of the time idling (and then of course it is the problem of power dissipation density).
- adrian_b 6 hours ago
  
  While there also were cheap motherboards with 80386SX and no cache memory, most motherboards for 80386DX had a write-through cache memory, typically either of 32 kB or of 64 kB.
  By the time of 80486, motherboard cache sizes had increased to the range of 128 to 256 kB, while 80486 also had an internal cache of 8 kB (much later increased to 16 kB in 80486DX4, at a time when Pentium already existed).
  So except for the lower-end MBs, a memory hierarchy already existed in the 80386-based computers, because the DRAM was already not fast enough.

dcassett 9 hours ago

> (Note 4) But to write a value into the latch, the switch is enabled and its output overpowers the weak inverter.

This implementation is sometimes called a "jam latch" (the new value is "jammed" into the inverter loop).

z3ratul163071 16 hours ago

amazing and very informative work. thank you!

burnt-resistor 16 hours ago

I'm curious to know which model, speed, voltage, stepping, and package writing sample(s) were evaluated because there isn't just one 386. i386DX I assume but it doesn't specify whether it was a buggy 32-bit multiply or "ΣΣ" or newer.

"Showing one's work" would need details that are verifiable and reproducible.

kens 15 hours ago

I've looked at a bunch of 386 dies, see: https://www.righto.com/2023/10/intel-386-die-versions.html I typically use an earlier 1.5µm chip since it's easier to study under the microscope than a 1µm chip and I use "ΣΣ" because they are more obtainable. Typical steppings are S40362 or S40344, whatever is cheapest on eBay.

ermaa 11 hours ago

Great work and pleasant reading!