> Microcode can be implemented in a variety of ways. Many computers use "vertica...

monocasa · on Jan 25, 2022

From what little we know of recent designs (the best public documentation being the fantastic work to reverse engineer AMD K8 and K10 microcode here https://github.com/RUB-SysSec/Microcode ), I'd describe x86 microcode as particularly wide vertical microcode, 64 bit ops in the case of k8/k10.

The bit width is more a heuristic. With horizontal microcode you can look at each group of bits and it's clear 'these three bits are the selection input to this mux', 'this bit is an enable for the buffer linking these two buses', etc. Vertical microcode in contrast is further decoded with bit fields having different meanings based on opcode style fields. RISC in a lot of ways was the realization 'hey, we can assume with this new arch that there's an i-cache, so why have microcode at all, but instead load what was vertical microcode from RAM dynamically and execute it directly'.

Pretty universally, OoO superscalar cores will use vertical microcode (or vertical microcode looking micro-ops even if they don't originate from microcode) because that's the right abstraction you want at the most expensive part of the design: the tracking of in flight and undispatched operations in the reorder buffer, and how the results route in the bypass network. Any additional wodtch there really starts to hit your power budget, and it's the wrong level for horizontal microcode because the execution units will make different choices on even how many control signals they want.

colejohnson66 · on Jan 25, 2022

They're wider, but that's just because one "word" holds the whole instruction, instead of multiple bytes. In fact, reverse engineering efforts[0] (and the "RISC86" patent[1]) make clear that they're actually "vertical". Intel Goldmont (from [0]) has entries that are 176 bits each, but that's actually three (distinct) 48 bit uops and a 30 bit "sequence word".

Horizontal microcode is much simpler for in-order processors, but my understanding of this stuff seems like they wouldn't work well with the superscalar processors of today. Gating the hundreds of control lines seems (to me) like more effort than gating a few dozen bits of a uop.

[0]: https://github.com/chip-red-pill/uCodeDisasm

[1]: https://patents.google.com/patent/US5926642A

rbanffy · on Jan 25, 2022

In college we were tasked with designing a CPU. Mine was a stack oriented (started register based and retained the registers but most ops were on the stack) that used a very large microcode word, one per clock cycle of the instruction being executed. In the end, I was saving bits from the control word and doing "ready" signals between the blocks so that the microcode didn't need to drive everything. In theory, it could do more than one thing in a clock cycle if the stars aligned just right and there would be no dependencies. No instruction used the feature in the end, because the deadline was too close.

Wish I had the time to implement it. OTOH, I'm glad I never had to debug all the analog glitches and timing bugs that design certainly would show when it colided with reality