Nand to Tetris 2

This post is the second part of our Nand to Tetris mental model diffing series. This week we look at machine language, computer architecture, and assemblers.

The format is the same as last week - I’ve written down a collection of assertions and then received answers and comments from people who know a lot more about these things. Several people provided feedback this this time and you’ll find a list of them at the end of this post. Let’s begin!

Assertions

1. Assembly language is just mnemonic sugar over binary codes.

This means that if you read assembly you can straightforwardly, albeit tediously, translate it into binary code. It also means you have “one assembler” per computer architecture.

There are certainly “high level” assembly languages where one mnemonic has multiple possible valid translations, but in general it’s accurate. It depends on what you mean exactly, for example look at AT&T vs Intel x86 assembly - they’re quite similar, and can be translated to each other, but people get really annoyed when they’re used to one and have to deal with the other, even just reading it. (And all of this, in turn, is leaving out what defines a computer architecture - is a system that can be big-Endian or little-Endian one architecture or two?)

2. Screen, Keyboard et al. are usually accessed through a memory map.

Alternatively, I know we can issue high-level instructions to a GPU that yields much better performance, such as “draw a line from here to here”. I don’t know why that is.

This depends on both the hardware and the OS, but I think that this is generally not true for “slow” I/O devices like a keyboard. For example, in x86/DOS, I believe that you get an int 0x09 for each keypress. With a modern version of Windows and an x64/SMP system, it somehow uses ACPI and APIC, and I believe it’s valid to think of that as an abstraction over interrupts. The DOS case is pretty simple, and the link above explains how you can get access to keypresses. Modern interrupt I/O is much more complicated, and for Windows you can read more about it in Russinovich’s Windows Internals book.

Fast I/O devices will just DMA stuff in and out of memory, though. There are multiple reasons why fast I/O devices are faster. The first is that they can just directly talk to memory instead of having to talk to a device, which talks to another device, which signals the CPU to talk to the I/O device to do work. And that signal will cause a context switch, which is pretty expensive.

Another reason is that they can sit on much faster busses. PCIe gen 3 can deliver about 1GB per lane, and it’s not uncommon to see 16+ lane devices. That’s a lot more than you can get out of USB. Compare that to USB3, where you get maybe 10Gb/s, i.e., barely more than 1GB/s, total. And you can do even better than PCIe if you’re on the same bus that the processor is on, and that’s one reason that AMD and IBM expose that stuff. I don’t know if you can get a QPI license from Intel, but maybe?

Why is the GPU so much faster?

Using a memory map adds memory copy latency and saturates the memory bandwidth between CPU and GPU. For example, consider pressing return to create a new line at the bottom of a page – you need to scroll everything on the view up by a line height, then add your new blank line. Doing this pixel by pixel is slower than issuing a GPU command that says “copy a rectangle from [(x=0, y=100), (x=width, y=height)] to [(x=0, y=0), (x=width, y=height-100)], then fill in a background color area from height-100 to height.”

In short: The GPU is so much faster because there’s no throughput/latency bottleneck in copying memory across the motherboard. GPUs are also optimized for parallel processing (such as performing an operation on a matrix of pixels), and CPUs aren’t.

3. GOTOs and jumps are the only branching statements in Assembly.

This depends on the assembly language and architecture. For example, x86 has a loop instruction. This is not just a macro. It’s possible some CPUs will implement loop as a conditional branch, but it’s literally a hardware instruction. For an interesting example, look up x86 string instructions. They were designed to be fast, then were a slow option that stuck around for backwards compatibility, etc.

What about Dijkstra’s “GOTO considered harmful?”

Dijkstra wanted to promote another, more structured, way of dealing with software: procedures with one entry and one exit point. Taken to its extreme, this leads to hard-to-analyze programs, full of extra control flags for the one exit point of the function, and much uglier error-handling code - the Linux kernel extensively uses goto in functions where there have to be multiple memory allocations, any of which may fail, and where the subset that were done need to be cleaned up, for instance. All that said, some level of structured programming is generally much preferable to most uses of GOTO.

4. The von Neumann bottleneck is still a thing.

Our computer architecture illustrates the von Neumann bottleneck, without explicitly talking about it. Here’s a cameo from John Backus which laments the existance of it:

Surely there must be a less primitive way of making big changes in the store than by pushing vast numbers of words back and forth through the von Neumann bottleneck. Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself, but where to find it. (from his Turing Award lecture).

Yes. Higher level languages help conceptually, but not so much in terms of implementation. Having tools to cut the CPU out of dealing with things like blatting data from devices to RAM helps a bit. But fundamentally, CPUs have gotten really fast, memory access is amazingly slow, disk access slower yet, and caches only partly help. This has some numbers that give an idea of the costs.

5. All symbols in an Assembler resolve to some memory address.

No, one might denote a numerical constant, for instance. Symbol resolution is surprisingly deep, and parts happen at runtime - check out LD_PRELOAD, LD_LIBRARY_PATH, and similar under Linux, for instance, where even the library to be loaded, much less addresses within it, can vary every time you launch the program, not just at compile time. Linkers and loaders if quite a fun read if you have a few hours to spare.

Conclusion

All the credit goes to these people, and any inaccuracies are due to this author:

Compared to last week, things got a lot more subtle quickly. It’s not that last week was less complicated, but the answers and diff seemed more straightforward. I suspect this will be increasingly true as we move up the stack, as things change more quickly the higher up we get.

A very low-resolution view of how the assertions fared would go something like this: Ish, No, No, Yes, No. This is good news: it means we are learning :)