This scheduler is probably the most salient feature of Go, but is only indirectly described in the language specification.
Perhaps it is just me, but it seems all this user space rigamarole to map bits of execution onto cores points to an overall architecture “smell”. This should be performed and enabled by the OS.
You can see the seams between the OS and the go runtime tear a little whenever a library acquires an ownership lock where the thread id is recorded. In Go, computation moves freely between threads, so that lock doesn’t work (at least without special instructions to the runtime to lock that goroutine to a thread).
The whole POSIX threading model seems broken in this context.
User-space threading is not broken. Windows even directly provides support for user-space scheduled threads[0]. The whole model isn't broken; rather, it's liberating. Once the application programmer gets rid of the idea that threads are expensive and starts creating thousands of them willy-nilly, these applications often benefit from a much simpler architecture and fewer bugs. All these complexities are pushed into the user-space scheduler. It's worth it.
Unfortunately the work seems to have stalled out and never made it into the kernel. If that work actually makes it into the Linux kernel, then other languages like C++ and Rust that have more stringent runtime requirements could make uses of lightweight threading as well.
What’s the advantage of lightweight threading features? I thought your position (based on comments elsewhere) was that kernel threads are roughly as fast as it gets? What am I misunderstanding?
POSIX threading is not broken, the Go scheduler just does a bunch of goofy things that aren't really supported. Moving stacks between threads breaks all kinds of things. A more idiomatic approach would be for the compiler to emit properly resumable functions, like most async/await implementations do.
LLVM IR has async/await and coroutines, but most real-world VMs and language static compilers cannot depend on such intrinsics because of their memory and execution models. For example, Pony's ORCA has unique memory barrier and execution models that wouldn't work with this approach, although it uses LLVM for compilation down to metal. This is why LLVM is a loose framework and collection of tools split into "middleware" passes, rather than a single monolith.
PS: According to its paper, ORCA is supposedly one of the fastest GCs for most use-cases. It beat Zulu's C4, Erlang BEAM and another one in a deathmatch. It's too bad it can't be extracted as a separate project or integrated into OpenJDK or LLVM without lots of work. Of course, no GC is better (I'm staring at you, Rust. :).
There are two main ways to do async/concurrency where you release the thread to do other work while you are waiting.
1. stackless (async/await) where the operation becomes an inspectable object that you can choose what to do with (awaiting being suspend for completion) as taken by C#, C++, Python, JS, PHP, Swift and Rust
2. "with stack" where you pretend its not async; but this means when something else uses the thread you need to get the suspended operation's stuff off the thread; usually by not using the thread's stack at all and having it in the heap and just jumping into and out of these "off-thread" stacks; as used by Go and being looked at for Java (as Project Loom)
> While fibers may have looked like an attractive approach to write scalable concurrent code in the 90s, the experience of using fibers, the advances in operating systems, hardware and compiler technology (stackless
coroutines), made them no longer a recommended facility.
Disadvantage of stackless is the extra boilerplate (e.g. async/await everywhere); though it also gives more control as the consumer of the operations (e.g. fanout and wait for many; or continue not waiting for the result at all)
Advantage of the "with stack" approach is it looks the same as non async code as its all hidden (goroutines aside); which is why Java is no doubt looking at doing it as there is a large body of code that would need to be rewritten so "hiding it" is easier to avoid that.
C# had/has teething issues when async/await was introduced as it kept the initial thread blocking methods; and added the async and they don't mix very well, you need really to go one way or the other when developing.
Javascript leapt at async/await as it was all async anyway, but callback based which makes for horrible code to follow; so it made everything much cleaner.
I think to keep Go code directly callable from C, they have to follow the platform's C calling conventions which means the same stack layout. So for cooperative concurrency on a single thread to work, each Goroutine needs its very own stack. On Intel, that means saving stack pointers RSP and RBP (16 bytes) for each. Also, each will need memory allocated for its stack for the stack pointers to point to... another 8-16 bytes (pointer and length).
Not sure where you got this idea. The only significant part of the Go codebase that was inherited from Plan 9 was the C compilers, used to build the original Go compiler that was written (from scratch) in C. I think perhaps a hash table implementation was also brought over from P9. That stuff is all long gone now, though.
The idea that Go's scheduler design is somehow inherited from Plan 9 is ridiculous.
Well, in general, POSIX threads are much more expensive (RAM) than some unit of minimal cooperative concurrency/parallelism, say Erlang "processes." The idea of using a threadpool isn't broken because an user-space "scheduler" decides which tasks to run on which threads. It might also decide how to scale or shrink the threadpool. Ultimately, only one thing can run on a processor at a given time, and that typical means a task structure containing at least two items if executing on an interpreted/p-code VM:
0. next unit of work (pointer/counter; instruction, function pointer, etc.)
1. task-local heap (pointer or structure)
2. operand stack (pointer or structure; for stack-oriented VMs only)
Depends on your codebase. Userspace C libraries and programs, including libc, often store surprisingly large buffers on the stack.
For example, try setting 'ulimit -s 128' (128kB stack limit) and see how many C programs crash. Then try, say, 16. Go's default is 8 kB, raised from 4 kB in 1.2: https://golang.org/doc/go1.2#stack_size
Linux's default userspace stack limit is 8 megabytes for a reason — programs really do use it.
Not really, the 8MB limit was added back in '95 from a previous limit of "essentially none"[0] with a justification of
> Limit the stack by to some sane default: root can always increase this limit if needed.. 8MB seems reasonable.
Developers don't generally think about their stack size, especially for single-threaded programs[1] so the defaults need to be a sweet spot of not unnecessarily big (such that you can catch unbounded recursion) but not so small that you'd segfault more than a very small fraction of all programs.
Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small. Small stacks are properly thought of as a property that GC enables, not a property that M:N threading enables.
> Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small.
The problem there is that you need to very carefully size your stack as a mis-sizing will lead to a risky stack overflow. I'm not sure it's necessary either as allocating a "large stack" but using very little of it means most of it is never committed, and thus only costs memory mappings.
Is there any systems where the C stack is growable? Do stack frames get prefixed with an explicit request for some amount of stack memory, leading to the stack possibly being moved before the funcall happens?
We used to have growable C stacks in Rust using that technique. They worked (though were too slow for us). It could have been fixed by using stack copying like Go does.
As I recall the minimum total user + kernel stack size is 10kB in the Linux kernel, so 1M threads is 10GB of space. It should be doable, though you will probably have to bump up kernel limits.
A million threads is an extreme case, though. No system can reliably spawn that many threads that are actually doing something interesting without a very large amount of memory. When you leave the realm of microbenchmarks you have to expect that an unknown quantity of threads will have deep call stacks at any given time, so you really need to give yourself leeway to avoid the risk of OOM.
So a big advantage that Go's M:N goroutine model brings to the table is how cheap they are. Cheap enough that tons of concurrency-related stuff you want to do in application code, like implementing a highly-concurrent algorithm, can be done with goroutines directly, without having to think too hard about mechanical sympathy and e.g. translate logical concurrency to physical threading. Go processes commonly have 1M or even 10M active goroutines at once.
So I don't think it's fair to say POSIX threads are comparable or whatever if they don't have this property.
Go processes do not typically have 1M or 10M active goroutines at once. The initial stack size for goroutines is 2kB, so 10M goroutines would mean 20GB just for stacks, even assuming that the goroutines never grow their stack (which cannot be assumed for anything nontrivial). The 2kB minimum stack size is on the same order of magnitude as the 10kB POSIX thread stack size.
The mismatch occurs because Go implements its own threading model, completely ignoring your operating system's implementation of userspace pthreads. If it then attempts to interact with programs using pthreads without taking special care, yeah, it can violate the pthreads API. Such Go programs are broken.
I'm not sure why this leads you to the conclusion that "the whole POSIX threading model seems broken."
Perhaps it is just me, but it seems all this user space rigamarole to map bits of execution onto cores points to an overall architecture “smell”. This should be performed and enabled by the OS.
You can see the seams between the OS and the go runtime tear a little whenever a library acquires an ownership lock where the thread id is recorded. In Go, computation moves freely between threads, so that lock doesn’t work (at least without special instructions to the runtime to lock that goroutine to a thread).
The whole POSIX threading model seems broken in this context.