This scheduler is probably the most salient feature of Go, but is only indirectl...

kccqzy · on Dec 8, 2019

User-space threading is not broken. Windows even directly provides support for user-space scheduled threads[0]. The whole model isn't broken; rather, it's liberating. Once the application programmer gets rid of the idea that threads are expensive and starts creating thousands of them willy-nilly, these applications often benefit from a much simpler architecture and fewer bugs. All these complexities are pushed into the user-space scheduler. It's worth it.

[0]: https://docs.microsoft.com/en-us/windows/win32/procthread/us...

pcwalton · on Dec 8, 2019

I agree. Paul Turner at Google did a presentation at LPC in which he presented an alternative model that actually uses OS threads: https://blog.linuxplumbersconf.org/2013/ocw/system/presentat...

Unfortunately the work seems to have stalled out and never made it into the kernel. If that work actually makes it into the Linux kernel, then other languages like C++ and Rust that have more stringent runtime requirements could make uses of lightweight threading as well.

weberc2 · on Dec 8, 2019

What’s the advantage of lightweight threading features? I thought your position (based on comments elsewhere) was that kernel threads are roughly as fast as it gets? What am I misunderstanding?

pcwalton · on Dec 8, 2019

I think M:N threads mostly aren't worth the drawbacks right now, but those proposed kernel features would change the calculus significantly.

toolslive · on Dec 8, 2019

Threads are fast, when they are working for you. If not, you need to wait until the work gets scheduled... and thread context switching is slow.

gok · on Dec 8, 2019

POSIX threading is not broken, the Go scheduler just does a bunch of goofy things that aren't really supported. Moving stacks between threads breaks all kinds of things. A more idiomatic approach would be for the compiler to emit properly resumable functions, like most async/await implementations do.

duelingjello · on Dec 8, 2019

There is no one-size-fits-all approach.

LLVM IR has async/await and coroutines, but most real-world VMs and language static compilers cannot depend on such intrinsics because of their memory and execution models. For example, Pony's ORCA has unique memory barrier and execution models that wouldn't work with this approach, although it uses LLVM for compilation down to metal. This is why LLVM is a loose framework and collection of tools split into "middleware" passes, rather than a single monolith.

PS: According to its paper, ORCA is supposedly one of the fastest GCs for most use-cases. It beat Zulu's C4, Erlang BEAM and another one in a deathmatch. It's too bad it can't be extracted as a separate project or integrated into OpenJDK or LLVM without lots of work. Of course, no GC is better (I'm staring at you, Rust. :).

http://releases.llvm.org/8.0.0/docs/Coroutines.html

dboreham · on Dec 8, 2019

This is always a problem with green threads aka M:N threading.

mitchty · on Dec 8, 2019

The go runtime moves stacks between threads?

Oof that’s horrible, any pointers to the logic behind it? I’m curious the rationale.

benaadams · on Dec 9, 2019

There are two main ways to do async/concurrency where you release the thread to do other work while you are waiting.

1. stackless (async/await) where the operation becomes an inspectable object that you can choose what to do with (awaiting being suspend for completion) as taken by C#, C++, Python, JS, PHP, Swift and Rust

2. "with stack" where you pretend its not async; but this means when something else uses the thread you need to get the suspended operation's stuff off the thread; usually by not using the thread's stack at all and having it in the heap and just jumping into and out of these "off-thread" stacks; as used by Go and being looked at for Java (as Project Loom)

Interesting paper on it http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2018/p136...

> While fibers may have looked like an attractive approach to write scalable concurrent code in the 90s, the experience of using fibers, the advances in operating systems, hardware and compiler technology (stackless coroutines), made them no longer a recommended facility.

Disadvantage of stackless is the extra boilerplate (e.g. async/await everywhere); though it also gives more control as the consumer of the operations (e.g. fanout and wait for many; or continue not waiting for the result at all)

Advantage of the "with stack" approach is it looks the same as non async code as its all hidden (goroutines aside); which is why Java is no doubt looking at doing it as there is a large body of code that would need to be rewritten so "hiding it" is easier to avoid that.

C# had/has teething issues when async/await was introduced as it kept the initial thread blocking methods; and added the async and they don't mix very well, you need really to go one way or the other when developing.

Javascript leapt at async/await as it was all async anyway, but callback based which makes for horrible code to follow; so it made everything much cleaner.

pcwalton · on Dec 8, 2019

The valid reasons for M:N threading are reducing syscalls on goroutine spawning, and avoiding the overhead of the kernel scheduler on context switch.

duelingjello · on Dec 8, 2019

I think to keep Go code directly callable from C, they have to follow the platform's C calling conventions which means the same stack layout. So for cooperative concurrency on a single thread to work, each Goroutine needs its very own stack. On Intel, that means saving stack pointers RSP and RBP (16 bytes) for each. Also, each will need memory allocated for its stack for the stack pointers to point to... another 8-16 bytes (pointer and length).

echlebek · on Dec 9, 2019

The gc compiler, used by the vast majority of Go developers, does not use the C calling convention.

https://golang.org/doc/faq#Do_Go_programs_link_with_Cpp_prog...

gok · on Dec 8, 2019

Like much of Go, it was likely done because it made it easier to recycle Plan 9 code.

enneff · on Dec 8, 2019

Not sure where you got this idea. The only significant part of the Go codebase that was inherited from Plan 9 was the C compilers, used to build the original Go compiler that was written (from scratch) in C. I think perhaps a hash table implementation was also brought over from P9. That stuff is all long gone now, though.

The idea that Go's scheduler design is somehow inherited from Plan 9 is ridiculous.

duelingjello · on Dec 8, 2019

Well, in general, POSIX threads are much more expensive (RAM) than some unit of minimal cooperative concurrency/parallelism, say Erlang "processes." The idea of using a threadpool isn't broken because an user-space "scheduler" decides which tasks to run on which threads. It might also decide how to scale or shrink the threadpool. Ultimately, only one thing can run on a processor at a given time, and that typical means a task structure containing at least two items if executing on an interpreted/p-code VM:

0. next unit of work (pointer/counter; instruction, function pointer, etc.)

1. task-local heap (pointer or structure)

2. operand stack (pointer or structure; for stack-oriented VMs only)

masklinn · on Dec 8, 2019

> Well, in general, POSIX threads are much more expensive (RAM) than some unit of minimal cooperative concurrency/parallelism, say Erlang "processes."

The vast majority of the "expense" is irrelevant as it's virtual memory and unlikely to ever be touched (and thus committed).

loeg · on Dec 9, 2019

Depends on your codebase. Userspace C libraries and programs, including libc, often store surprisingly large buffers on the stack.

For example, try setting 'ulimit -s 128' (128kB stack limit) and see how many C programs crash. Then try, say, 16. Go's default is 8 kB, raised from 4 kB in 1.2: https://golang.org/doc/go1.2#stack_size

Linux's default userspace stack limit is 8 megabytes for a reason — programs really do use it.

masklinn · on Dec 9, 2019

> for a reason

Not really, the 8MB limit was added back in '95 from a previous limit of "essentially none"[0] with a justification of

> Limit the stack by to some sane default: root can always increase this limit if needed.. 8MB seems reasonable.

Developers don't generally think about their stack size, especially for single-threaded programs[1] so the defaults need to be a sweet spot of not unnecessarily big (such that you can catch unbounded recursion) but not so small that you'd segfault more than a very small fraction of all programs.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/history/hist...

[1] which would be why e.g. OSX has a large main thread stack (8MB) and a relatively puny secondary thread stack (512k).

pcwalton · on Dec 8, 2019

Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small. Small stacks are properly thought of as a property that GC enables, not a property that M:N threading enables.

masklinn · on Dec 8, 2019

> Most of the memory cost of a POSIX thread is in the stack, and you can customize the stack size to be quite small.

The problem there is that you need to very carefully size your stack as a mis-sizing will lead to a risky stack overflow. I'm not sure it's necessary either as allocating a "large stack" but using very little of it means most of it is never committed, and thus only costs memory mappings.

pcwalton · on Dec 8, 2019

If you have the runtime infrastructure to grow stacks, then you can use that with POSIX threads too.

masklinn · on Dec 9, 2019

Is there any systems where the C stack is growable? Do stack frames get prefixed with an explicit request for some amount of stack memory, leading to the stack possibly being moved before the funcall happens?

pcwalton · on Dec 10, 2019

We used to have growable C stacks in Rust using that technique. They worked (though were too slow for us). It could have been fixed by using stack copying like Go does.

sagichmal · on Dec 9, 2019

Is there a stack size configuration that would enable a program to spawn 1M POSIX threads?

pcwalton · on Dec 9, 2019

As I recall the minimum total user + kernel stack size is 10kB in the Linux kernel, so 1M threads is 10GB of space. It should be doable, though you will probably have to bump up kernel limits.

A million threads is an extreme case, though. No system can reliably spawn that many threads that are actually doing something interesting without a very large amount of memory. When you leave the realm of microbenchmarks you have to expect that an unknown quantity of threads will have deep call stacks at any given time, so you really need to give yourself leeway to avoid the risk of OOM.

sagichmal · on Dec 9, 2019

So a big advantage that Go's M:N goroutine model brings to the table is how cheap they are. Cheap enough that tons of concurrency-related stuff you want to do in application code, like implementing a highly-concurrent algorithm, can be done with goroutines directly, without having to think too hard about mechanical sympathy and e.g. translate logical concurrency to physical threading. Go processes commonly have 1M or even 10M active goroutines at once.

So I don't think it's fair to say POSIX threads are comparable or whatever if they don't have this property.

pcwalton · on Dec 10, 2019

Go processes do not typically have 1M or 10M active goroutines at once. The initial stack size for goroutines is 2kB, so 10M goroutines would mean 20GB just for stacks, even assuming that the goroutines never grow their stack (which cannot be assumed for anything nontrivial). The 2kB minimum stack size is on the same order of magnitude as the 10kB POSIX thread stack size.

sagichmal · on Dec 10, 2019

> Go processes do not typically have 1M or 10M active goroutines at once.

It depends on domain, but in my domain of high-RPS network servers, they absolutely do.

> 10M goroutines would mean 20GB just for stacks.

When I deploy to metal, the average host has ~512GB of RAM, and not much cotenancy.

> The 2kB minimum stack size is on the same order of magnitude as the 10kB POSIX thread stack size.

There's also a question of cost to create and destroy; it's very common for goroutines to live for O(µs). I don't know how POSIX threads compare here.

loeg · on Dec 9, 2019

The mismatch occurs because Go implements its own threading model, completely ignoring your operating system's implementation of userspace pthreads. If it then attempts to interact with programs using pthreads without taking special care, yeah, it can violate the pthreads API. Such Go programs are broken.

I'm not sure why this leads you to the conclusion that "the whole POSIX threading model seems broken."