> In more extreme scenarios, if journald is blocked long enough for its watchdog timer to expire, it will be killed with SIGABRT, dump core, and abruptly close all connected sockets losing whatever log data was in-flight. In some cases this can even cause services to crash due to the unexpected hangup of their stdout and/or stderr.
Ouch.
> we were seeing a delay between .1-.5 seconds twice every 5 seconds
Double ouch. Half a second can be an eternity when wringing all the performance you can out of a system, and with so many cloud providers using network attached storage, that half second could balloon so quickly.
200kb is not an unusual size for unix socket buffers, and it wouldn't be too hard to bypass 200k in log writes in half a second on a busy system.
As an org who will be moving onto systemd with the release of the next Ubuntu LTS version, I appreciate what CoreOS is doing here. Great article, great finds.
As a side note, this is something to look out for on AWS instances where the default for /var/log is on an EBS mount (which can see delays due to network latency). It would make sense to mount /var/log to the ephemeral drive to avoid or minimize such unpleasantness. Of course, that limits the ability to use /var/log to troubleshoot kernel panics.
Durability and performance is a classic tradeoff, and systemd-journald is hardly the only application affected by this. If you need more performance while maintaining durability, buy faster storage technology.
That said, if journald could use multiple I/O threads, it might be able to take advantage of concurrency in the underlying storage if available (e.g. RAID). MySQL's InnoDB is a good example of such an implementation.
At any rate, in a modern production system, you generally want logs to be copied to an indexed archive located elsewhere in the network. On-host logs should be considered ephemeral, and mainly used as a buffer against slow consumers.
I think the bigger problem is more related to how outside services are being negativiely impacted (up to and including crashing) by delays in writing to disk, and have no choice in the performance/safety trade off.
Any program that uses the syslog(3) library function will also have this problem, as it also blocks (internally it writes to the /dev/log socket file). If the consumer can't keep up, whether it be rsyslogd or syslog-ng or whatever is listening to the socket, the producer can block as well.
So this is not a new problem. If you really want your producer to drop logs in the event of backpressure, you can directly use the syslog protocol over UDP, but it has tradeoffs too.
Great discussion, but one issue I've run into journald, compared to a traditional logging mechanism, is the size of the journal itself.
journald's archives are generally a factor of at least 200x of syslog for the same amount of information stored.
That is, on two identical servers, I've run rsyslog with the traditional logrotate mechanism and journald side-by-side to see how they compared.
I'm pleased with the extra querying mechanism provided by journald, which is absolutely great for debugging, however while I can store the logs one server virtually forever with just bz2 compression, I've unexpectedly run out of space in my var partition with journald after just a couple of months. The extra bandwidth is also a factor in the performance of journald, which is equally ignored.
In fact, this has forced me to reconsider the storage logging policy. It's much more effective to use rsyslog to send encrypted logs through the network (faster, less latency and less bandwidth) compared to what it would be with journald. I still keep the journal on individual machines for easier querying, but with a very limited retention.
If I have to chose between a log I can query easily, and a log file that can have 200x the density, I'll easily pick the second.
I've also found that the resilience of the journald archives are much lower. A corrupted text log file (due to a crash) has nothing to recover and can be easily read to inspect issues in any worst-case scenario. journald archives however are often corrupted in crashes and moved away after the reboot. For remote logging it might not be an issue, but for embedded systems and/or desktop machines it is an issue. Ensuring logs can be read even when corrupted is easily much more important than introducing the signing mechanism in the archives (which, on the other hand, is totally a non-issue as sensitive hosts should do remote logging).
I don't think this is entirely true: Even if you're appending lines to a text-file, if you do it from multiple concurrent threads, you're going to potentially corrupt individual lines.
And if you want to make absolutely sure that the log line you've just written is actually safely on disk, then you still have to fsync().
I would argue that journald's mmap-with-periodic-fsync() approach is actually better for latency.
If you want to make sure that your system performance isn't affected by stalled logging i/o, then you should ship the logs off-machine using a non-blocking protocol. Of course then you lose guarantees that your log-entries have actually been stored (which might or might not be a problem for you).
> If a write() requests that more bytes be written than there is room for (for example, the file size limit of the process or the physical end of a medium), only as many bytes as there is room for shall be written.
(Admittely disk full is pretty much an edge case.)
> If write() is interrupted by a signal after it successfully writes some data, it shall return the number of
bytes written.
"write() is atomic, but may not write the number of bytes asked to write."
A possible workaround is to use a start of line marker and provide a way to detect broken lines.
For example, if one starts every line with a 0x02 and ends it in a 0x03 (Start of Text/End of Text ASCII characters) and removes any such character from the text being logged, partly written log entries can be detected and ignored when reading the log. If one does that, one can just retry writing the entire line until write() indicates it wrote the entire buffer passed in.
I disagree. It is far better to use a String like approach and use a structure that includes the length of the next written object.
I believe that a simple sacrifice of a little more storage would also make that easier to split out. Pre-pad the message up to pad size zeros and then a pseudo-random number of additional pad objects (1 to n). Next you would store either the offset of the start or end, as well as the length of the payload (start would probably be easier to verify). With that structure it should be difficult to craft a payload message that looked like a valid message for the log.
I don't quite understand what approach you are advocating, but for me "it should be difficult to craft a payload message that looked like a valid message for the log." is not good enough for a system service.
This is a back of the napkin grade design; refine it and/or develop a proof for your specific use case.
Can you see any way it would be obviously incorrect? I could not when I wrote it, but I did not take the much greater effort of /proving/ it was a correct design. (which much like crypto algorithms, should also be done on an implementation by implementation basis)
> write() is atomic, but may not write the number of bytes asked to write.
as an alternative formulation, but I'm not sure what that's even supposed to mean, semantically. "Atomic" literally means "indivisible" (well, "cannot cut" or "un-cuttable"). So either "all of it" or "none of it".
You shouldn't need to choose. MySQL's InnoDB engine supports compression, and not only does it use less space on disk, it's actually faster to read and write. I don't see why journald can't implement something similar.
The deflate algorithm is well understood, well defined, available in every language, and not hard to implement. At this point, we might benefit from having it included and provided by the kernel. In fact, the kernel does have limited support for compression[1] (well, decompression only currently, it seems). Expanding upon this to allow compression as well would likely yield quite a few benefits for multiple applications.
This is one of those incredibly clearly written technical articles by someone who really knows their stuff that brings together a few things I understood vaguely into a coherent understanding. Lines like:
> There’s a per-kernel sync(), per-filesystem syncfs(), per-file fsync() and fdatasync(), per-mmap msync(), and per-file-range sync_file_range().
Incidentally, the CoreOS has one of the best home pages that I've ever encountered. "These are our main products, and for each of them here's an information icon that you can click that will actually inform you what they do and what they're for".
I would like to hire whoever wrote those descriptions too - clear concise explanation of what things do and why you'd want them.
This is a fantastic analysis and writeup of the problem.
Is there a practical reason that system-journald doesn't do O_DIRECT and non-blocking (using io_submit family of syscalls) writes/fsync? If minimizing blocking is such an important issue then given the state of disk IO in Linux that your only solution.
As long as you pre-allocated the space and are not required to do metadata updates you will get mostly non-blocking operations. Since this is done for logging which I'm presuming is not going to be read right back there's no downside of using O_DIRECT, in fact not polluting the page cache might be a net benefit.
Btw, is there anyway to get in touch with the author to talk about the issue? Can't find his contact details.
Jesus. From the first bug mentioned in your comment in that thread:
Reporter: "libsystemd-journal or systemd-journald loses some systemd's parts of message context when few messages logged in a moment"
Poettering: "This is a race. ... To fix this we need [two new things.] ... These items are listed on the plumbers wishlist, because we need [those things mentioned previously] from the kernel folks."
Software Development Jesus descends from the clouds, wearing a virtual reality headset and a t-shirt with the words "freedesktop.org". Something about him reminds you of Bill Gates, but you are not sure what, the headset with its peculiar red laser pointer attached to the side covers too much of his face. He opens his right palm towards your computer, which surprisingly seems to be running Fedora Linux now, and through the magic of Zeroconf networking and D-Bus, the following text appears on your screen:
Happy to explain this to you, simoncion!
After receiving a log message from a client process, the journal tries to get trusted metadata, like the application name, about that process. If the application exits too soon after logging its message, some of that metadata, e.g. /proc/<pid>/cmdline, might not be available anymore. The message itself is always logged, though.
Syslog, of course, also logs application names, but that information cannot be trusted, because it is reported by the clients themselves. So even if not all expected metadata is logged, you lose nothing compared to syslog.
Ouch.
> we were seeing a delay between .1-.5 seconds twice every 5 seconds
Double ouch. Half a second can be an eternity when wringing all the performance you can out of a system, and with so many cloud providers using network attached storage, that half second could balloon so quickly.
200kb is not an unusual size for unix socket buffers, and it wouldn't be too hard to bypass 200k in log writes in half a second on a busy system.
As an org who will be moving onto systemd with the release of the next Ubuntu LTS version, I appreciate what CoreOS is doing here. Great article, great finds.
As a side note, this is something to look out for on AWS instances where the default for /var/log is on an EBS mount (which can see delays due to network latency). It would make sense to mount /var/log to the ephemeral drive to avoid or minimize such unpleasantness. Of course, that limits the ability to use /var/log to troubleshoot kernel panics.