There are a whole bunch of other things I dislike about Pascal, but in this one he was just undeniably correct. Worrying about 1-3 extra bytes (on the 16 and 32 bit platforms where it potentially mattered) was just not worth all the issues that null-terminated strings brought with them.
My current favorite string implementations are the various compact string crates for Rust. Generally, you want a string to be able to do at least three things:
- Pointer, length, capacity tuple, for heap-allocated strings, 24B on x64.
- String inlined into the 24B buffer.
- Pointer, length tuple where pointer points to rodata.
You can do any of that and the discriminator in 24B, given the healthy assumption that all strings are shorter than 2^63.
Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with. Hopefully, whatever comes next can crib from smartstring or compact_str or the like.
There was a paper on this in the late 70s from the Cedar group at PARC. This was back when computer science papers were actual scientific papers, so full of analysis of different alogorithms' performance with counted vs delimited strings.
Counted strings won hands down on anything but strings so short the length was a large percentage of overall size.
Yet...since nobody reads the literature, we have all continued to suffer.
> Sadly, switching costs are massive and every programming language is pretty much struck with the string they started with.
Haskell is _almost_ flexible enough to be able to use a different string than the one it started with.
The language itself actually is flexible enough, but many of the libraries are not.
The main thing making Haskell flexible enough is that a literal like "foo" can be statically determined to be the right string type that you want to use. (And that happens at compile time, it's not a runtime conversion.)
Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset. All standard functions will continue working and you didn’t have to make any calls outside of your loop, not to the allocator, not to memcopy, nothing.
With strings being objects that are prefixed by a header cannot do this. At a minimum you need to allocate a new header, if not the whole string. Yes that’s the safer route but also a lot less performant.
Most crucially, you can build the header string implementation on top of C strings. You cannot do the opposite.
Realistically though C strings (aka null terminated strings) are just not a great thing because of the null termination. For my money, I would prefer to just use untermianted arrays and a separate size variable, as well as wide character strings for actual display stuff. This way all the interop must include string lengths (or some other way to determine length), and all internal stuff may be just ASCII but must not leave your internal logic and never be shown to the user.
> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop. In standard C you can just reference a part of a string with a pointer offset.
If you want your substring to terminate in the same place as the original, at a null terminator. But that sadly is almost never the case, and as many C practitioners know, references like this are often unsafe and so APIs that substring tend to copy. That's just what they have to do to pass address sanitizer and static analysis checks.
If you want arbitrary views on a null terminated string, well, it's no longer null terminated and that's just the start of your problems in C.
In languages like Rust and Go, taking a view of a string or array is safe and doesn't copy the underlying data or require an allocation. So if you are writing performance sensitive code where substrings are a major contributor to CPU cycles, best go with those language (or C++) rather than C.
That’s fair: you won’t be able to use any libc functions that rely on null termination. But a lot of the time you don’t need to either. Think writing the substring to a socket or comparing it to a known constant.
In Rust, you would do both of those with a &str, which works fine. Just works exactly as in C, with no calls to memcopy or allocator or anything. And you would also be able to do all the other things that in C use null termination, too.
The solution in Rust is separate String and &str. &str is a reference to somewhere within String, and the length of the referred to region, and borrows from the String it refers to.
Any function that does not need to modify a String takes a &str. Any function that does modify a String typically takes a String, which means they consume their input. (Because of utf-8, in-place modification is generally a pipedream.)
Also, the headers are typically allocated on stack. Rust is a lot less shy about types that are larger than a pointer living inline whereever they are used, and this is something that seems to work a lot better than the alternative.
So which is it, then? Does keeping size separate "blows your CPU cache"¹ or not? You can't argue it does in one case (Rust) but not in your case…
(And note that the representation you're responding to is not really a "header", in the same sense that the trailing null is a "footer". The representation does not require the length be contiguous with the data, but that's what upthread was trying to say in the first place.)
So now you are arguing that by default your strings should come with a length? Great!
If you want that, you might as well bake that length into the string type by default (and use a specialised type, perhaps a naked raw pointer into the string) for when you don't want to pass the length.
To get a correct understanding, if you aren't a Rust person, Rust's String is (literally, though this is opaque) Vec<u8> with checks to ensure it's actually not just arbitrary bytes you're storing but UTF-8 text.
Vec<u8> unlike String has a direct equivalent (well, the Rust design is slightly better from decades of extra experience, but same rough shape) in C++ std::vector<std::byte>
The C++ std::string is this very weird thing that results from having standardized std::string in C++ 98, then changing their mind after implementation experience. So while it's serviceable it's pretty poor and nobody should model what they're doing on this. There have been HN links to articles by people like Raymond Chen on what this type looks like now.
In order to access the string contents in the first place you need the pointer. The length is stored right next to it. So they're both going to be in the same cache line, assuming proper alignment. In the rare case in which they straddle a cache line, you just have to load once and then the length remains in cache for the remainder of the loop. (This is true regardless of where the length lives, in fact; as far as CPU cache is concerned it really makes little difference either way.)
(This is assuming SROA didn't break apart the string and put the length in a register, which it often does, making this entire question moot.)
Huh? The headers are either in registers or in stack. The top of stack is always in L1. There is no way in which this is inferior to handing over a pointer to a string and a length separately, other than requiring two additional words of storage in registers/stack.
If you are reading 1000 lines from stdin at once to separate Strings, you are already going to be accessing memory in 1000 places at the same time, and making it 1001 isn't meaningfully worse for your cache. (Implementation would be Vec<String>, which would lay out the 1000 headers contiguously.)
But I genuinely have a hard time understanding for what kind of workload I would ever do that. If you want to read a 1000 lines of stdin, and cannot use an iterator and must work on them at the same time, I would likely much rather read them into a single string and then split that into a 1000 &str using the .lines() iterator.
Presumably the idea is, for example, sorting? In which case you do have to read the entire input before you can do anything. But the way I'd do that is to read the entire stdin to a single String, then work with &str pointers to it.
Null terminated strings have a footer, so it is the exact same problem, just on the other end of the string. It is inherently impossible to substring an arbitrary string without copying and using the same memory layout for the full string and the substring(s).
Of course, if your string type is a struct containing a size and a pointer, you can easily have multiple substrings pointing into the same byte array.
> Imagine you are writing performance sensitive code. You want to get a substring from a string, one that is not going to live outside your hot loop.
Zig uses slices for this (and everything else except interop): a pointer and a length, with some nicely ergonomic syntax for making another one, like `slice[start..][0..length]`.
When you're building strings then you have an ArrayList, which keeps a capacity around, and maybe a pointer to an Allocator depending on what you want. It's trivial to get the slice out of this when you need it.
Doing anything useful with a string requires knowing where it is (the pointer) and how much of it you have (the length) so keeping them together in one fat pointer is just good sense. It's remarkable how much easier this is to work with than C strings.
48bit address space and 128bit of return value on systemv make pointer, size, 32bit of capacity-past-the-end attractive on x64. Specifically ptr, size as u64 with the extra capacity stored across the high 16 bits of each of them.
What do you do when you have a string that's longer than 2^32 that gets truncated to len=0? Instantly freeing the buffer might not be what the user wants, if they intend to immediately reuse it for an another very long string, for example.
I think that's a pretty bad case of premature optimization, especially because the first CPUs with 57 bit support are now hitting mainstream. Just use 3 words, it's not that much extra space.
Realloc/remap down to 4gb in that case sounds OK to me. > 4gb allocated from a structure which can't do any resizing seems moderately unlikely, but sure, I guess free is also correct in that case.
Two 64 bit values can be returned in registers on the systemv x64 abi, three get passed as a pointer to stack memory. It's an optimisation but I think it's a valid one.
57 bit address space has been coming any year now for maybe a decade, I'll worry about that when it happens.
Yes, his arrays which have length as an immutable part of their type certainly prevent certain kinds of bugs. Too bad about making it impossible to write generic array-handling subroutines, even if you accept the generally inexpressive type system as a given.