Er, no, it wouldn’t - the point is that you’re taking something that resembles a real world spec and interpreting it into code. You’d be surprised at how many people stumble on this despite how basic the solutions are.
I gave out a file that was 1000 lines long. It may contain duplicates. The numbers are all valid, but are not necessarily reduced (e.g. the number 4 may be represented as IV or IIII).
Print out the reduced numbers in sorted order. If both IV and IIII exist in the file, then IV should be printed out twice (in the proper spot of the overall sorted order).
I timed myself doing this from scratch. It took about 1 hour (7:38 pm to 8:48 pm with some thinking and cleaning sporadically afterwards - https://github.com/shagie/RomanSort/commits/master its 6c79654 that has it acceptably working)
One candidate did it very similar to what I had.
Another candidate didn't have runnable code. Because it was only to take about an hour, and their interview was Monday at 10 am, they didn't start it until they got into their place of current employment that morning (they had a full week to work on it). Neither I nor my manager was impressed with that one.
Another candidate put them into an array (because arrays were faster than lists to access), and then wrote a Comparator to take two roman numeral strings and called Arrays.parallelSort() (because that was faster too), and took the String representation of the array and did a replaceAll on the delimiter with a new line and printed that out. It worked, but the code was messy.
I'm not seeing how your example includes an "obvious improvement" that "only the best engineers would do". I would use arrays over lists but I have a ton of low-level experience so I tend to care about performance, but I wouldn't necessarily expect the "best" engineers to do so unless the prompt says "Please take special care to turn in the fastest solution".
There's a couple obvious tests (written unit tests) that I would expect the candidates to think of for their roman number sorting function, and I usually graded the take homes better when there was 1) evidence of testing 2) evidence of smart, appropriate edge case testing, but not having time to write polished tests in 1h isn't necessarily a red flag.
I'm really struggling to find how simple examples like this can demonstrate anything like GP mentioned.