I'd love to know what your use case is that makes those things important to you ...

I'd love to know what your use case is that makes those things important to you - and what kind of benchmarks and cleaning tasks do you need to run?

Also, what kind of evaluations for quality of reasoning do you use?