Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd love to know what your use case is that makes those things important to you - and what kind of benchmarks and cleaning tasks do you need to run?

Also, what kind of evaluations for quality of reasoning do you use?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: