Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.

Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.

A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.

There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow

And to finish, Requests is also super slow, if you want something fast you have to use pycurl.



In my experience selectolax is about 10x faster than lxml, and keeps the familiar CSS selector API: https://rushter.com/blog/python-fast-html-parser/


Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.


Selectolax is nice, much faster than bs4 or lxml. Not a very well known project yet though.

Not sure there's anything faster on the javascript side of the fence?


If they beat lxml it is pretty impressive. Too bad that they don't support xpath


Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.


Beautiful Soup can use lxml, and does by default for parsing xml.


There is a big speed difference between lxml alone and lxml + bs4


I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.


Yes, the difference is infinite with broken HTML which is checks notes a huge chunk of the Internet.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: