BeautifulSoup is great if you don't care about the performance at all. Because it is painfully slooooooww.
Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.
A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.
There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow
And to finish, Requests is also super slow, if you want something fast you have to use pycurl.
Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.
Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.
I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.
Lxml doesn't work well with broken html, but is an or two orders of magnitude faster for parsing, and same for querying with xpath.
A part from that, there is also Scrapy which is used a lot, but same it is also very slow, it is just horizontally scalable easily.
There are a lot of times in which scrapping doesn't use html parsing, when you are scrapping pages which change a lot of structure, it might be better to go with full text search, and in this case, the faster the better. And in that area Python is far from the best, except when .split() and .join() are enough. Even re.match is slow because of the algorithm they use is slow
And to finish, Requests is also super slow, if you want something fast you have to use pycurl.