BeautifulSoup is great if you don't care about the performance at all. Because i...

sigil · on Aug 12, 2021

In my experience selectolax is about 10x faster than lxml, and keeps the familiar CSS selector API: https://rushter.com/blog/python-fast-html-parser/

guskel · on Aug 11, 2021

Does Scrapy's slow speed actually matter much? Your main bottleneck is always going to be network calls and rate limiting. I don't know how much optimization can help there.

traverseda · on Aug 11, 2021

Selectolax is nice, much faster than bs4 or lxml. Not a very well known project yet though.

Not sure there's anything faster on the javascript side of the fence?

polote · on Aug 11, 2021

If they beat lxml it is pretty impressive. Too bad that they don't support xpath

f311a · on Aug 11, 2021

Libxml is pretty slow (lxml uses it). Selectolax is 5 times faster for simple CSS queries. It is basically a thin wrapper for a well optimized HTML parser written in C.

dec0dedab0de · on Aug 11, 2021

Beautiful Soup can use lxml, and does by default for parsing xml.

polote · on Aug 11, 2021

There is a big speed difference between lxml alone and lxml + bs4

dec0dedab0de · on Aug 12, 2021

I was unaware, I always use bs4 with lxml for parsing xml just because I like the interface. For what I'm doing, the bottleneck is the remote system/network, so it doesn't really matter. But now I'm curious about which parts are slower and why. Maybe I'll run some experiments later.

Marazan · on Aug 11, 2021

Yes, the difference is infinite with broken HTML which is checks notes a huge chunk of the Internet.