HN comments for: Show HN: Crawlee for Python – a web scraping and browser automation library

marban

14 replies

2d8h

2024-07-09 09:55:03 UTC

Nice list, but what would be the arguments for switching over from other libraries? I’ve built my own crawler over time, but from what I see, there’s nothing truly unique.

jancurn

13 replies

2d8h

2024-07-09 10:04:19 UTC

The main advantage (for now) is that the library has a single interface for both HTTP and headless browsers, and bundled auto scaling. You can write your crawlers using the same base abstraction, and the framework takes care of this heavy lifting. Developers of scrapers shouldn't need to reinvent the wheel, and just focus on building the "business" logic of their scrapers. Having said that, if you wrote your own crawling library, the motivation to use Crawlee might be lower, and that's fair enough.

Please note that this is the first release, and we'll keep adding many more features as we go, including anti-blocking, adaptive crawling, etc. To see where this might go, check https://github.com/apify/crawlee

robertlagrant

12 replies

2d6h

2024-07-09 12:29:48 UTC

Can I ask - what is anti-blocking?

fullspectrumdev

11 replies

2d5h

2024-07-09 12:32:56 UTC

Usually refers to “evading bot detection”.

Detecting when blocked and switching proxy/“browser fingerprint”.

robertlagrant

10 replies

2d5h

2024-07-09 13:23:01 UTC

Is this a good feature to include? Shouldn't we respect the host's settings on this?

nlh

9 replies

2d4h

2024-07-09 13:45:29 UTC

It’s a fair and totally reasonable question but clashes with reality. Many hosts have data that others want/like to scrape (eBay, Amazon, Google, airlines, etc.) and they setup anti-scraping mechanisms to try and prevent scraping. Whether or not to respect those desires is a bigger question but not one for the scraping library - it’s one for those doing the scraping and their lawyers.

The fact is - many many people want to scrape these sites and there is massive demand for tools to help them do that, so if APIFY/Crawlee decide to take the moral ground and not offer a way around bot detection, someone else will.

thebytefairy

8 replies

2d3h

2024-07-09 14:40:37 UTC

Ah yes, the old 'if I don't build the bombs for them, someone else will'. I don't think this is taking the moral high ground, this is saying we don't care whether it's moral, there's demand and we'll build it.

jancurn

2 replies

2d3h

2024-07-09 15:18:06 UTC

There are many legitimate and legal use cases where one might want to circumvent blocking of bots. We believe that everyone has the moral right to access and fairly use non-personal publicly available data on the web the way they want, not just the way the publishers want them to. This is the core founding principle of the open web, which allowed the web to become what it is today.

BTW we continuously update this exhaustive post covering all legal aspects of web scraping: https://blog.apify.com/is-web-scraping-legal/

beeboobaa3

1 replies

2d1h

2024-07-09 16:53:00 UTC

Thoughts on this law? https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A...

mnmkng

0 replies

2024-07-09 17:45:09 UTC

It’s an “old” law that did not consider many intricacies of internet and the platforms that exist on it and it’s mostly made obsolete by EU case law, which has shrunk the definition of a protected database under this law so much that it’s practically inapplicable to web scraping.

(Not my opinion. I visited a major global law firm’s seminar on this topic a month ago and this is what they said.)

BiteCode_dev

2 replies

2d2h

2024-07-09 16:24:30 UTC

Google and Amazon where built on scrapped data, who are you kidding?

robertlagrant

0 replies

1d23h

2024-07-09 19:17:24 UTC

There's a bidirectional benefit to Google at least. That's why SEO exists. People want to appear in search results.

nurettin

0 replies

1d23h

2024-07-09 19:14:16 UTC

I make sure to enroll in projects which scrape Google/Amazon en-masse just for the satisfaction.

amarcheschi

1 replies

2d3h

2024-07-09 15:00:27 UTC

I'm not gonna feel bad if a corporation gets its data scraped (whenever it's legal to do so, and this is another kind of question I'm not knowledgeable enough to face) when they themselves try to scrape other companies' data

robertlagrant

0 replies

2d2h

2024-07-09 16:04:18 UTC

You seem to have a massive category error here. To my understanding, this is not only going to circumvent the scraping protection of companies that scrape other people's data.

intev

6 replies

2d7h

2024-07-09 10:52:16 UTC

How is this different from Scrapy?

sauain

5 replies

2d7h

2024-07-09 11:06:30 UTC

hey intev,

- Crawlee has out-of-the-box support for headless browser crawling (Playwright). You don't have to install any plugin or set up the middleware. - Crawlee has a minimalistic & elegant interface - Set up your scraper with fewer than 10 lines of code. You don't have to care about what middleware, settings, and anything are or need to be changed, on the top that we also have templates which makes the learning curve much smaller. - Complete type hint coverage. Which is something Scrapy hasn't completed yet. - Based on standard Asyncio. Integrating Scrapy into a classic asyncio app requires integration of Twisted and asyncio. Which is possible, but not easy, and can result in troubles.

mdaniel

4 replies

2d2h

2024-07-09 16:28:42 UTC

You don't have to install any plugin or set up the middleware.

That cuts both ways, in true 80/20 fashion: it also means that anyone who isn't on the happy path of the way that crawlee was designed is going to have to edit your python files (`pip install -e` type business) to achieve their goals

8organicbits

3 replies

1d17h

2024-07-10 00:33:42 UTC

I've been working on a crawler recently and honestly you need the flexibility middleware gives you. You can only get so far with reasonable defaults, crawling isn't a one-size fits all kinda thing.

mnmkng

2 replies

1d12h

2024-07-10 06:10:05 UTC

Crawlee isn’t any less configurable than Scrapy. It just uses different, in my personal opinion more approachable, patterns. It makes it easier to start with, but you can tweak whatever you want. Btw, you can add middleware in Crawlee Router.

mdaniel

1 replies

1d1h

2024-07-10 16:38:44 UTC

Crawlee isn’t any less configurable than Scrapy.

Oh, then I have obviously overlooked how one would be able to determined if a proxy has been blocked and evict it from the pool <https://github.com/rejoiceinhope/scrapy-proxy-pool/blob/b833...> . Or how to use an HTTP cache independent of the "browser" cache (e.g. to allow short-circuiting the actual request if I can prove it is not stale for my needs, which enables recrawls to fix logic bugs or even downloading the actual request-response payloads for making better tests) https://docs.scrapy.org/en/2.11/topics/downloader-middleware...

Unless you meant what I said about "pip install -e && echo glhf" in which case, yes, it's a "simple matter of programming" into a framework that was not designed to be extended

8organicbits

0 replies

23h16m

2024-07-10 19:14:12 UTC

Cache management is also what I had in mind. Ive been using golang+colly and the default caching behavior is just different enough from what I need. I haven't written a custom cache middleware, but I'm getting to that point.

mdaniel

4 replies

2d1h

2024-07-09 16:33:51 UTC

You'll want to prioritize documenting the existing features, since it's no good having a super awesome full stack web scraping platform if only you can use it. I ordinarily would default to a "read the source" response but your cutesy coding style makes that a non-starter

As a concrete example: command-f for "tier" on https://crawlee.dev/python/docs/guides/proxy-management and tell me how anyone could possibly know what `tiered_proxy_urls: list[list[str]] | None = None` should contain and why?

mnmkng

1 replies

2024-07-09 17:38:54 UTC

Sorry about the confusion. Some features, like the tiered proxies, are not documented properly. You’re absolutely right. Updates will come soon.

We wanted to have as many features in the initial release as possible, because we have a local Python community conference coming up tomorrow and we wanted to have the library ready for that.

More docs will come soon. I promise. And thanks for the shout.

bn-l

0 replies

2024-07-09 18:02:41 UTC

I literally had to go through the entire codebase the documentation is that lacking. It’s boring to document but imo it’s the lowest hanging fruit to get people moving down that crawlee -> appify funnel.

ttymck

0 replies

11h20m

2024-07-11 07:10:03 UTC

Can you link to some of the cutesy code? I've never heard this before.

NoThisIsMe

0 replies

12h56m

2024-07-11 05:34:02 UTC

but your cutesy coding style makes that a non-starter

I don't think this is fair. The code looks pretty readable to me.

localfirst

3 replies

2024-07-09 17:46:09 UTC

in one sentence, what does this do that existing web scraping and browser automation doesn't do?

mnmkng

2 replies

2024-07-09 18:19:08 UTC

In one word. Nothing.

But I personally think it does some things a little easier, a little faster and little more conveniently than the other libraries and tools out there.

Although there’s one thing that the JS version of Crawlee has which unfortunately isn’t in Python yet, but it will be there soon. AFAIK it’s unique among all libraries. It’s automatically detecting whether a headless browser is needed or if HTTP will suffice and using the most performant option.

localfirst

1 replies

1d23h

2024-07-09 18:49:56 UTC

is there anything that uses a computer vision model/ocr locally to extract data?

I find some dynamic sites purposefully make it extremely difficult to parse and they obfuscate the XHR calls to their API

I've also seen some websites pollute the data when it detects scraping which results in garbage data but you don't know until its verified

mnmkng

0 replies

1d22h

2024-07-09 20:01:56 UTC

We tried a self hosted OCR model a few years ago, but the quality and speed wasn’t great. From experience, it’s usually better to reverse engineer the APIs. The more complicated they are, the less they change. So it can sometimes be painful to set up the scrapers, but once they work, they tend to be more stable than other methods.

Data pollution is real. Also location specific results, personalized results, A/B testing, and my favorite, badly implemented websites are real as well.

When you encounter this, you can try scraping the data from different locations, with various tokens, cookies, referrers etc. and often you can find a pattern to make the data consistent. Websites hate scraping, but they hate showing wrong data to human users even more. So if you resemble a legit user, you’ll most likely get correct data. But of course, there are exceptions.

ranedk

2 replies

2d3h

2024-07-09 14:47:40 UTC

I found crawlee a few days ago while figuring out a stack for a project. I wanted a python library but found crawlee with typescript so much easier that I ended up coding the entire project in less than a week in Typescript+Crawlee+Playwright

I found the api a lot better than any python scraping api till date. However I am tempted to try out python with Crawlee.

The playwright integration with gotScraping makes the entire programming experience a breeze. My crawling and scraping involves all kinds of frontend rendered websites with a lot of modified XHR responses to be captured. And IT JUST WORKS!

Thanks a ton . I will definitely use the Apify platform to scale given the integration.

sauain

0 replies

2d3h

2024-07-09 15:15:28 UTC

would love to have your feedback on the python one too :)

3abiton

0 replies

1d19h

2024-07-09 23:03:49 UTC

How does it compare to selenium for python?

manishsharan

2 replies

2d1h

2024-07-09 17:16:06 UTC

Can this work on intranet sites like sharepoint or confluence , which require employee SSO ?

I was trying to build a small Langchain based RAG based on internal documents but getting the documents from sharepoint/confluence (we have both) is very painful.

mnmkng

0 replies

1d22h

2024-07-09 20:06:09 UTC

Technically it can. You can log in with the PlaywrightCrawler class without issue. The question is if there’s 2FA as well and how that’s handled. Crawlee does not have any abstraction for handling 2FA as it depends a lot on what verification options are supported on the SSO side. So that part would need a custom implementation within Crawlee.

jancurn

0 replies

1d22h

2024-07-09 20:13:11 UTC

For this use case, you might use this ready-made Actor: https://apify.com/apify/website-content-crawler

thelastgallon

1 replies

2d3h

2024-07-09 15:22:56 UTC

I wonder if there are any AI tools that do web scraping for you without having to write any code?

lifesaverluke

0 replies

2d3h

2024-07-09 15:27:11 UTC

Which site would you like to scrape?

renegat0x0

1 replies

2d3h

2024-07-09 15:01:26 UTC

Can it be used to obtain RSS contents? Most of examples focus on html

sauain

0 replies

2d3h

2024-07-09 15:10:17 UTC

I didn't try it, but I don't see a reason why not:

- RSS feed is transferred via HTTP. - BeatifulSoup can parse both HTML & XML.

(RSS uses XML format)

ijustlovemath

1 replies

2d5h

2024-07-09 12:30:46 UTC

Do you have any plans to monetize this? How are you supporting development?

sauain

0 replies

2d5h

2024-07-09 12:46:39 UTC

Crawlee is open source and free to use and we don't have any plans to monetize it in future. It will be always free to use.

We provide Apify platform to publish your scrapers as Actors for the developer community, and developers earn money through it. You can use Crawlee for Python as well :)

tldr; Crawlee is and always will be free to use and open sourced.

holoduke

1 replies

2024-07-09 17:34:34 UTC

Does it have event listeners to wait for specific elements based on certain pattern matches. One reason i am still using phantomjs is because it simulates the entire browser and you can compile your own webkit in it.

mnmkng

0 replies

2024-07-09 18:00:41 UTC

It uses Playwright under the hood, so yes, it can do all of that, and more.

c0brac0bra

1 replies

2d6h

2024-07-09 12:12:27 UTC

Wanted to say thanks for apify/crawlee. I'm a long-time node.js user and your library has worked better than all the others I've tried.

jancurn

0 replies

2d2h

2024-07-09 15:43:44 UTC

Thank you!

bmitc

1 replies

1d23h

2024-07-09 18:34:38 UTC

Can you use this to auto-logon to systems?

jancurn

0 replies

1d22h

2024-07-09 20:12:16 UTC

For sure, simply store cookies after login and then use them to initiate the crawl.

VagabundoP

1 replies

2d5h

2024-07-09 13:16:56 UTC

Looks nice, and modern python.

The code example on the front page has this:

`const data = await crawler.get_data()`

That looks like Javascript? Is there a missing underscore?

mnmkng

0 replies

2d5h

2024-07-09 13:27:33 UTC

Oh wow, thanks! Will fix it right away. Crawlee is originally a JS library.

Findecanor

1 replies

2d2h

2024-07-09 15:47:30 UTC

Does it have support for web scraping opt-out protocols, such as Robots.txt, HTTP and content tags? These are getting more important now, especially in the EU after the DSM directive.

jancurn

0 replies

2d2h

2024-07-09 16:22:59 UTC

Not yet, but it’s on the roadmap

renegat0x0

0 replies

3h25m

2024-07-11 15:04:40 UTC

I have been running my project with selenium for some time.

Now I am using crawlee. Thanks. I will work on it, to better integrate into my project, however I already can tell it works flawlessly.

My project, with crawlee: https://github.com/rumca-js/Django-link-archive

nobodywillobsrv

0 replies

1d7h

2024-07-10 10:42:00 UTC

I don't really understand it. Tried it on some fund site and it didn't really do much besides apparently grepping for links.

The example should show how to literally find and target all data as in .csv .xlsx tables etc and actually download it.

Anyone can use requests and just get the text and grep for urls. I don't get it.

Remember: pick an example where you need to parse one thing to get 1000s of other things to then hit some other endpoints to then get the 3-5 things at each of those. Any example that doesn't look like that is not going to impress anyone.

I'm not even clear if this is saying it's a framework or actually some automation tool. Automation meaning it actually autodetects where to look.

fforflo

0 replies

1d23h

2024-07-09 18:32:57 UTC

I'd suggest bringing more code snippets from the test cases to documentation as examples.

Nice work though.

barrenko

0 replies

2d1h

2024-07-09 17:11:37 UTC

Pretty cool, and any scraping tool is really welcome - I'll try it out for my personal project. At the monment, due to AI, scraping is like selling shovels during a gold rush.