HN comments for: Web Scraping in Python

This guide (and most other guides) are missing a massive tip: Separate the crawling (finding urls and fetching the HTML content) from the scraping step (extracting structured data out of the HTML).

More than once, I wrote a scraper that did both of these steps together. Only later I realized that I forgot to extract some information that I need and had to do the costly task of re-crawling and scraping everything.

If you do this in two steps, you can always go back, change the scraper and quickly rerun it on historical data instead of re-crawling everything from scratch.

I've found this to be a good practice for ETL in general. Separate the steps, and save the raw data from "E" if you can because it makes testing and verifying "T" later much easier.

this is what i try to do but i want to learn more about approaches like this, do you know any good resources about how to design ETL pipelines?

Disclaimer: previous job had a lot of cases where CSVs were dropped by SFTP, your milage may vary..., JSON APIs are said to be a different flavor of crazy...

Haven't heard much beyond "ask the Old Ones", but "Murphy's law strikes again", "eventually someone will want that data even though they swore it was unnecessary", "eventually someone will ask for a backfill/replay", "eventually someone will give you a duplicate file", "eventually someone will want to slice-and-dice the data a different way" and "eventually someone will change the schema without telling you" have been some things I have noticed.

Even de-duplicating data is, in a sense, deletion (or someone will eventually want to get at the data with duplicates -- e.g. for detecting errors or repeats or fraud or some other analysis that mirrors looking at the bullet holes in World War 2 bombers)

Store the data as close to the original form as you can. Keep a timestamp of when you landed the data. Create a UUID for the record. Create a hash of the record if you can. Create a batch_id if you load multiple things at once (e.g. multiple CSVs). Don't truncate and reload a table - rather, append to it. If you still need something that looks like atomic table changes, I've gotten away with something close: "a view that shows only the most recent valid batch". (Yes this is re-inventing the database wheel, but sometimes you make do with the tools you are forced to use.)

Someone, somewhere, will hand you a file that does not conform to the established agreement. You want to log that schema change, with a timestamp, so you can complain to them with evidence that they ain't sending you what they used to, and they didn't bother sending you an email beforehand...

They're not going to fix it on your timeline, so you're probably going to end up hacking your code to work a different way... Until, you know, they switch it back...

So, yeah. Log it. Timestamp it. Hash it. UUID it. Don't trust the source system to do it right, because they will eventually change the script on you. Keep notes, and plan in such a way that you have audit logs and can move with agility.

sound advice. thank you.

In conclusion...

I find, in data engineering ,the goal is not to prevent everything, it's to be flexible and prepared to handle lots of change, even silly changes, and be able to audit it, observe it, maneuver around it, and keep the mean-time-to-resolution low.

I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms.

It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier.

i appreciate you sharing all that, but it seems like we might be on similar levels of knowledge/experience. i've been a dev who does a lot data engineering for 5 years. i'm looking more for best practices and theory about designing the pipeline, how to arrange the order of operations, how to separate each step, logging practices, how to make it reproducible, how to restart when it fails halfway in without going back to the beginning, how many retries, what to do if a step gets stuck in failed state, how to flag that bad data, etc. so. many. questions. while i build these pipelines.

i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc

I wish I did. I currently work at a startup with our core offering being ETL, so I've learned along the way as we've continued. If anyone has any, I'd love to hear as well.

Keeping raw data when possible has been huge. We keep some in our codebase for quick tests during development and then we keep raws from production runs that we can evaluate with each change, giving us an idea of the production impact of the change.

There's quite a bit of new tooling in this space, selecting the right one is going to depend on your needs then you can spike from there. Check out Prefect, Dagster, Windmill, Airbyte (although the latter is more ELT than ETL).

I realise from working a few places that this isn't entirely common practice, but when we built the data warehouse at a startup I worked at, we engaged with a consultancy who taught us the fundamentals of how to do it properly.

One of those fundamentals was separating out the steps of landing the data vs subsequent normalisation and transformation steps.

It's unfortunate that "ETL" stuck in mindshare, as afaik almost all use cases are better with "ELT"

I.e. first preserve your raw upstream via a 1:1 copy, then transform/materialize as makes sense for you, before consuming

Which makes sense, as ELT models are essentially agile for data... (solution for not knowing what we don't yet know)

I think ETL is right from the perspective where E refers to “from the source of data” and L refers to “to the ultimate store of data”.

But the ETL functionality should itself lives in a (sub)system that has its own logical datastore (which may or may not be physically separate from the destination store), and things should be ELT where the L is with respect to that store. So, its E(LTE)L, in a sense.

For those confused as to whether ETL or ELT is ultimately more appropriate for you… almost everyone is really just doing ETLTLTLT or ELTLTLTL anyways. The distinction is really moot.

Maybe my understanding is incorrect, but expansion on the distinction.

Assumptions -- We're talking about two separate systems (source and destination) with non-neglible transfer time (although perhaps "quick")

ETL -- Performing the transform before/during the load, such that fields in the destination are not guaranteed to have existed in the source (i.e. 2 db model)

ELT -- Performing a 1:1 copy of source into an intermediary table/db (albeit perhaps with filtering), then performing a transform on the intermediary table/db to generate the destination table/db (either realized or materialized at query time), with the intermediary table/database history retained (i.e. 3 table/db model)

In short distinction, if regeneration or altering the destination is required, ETL relies on history being available in the upstream source.

ELT pulls control of that to the destination-owner, as they're retaining the raw data on their side.

This is how I do it.

I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.

I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.

This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.

Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.

[0]: https://urlbox.com [1]: https://urlbox.com/s3 [2]: https://urlbox.com/webhooks [3]: https://nokogiri.org

Does it save the whole page or just the viewport? Just checked the landing page it looks targeted to a specific case of saving “screenshots” and this is also obvious from limitations in the pricing page so it would be unfeasible for larger projects?

Urlbox will save the whole page.

It's primarily purpose is to render screenshots full-page or limited to viewport or an element. To do that well as it does the HTML has to be rendered perfectly first.

It's not as cheap as other solutions but we have customers who render millions of pages per month with us. They value the accuracy and reliability that's come from over a decade of refinements to the service.

Larger projects can request preferential pricing based on the specifics of the kinds of pages they are rendering.

What I find most effective, is to wrap `get` with local cache, and this is the first thing I write when I start a web crawling project. Therefore, from the very beginning, even when I'm just exploring and experimenting, every page only gets downloaded once to my machine. This way I don't end up accidentally bother the server too much, and I don't have to re-crawl if I make a mistake in code.

requests-cache [0] is an easy way to do this if using the requests package in python. You can patch requests with

  import requests_cache
  requests_cache.install_cache('dog_breed_scraping')

and responses will be stored into a local sqlite file.

[0] https://requests-cache.readthedocs.io/en/stable/

I've found this approach works really well using JavaScript and puppeteer for the first stage, and then Python for the second stage (the re module for regular expressions is nice here IMO).

JS/puppeter seems a bit easier for things like rotating user agents, from article:

"Websites often block scrapers via blocked IP ranges or blocking characteristic bot activity through heuristics. Solutions: Slow down requests, properly mimic browsers, rotate user agents and proxies."

If you're using JS in the first step just because you need puppeteer, check out playwright. It's what the original authors of puppeteer are working on now and it's been more actively developed in the past few years, very similar in usage and features, but it also has an official python wrapper package.

An easy way to do this that I've used is to cache web requests. This way, I can run the part of the code that gets the data again with say a modification to grab data from additional urls, and I'm not unnecessarily rerunning my existing URLs. With this method, I don't need to modify existing code either, best of both worlds.

For this I've used the requests-cache lib.

Looks like a really useful library - thanks for the tip.

It applies to many other project too: cling on to the raw data as long as it isn't bogging you down too much.

Generally it's enough to archive the retrieved HTML just in case.

Although in general I like the idea of a queue for a scraper to access separately, another option - assuming you have the storage and bandwidth - is to capture and store every requested page, which lets you replay the extraction step later.

Yes!

My Clojure scraping framework [0] facilitates that kind of workflow, and I’ve been using it to scrape/restructure massive sites (millions of pages). I guess I’m going to write a blog post about scraping with it at scale. Although it doesn’t really scale much above that – it’s meant for single-machine loads at the moment – it could be enhanced to support that kind of workflow rather easily.

[0]: https://github.com/nathell/skyscraper

this is the way.

Can confirm. A few discrete scripts each focused on one part of the process can make the whole thing run seamlessly async, and you naturally end up storing the pages for processing by subsequent scripts. Especially if you write a dedicated downloader - then you can really go nuts optimizing and randomizing the download parameters for each individual link in the queue. "Do one thing and do it well" FTW.

I talked about exactly that on a conference in 2022: https://youtu.be/b0lAd-KEUWg?feature=shared free to watch.

if you're using requests in python, requests-cache does exactly this for you, saving the data to an sqlite db, and is compatible with your code using requests.

Yes!! https://beepb00p.xyz/unnecessary-db.html really changed how I think about data manipulation, mostly with this principle.

The problem is crawling is generally optimized with info you find in the page.

I strongly recommend adding Playwright to your set of tools for Python web scraping. It's by far the most powerful and best designed browser automation tool I've ever worked with.

I use it for my shot-scraper CLI tool: https://shot-scraper.datasette.io/ - which lets you scrape web pages directly from the command line by running JavaScript against pages to extract JSON data: https://shot-scraper.datasette.io/en/stable/javascript.html

Came here to write about Playwright. I've been using it for the last ~13 months to scrape supermarket prices and it's been a great experience.

would love to learn more about what you are doing with supermarket prices

My main drive was to document the crazy price hikes that's been going on in my home country, Greece, so I'm scraping its 3 biggest supermarkets and keep track of the prices of their products*

Had a lot of fun building and automating the scraping, especially in order to get around some bot catching rules that they have. For example one of them blocks all the requests originating from non-residential IPs, so I had to use tailscale to route the scraper's traffic through my home connection and take advantage of my ISP's CGNAT.

You can take a look here: https://pricewatcher.gr/en/

* I'm not doing any deduplication or price comparisons between the supermarkets, I only show historical prices of the same product, to showcase the changes.

I can't get prices on my local supermarket's website (in Germany) without selecting a specific branch.

Seems kinda suspicious to me.

Yeah in this case between these 3 supermarkets there are 2 options:

1. You choose your general location or 2. You don't choose

In the first option you get one less general category to choose from (for example they may not have fresh fish)

As far as I can tell, in both cases the supermarket closest to the delivery address is responsible for filling out your order and usually what happens is that they call you to let you know that they don't have something and suggest substitutions.

+1 Interested in this area as well.

I too choose this guys supermaket scraper!

What Ive long wanted was the the ability to map prices to SCUs by having folks simply take a pic of the UPC + price, just like gasbuddy or what not - in addition to scraping from grocery posting their coupon sheets online for scraping, in addition to people just scanning (non-PII) portions of receipts.

Can you share what you've made thus far?

* could it be used as an automated "price matching" finder? (for those companies that do a "we price match!" thing

I actually use your shot-scraper tool (coupled with Mozilla's Readability) to extract the main text of a site (to convert to audio and listen via a podcast player). I love it!

Some caveats though:

- It does fail on some sites. I think the value of scrapy is you get more fine grained control. Although I guess if you can use any JS with shot-scraper you could also get that fine grained control.

- It's slow and uses up a lot of CPU (because Playwright is slow and uses up a lot of CPU). I recently used shot-scraper to extract the text of about 90K sites (long story). Ran it on 22 cores, and the room got very hot. I suspect Scrapy would use an order of magnitude less power.

On the plus side, of course, is the fact that it actually executes JS, so you can get past a lot of JS walls.

Readability is great, and I use it, but it's odd how half-assed the maintenance for it has been. I've haven't seen any noticeable improvements to it in quite some time, and when I've looked for alternatives, it usually turns out they're using it under the hood in some capacity.

Perhaps it's already being made obsolete by LLM technologies? I'd be curious to hear from anyone who's used a locally running LLM to extract written content, especially if it's been built specifically for that task.

Readability is great, and I use it, but it's odd how half-assed the maintenance for it has been. I've haven't seen any noticeable improvements to it in quite some time

What improvements are you looking for? For me, it works over 95% of the time, so I'm happy. Occasionally it excises a section (e.g. "too short" heuristic), and I wish it was smarter about it. But like you, I haven't found better alternatives. I also need something I can run in a script.

Perhaps it's already being made obsolete by LLM technologies? I'd be curious to hear from anyone who's used a locally running LLM to extract written content, especially if it's been built specifically for that task.

It would be good to benchmark this across, say, 50 sites and see which one performs better. At the moment, I don't know if I'd trust an LLM more than Readability - especially for longer content. Also, I wouldn't use it to scrape 90K sites. Both slow and expensive!

Although it works most of the time, I've found it's common for it to either pick up things that shouldn't be included or it only picks up something like the footer but not the actual body. This can be true even when, upon inspection, there's no clear reason why the body couldn't be identified. It's particularly problematic on many academic articles that are in HTML (sort of ironic). I'd also like a bit more normalization built in, even if it's turned off by default.

Wow, you're really putting it through its paces!

When you ran it against 90,000 sites were you running the "shot-scraper" command 90,000 times? If so, my guess is that most of that CPU time is spent starting and stopping the process - shot-scraper wasn't designed for efficient start/stop times.

I wonder if that could be fixed? For the moment I'd suggest writing Playwright code for 90,000 site scraping directly in Python or JavaScript, to avoid that startup overhead.

Yes, indeed I launched shot-scraper command 90K times. Because it's convenient :-)

I didn't realize starting/stopping was that expensive. I thought it was mostly the fact that you're practically running a whole browser engine (along with a JS engine).

If I do this again, I'll look into writing the playwright code directly (I've never used it).

Kinda tangent, but Playwright's doc (specifically, the intro https://playwright.dev/python/docs/intro ) confuses me. It asks you to write a test and then run `pytest`, instead of just letting you to use the library directly (which exists, but is buried in the main text: https://playwright.dev/python/docs/library).

I understand that using Playwright in tests is probably the most common use case (it's even in their tagline) but ultimately the introduction section of a lib should be about the lib itself, not certain scenario to use it with a 3rd-party lib B (`pytest`). Especially when it may cause side effect (I wasn't "bitten" by it but surely was confusing: when I was learning it before, I created test_example.py as said in a minefield folder which has batch of other test_xxxx.py files. And running `pytest` causes all of them to run, and gives confusing outputs. And it's not obvious to me at all, since I've never used pytest before and this is not a documentation about pytest, so no additional context was given.)

tagline

Hah yeah that's confusing.

https://playwright.dev/python/docs/intro is actually the documentation for pytest-playwright - their pytest plugin.

https://playwright.dev/python/docs/library is the documentation for their automation library.

I just filed an issue pointing out that this is confusing. https://github.com/microsoft/playwright/issues/29579

Back in the day I used to use HTMLUnit

https://htmlunit.sourceforge.io/

to crawl Javascript-based sites from Java. I think it was originally intended for integration tests but it sure works well for webcrawlers.

I just wrote a Python-based webcrawler this weekend for a small set of sites that is connected to a bookmark manager (you bookmark a page, it crawls related pages, builds database records, copies images, etc.) and had a very easy time picking out relevant links, text and images w/ CSS selectors and beautifulsoup. This time I used a database to manage the frontier because the system is interactive (you add a new link and it ought to get crawled quickly) but for a long time my habit was writing crawlers that read the frontier for pass N from a text file which is one URL per line and then write the frontier for pass N+1 to another text file because this kind of crawler is not only simple to write but it doesn't get stuck in web traps.

I have a few of these systems that do very heterogenous processing of mostly scraped content and something think about setting up a celery server to break work up into tasks .

agreed, playwright is great. it even has device emulation profiles built in, so you can for instance use an iphone device with the right screen size/browser/metadata automatically

How does it compare to selenium or puppeteer?

Ime playwright is selenium plus some, e.g. you can inspect network activity without having a separately configured proxy.

Playwright is a rewrite of puppeteer by people who worked on puppeteer before, but now under Microsoft instead of Github. Not sure if it reached feature parity yet, but all the things we used to do with puppeteer work with playwright, and it seems to be more actively developed.

100%. Playwright (which does have Python support) is completely owning this scene. The robustness is amazing.

We use shot-scraper internally to automate keeping screenshots in our documentation up-to-date. Thanks for the tool![1]

Agree that Playwright is great. It's super easy to run on Modal.[2]

1. https://modal.com/docs/guide/workspaces#dashboard

2. https://modal.com/docs/examples/web-scraper#a-simple-web-scr...

Can anyone recommend a good methodology for writing tests against a Playwright scraping project?

I have a relatively sophisticated scraping operation going, but I haven’t found a great way to test methods that are dependent on JavaScript interaction behind a login.

I’ve used Playwright’s har recording to great effect for writing tests that don’t require login, but I’ve found that har recording doesn’t get me there for post-login because the har playback keeps serving the content from pre-login (even though it includes the relevant assets from both pre and post login.)

+1 for playwright. The codegen is a brilliant way to simplify scraping.

I'm not sure why Python web scraping is so popular compared to Node.js web scraping. npm has some very well made packages for DOM parsing, and since it's in Javascript we have more native feeling DOM features (e.g. node-html-parser using querySelector instead of select - it just feels a lot more intuitive). It's super easy to scrape with Puppeteer or just regular html parsers on a Lambda.

To me it's mainly the following three reasons, but take it with a grain of salt since my JS is not as fluent as Python.

1. the async nature of JS is surprisingly detrimental when writing scraping script. It's hard to describe, but it makes have a mental image of the whole code base or workflow harder. Writing mostly sync code and only use things like ThreadPoolExecutor (not even Threading directly) when necessary has been much easier for me to write clean, easy-to-maintain code.

2. I really don't like the syntax of loops or iterations in JS, and there are a lot of them in web scraping.

3. String processing and/or data re-shaping feels harder in JS. The built-in functions often feel unintuitive.

I've had some experiences with selenium and now I'm using puppeteer, and I honestly don't see the problem with JS. It's true that I have not much experience coding but it seems to me that Pupeteer + Flask serving ML to extract data is the cake. Also, being able to play around evaluating expressions in pupeteer, etc, makes it manageable.

Maybe I lack experience but I don't see JS being a barrier.

I would like to know what kind of string work are you doing. I can't imagine being dependent on parsing strings and such, that looks very easy to break, even easier that css selector dance.

String processing in general is just easier in Python.

As a basic example, `lstrip()` and `rstrip()` trim whitespace by default, but can also remove any characters you'd like from the end(s) of your string.

You'd need to code that up yourself in JS. Happens quite a lot.

String processing and/or data re-shaping feels harder in JS. The built-in functions often feel unintuitive.

Hey don't worry there's probably a library that does it for you! It only pulls down a half gigabyte of dependencies to left-pad strings!

I hate the JavaScript ecosystem so fucking much.

I agree with anything you said, and:

4. Having the scraped data in Python-land makes it sometimes way easier to dump it into an analysis landscape, which is probably Python, too.

Having done a lot of web scraping, the thing that often matters is string processing. Javascript/Node are fairly poor at this compared to Python, and lack a lot of the standard library ergonomics that Python has developed over many years. Web scraping in Node just doesn't feel productive. I'd imagine Perl is also good for those in that camp. I've also used Ruby and again it was nice and expressive in a way that JS/Node couldn't live up to. Lastly, I've done web scraping in Swift and that felt similar to JS/Node – much more effort to do data extraction and formatting, not without benefits of course.

I also suspect that DOM-like APIs are somewhat overrated here with regards to web scraping. JS/Node would only have an emulation of DOM APIs, or you're running a full web browser (which is a much bigger ask in terms of resources, deployment, performance, etc), and to be honest, lxml in Python is nice and fast. I generally found XPath much better for X(HT)ML parsing than CSS selectors, and XPath support is pretty available across a lot of different ecosystems.

Best web scraping guy I ever met (the type you hire when no one else can figure out how) was a Perl expert. I don’t know Perl so I don’t know why, but this is very real.

Perl expert

Because Perl excels at text processing.

Yeah I'm not surprised. My previous company had scrapers for ~50 of our suppliers (so they didn't need to integrated with us), and I worked on/off on them for 7 years. It was a very different type of work to the product/infra work I spent most of my time doing.

One scraper is often not hugely valuable, most companies I've seen with scrapers have many scrapers. This means that the time investment available for each one is low. Some companies outsource this, and that can work ok. Then scrapers also break. Frequently. Website redesigns, platform moves, bot protection (yes, even if you have a contract allowing you to scrape, IT and BizDev don't talk to each other), the site moving to needing JavaScript to render anything on the page... they can all cause you to go back to the drawing board.

The concept of "tech debt" kinda goes out of the window when you rewrite the code every 6 months. Instead the value comes from how quickly you can write a scraper and get it back in production. The code can in fact be terrible because you don't really need to read it again, automated testing is often pointless because you're not going to edit the scraper without re-testing manually anyway. Instead having a library of tested utility functions, a good manual feedback loop, and quick deployments, were much more useful for us.

I'm not sure why Python web scraping is so popular compared to Node.js web scraping

Take this with a grain of salt, since I am fully cognizant that I'm the outlier in most of these conversations, but Scrapy is A++ the no-kidding best framework for this activity that has been created thus far. So, if there was scrapyjs maybe I'd look into it, but there's not (that I'm aware of) so here we are. This conversation often comes up in any such "well, I just use requests & ..." conversation and if one is happy with main.py and a bunch of requests invocations, I'm glad for you, but I don't want to try and cobble together all the side-band stuff that Scrapy and its ecosystem provide for me in a reusable and predictable way

Also, often those conversations conflate the server side language with the "scrape using headed browser" language which happens to be the same one. So, if one is using cheerio <https://github.com/cheeriojs/cheerio> then sure node can be a fine thing - if the blog post is all "fire up puppeteer, what can go wrong?!" then there is the road to ruin of doing battle with all kinds of detection problems since it's kind of a browser but kind of not

I, under no circumstances, want the target site running their JS during my crawl runs. I fully accept responsibility for reproducing any XHR or auth or whatever to find the 3 URLs that I care about, without downloading every thumbnail and marketing JS and beacon and and and. I'm also cognizant that my traffic will thus stand out since it uniquely does not make the beacon and marketing calls, but my experience has been that I get the ban hammer less often with my target fetches than trying to pretend to be a browser with a human on the keyboard/mouse but is not

Perhaps more of the people who need to run this kind of data scraping operation are comfortable with Python. Data scientists, operations personnel, etc.

I've been using Perl and Python for 30 years, and JS for a few weeks scattered across those same years.

One of the reasons is that after you scrape it you want to do something with the data: put it in a Postgres/SQLite, save it to disk, POST it to some webserver, extract some stats from it and write to a CSV, ...

This stuff is much easier to do in Python.

Because it’s been around longer. Beautiful Soup was first released in 2004 according to its wiki page and I’m sure there were plenty of libraries before it.

I thought scraping is kind of dead given all the CAPTCHAs and auth walls everywhere. The article does mention proxies and rate limiting, but could anyone with (recent) practical experience elaborate on dealing with such challenges?

Not only is scraping not dead but it has won the arms race. There are ways around every defense, and this will only accelerate as AI advances.

The CAPTCHAs and walls are more of a desperate, doomed retreat.

How do you get around 403/401's from WSJ/Reuters/Axios? Because I've tried user agent manipulation and it seems like I'd have to use selenium and headless to deal with them.

If curl-impersonate works, it's probably TLS fingerprinting.

Sometimes you also need "Accept: html" I have noticed.

Some months ago, I had problems with captcha. I tried to write an application to access many drugstores and compare the price, but captcha with login system fail the mission.

Do you have any piece of advice for me?

Not sure about the currently available tools, given the break-neck speed of AI progress, but a couple of years ago I built a scraper that used a captcha-solving service, they sell something like 1000 solutions for $10, it was super cheap. The process was a bit slow because they were using humans to solve the captchas, but it worked really well

Try 2captcha https://2captcha.com/2captcha-api

A few different techniques -

1. use mobile phone proxies. Because of how mobile phone networks do NAT, basically it means that thousands of people share IPs and are much less like to get blocked.

2. Reverse engineer APIs if the data you want is returned in an ajax call.

3. Use a captcha solving service to defeat captchas. There's many and they are cheap.

4. Use an actual phone or get really good at convincing the server you are a mobile phone.

5. Buy 1000s of fake emails to simulate multiple accounts.

6. Experiment. Experiment. Experiment. Get some burner accounts. Figure out if they have request per min/hour/day throttling. See what behavior triggers a cloudflare captchas. Check if different variables such as email domain, useragent, voip vs non-voip sms based 2fa. your goal is to simulate a human. So if you sequentially enumerate through every document - that might be what get's you flagged.

Best of luck and happy scraping!

I've noticed many smaller and medium sites only use client-side CAPTCHAs/paywalls

If you have a decent gpu (16gb+ vram) and are using Linux, then this tool I wrote some days ago might do the trick. (at least for googles recaptcha). Also, for now, you have to call the main.py every time you see a captcha on a site and you need the gui since I am only using vision via Screenshots, no HTML or similar. (Sorry that it's not yet that well optimized. I am currently very busy with lots of other things, but next week I should have time to improve this further. But it should still work for basic scraping.) https://github.com/notune/captcha-solver/

We've used ScraperAPI for a long time: https://www.scraperapi.com/

Couldn't recommend them more.

I've used ScrapingBee which has similar pricing and has worked well, can't say which one is better: https://www.scrapingbee.com/

We used to be on ScraperAPI, but moved to ScrapingBee after more frequent failures from ScraperAPI. If your scraping needs have realtime requirements, then I'd recommend ScrapingBee.

Weird, we found the exact opposite - what were you scraping?

ScrapingBee really struggles on so many domains - ScraperAPI is almost as good as Brightdata when it comes to hard to beat sites.

Hi Thomas, really sorry you had a bad experience with ScrapingBee.

Would you mind sending me the account you used as I wasn't able to find anything under Thomas Isaac or Tillypa and couldn't see what was going wrong then.

I'm sure your comment has nothing to do with the fact that you share the same investor as ScraperAPI but I just wanted be sure.

any HN discount by chance? ;) i'm testing y'all out now for a time-sensitive scrape job that must be done by Mar 1st.

I am surprised nobody mentioned https://apify.com/ and they even offer discount for YC startups as ex-graduate from the YC Combinator program

I'm using Scraping Fish because of their pay-as-you-go style pricing as opposed to subscription with monthly scraping volume commitment. And they don't charge extra credits for JS rendering or residential proxies because the cost of each request is the same: https://scrapingfish.com

I got so annoyed by this kind of tedious web scraping work (maintenance, proxies, etc.) that I'm now trying to fully automate it with LLMs. AI should automate repetitive and un-creative work, and web scraping definitely fits this description.

It's a boring but challenging problem.

I've started using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

The service is using many small AI agents that basically just pick the right strategy for a specific sub-task in our workflows. In our case, an agent is a medium-sized LLM prompt that has a) context and b) a set of functions available to call. Tasks involve automatically deciding how to access a website (proxy, browser), naviage through pages, analyze network calls, and transform the data into the same structure.

The main challenge:

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

The integration of tightly constrained agents with traditional engineering methods effectively solved this issue.

Feel free to give it a try: https://www.kadoa.com/add

Kadoa looks great. For tool discovery/usage, are you using LangChain or something else?

Also, do you support scraping private sites, ie. sites that require a login/password to access the data to scrape?

Thank you!

+1 on the question about scraping behind authentication. One huge use case we have as an ecommerce store is to crawl data from our vendors, which do not have (or incomplete) export files

We found LangChain and other agentic frameworks to have too much overhead, so we built our own tailored orchestration layer. Authenticated scraping is currently in beta, could you email me your use case (see my profile)?

Incredible product, will give it a spin soon. How do you do under volume? I tried it out with Google but it was quite slow.

Minimum extraction cost 100 credits , so only 250 pages could be parsed with the regular plan?

BeautifulSoup

Features: Excellent HTML/XML parser, easy web scraping interface, flexible navigation and search.

It does not feature any parser. It’s basically a wrapper over lxml.

lxml

Features: Very fast XML and HTML parser.

It’s fast, but there are alternatives that are literally 5x faster.

This article is just another rewrite of a basic introduction. It’s not a guide, since it does mot describe any issues that you face in practice.

Beautiful Soup comes with a "html.parser", and by default it doesn't not use or even install lxml.

lxml is written in Cython and is very efficient in my tests. Much faster than BeautifulSoup, which is pure Python.

What alternatives are 5x faster?

I'm sorry but BeautifulSoup is not just a wrapper over lxml.

lxml even has a module for using beautifulsoup's parser.

lxml can make use of BeautifulSoup as a parser backend

https://lxml.de/elementsoup.html

A very nice feature of BeautifulSoup is its excellent support for encoding detection which can provide better results for real-world HTML pages that do not (correctly) declare their encoding.

Parsing HTML super-fast is very low on the list of priorities when web-scraping things. Yes, in practice.

Most of the time it won't even register on the scale, compared to the time spent sending/receiving requests and data.

Check out the cloudscraper library if are having speed/cpu issues with sites that require js/have cloudfare defending them. That plus a proxy list plus threading allows me to make 300 requests a minute across 32 different proxies. Recently implemented it for a project: https://github.com/rezaisrad/discogs/tree/main/src/managers

Nicely written scraper, btw. Good code.

appreciate that! as a few mentioned here, there’s a lot of useful scraping tools/libraries to leverage these days. headless selenium no longer seems to make sense to me for most use cases

I've found myself writing the same session/proxy/rate limiting/header faking management code over and over for my scrapers. I've extracted it into it's own service that runs in docker and acts as a MITM proxy between you and target. It is client language agnostic, so you can write scrapers in python, node or whatever and still have great performance.

Highly recommend this approach, it allows you to separate infrastructure code, that gets highly complex as you need more requests, from actual spider/parser code that is usually pretty straightforward and project specific.

https://github.com/jkelin/forward-proxy-manager

This is great, was totally in the back of my mind as a next step.

I'm convinced there is a gold mine sitting right in front of us ready to be picked by someone who can intelligently combine web scraping knowledge with LLMs e.g. scrape data, feed it into LLMs do get insights in an automated fashion. I don't know exactly what the final manifestation looks like but its there and will be super obvious when someone does it.

I feel that the more immediate and impactful opportunity that people are doing is instead of scraping to get/understand content. LLM agents can just interactively navigate websites and perform actions. Parsing/Scraping can be brittle with changes, but an LLM agent to perform an action can just follow steps to search, click on results, and navigate like a human would

Are you aware of any projects for this? I began to build my own but quickly saw that the context window is not large enough to hold the DOM of many websites. I began to strip unnecessary things from the DOM but it became a bit of a slog. L

I tried that. Turns out that LLM-generated regex is still better (and a lot faster) than using an LLM directly.

Here are some tips not mentioned:

1. <domain>/robots.txt can sometimes have useful info for scraping a website. It will often include links to sitemaps that let you enumerate all pages on a site. This is a useful library for fetching/parsing a sitemap (https://github.com/mediacloud/ultimate-sitemap-parser)

2. Instead of parsing HTML tags, sometimes you can extract the data you need through structured metadata. This is a useful library for extracting it into JSON (https://github.com/scrapinghub/extruct)

This. A lot of modern sites can be really easy to scrape. Lots of machine-readable data.

APIs (for SPAs), OpenGraph/LD+JSON data in <head>, and data- attributes with proper data in them (e.g. a timestamp vs "just now" in the text for the human).

Scraping is a lot easier than it used to be.

Adding on to this, if an app uses client-side hydration (ex Next apps) sometimes you can find a big JSON object in the HTML with all the page data. In these cases you can usually write some custom code to extract and parse this JSON object. Sometimes the JSON is embedded in some JavaScript code so you need to use a little regex to extract it.

Shameless plug:

Flyscrape[0] eliminates a lot of boilerplate code that is otherwise necessary when building a scraper from scratch, while still giving you the flexibility to extract data that perfectly fit your needs.

It comes as a single binary executable and runs small JavaScript files without having to deal with npm or node (or python).

You can have a collection of small and isolated scraping scripts, rather than full on node (or python) projects.

[0]: https://github.com/philippta/flyscrape

Does Flyscrape execute JavaScript that is on the page (e.g. by running a headless browser) or is it just parsing HTML and using CSS selectors to extract code from a static DOM?

As of right now Flyscrape just parses HTML using CSS selectors from static DOM.

But as more than enough websites there days are just an empty shell I am working on adding browser rendering support.

I've tried this for the first time recently in 10 years - it's really become a miserable chore. There are so many countermeasures deployed to web scraping. The best path forward I could imagine is utilizing LLMs, taking screenshots and having the AI tell me what it sees on the page; but even gathering links is difficult. xml site maps for the win.

Literally step for step what I spent my weekend putting together. Here's the preview blog I wrote on it. https://blog.bonner.is/using-ai-to-find-fencing-courses-in-l...

Only step you missed was embeddings to avoid all the privacy pages, and a cookie banner blocker (which arguably the AI could navigate if I cared).

Awesome! Things have gotten so bad this is the only alternative. I tried building a hobby search engine then quickly gave up, but did imagine how I would do the scraping!

Are scrapers written on a per-website basis? Are there techniques to separate content from menus / ads / filler / additional information, etc? How do people deal with design changes - is it by rewriting the scraper whenever this happens? Thanks!

Yeah it’s often gonna be a per site, lots of xpath queries, email me when it breaks kind of endeavor.

Yeah. I managed to abstract a bit the structure but in the end websites change.

There was a similar guide on HN titled something like "how to scrape like the big boys" which dug into a setup using mobile IPs, racks of burner phones, and so on.

It's been lost to a bad bookmark setup of mine, and if anyone has a lead on that resource, please link, thank you and unlimited e-karma heading your way.

https://news.ycombinator.com/item?id=29117022

Amazing, this it! Sincere thanks, been looking around for this for a few years, looks like my HN search abilities needs work.

Good luck bypassing akamai

The way I'm bypassing it is by using tailscale to route the scraper's traffic through my home connection and take advantage my ISP's CGNAT. Works like a charm.

Any modern web scraping set up is going to require browser agents. You will probably have to build your own tools to get anything from a major social media platform, or even NYT articles.

May be misunderstanding what you mean by “browser agents” but I’ve done some web scraping that had dynamic content and it was easy with a simple chrome driver / gecko driver + scraper crate in Rust

As someone who works at a non-profit which is increasingly and regularly crawled, sometimes very aggressively,

PLEASE PLEASE PLEASE establish and use a consistent useragent string.

This lets us load balance and steer traffic appropriately.

Thank you.

Mind sharing which one? I'm curious

This is basically an advertisement for the site's scraping proxy service.

I wonder what percentage of Google's daily searches are actually coming from scrapers

Good beginner tutorial and some good stuff in here, but the chances of scraping any site that is behind Cloudflare or AWS WAF (which is almost all interesting sites), are basically zero.

How expensive are the content bundles?

Of course, Wikipedia is the easiest website to scrape. The HTML is so clean an organized. I'd like to find some code to scrape Airbnb.

Always funny seeing SaaS companies pitch their own product in blog posts. I understand it's just how marketing works, but pitching your own product as a solution to a problem (that you yourself are introducing, perhaps the first time to a novice reader) never fails to amuse me.

I've been writing rudimentary Python scripts to scrape online recipe websites for my hobby cooking purposes, and I wish there was some general software that could do this more simply. One of the websites has started making their images unclickable, so measures like that make me think it might become harder to automatically fetch such content.

It's much simpler to get the links via pandas read_html:

import pandas as pd

tables = pd.read_html('https://commons.wikimedia.org/wiki/List_of_dog_breeds', extract_links="all")

tables[-1]

I recently used Playwright for Python [0] and pypandoc [1] to build a scraper that fetches a webpage and turns the content into sane markdown so that it can be passed into an AI coding chat [2].

They are both powerful yet pragmatic dependencies to add to a project. I really like that both packages contain wheels or scriptable methods to install their underlying platform-specific binary dependencies. This means you don't need to ask end users to figure out some complex, platform-specific package manager to install playwright and pandoc.

Playwright let's you scrape pages that rely on js. Pandoc is great at turning HTML into sensible markdown.

For example, below is an excerpt of the openai pricing docs [3] that have been scraped to markdown [4] in this manner.

[0] https://playwright.dev/python/docs/intro

[1] https://github.com/JessicaTegner/pypandoc

[2] https://github.com/paul-gauthier/aider

[3] https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...

[4] https://gist.githubusercontent.com/paul-gauthier/95a1434a28d...

  ## GPT-4 and GPT-4 Turbo

  GPT-4 is a large multimodal model (accepting text or image inputs and
  outputting text) that can solve difficult problems with greater accuracy
  than any of our previous models, thanks to its broader general knowledge
  and advanced reasoning capabilities. GPT-4 is available in the OpenAI
  API to [paying
  customers](https://help.openai.com/en/articles/7102672-how-can-i-access-gpt-4).
  Like `gpt-3.5-turbo`, GPT-4 is optimized for chat but works well for
  traditional completions tasks using the [Chat Completions
  API](/docs/api-reference/chat). Learn how to use GPT-4 in our [text
  generation guide](/docs/guides/text-generation).

  +-----------------+-----------------+-----------------+-----------------+
  | Model           | Description     | Context window  | Training data   |
  +=================+=================+=================+=================+
  | gpt             |                 | 128,000 tokens  | Up to Dec 2023  |
  | -4-0125-preview |                 |                 |                 |
  |                 | New             |                 |                 |
  |                 |                 |                 |                 |
  |                 |                 |                 |                 |
  |                 |                 |                 |                 |
  |                 | **GPT-4         |                 |                 |
  |                 | Turbo**\        |                 |                 |
  |                 | The latest      |                 |                 |
  |                 | GPT-4 model     |                 |                 |
  |                 | intended to     |                 |                 |
  |                 | reduce cases of |                 |                 |
  |                 | "laziness"      |                 |                 |
  |                 | where the model |                 |                 |
  |                 | doesn't         |                 |                 |
  |                 | complete a      |                 |                 |
  |                 | task. Returns a |                 |                 |
  |                 | maximum of      |                 |                 |
  |                 | 4,096 output    |                 |                 |
  |                 | tokens. [Learn  |                 |                 |
  |                 | more](ht        |                 |                 |
  |                 | tps://openai.co |                 |                 |
  |                 | m/blog/new-embe |                 |                 |
  |                 | dding-models-an |                 |                 |
  |                 | d-api-updates). |                 |                 |
  +-----------------+-----------------+-----------------+-----------------+
  | gpt-            | Currently       | 128,000 tokens  | Up to Dec 2023  |
  | 4-turbo-preview | points to       |                 |                 |
  |                 | `gpt-4          |                 |                 |
  |                 | -0125-preview`. |                 |                 |
  +-----------------+-----------------+-----------------+-----------------+
  ...

I've had to do a lot of scraping recently and something that really helps is https://pypi.org/project/requests-cache/ . It's a drop in replacement for the requests library but it caches all the responses to a sqlite database.

Really helps if you need to tweak your script and you're being rated limited by the sites you're scraping.

I would prefer if people would not submit sites to HN that are using anti-scraping tactics such as DataDome that block ordinary web users making a single HTTP request using a non-popular, smaller, simpler client.

One example is www.reuters.com. It makes no sense because the site works fine without Javascript but Javascript is required as a result of the use of DataDome. See below for example demonstration.

For anyone who is doing the scraping that causes these websites to use hacks like DataDome: Does your scraping solution get blocked by DataDome. I suspect many will answer no, indicating to me that DataDome is not effective at anything more than blocking non-popular clients. To be more specific, there seems to be a blurring of the line between blocking non-popular clients and preventing "scraping". If scraping can be accomplished with the gigantic, complex popular clients, then why block the smaller, simpler non-popular clients that make a single HTTP request.

To browse www.reuters.com text-only in a gigantic, complex, popular browser

1. Clear all cookies

2. Allow Javascript in Settings for the site

ct.captcha-delivery.com

3. Block Javascript for the site

www.reuters.com

4. Block images for the site

www.reuters.com

First try browsing www.reuters.com with these settings. Two cookies will be stored. One from reuters.com. This one is the DataDome cookie. And another one from www.reuters.com. This second cookie can be deleted with no effect on browsing.

NB. No ad blocker is needed.

Then clear the cookies, remove the above settings and try browsing www.reuters.com with Javascript enabled for all sites and again without an ad blocker. This is what DataDome and Reuters ask web users to do:

"Please enable JS and disable any ad blocker."

Following this instruction from some anonymous web developer totally locks up the computer I am using. The user experience is unbearable.

Whereas with the above settings I used for the demonstration, browsing and reading is fast.

How many complete guides are out there for Python Scraping?