This guide (and most other guides) are missing a massive tip: Separate the crawling (finding urls and fetching the HTML content) from the scraping step (extracting structured data out of the HTML).
More than once, I wrote a scraper that did both of these steps together. Only later I realized that I forgot to extract some information that I need and had to do the costly task of re-crawling and scraping everything.
If you do this in two steps, you can always go back, change the scraper and quickly rerun it on historical data instead of re-crawling everything from scratch.
I've found this to be a good practice for ETL in general. Separate the steps, and save the raw data from "E" if you can because it makes testing and verifying "T" later much easier.
this is what i try to do but i want to learn more about approaches like this, do you know any good resources about how to design ETL pipelines?
Disclaimer: previous job had a lot of cases where CSVs were dropped by SFTP, your milage may vary..., JSON APIs are said to be a different flavor of crazy...
Haven't heard much beyond "ask the Old Ones", but "Murphy's law strikes again", "eventually someone will want that data even though they swore it was unnecessary", "eventually someone will ask for a backfill/replay", "eventually someone will give you a duplicate file", "eventually someone will want to slice-and-dice the data a different way" and "eventually someone will change the schema without telling you" have been some things I have noticed.
Even de-duplicating data is, in a sense, deletion (or someone will eventually want to get at the data with duplicates -- e.g. for detecting errors or repeats or fraud or some other analysis that mirrors looking at the bullet holes in World War 2 bombers)
Store the data as close to the original form as you can. Keep a timestamp of when you landed the data. Create a UUID for the record. Create a hash of the record if you can. Create a batch_id if you load multiple things at once (e.g. multiple CSVs). Don't truncate and reload a table - rather, append to it. If you still need something that looks like atomic table changes, I've gotten away with something close: "a view that shows only the most recent valid batch". (Yes this is re-inventing the database wheel, but sometimes you make do with the tools you are forced to use.)
Someone, somewhere, will hand you a file that does not conform to the established agreement. You want to log that schema change, with a timestamp, so you can complain to them with evidence that they ain't sending you what they used to, and they didn't bother sending you an email beforehand...
They're not going to fix it on your timeline, so you're probably going to end up hacking your code to work a different way... Until, you know, they switch it back...
So, yeah. Log it. Timestamp it. Hash it. UUID it. Don't trust the source system to do it right, because they will eventually change the script on you. Keep notes, and plan in such a way that you have audit logs and can move with agility.
sound advice. thank you.
In conclusion...
I find, in data engineering ,the goal is not to prevent everything, it's to be flexible and prepared to handle lots of change, even silly changes, and be able to audit it, observe it, maneuver around it, and keep the mean-time-to-resolution low.
I built an ETL pipeline for a government client using just AWS, Node, and Snowflake. All Typescript. To cache the data I store responses in S3. If there's a cache available, use the S3 data, if not get the new data. We can also clean the old cache occasionally with a cron job. Then do transforms and put it in Snowflake. Sometimes we need to do transforms before caching the data in S3 (e.g. adding a unique ID to CSV rows), or doing things like splitting giant CSV files into smaller files that can then be inserted into Snowflake (Snowflake has a 50mb payload limit). We have alerts, logging, and metadata set up as well in AWS and Snowflake. Most of this comes down to your knowledge of cloud data platforms.
It's honestly not that difficult to build ETL pipelines from scratch. We're using a ton of different sources with different data formats as well. Using the Serverless framework to set up all the Lambda functions and cron jobs also makes things a lot easier.
i appreciate you sharing all that, but it seems like we might be on similar levels of knowledge/experience. i've been a dev who does a lot data engineering for 5 years. i'm looking more for best practices and theory about designing the pipeline, how to arrange the order of operations, how to separate each step, logging practices, how to make it reproducible, how to restart when it fails halfway in without going back to the beginning, how many retries, what to do if a step gets stuck in failed state, how to flag that bad data, etc. so. many. questions. while i build these pipelines.
i have figured out these questions by seeing how more experienced devs do it and on my own, but i want to learn from a book or video series because you can only figure out so much yourself, eventually you need to seek out experts and sometimes the experts around you also figured it out themselves and you need to find an expert outside of your circle. unfortunately a lot of the "ETL experts" teaching stuff online are trying to sell me on prefect or airflow or snowflake etc
I wish I did. I currently work at a startup with our core offering being ETL, so I've learned along the way as we've continued. If anyone has any, I'd love to hear as well.
Keeping raw data when possible has been huge. We keep some in our codebase for quick tests during development and then we keep raws from production runs that we can evaluate with each change, giving us an idea of the production impact of the change.
There's quite a bit of new tooling in this space, selecting the right one is going to depend on your needs then you can spike from there. Check out Prefect, Dagster, Windmill, Airbyte (although the latter is more ELT than ETL).
I realise from working a few places that this isn't entirely common practice, but when we built the data warehouse at a startup I worked at, we engaged with a consultancy who taught us the fundamentals of how to do it properly.
One of those fundamentals was separating out the steps of landing the data vs subsequent normalisation and transformation steps.
It's unfortunate that "ETL" stuck in mindshare, as afaik almost all use cases are better with "ELT"
I.e. first preserve your raw upstream via a 1:1 copy, then transform/materialize as makes sense for you, before consuming
Which makes sense, as ELT models are essentially agile for data... (solution for not knowing what we don't yet know)
I think ETL is right from the perspective where E refers to “from the source of data” and L refers to “to the ultimate store of data”.
But the ETL functionality should itself lives in a (sub)system that has its own logical datastore (which may or may not be physically separate from the destination store), and things should be ELT where the L is with respect to that store. So, its E(LTE)L, in a sense.
For those confused as to whether ETL or ELT is ultimately more appropriate for you… almost everyone is really just doing ETLTLTLT or ELTLTLTL anyways. The distinction is really moot.
Maybe my understanding is incorrect, but expansion on the distinction.
Assumptions -- We're talking about two separate systems (source and destination) with non-neglible transfer time (although perhaps "quick")
ETL -- Performing the transform before/during the load, such that fields in the destination are not guaranteed to have existed in the source (i.e. 2 db model)
ELT -- Performing a 1:1 copy of source into an intermediary table/db (albeit perhaps with filtering), then performing a transform on the intermediary table/db to generate the destination table/db (either realized or materialized at query time), with the intermediary table/database history retained (i.e. 3 table/db model)
In short distinction, if regeneration or altering the destination is required, ETL relies on history being available in the upstream source.
ELT pulls control of that to the destination-owner, as they're retaining the raw data on their side.
This is how I do it.
I send the URLs I want scraped to Urlbox[0] it renders the pages saves HTML (and screenshot and metadata) to my S3 bucket[1]. I get a webhook[2] when it's ready for me to process.
I prefer to use Ruby so Nokogiri[3] is the tool I use for scraping step.
This has been particularly useful when I've want to scrape some pages live from a web app and don't want to manage running Puppeteer or Playwright in production.
Disclosure: I work on Urlbox now but I also did this in the five years I was a customer before joining the team.
[0]: https://urlbox.com [1]: https://urlbox.com/s3 [2]: https://urlbox.com/webhooks [3]: https://nokogiri.org
Does it save the whole page or just the viewport? Just checked the landing page it looks targeted to a specific case of saving “screenshots” and this is also obvious from limitations in the pricing page so it would be unfeasible for larger projects?
Urlbox will save the whole page.
It's primarily purpose is to render screenshots full-page or limited to viewport or an element. To do that well as it does the HTML has to be rendered perfectly first.
It's not as cheap as other solutions but we have customers who render millions of pages per month with us. They value the accuracy and reliability that's come from over a decade of refinements to the service.
Larger projects can request preferential pricing based on the specifics of the kinds of pages they are rendering.
What I find most effective, is to wrap `get` with local cache, and this is the first thing I write when I start a web crawling project. Therefore, from the very beginning, even when I'm just exploring and experimenting, every page only gets downloaded once to my machine. This way I don't end up accidentally bother the server too much, and I don't have to re-crawl if I make a mistake in code.
requests-cache [0] is an easy way to do this if using the requests package in python. You can patch requests with
and responses will be stored into a local sqlite file.[0] https://requests-cache.readthedocs.io/en/stable/
I've found this approach works really well using JavaScript and puppeteer for the first stage, and then Python for the second stage (the re module for regular expressions is nice here IMO).
JS/puppeter seems a bit easier for things like rotating user agents, from article:
If you're using JS in the first step just because you need puppeteer, check out playwright. It's what the original authors of puppeteer are working on now and it's been more actively developed in the past few years, very similar in usage and features, but it also has an official python wrapper package.
An easy way to do this that I've used is to cache web requests. This way, I can run the part of the code that gets the data again with say a modification to grab data from additional urls, and I'm not unnecessarily rerunning my existing URLs. With this method, I don't need to modify existing code either, best of both worlds.
For this I've used the requests-cache lib.
Looks like a really useful library - thanks for the tip.
It applies to many other project too: cling on to the raw data as long as it isn't bogging you down too much.
Generally it's enough to archive the retrieved HTML just in case.
Although in general I like the idea of a queue for a scraper to access separately, another option - assuming you have the storage and bandwidth - is to capture and store every requested page, which lets you replay the extraction step later.
Yes!
My Clojure scraping framework [0] facilitates that kind of workflow, and I’ve been using it to scrape/restructure massive sites (millions of pages). I guess I’m going to write a blog post about scraping with it at scale. Although it doesn’t really scale much above that – it’s meant for single-machine loads at the moment – it could be enhanced to support that kind of workflow rather easily.
[0]: https://github.com/nathell/skyscraper
this is the way.
Can confirm. A few discrete scripts each focused on one part of the process can make the whole thing run seamlessly async, and you naturally end up storing the pages for processing by subsequent scripts. Especially if you write a dedicated downloader - then you can really go nuts optimizing and randomizing the download parameters for each individual link in the queue. "Do one thing and do it well" FTW.
I talked about exactly that on a conference in 2022: https://youtu.be/b0lAd-KEUWg?feature=shared free to watch.
if you're using requests in python, requests-cache does exactly this for you, saving the data to an sqlite db, and is compatible with your code using requests.
Yes!! https://beepb00p.xyz/unnecessary-db.html really changed how I think about data manipulation, mostly with this principle.
The problem is crawling is generally optimized with info you find in the page.