HN comments for: Web scraping with GPT-4o: powerful but expensive

parhamn

29 replies

18h56m

2024-09-02 23:35:36 UTC

Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc.

I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.

We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.

Whats the gold standard for something like this?

axg11

11 replies

18h47m

2024-09-02 23:45:00 UTC

I wrote an in-house one for Ribbon. If there’s interest, will open source this. It’s amazing how much better our LLM outputs are with the reducer.

yadaeno

0 replies

3h21m

2024-09-03 15:10:30 UTC

ushtaritk421

0 replies

11h26m

2024-09-03 07:06:15 UTC

I would be very interested in this.

parhamn

0 replies

18h43m

2024-09-02 23:48:53 UTC

Yes! Happy to try it on a fairly large user base and contribute to it! Email in bio if you want a beta user.

kurthr

0 replies

18h45m

2024-09-02 23:46:35 UTC

That would be wonderful.

guwop

0 replies

6h55m

2024-09-03 11:36:45 UTC

sounds amazing!

faefox

0 replies

3h48m

2024-09-03 14:44:02 UTC

downWidOutaFite

0 replies

17h55m

2024-09-03 00:37:13 UTC

chrisrickard

0 replies

3h38m

2024-09-03 14:53:32 UTC

100%

Tostino

0 replies

18h44m

2024-09-02 23:47:37 UTC

I'm absolutely interested in this.

CalRobert

0 replies

1h3m

2024-09-03 17:29:12 UTC

Add my voice to the chorus asking for this!

7thpower

0 replies

16h21m

2024-09-03 02:10:49 UTC

Yes please

edublancas

5 replies

18h50m

2024-09-02 23:42:25 UTC

author here: I'm working on a follow-up post. Turns out, removing all HTML tags works great and reduces the cost by a huge margin.

7thpower

3 replies

16h20m

2024-09-03 02:11:34 UTC

What do you mean? What do you use as reference points?

edublancas

2 replies

15h24m

2024-09-03 03:07:54 UTC

nothing, I strip out all the HTML tags and pass raw text

isaacfung

1 replies

14h46m

2024-09-03 03:45:30 UTC

How do you keep table structure?

jaimehrubiks

0 replies

9h18m

2024-09-03 09:14:12 UTC

They should probably keep tables and lists and strip most of the rest.

AbstractH24

0 replies

7h31m

2024-09-03 11:00:42 UTC

Am I crazy or is there no way to “subscribe” to your site? Interested to follow your learnings in this area.

lelandfe

2 replies

18h51m

2024-09-02 23:41:00 UTC

Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.

DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.

zexodus

1 replies

10h44m

2024-09-03 07:47:31 UTC

It's possible to capture the DOM by running a headless browser (i.e. with chromedriver/geckodriver), allowing the js execute and then saving the HTML.

If these readers do not use already rendered HTML to parse the information on the screen, then...

lelandfe

0 replies

5h5m

2024-09-03 13:26:55 UTC

Indeed, Safari's reader already upgrades to using the rendered page, but even it fails on more esoteric pages using e.g. lazy loaded content (i.e. you haven't scrolled to it yet for it to load); or (god forbid) virtualized scrolling pages, which offloads content out of view.

It's a big web out there, there's even more heinous stuff. Even identifying what the main content is can be a challenge.

And reader mode has the benefit of being ran by the user. Identifying when to run a page-simplifying action on some headlessly loaded URL can be tricky. I imagine it would need to be like: load URL, await load event, scroll to bottom of page, wait for the network to be idle (and possibly for long tasks/animations to finish, too)

ErikAugust

2 replies

18h52m

2024-09-02 23:39:51 UTC

Running it through Readability:

https://github.com/mozilla/readability

parhamn

1 replies

18h48m

2024-09-02 23:44:04 UTC

I snuck in an edit about readability before I saw your reply. The quality of that one in particular is very meh, especially for most new sites and then you lose all of the dom structure in case you want to do more with the page. Though now I'm curious how it works on the weather.com page the author tried. pupeteer -> screenshot -> ocr (or even multi-modal which many do OCR first) -> LLM pipeline might work better there.

LunaSea

0 replies

11h10m

2024-09-03 07:21:39 UTC

Issue is that Llama are x100 more expensive at the very least.

simonw

1 replies

18h49m

2024-09-02 23:43:05 UTC

Jina.ai offer a really neat (currently free) API for this - you add https://r.jina.ai/ on the beginning of any API and it gives you back a Markdown version of the main content of that page, suitable for piping into an LLM.

Here's an example: https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... - for this page: https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-us...

Their code is open source so you can run your own copy if you like: https://github.com/jina-ai/reader - it's written in TypeScript and uses Puppeteer and https://github.com/mozilla/readability

I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...

    shot-scraper javascript https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-user-interface/ "
    async () => {
      const readability = await import('https://cdn.skypack.dev/@mozilla/readability');
      return (new readability.Readability(document)).parse();
    }"

BeetleB

0 replies

3h11m

2024-09-03 15:20:46 UTC

+1 for this. I too use Readability via Simon's shot-scraper tool.

suchintan

0 replies

13h54m

2024-09-03 04:38:06 UTC

We wrote something like this to power Skyvern: https://github.com/Skyvern-AI/skyvern/blob/0d39e62df6c516e0a...

It's adapted from vimium and works like a charm. Distill the html down to it's important bits, and handle a ton of edge cases along the way haha

purple-leafy

0 replies

16h36m

2024-09-03 01:55:52 UTC

I wrote one for a project that captures a portion of the DOM and sends it to an LLM.

It’s strips all JS/event handlers, most attributes and most CSS, and only keeps important text nodes

I needed this because I was using LLM to reimplement portions of a page using just tailwind, so needed to minimise input tokens

nickpsecurity

0 replies

2h33m

2024-09-03 15:58:58 UTC

That’s easy to do with BeautifulSoup in Python. Look up tutorials on that. Use it on non-essential tags. That will at least work when the content is in HTML rather than procedurally generated (eg JavaScript).

jumploops

20 replies

18h46m

2024-09-02 23:45:40 UTC

We've had the best success by first converting the HTML to a simpler format (i.e. markdown) before passing it to the LLM.

There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].

Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.

Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.

By simplifying the "problem" we've had decent success, even with GPT-4o mini.

[0] https://github.com/extractus

[1] https://github.com/romansky/dom-to-semantic-markdown

[2] https://apify.com/

[3] https://www.firecrawl.dev/

[4] https://magicloops.dev/

pkiv

10 replies

15h3m

2024-09-03 03:28:38 UTC

If you're open to it, I'd love to hear what you think of what we're building at https://browserbase.com/ - you can run a chrome extension on a headless browser so you can do the semantic markdown within the browser, before pulling anything off.

We even have an iFrame-able live view of the browser, so your users can get real-time feedback on the XPaths they're generating: https://docs.browserbase.com/features/session-live-view#give...

Happy to answer any questions!

jumploops

5 replies

12h51m

2024-09-03 05:40:55 UTC

This is super neat and I think I've seen your site before :)

Do you handle authentication? We have lots of users that want to automate some part of their daily workflow but the pages are often behind a login and/or require a few clicks to reach the desired content.

Happy to chat: username@gmail.com

retrofuturism

3 replies

10h0m

2024-09-03 08:31:46 UTC

You must get a lot of test emails to that FANTASTIC gmail address. Funny how it might even be worth some decent money.

spyder

2 replies

6h46m

2024-09-03 11:46:06 UTC

That's not literally his e-mail :D. He means that you have to replace it with his HN username. It would have been better to write it like this: [HN username]@gmail.com

retrofuturism

0 replies

4h4m

2024-09-03 14:28:24 UTC

Hahaha okay I feel dumb now.

delfinom

0 replies

6h4m

2024-09-03 12:28:02 UTC

Personally I thought it was a LLM reply to a LLM marketing post to fake engagement. Lol

peab

0 replies

2h38m

2024-09-03 15:53:34 UTC

I'm also curious about this! I've been learning about scraping, but I've had a hard time finding good info about how to deal with user auth effectively.

sunshadow

1 replies

12h52m

2024-09-03 05:39:53 UTC

I don't see any difference than browserless?

pkiv

0 replies

12h47m

2024-09-03 05:45:20 UTC

The price and the dashboard are a great start :)

naiv

0 replies

12h13m

2024-09-03 06:19:05 UTC

Awesome product!

I was just a bit confused that the sign up buttons for the Hobby and Scale plans are grey, I thought that they are disabled until randomly hovering over them.

emilburzo

0 replies

11h54m

2024-09-03 06:38:12 UTC

Romania is missing from the list of phone number countries on signup, not sure if on purpose or not.

mistercow

4 replies

3h3m

2024-09-03 15:29:02 UTC

Have you compared markdown to just stripping the HTML down (strip tag attributes, unwrap links, remove obvious non-displaying elements)? My experience has been that performance is pretty similar to markdown, and it’s an easier transformation with fewer edge cases.

nickpsecurity

3 replies

2h43m

2024-09-03 15:49:06 UTC

That’s what I’ve done for quite a few [non-LLM] applications. The remaining problem is that HTML is verbose vs other formats. That has a higher, per-token cost. So, maybe stripping followed by substituting HTML tags with a compressed notation.

defgeneric

1 replies

1h20m

2024-09-03 17:11:39 UTC

I've tried this and found it doesn't make much difference. The idea was to somehow preserve the document structure while reducing the token count, so you do things like strip all styles, etc. until you have something like a structure of divs, then reduce that. But I found no performance gain in terms of output. It seems whatever structure of the document is left over after doing the reduction has little semantic meaning that can't be conveyed by spaces or newlines. Even when using something like html2markdown, it doesn't perform much better. So in a sense the LLM is "too good", and all you really need to worry about is reducing the token count.

a_wild_dandan

0 replies

17m

2024-09-03 18:15:00 UTC

I wonder if using nested markdown bullet points would help. You would preserve the information hierarchy, and LLMs are phenomenal with (and often output) markdown.

mistercow

0 replies

2h16m

2024-09-03 16:16:03 UTC

Yeah it’s slightly more token heavy, although not as much as it seems like at first glance. Most opening tags are 2-3 tokens, and most closing tags are 3-4. Since tags are generally a pretty small fraction of the text, it typically doesn’t make a huge difference IME, but it obviously depends on the particular content.

neeleshs

1 replies

14h51m

2024-09-03 03:41:11 UTC

We did something similar -na although in a somewhat different context.

Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.

jumploops

0 replies

12h50m

2024-09-03 05:42:17 UTC

Yes, just like with humans, if you simplify the context, it's often much easier to get the desired result.

Source: I have a toddler at home.

snthpy

0 replies

13h13m

2024-09-03 05:19:25 UTC

Thank you for these.

I've been wanting to try the same approach and have been looking for the right tools.

pbronez

0 replies

6h44m

2024-09-03 11:48:15 UTC

First I’ve heard of Semantic Markdown [0]. It appears to be a way to embed RDF data in Markdown documents.

The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals

=====

What is Semantic Markdown?

Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.

Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.

Design Rationale:

Embed RDFa-like semantic annotation within Markdown

Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists

Ability to semantically annotate an existing plain Markdown document with semantic annotations

Keep human-readability to a maximum About this document

=====

[0] https://hackmd.io/@sparna/semantic-markdown-draft

luigi23

18 replies

20h16m

2024-09-02 22:15:44 UTC

Why are scrapers so popular nowadays?

rietta

7 replies

20h13m

2024-09-02 22:19:07 UTC

Because publishers don’t push structured data or APIs enough to satisfy demand for the data.

luigi23

6 replies

20h11m

2024-09-02 22:20:37 UTC

Got it, but why is it booming now and often it’s a showcase of llm model? Is there some secret market/ usecase for it?

IanCal

2 replies

19h43m

2024-09-02 22:48:33 UTC

Building scrapers sucks.

It's generally not hard because it's conceptually very difficult, or that it requires extremely high level reasoning.

It sucks because when someone changes "<section class='bio'>" to "<div class='section bio'>" your scraper breaks. I just want the bio and it's obvious what to grab, but machines have no nuance.

LLMs have enough common sense to be able to deal with these things and they take almost no time to work with. I can throw html at something, with a vague description and pull out structured data with no engineer required, and it'll probably work when the page changes.

There's a huge number of one-off jobs people will do where perfect isn't the goal, and a fast solution + a bit of cleanup is hugely beneficial.

atomic128

1 replies

19h12m

2024-09-02 23:19:52 UTC

Another approach is to use a regexp scraper. These are very "loose" and tolerant of changes. For example, RNSAFFN.com uses regular expressions to scrape the Commitments of Traders report from the Commodity Futures Trading Commission every week.

simonw

0 replies

19h10m

2024-09-02 23:22:14 UTC

My experience has been the opposite: regex scrapers are usually incredibly brittle, and also harder to debug when something DOES change.

My preferred approach for scraping these days is Playwright Python and CSS selectors to select things from the DOM. Still prone to breakage, but reasonably pleasant to debug using browser DevTools.

yuters

0 replies

16h26m

2024-09-03 02:05:31 UTC

I don't know if many has the same use case but... I'm heavily relying on this right now because my daughter started school. The school board, the school, and the teacher each use a different app to communicate important information to parents. I'm just trying to make one feed with all of them. Before AI it would have been hell to scrape, because you can imagine those apps are terrible.

Fun aside: The worst one of them is a public Facebook page. The school board is making it their official communication channel, which I find horrible. Facebook is making it so hard to scrape. And if you don't know, you can't even use Facebook's API for this anymore, unless you have a business verified account and go through a review just for this permission.

drusepth

0 replies

20h8m

2024-09-02 22:24:15 UTC

Scrapers have always been notoriously brittle and prone to breaking completely when pages make even the smallest of structural changes.

Scraping with LLMs bypasses that pitfall because it's more of a summarization task on the whole document, rather than working specifically on a hard-coded document structure to extract specific data.

bobajeff

0 replies

19h54m

2024-09-02 22:37:41 UTC

Personally I find it's better for archiving as most sites that don't provide a convenient way to save their content directly. Occasionally, I do it just to make a better interface over the data.

drusepth

4 replies

20h12m

2024-09-02 22:19:50 UTC

I'd say scrapers have always been popular, but I imagine they're even more popular nowadays with all the tools (AI but also non-AI) readily available to do cool stuff on a lot of data.

bongodongobob

3 replies

20h8m

2024-09-02 22:23:40 UTC

Bingo. During the pandemic, I started a project to keep myself busy by trying to scrape stock market ticker data and then do some analysis and make some pretty graphs out of it. I know there are paid services for this, but I wanted to pull it from various websites for free. It took me a couple months to get it right. There are so many corner cases to deal with if the pages aren't exactly the same each time you load them. Now with the help of AI, you can slap together a scraping program in a couple of hours.

MaxPock

2 replies

19h55m

2024-09-02 22:37:05 UTC

Was it profitable?

keyle

0 replies

19h15m

2024-09-02 23:16:32 UTC

I'm sure it was profitable in keeping him busy during the pandemic. Not everything has to derive monetary value, you can do something for experience, fun, kick the tyres, open-source and/or philanthropic avenues.

Besides it's a low margin, heavily capitalized and heavily crowded market you'd be entering and not worth the negative-monetary investment in the short and medium term (unless you wrote AI in the title and we're going to the mooooooon babyh)

bongodongobob

0 replies

14h23m

2024-09-03 04:08:52 UTC

It was in the sense that I learned that trying to beat the market is fundamentally impossible/stupid, so just invest in index funds.

CSMastermind

2 replies

11h58m

2024-09-03 06:33:31 UTC

There's been a large push to do server-side rendering for web pages which means that companies no longer have a publicly facing API to fetch the data they display on their websites.

Parsing the rendered HTML is the only way to extract the data you need.

kordlessagain

1 replies

4h26m

2024-09-03 14:06:14 UTC

I've had good success running Playwright screenshots through EasyOCR, so parsing the DOM isn't the only way to do it. Granted, tables end up pretty messy...

fzysingularity

0 replies

1h56m

2024-09-03 16:35:32 UTC

We've been doing something simliar for VLM Run [1]. A lot of websites that have obfuscated HTML / JS or rendered charts / tables tend to be hard to parse with the DOM. Taking screenshots are definitely more reliable and future-proof as these webpages are built for humans to interact with.

That said, the costs can be high as the OP says, but we're building cheaper and more specialized models for web screenshot -> JSON parsing.

Also, it turns out you can do a lot more than just web-scraping [2].

[1] https://vlm.run

[2] https://docs.vlm.run/introduction

nsonha

0 replies

14h36m

2024-09-03 03:55:58 UTC

What do you think all these LLM stuff will evolve into? Of course it's moving on from chitchat on stale information and now onto "automate the web" kinda phase, like it or not.

adamtaylor_13

0 replies

19h15m

2024-09-02 23:17:06 UTC

There’s a lot of data that we should have programmatic access to that we don’t.

The fact that I can’t get my own receipt data from online retailers is unacceptable. I built a CLI Puppeteer scraper to scrape sites like Target, Amazon, Walmart, and Kroger for precisely this reason.

Any website that has my data and doesn’t give me access to it is a great target for scraping.

tom1337

8 replies

19h58m

2024-09-02 22:33:51 UTC

OpenAI recently announced a Batch API [1] which allows you to prepare all prompts and then run them as a batch. This reduces costs as its just 50% the price. Used it a lot with GPT-4o mini in the past and was able to prompt 3000 Items in less than 5min. Could be great for non-realtime applications.

[1] https://platform.openai.com/docs/guides/batch

Tostino

5 replies

19h1m

2024-09-02 23:30:36 UTC

I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.

asaddhamani

2 replies

15h25m

2024-09-03 03:07:15 UTC

Do note it can take up to 24 hours or drop requests altogether. But if that’s not an issue for your use case it’s a great cost saving.

jumploops

0 replies

15h13m

2024-09-03 03:19:17 UTC

This is neat, I’ve been looking for a way to run our analytics (LLM-based) without affecting the rate limits of our prod app.

May need to give this a try!

altdataseller

0 replies

9h30m

2024-09-03 09:01:38 UTC

What percentage of requests usually get dropped? Is it something miniscule like 1% or are we talking non trivial like 10%

johndough

1 replies

11h13m

2024-09-03 07:19:10 UTC

llama.cpp enabled continuous batching by default half a year ago: https://github.com/ggerganov/llama.cpp/pull/6231

There is no need for a new API endpoint. Just send multiple requests at once.

Tostino

0 replies

11h4m

2024-09-03 07:28:11 UTC

The point of the endpoint is to be able to standardize my codebase and have an agnostic LLM provider that works the same.

Continuous batching is helpful for this type of thing, but it really isn't everything you need. You'd ideally maintain a low priority queue for the batch endpoint and a high priority queue for your real-time chat/completions endpoint.

Would allow utilizing your hardware much better.

cdrini

0 replies

5h48m

2024-09-03 12:43:43 UTC

Yeah this was a phenomenal decision on their part. I wish some of the other cloud tools like azure would offer the same thing, it just makes so much sense!

LunaSea

0 replies

11h15m

2024-09-03 07:17:12 UTC

That's a great proposition by OpenAI. I think however that it is still one to two orders of magnitude too expensive compared to traditional text extraction with very similar precision and recall levels.

namukang

7 replies

10h14m

2024-09-03 08:18:12 UTC

For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.

I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).

For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.

[0] https://easyscraper.com

sebstefan

3 replies

6h38m

2024-09-03 11:54:02 UTC

The LLM is resistant to website updates that would break normal scraping

If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.

melenaboija

1 replies

5h43m

2024-09-03 12:49:27 UTC

Unless the update is splitting cells.

Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged

sebstefan

0 replies

4h33m

2024-09-03 13:58:44 UTC

Nah, still correct :-) that would break the regular scraping as well

The LLM is resistant to website updates that would break normal scraping

is_true

0 replies

5h43m

2024-09-03 12:49:14 UTC

Xpath can be based on content, not only positions

poulpy123

2 replies

6h27m

2024-09-03 12:05:14 UTC

yours is the first one see that allows to scrape by selecting directly what to scrape. I always wondered why there was no tool doing that.

sebstefan

1 replies

4h30m

2024-09-03 14:01:34 UTC

I've seen another website like this that had this feature on hackernews but it was from a retrospective. These websites have the nasty habit of ceasing operations

echelon

0 replies

36m

2024-09-03 17:55:28 UTC

It needs to be a library.

marcell

5 replies

19h22m

2024-09-02 23:10:14 UTC

I'm working on a Chrome extension to do web scraping using OpenAI, and I've been impressed by what ChatGPT can do. It can scrape complicated text/html, and usually returns the correct results.

It's very early still but check it out at https://FetchFoxAI.com

One of the cool things is that you can scrape non-uniform pages easily. For example I helped someone scrape auto dealer leads from different websites: https://youtu.be/QlWX83uHgHs . This would be a lot harder with a "traditional" scraper.

hydrogenpolo

4 replies

19h13m

2024-09-02 23:19:24 UTC

Cool, would this work on something like instagram? Scraping pages?

dghlsakjg

2 replies

19h9m

2024-09-02 23:22:33 UTC

Instagram really doesn’t want you scraping. There are almost certainly terms against it in the user agreement

bangaladore

1 replies

18h55m

2024-09-02 23:36:28 UTC

Companies like Instagram (Facebook/Meta/Garbage) abuse their users' data day in and day out. Who cares what their TOS says. Let them spend millions of dollars trying to block you, its a drop in the bucket.

houseplant

0 replies

13h40m

2024-09-03 04:51:30 UTC

instead, don't do it because it's disrespectful to people. A lot of people weren't made aware- or didn't have the option- to object to that TOS change. Saying "well, THOSE guys do it! why can't I!" isn't a mature stance. Don't take their images because it's the right thing to do

marcell

0 replies

19h5m

2024-09-02 23:27:09 UTC

Yes! I actually just had someone else ask about Instagram. Try it out :)

I got these results just now: https://fetchfoxai.com/s/UOqL5HtuNe

If you want to do the same scrape, here is the prompt I used: https://imgur.com/XhguCk4

zulko

3 replies

7h42m

2024-09-03 10:49:52 UTC

Same experience here. Been building a classical music database [1] where historical and composer life events are scraped off wikipedia by asking ChatGPT to extract lists of `[{event, year, location}, ...]` from biographies.

- Using chatgpt-mini was the only cheap option, worked well (although I have a feeling it's dumbing down these days) and made it virtually free.

- Just extracting the webpage text from HTML, with `BeautifulSoup(html).text` slashes the number of tokens (but can be risky when dealing with complex tables)

- At some point I needed to scrape ~10,000 pages that have the same format and it was much more efficient speed-wise and price-wise to provide ChatGPT with the HTML once and say "write some python code that extracts data", then apply that code to the 10,000 pages. I'm thinking a very smart GPT-based web parser could do that, with dynamically generated scraping methods.

- Finally because this article mentions tables, Pandas has a very nice feature `from_html("http:/the-website.com")` that will detect and parse all tables on a page. But the article does a good job pointing at websites where the method would fail because the tables don't use `<table/>`

[1] https://github.com/Zulko/composer-timelines

davidsojevic

2 replies

7h22m

2024-09-03 11:10:21 UTC

If you haven't considered it, you can also use the direct wikitext markup, from which the HTML is derived.

Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.

You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

zulko

1 replies

6h43m

2024-09-03 11:48:52 UTC

reduce it do only sections that are likely to be relevant (eg. "Life and career")

True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.

Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).

One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).

distances

0 replies

5h16m

2024-09-03 13:15:39 UTC

Wikipedia's api.php supports JSON output, which probably helps already quite a bit. For example https://en.wikipedia.org/w/api.php?action=query&prop=extract...

ozr

3 replies

14h22m

2024-09-03 04:09:43 UTC

GPT-4 (and Claude) are definitely the top models out there, but: Llama, even the 8b, is more than capable of handling extraction like this. I've pumped absurd batches through it via vLLM.

With serverless GPUs, the cost has been basically nothing.

shoelessone

2 replies

10h7m

2024-09-03 08:25:17 UTC

Can you explain a bit more about what "serverless GPUs" are exactly? Is there a specific cloud provider you're thinking of, e.g. is there a GPU product with AWS? Google gives me SageMaker, which is perhaps what you are referring to?

ozr

0 replies

3h51m

2024-09-03 14:40:58 UTC

There are a few companies out there that provide it, Runpod and Replicate being the two that I've used. If you've ever used AWS Lambda (or any other FaaS) it's essentially the same thing.

You ship your code as a container within a library they provide that allows them to execute it, and then you're billed per-second for execution time.

Like most FaaS, if your load is steady-state it's more expensive than just spinning up a GPU instance.

If your use-case is more on-demand, with a lot of peaks and troughs, it's dramatically cheaper. Particularly if your trough frequently goes to zero. Think small-scale chatbots and the like.

Runpod, for example, would cost $3.29/hr or ~$2400/mo for a single H100. I can use their serverless offering instead for $0.00155/second. I get the same H100 performance, but it's not sitting around idle (read: costing me money) all the time.

agcat

0 replies

30m

2024-09-03 18:02:15 UTC

You can check out this technical deep dive on Serverless GPUs offerings/Pay-as-you-go way.

This includes benchmarks around cold-starts, performance consistency, scalability, and cost-effectiveness for models like Llama2 7Bn & Stable Diffusion across different providers -https://www.inferless.com/learn/the-state-of-serverless-gpus... .Can save months of your time. Do give it a read.

P.S: I am from Inferless.

wslh

2 replies

19h52m

2024-09-02 22:40:05 UTC

Isn't ollama an answer to this? Or is there something inherent to OpenAI that makes it significantly better for web scraping?

simonw

1 replies

19h8m

2024-09-02 23:23:54 UTC

GPT-4o (and the other top-tier models like Claude 3.5 Sonnet and Gemini 1.4 Pro) is massively more capable than models you can run on your own machine using Ollama - unless you can run something truly monstrous like Llama 3.1 405b, but that's requires 100GBs of GPU RAM which is very expensive.

wslh

0 replies

7h58m

2024-09-03 10:34:22 UTC

I understand that if the web scraping activity has some ROI it is perfectly affordable. It is iust for fun it doesn't make sense but the article is already paying for a service and looking for a goal.

albert_e

2 replies

6h45m

2024-09-03 11:47:04 UTC

Offtopic:

What are some good frameworks for webscraping and PDF document processing -- some public and some behind login, some requiring multiple clicks before the sites display relevant data.

We need to ingest a wide variety of data sources for one solution. Very few of those sources supply data as API / json.

riiii

0 replies

6h42m

2024-09-03 11:49:41 UTC

I like Crawlee: https://crawlee.dev/

kordlessagain

0 replies

4h17m

2024-09-03 14:15:21 UTC

I have built most of this and have it running on Google Cloud as a service. The framework I built is Open Source. Let me know if you want to discuss: https://mitta.ai

abhgh

2 replies

10h48m

2024-09-03 07:43:47 UTC

As others have mentioned here you might get better results cheaper (this probably wasn't the point of the article, so just fyi) if you preprocess the html first. I personally have had good results with trafilatura[1], which I don't see mentioned yet.

[1] https://trafilatura.readthedocs.io/en/latest/

jeanloolz

1 replies

6h55m

2024-09-03 11:36:59 UTC

I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM. I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.

abhgh

0 replies

5h20m

2024-09-03 13:11:28 UTC

Good to know! Yes, trafilatura is great, sure it breaks sometimes, but everything breaks on some website - the real questions are how often and what is the extent of breakage. For general info., the library was published about here [1], where in Table 1 they provide some benchmarks.

I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].

[1] https://aclanthology.org/2021.acl-demo.15.pdf

[2] https://r.jina.ai/news.ycombinator.com/

[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274

LetsGetTechnicl

2 replies

4h20m

2024-09-03 14:12:06 UTC

I'm starting to think that LLM's are a solution in need of a problem like how crypto and the blockchain was. Have we not already solved web scraping?

gallerdude

1 replies

4h12m

2024-09-03 14:19:52 UTC

New technologies can solve problems better.

"I'm starting to think computers are a solution in the need of a problem. Have we not already solved doing math?"

LetsGetTechnicl

0 replies

1h15m

2024-09-03 17:16:35 UTC

Does it do it better? It also uses a ton more electricity so even if it's better, is it worth the cost?

timsuchanek

1 replies

11h49m

2024-09-03 06:42:40 UTC

This is also how we started a while ago. I agree that it's too expensive, hence we're working on making this scalable and cheaper now! We'll soon launch, but here we go! https://expand.ai

hmottestad

0 replies

10h26m

2024-09-03 08:05:34 UTC

I’m curious to know more about your product. Currently, I’m using visualping.io to keep an eye on the website of my local housing community. They share important updates there, and it’s really helpful for me to get an email every few months instead of having to check their site every day.

simonw

1 replies

19h12m

2024-09-02 23:20:06 UTC

GPT-4o mini is 33x cheaper than GPT-4o, or 66x cheaper in batch mode. But the article says:

I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.

Would be interesting to compare with the other inexpensive top tier models, Claude 3 Haiku and Gemini 1.5 Flash.

edublancas

0 replies

18h47m

2024-09-02 23:44:42 UTC

author here: I'm working on a follow-up post where I benchmark pre-processing techniques (to reduce the token count). Turns out, removing all HTML works well (much cheaper and doesn't impact accuracy). So far, I've only tried gpt-4o and the mini version, but trying other models would be interesting!

kcorbitt

1 replies

20h3m

2024-09-02 22:28:28 UTC

Funnily enough, web scraping was actually the motivating use-case that started my co-founder and I building what is now openpipe.ai. GPT-4 is really good at it, but extremely expensive. But it's actually pretty easy to distill its skill at scraping a specific class of site down to a fine-tuned model that's way cheaper and also really good at scraping that class of site reliably.

artembugara

0 replies

10h3m

2024-09-03 08:29:21 UTC

Wow, Kyle, you should have mentioned it earlier!

We've been working on this for quite a while. I'll contact you to show how far we've gotten

hubraumhugo

1 replies

13h45m

2024-09-03 04:47:04 UTC

We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.

Here is what we ended up with:

- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.

- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.

We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.

Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us

[0] https://kadoa.com

artembugara

0 replies

9h42m

2024-09-03 08:49:48 UTC

this! I've been following Kadoa since its very first days. Great team.

fvdessen

1 replies

19h7m

2024-09-02 23:24:46 UTC

This looks super useful, but from what i've heard, if you try to do this at any meaningful scale your scrapers will be blocked by Cloudflare and the likes

danpalmer

0 replies

14h55m

2024-09-03 03:36:50 UTC

I used to do a lot of web scraping. Cloudflare is an issue, as are a few Cloudflare competitors, but scraping can still be useful. We had contracts with companies we scraped that allowed us to scrape their sites, specifically so that they didn't need to do any integration work to partner with us. The most anyone had to do on the company side was allowlist us with Cloudflare.

Would recommend web scraping as a "growth hack" in that way, we got a lot of partnerships that we wouldn't otherwise have got.

bilater

1 replies

1h41m

2024-09-03 16:51:00 UTC

Not sure why author didn't use 4o-mini. 4o for reasoning but things like parsing/summarizing can be done by cheaper models with little loss in quality.

mthoms

0 replies

1h40m

2024-09-03 16:52:25 UTC

Note: I also tried GPT-4o mini but yielded significantly worse results so I just continued my experiments with GPT-4o.

artembugara

1 replies

9h27m

2024-09-03 09:05:05 UTC

Wow, that's one of the most orange tag-rich posts I've ever seen.

We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.

"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.

Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.

So, we're looking for LLM to generate a code to parse HTML.

Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com

AbstractH24

0 replies

7h28m

2024-09-03 11:04:00 UTC

I’d love to look into this for a hobbyist project I’m working on. Wish you had self signup!

antirez

1 replies

8h4m

2024-09-03 10:27:58 UTC

It's very surprising that the author of this post does 99% of the work and writing and then does not go forward for the other 1% downloading ollama (or some other llama.cpp based engine) and testing how some decent local LLM works in this use case. Because maybe a 7B or 30B model will do great in this use case, and that's cheap enough to run: no GPT-4o needed.

devoutsalsa

0 replies

1h18m

2024-09-03 17:14:09 UTC

Not OP, but thanks for the suggestion. I’m starting to play around with LLMs and will explore locally hosted versions.

webprofusion

0 replies

15h36m

2024-09-03 02:55:58 UTC

Just run the model locally?

tuktuktuk

0 replies

13h0m

2024-09-03 05:31:51 UTC

Can you share how long it took for you to parse the HTML? I recently experimented with comparing different AI models, including GPT-4o, alongside Gemini and Claude to parse raw HTML: https://serpapi.com/blog/web-scraping-with-ai-parsing-html-t.... Result is pretty interesting.

the_cat_kittles

0 replies

2h49m

2024-09-03 15:42:43 UTC

is it really so hard to look at a couple xpaths in chrome? insane that people actually use an llm when trying to do this for real. were headed where automakers are now- just put in idiot lights, no one knows how to work on any parts anymore. suit yourself i guess

sentinels

0 replies

5h31m

2024-09-03 13:01:13 UTC

What people mentioned above is pretty much what they did at octabear and as an extension of the idea it's also what a lot of startups applicants did for other type of media like video scraping, podcast scraping, audio scraping, etc [0] https://www.octabear.com/

raybb

0 replies

16h3m

2024-09-03 02:29:24 UTC

On this note, does anyone know how Cursor scrapes websites? Is it just fetching locally and then feeding the raw html or doing some type of preprocessing?

nsonha

0 replies

14h48m

2024-09-03 03:43:52 UTC

Most discussion I found about the topic is how to extract information. Is there any technique for extracting interactive elements? I reckon listing all of inputs/controls would not be hard, but finding the corresponding labels/articles might be tricky.

Another thing I wonder is, regarding text extraction, would it be a crazy idea to just snapshot the page and ask it to OCR & generate a bare minimum html table layout. That way both the content and the spatial relationship of elements are maintained (not sure how useful but I'd like to keep it anyway).

mmasu

0 replies

4h19m

2024-09-03 14:13:20 UTC

As a poc, we first took a screenshot of the page, cropped it to the part we needed and then passed it to GPT. One of the things we do is compare prices of different suppliers for the same product (i.e. airline tickets), and sometimes need to do it manually. While the approach could look expensive, it is in general cheaper than a real person, and enables the real person to do more meaningful work… so it’s a win-win. I am looking forward to put this in production hopefully

mjrbds

0 replies

15h58m

2024-09-03 02:33:57 UTC

We've had lots of success with this at Rastro.sh - but the biggest unlock came when we used this as benchmark data to build scraping code. Sonnet 3.5 is able to do this. It reduced our cost and improved accuracy for our use case (extracting e-commerce products), as some of these models are not reliable to extract lists of 50+ items.

mfrye0

0 replies

13h42m

2024-09-03 04:49:29 UTC

As others have mentioned, converting html to markdown works pretty well.

With that said, we've noticed that for some sites that have nested lists or tables, we get better results by reducing those elements to a simplified html instead of markdown. Essentially providing context when the structures start and stop.

It's also been helpful for chunking docs, to ensure that lists / tables aren't broken apart in different chunks.

mateuszbuda

0 replies

7h28m

2024-09-03 11:04:26 UTC

I think that LLM costs, even GPT-4o, are probably lower compared to proxy costs usually required for web scraping at scale. The cost of residential/mobile proxies is a few $ per GB. If I were to process cleaned data obtained using 1GB of residential/mobile proxy transfer, I wouldn't pay more for LLM.

lccerina

0 replies

10h42m

2024-09-03 07:50:00 UTC

We are re-opening coal plants to do this? Every day a bit more disgusted by GenAI stuff

kimoz

0 replies

20h11m

2024-09-02 22:21:13 UTC

Is it possible to achieve good results using the open source models for scrapping?

kanzure

0 replies

4h32m

2024-09-03 14:00:24 UTC

Instead of directly scraping with GPT-4o, what you could do is have GPT-4o write a script for a simple web scraper and then use a prompt-loop when something breaks or goes wrong.

I have the same opinion about a man and his animals crossing a river on a boat. Instead of spending tokens on trying to solve a word problem, have it create a constraint solver and then run that. Same thing.

jasonthorsness

0 replies

19h21m

2024-09-02 23:10:57 UTC

I also had good results with structured outputs, scraping news articles for city names from https://lite.cnn.com for the “in the news” list at https://weather.bingo - code here: https://www.jasonthorsness.com/13

I’ve had problems with hallucinations though even for something as simple as city names; also the model often ignores my prompt and returns country names - am thinking of trying a two-stage scrape with one checking the output of the other.

impure

0 replies

15h40m

2024-09-03 02:51:29 UTC

I was thinking of adding a feature of my app to use LLMs to extract XPaths to generate RSS feeds from sites that don't support it. The section on XPaths is unfortunate.

godber

0 replies

4h32m

2024-09-03 14:00:17 UTC

I would definitely approach this problem by having the LLM write code to scrape the page. That would address the cost and accuracy problems. And also give you testable code.

fsndz

0 replies

18h16m

2024-09-03 00:16:20 UTC

useful for one shot cases, but not more for the moment imo.

danielvaughn

0 replies

34m

2024-09-03 17:57:59 UTC

The author claims that attempting to retrieve xpaths with the LLM proved to be unreliable. I've been curious about this approach because it seems like the best "bang for your buck" with regards to cost. I bet if you experimented more, you could probably improve your results.

btbuildem

0 replies

6h17m

2024-09-03 12:15:12 UTC

I've had good luck with giving it an example of HTML I want scraped and asking for a beautifulsoup code snippet. Generally the structure of what you want to scrape remains the same, and it's a tedious exercise coming up with the garbled string of nonsense that ends up parsing it.

Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.

blackeyeblitzar

0 replies

14h35m

2024-09-03 03:57:17 UTC

I just want something that can take all my bookmarks, log into all by subscriptions using my credentials, and archive all those articles. I can then feed them to an LLM of my choice to ask questions later. But having the raw archive is the important part. I don’t know if there are any easy to use tools to do this though, especially with paywalled subscription based websites.

ammario

0 replies

20h9m

2024-09-02 22:22:40 UTC

To scale such an approach you could have the LLM generate JS to walk the DOM and extract content, caching the JS for each page.

LetsGetTechnicl

0 replies

5h8m

2024-09-03 13:24:10 UTC

Surely you don't need an LLM for this

Havoc

0 replies

19h28m

2024-09-02 23:04:08 UTC

Asking for XPaths is clever!

Plus you can probably use that until it fails (website changes) and then just re scrape it with llm request

Gee101

0 replies

14h50m

2024-09-03 03:41:37 UTC

A bit of topic but great post title.

FooBarWidget

0 replies

7h45m

2024-09-03 10:47:02 UTC

Can anyone recommend an AI vision web browsing automation framework rather than just scraping? My use case: automate the monthly task of logging into a website and downloading the latest invoice PDF.