One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability
This is awesome. I kind of want a browser extension that does this to every page I read and saves them somewhere.
Wouldn’t that describe Pocket, Readwise Reader, Matter, or another one of those many apps?
Edit: read too fast. Didn’t notice the automatic and systematic aspects.
Pocket saves the address; I don't think they save the content.
Just checked it, they also save the content.
My choice (manual): Markdown clipper
https://github.com/deathau/markdown-clipper
I guess there are dozens of alternative extensions available out there …
This fork:
https://github.com/deathau/markdownload
With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.
Singlefile for Firefox https://addons.mozilla.org/en-US/firefox/addon/single-file/
Also WebScrapBook: https://github.com/danny0838/webscrapbook
Omnivore. Saves a copy using web archive.
Wallabag + Obsidian + Wallabag Browser Ext. It's a manual trigger but its great.
Converting websites to markdown comes with 3 distinct problems:
1. Throughly scraping the content of page (high recall)
2. Dropping all the ads/auxilliary content (high precision)
3. And getting the correct layout/section types (formatting)
For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.
Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.
Thanks for the links I had no idea those existed.
For my article web scraper (wip) the current steps are:
- Navigate with playwright + adblocker
- Run mozilla's readability on the page
- LLM checks readability output
If check failed
- Trim whole page HTML context
- Convert to markdown with pandoc
- LLM extracts from markdown
Mozilla has released Readability as a standalone package so you can avoid spinning up a browser entirely: https://github.com/mozilla/readability
I still wanted the browser for UBlock Origin and handling sites with heavy JS. I was using the standalone Readability script already but today I ended up dropping it for Trafilatura. It works a lot better.
The inefficiency of using a browser rather than just taking the html doesn't really matter because the limiting factor is the LLM here.
And yes the LLM is essential for getting clean data. None of the existing methods are flexible enough for all cases even if people say "you don't need AI to do this".
you would still need to run. For js based websites.
Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.
For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.
This is pretty cool. Care to share your Swift port?
Not planning to. It’s my first Swift/iOS project. I neither want to polish it nor maintain it publicly. Happy to share it privately, email is in the bio. I’m planning on a blog post describing the general approach though!
Those are pretty damn cool i didnt know those even existed.
Vercel!! Watch out for you bill now this is being hugged. Hopefully you are not using <Image /> like they pester you to do.
What’s the problem with <Image />?
Vercel charges for image optimizations/dynamic scaling. People get wrong perception that it should be a free service because the DX is so easy...
it's pretty sad, they motivate you all throughout their docs to use it, when just a webp would work ok
it's literally that meme of the bus — the happy guy is you and vercel and the sad one is your wallet
(unless you need dynamic scaling and minification though)
it has actually scaled pretty well and has been negligible cost-wise.
I didn’t have to do anything to handle HN traffic. Just a boilerplate nextjs app.
I think it was hugged to death
working on it!
I think it should be up now
"Either the URL is invalid or the server is too busy".
Aw, tell the user which it is.
Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown
I think this doesn't load js before scraping. Many sites hydrate content using JS.
use something like Browserbase?
Great idea to offer image downloads and filtering with GPT!
I built a similar tool last year that doesn't have those features: https://url2text.com/
Apologies if the UI is slow - you can see some example output on the homepage.
The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text
You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3
And/or delivered by webhook: https://urlbox.com/webhooks
I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.
If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/
Looks nice, but url2text doesn't seem to have an API, and urlbox doesn't seem to have an option to skip the screenshot if you only want the text. And for just the text, it looks to be really expensive.
Thanks!
Sorry it's not clearer but you can skip the screenshot in the Urlbox API if you want to with:
curl -X POST \
https://api.urlbox.io/v1/render/sync \
-H 'Authorization: Bearer YOUR_URLBOX_SECRET' \
-H 'Content-Type: application/json' \
-d '
{
"url": "example.com",
"format": "md"
}
'
Here's the result of that: https://renders.urlbox.io/urlbox1/renders/5799274d37a8b4e604...Sorry the pricing isn't a good fit for you. Urlbox has been running for over 11 years. We're bootstrapped and profitable with a team of 3 (plus a few contractors). We're priced to be sustainable so our customers can depend on us in the long term. We automatically give volume discounts as your usage grows.
This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.
I second this. Pandoc is up there as one of the most useful tools that exist, that almost no one talks about. It's amazing, easy to use, and works. I regularly see new tools in the space pop-up, but someone would have to have a REALLY unique and compelling feature, or highly optimized use case to get me to use anything else (besides Pandoc).
I wrote a series of blog posts about typesetting Markdown using pandoc:
Eventually, I found pandoc to be a little limiting:
* Awkward to use interpolated variables within prose.
* No real-time preview prior to rendering the final document.
* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).
* Inconsistent syntax for captions and cross-references.
* Requires glue to apply a single YAML metadata source file to multiple documents (e.g., book chapters).
* Does not (reliably) convert straight quotes to curly quotes.
For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr by writing an integrated FOSS cross-platform desktop editor.
KeenWrite uses flexmark-java + Renjin + KeenTeX + KeenQuotes to provide a solution that can replace pandoc + knitr in some situations.
Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...
There's also command-line usage for integrating into build pipelines:
https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd...
It looks like, if the website presents a cookie message, the tool just gets stuck on that and does not parse the actual content. As an example, I tried https://www.cnbc.com/ and all it created was a markdown of the cookie message and some legalese around it.
It's not easy working around things like that. But here's how it could work: https://url2text.com/u/wYVake
We were lucky to build this on a mature API that already solves loads of the edge cases around rendering different kinds of pages.
"Failed to Convert. Either the URL is invalid or the server is too busy. Please try again later."
can you try now
links -dump
elinks -dump
lynx -dump
let me guess, you need more?
If the page is rendered using JavaScript then yes you need more.
If there were a version of links, elinks, or lynx that executed JS that would be wonderful.
I tried a lot of tools, but none of them can work on website like https://www.globalrelay.com/company/careers/jobs/?gh_jid=507...
Our tool sadly also fails on this: https://url2text.com/u/KYkpBj
The challenge there is that the content is in an iframe.
If you get the URL used for the iframe you can get the content: https://url2text.com/u/kJWaZY
But that's frustrating as it requires two steps.
We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.
Drop us an email: support@urlbox.com and we'll see what we can do.
I’m getting the server overload error but assuming this mostly works I’d use it every day!
can you try now?
I've found htmltidy [1] and pandoc html->markdown sufficiently capable.
Never heard of tidy, this looks promising.
I am kind of tempted/horrified to run all of my final templated HTML through this and see if I can spot any lingering malformations. Depending on how structured the corrections are, could make it a test-suite thing.
I did cnn.com and it just downloaded a file that said cnn.com can't be found lol
I wonder if it's not following redirects well
Nice! I did something similar a while back, but just for substack :)
One example --
URL: https://newsletter.pragmaticengineer.com/p/zirp-software-eng...
Sanitized output: https://substack-ai.vercel.app/u/pragmaticengineer/p/zirp-so...
Raw markdown: https://substack-ai.vercel.app/api/users/pragmaticengineer/p...
(Would be happy to open source it if anyone cares!)
I'd be interested in seeing the code, I love working on tools like this
Just tried it on a complex marketing page and it did a great job. Congrats!
I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?
this is slightly heavy due to loading a headless chrome instance. I will look into optimizing this part.
other than that, gpt4 is expensive, but so far it’s been negligible so i am hopeful.
i feel i can keep it around for long.
Tried it earlier today, but it was hugged to death. Tried it now, but it only gave me the contents of the cookie-wall, not the page I was looking for.
Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.
Pretty cool.
I build something very similar - smort.io . Just prepend smort.io/ before any article URL to easily edit, annotate and share it with anyone.
Also works on ArXiv papers!
This was the Show HN post for Smort - https://news.ycombinator.com/item?id=30673502
If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this library https://github.com/tim-gromeyer/html2md in my app.
Is the code open source?
Is this open sourced anywhere by any chance? Are you using GPT to do the conversion, or just doing it yourself by ways of HTML -> Markdown substitutions?
nitpick: the tooltip (triggered by the question mark icon) does not work on mobile - at least on my iPhone (both chrome & safari)
May be worth taking a look at.
Good stuff otherwise! Cheers on the launch
Nice. Turn this into a browser extension and id install it. Feel like id forget about it otherwise
I recently built a website using Markdown as the page source, which then uses CommonMark to produce the HTML. I was interested in seeing how different the extracted markdown looked to the original source markdown. It looks identical!
I also made one like this a while back, you can extract to markdown, html, text or PDF. I found pages that are the straight tool are very hard to SEO-position since there's not a lot of text/content on the page, even if the tool could be very useful. Feedback welcome:
These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.
Very cool. We posted about a similar tool we built yesterday
It also crawls (although you can scrape single pages as well)
I use markdownload extension for Firefox. Seems to work pretty ok.
I wrote a similar article a while ago: https://blog.platypush.tech/article/Deliver-articles-to-your...
In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.
I've been looking for this! My method requires too many steps. I look forward to seeing if this improves my results. Thanks!
A year ago I implemented this as well (albeit as a commercial offering with 100 free scrapes per month): https://2markdown.com It also has javascript-enabled browsing available in private beta. Will make it public this week. In my experience, people fall back to the simple scraping and not use js that much, if at all.
Very cool! We also developed similar tool that can extract information from complex pdfs with embedded tables,images or graphs and get your multimodal data sources LLM-Ready. https://hellorag.ai
I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator
Anyways, it uses astro + markdown.
It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.
Zzzzzzz
How do you achieve the same things without AI here using that tool?
"How do you do it without AI" is a question I (sadly) expect to see more often.
Feel free to answer then, how do you do the same functions this does with gpt(3/4) without AI?
Edit -
This is an excellent use of it, a free text human input capable of doing things like extracting summaries. It does not seem to be used at all for the basic task of extracting content, but for post filtering.
I think “copy from a PDF” could be improved with AI. It’s been 30 years and I still get new lines in the middle of sentences when I try to copy from one.
i've long assumed that is a "feature" of PDF akin to DRM. Making copying text from a PDF makes sense from a publisher's standpoint.
That's a great use case, you might be able to do this if you've got a copy and paste on the command line with
https://github.com/simonw/llm
In between. An alias like pdfwtf translating to "paste | llm command | copy"
Meh, it’s just the “how does it work?” question. How content extractors work is interesting and not obvious nor trivial.
And even when you see how readability parser works, AI handles most of the edge cases that content extractors fail on, so they are genuinely superseded by LLMs.
how is it compared to mozilla/readability?
it uses readibility but does some additional stuff like relink images to local paths etc., which I needed
I have had challenges with readability. The output is good for blogs but when we try it for other type of content, it misses on important details even when the page is quite text-heavy just like blog.
yeah that’s correct. i put a checkbox to disable readability filter if needed…
I was honestly expecting it to be mostly black magic, but it looks like the meat of the project is a bunch of (surely hard won) regexes. Nifty.
Wait, regexes are the epitome of black magic. What do you consider as black magic?
Macros? Any situation where code edits other code?
Sure, I could not write a regex engine, but the language itself can be fine if you keep it to straightfoward stuff. Unlike the famous e-mail parsing regex.
Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?
This is worth having a look at: https://mixmark-io.github.io/turndown/
With some configuration you can get most of the way there.
Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.
oh AI is optional here. I do use readability to clean the html before converting to .md.