return to table of content

Show HN: I made a tool to clean and convert any webpage to Markdown

LeonidBugaev
18 replies
22h58m

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

IanCal
6 replies
22h18m

How do you achieve the same things without AI here using that tool?

chrisweekly
5 replies
22h5m

"How do you do it without AI" is a question I (sadly) expect to see more often.

IanCal
3 replies
21h51m

Feel free to answer then, how do you do the same functions this does with gpt(3/4) without AI?

Edit -

This is an excellent use of it, a free text human input capable of doing things like extracting summaries. It does not seem to be used at all for the basic task of extracting content, but for post filtering.

cactusfrog
2 replies
21h36m

I think “copy from a PDF” could be improved with AI. It’s been 30 years and I still get new lines in the middle of sentences when I try to copy from one.

genewitch
0 replies
19h37m

i've long assumed that is a "feature" of PDF akin to DRM. Making copying text from a PDF makes sense from a publisher's standpoint.

IanCal
0 replies
21h25m

That's a great use case, you might be able to do this if you've got a copy and paste on the command line with

https://github.com/simonw/llm

In between. An alias like pdfwtf translating to "paste | llm command | copy"

hombre_fatal
0 replies
20h57m

Meh, it’s just the “how does it work?” question. How content extractors work is interesting and not obvious nor trivial.

And even when you see how readability parser works, AI handles most of the edge cases that content extractors fail on, so they are genuinely superseded by LLMs.

foundzen
3 replies
21h50m

how is it compared to mozilla/readability?

asadm
2 replies
21h35m

it uses readibility but does some additional stuff like relink images to local paths etc., which I needed

foundzen
1 replies
13h49m

I have had challenges with readability. The output is good for blogs but when we try it for other type of content, it misses on important details even when the page is quite text-heavy just like blog.

asadalt
0 replies
12h49m

yeah that’s correct. i put a checkbox to disable readability filter if needed…

fbdab103
2 replies
20h49m

I was honestly expecting it to be mostly black magic, but it looks like the meat of the project is a bunch of (surely hard won) regexes. Nifty.

nyokodo
1 replies
17h14m

I was … expecting it to be mostly black magic, but … the meat of the project is a bunch of … regexes

Wait, regexes are the epitome of black magic. What do you consider as black magic?

fbdab103
0 replies
1h51m

Macros? Any situation where code edits other code?

Sure, I could not write a regex engine, but the language itself can be fine if you keep it to straightfoward stuff. Unlike the famous e-mail parsing regex.

haddr
1 replies
22h28m

Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?

jot
0 replies
20h5m

Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.

asadalt
0 replies
22h56m

oh AI is optional here. I do use readability to clean the html before converting to .md.

ChadNauseam
9 replies
22h52m

This is awesome. I kind of want a browser extension that does this to every page I read and saves them somewhere.

ElFitz
2 replies
22h39m

Wouldn’t that describe Pocket, Readwise Reader, Matter, or another one of those many apps?

Edit: read too fast. Didn’t notice the automatic and systematic aspects.

BHSPitMonkey
1 replies
21h56m

Pocket saves the address; I don't think they save the content.

intelkishan
0 replies
14h45m

Just checked it, they also save the content.

jwoq9118
0 replies
20h21m

Omnivore. Saves a copy using web archive.

https://omnivore.app/

ZunarJ5
0 replies
22h35m

Wallabag + Obsidian + Wallabag Browser Ext. It's a manual trigger but its great.

screye
8 replies
19h12m

Converting websites to markdown comes with 3 distinct problems:

1. Throughly scraping the content of page (high recall)

2. Dropping all the ads/auxilliary content (high precision)

3. And getting the correct layout/section types (formatting)

For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.

Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.

msp26
3 replies
8h40m

Thanks for the links I had no idea those existed.

For my article web scraper (wip) the current steps are:

- Navigate with playwright + adblocker

- Run mozilla's readability on the page

- LLM checks readability output

If check failed

- Trim whole page HTML context

- Convert to markdown with pandoc

- LLM extracts from markdown

privatenumber
2 replies
3h45m

Mozilla has released Readability as a standalone package so you can avoid spinning up a browser entirely: https://github.com/mozilla/readability

msp26
0 replies
2h27m

I still wanted the browser for UBlock Origin and handling sites with heavy JS. I was using the standalone Readability script already but today I ended up dropping it for Trafilatura. It works a lot better.

The inefficiency of using a browser rather than just taking the html doesn't really matter because the limiting factor is the LLM here.

And yes the LLM is essential for getting clean data. None of the existing methods are flexible enough for all cases even if people say "you don't need AI to do this".

asadalt
0 replies
3h24m

you would still need to run. For js based websites.

scary-size
2 replies
11h23m

Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.

For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.

[1] https://github.com/postlight/parser

justech
1 replies
10h11m

This is pretty cool. Care to share your Swift port?

scary-size
0 replies
9h52m

Not planning to. It’s my first Swift/iOS project. I neither want to polish it nor maintain it publicly. Happy to share it privately, email is in the bio. I’m planning on a blog post describing the general approach though!

cchance
0 replies
14h11m

Those are pretty damn cool i didnt know those even existed.

selfie
4 replies
19h43m

Vercel!! Watch out for you bill now this is being hugged. Hopefully you are not using <Image /> like they pester you to do.

aleksandern08
1 replies
8h24m

What’s the problem with <Image />?

asadm
0 replies
2h55m

Vercel charges for image optimizations/dynamic scaling. People get wrong perception that it should be a free service because the DX is so easy...

compootr
0 replies
14h20m

it's pretty sad, they motivate you all throughout their docs to use it, when just a webp would work ok

it's literally that meme of the bus — the happy guy is you and vercel and the sad one is your wallet

(unless you need dynamic scaling and minification though)

asadalt
0 replies
15h27m

it has actually scaled pretty well and has been negligible cost-wise.

I didn’t have to do anything to handle HN traffic. Just a boilerplate nextjs app.

julienreszka
3 replies
22h30m

I think it was hugged to death

asadm
1 replies
22h18m

working on it!

asadm
0 replies
22h4m

I think it should be up now

Animats
0 replies
22h25m

"Either the URL is invalid or the server is too busy".

Aw, tell the user which it is.

asadm
1 replies
21h18m

I think this doesn't load js before scraping. Many sites hydrate content using JS.

ushakov
0 replies
19h49m

use something like Browserbase?

jot
2 replies
20h37m

Great idea to offer image downloads and filtering with GPT!

I built a similar tool last year that doesn't have those features: https://url2text.com/

Apologies if the UI is slow - you can see some example output on the homepage.

The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text

You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3

And/or delivered by webhook: https://urlbox.com/webhooks

I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.

If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/

jph00
1 replies
19h52m

Looks nice, but url2text doesn't seem to have an API, and urlbox doesn't seem to have an option to skip the screenshot if you only want the text. And for just the text, it looks to be really expensive.

jot
0 replies
19h31m

Thanks!

Sorry it's not clearer but you can skip the screenshot in the Urlbox API if you want to with:

  curl -X POST \
    https://api.urlbox.io/v1/render/sync \
    -H 'Authorization: Bearer YOUR_URLBOX_SECRET' \
    -H 'Content-Type: application/json' \
    -d '
  {
    "url": "example.com",
    "format": "md"
  }
  '
Here's the result of that: https://renders.urlbox.io/urlbox1/renders/5799274d37a8b4e604...

Sorry the pricing isn't a good fit for you. Urlbox has been running for over 11 years. We're bootstrapped and profitable with a team of 3 (plus a few contractors). We're priced to be sustainable so our customers can depend on us in the long term. We automatically give volume discounts as your usage grows.

blobcode
2 replies
21h36m

This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

rubyn00bie
1 replies
21h30m

I second this. Pandoc is up there as one of the most useful tools that exist, that almost no one talks about. It's amazing, easy to use, and works. I regularly see new tools in the space pop-up, but someone would have to have a REALLY unique and compelling feature, or highly optimized use case to get me to use anything else (besides Pandoc).

thangalin
0 replies
21h8m

I wrote a series of blog posts about typesetting Markdown using pandoc:

https://dave.autonoma.ca/blog

Eventually, I found pandoc to be a little limiting:

* Awkward to use interpolated variables within prose.

* No real-time preview prior to rendering the final document.

* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).

* Inconsistent syntax for captions and cross-references.

* Requires glue to apply a single YAML metadata source file to multiple documents (e.g., book chapters).

* Does not (reliably) convert straight quotes to curly quotes.

For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr by writing an integrated FOSS cross-platform desktop editor.

https://keenwrite.com/

KeenWrite uses flexmark-java + Renjin + KeenTeX + KeenQuotes to provide a solution that can replace pandoc + knitr in some situations.

Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:

https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...

There's also command-line usage for integrating into build pipelines:

https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd...

old_dev
1 replies
20h16m

It looks like, if the website presents a cookie message, the tool just gets stuck on that and does not parse the actual content. As an example, I tried https://www.cnbc.com/ and all it created was a markdown of the cookie message and some legalese around it.

jot
0 replies
20h7m

It's not easy working around things like that. But here's how it could work: https://url2text.com/u/wYVake

We were lucky to build this on a mature API that already solves loads of the edge cases around rendering different kinds of pages.

midzer
1 replies
22h33m

"Failed to Convert. Either the URL is invalid or the server is too busy. Please try again later."

asadm
0 replies
22h3m

can you try now

katehikes88
1 replies
21h59m

links -dump

elinks -dump

lynx -dump

let me guess, you need more?

kwhitefoot
0 replies
21h21m

If the page is rendered using JavaScript then yes you need more.

If there were a version of links, elinks, or lynx that executed JS that would be wonderful.

jot
0 replies
19h13m

Our tool sadly also fails on this: https://url2text.com/u/KYkpBj

The challenge there is that the content is in an iframe.

If you get the URL used for the iframe you can get the content: https://url2text.com/u/kJWaZY

But that's frustrating as it requires two steps.

We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.

Drop us an email: support@urlbox.com and we'll see what we can do.

dazzaji
1 replies
22h5m

I’m getting the server overload error but assuming this mostly works I’d use it every day!

asadm
0 replies
22h4m

can you try now?

fbdab103
0 replies
20h40m

Never heard of tidy, this looks promising.

I am kind of tempted/horrified to run all of my final templated HTML through this and see if I can spot any lingering malformations. Depending on how structured the corrections are, could make it a test-suite thing.

cchance
1 replies
14h13m

I did cnn.com and it just downloaded a file that said cnn.com can't be found lol

radicalriddler
0 replies
14h6m

I wonder if it's not following redirects well

nbbaier
0 replies
19h41m

I'd be interested in seeing the code, I love working on tools like this

alsetmusic
1 replies
18h25m

Just tried it on a complex marketing page and it did a great job. Congrats!

I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?

asadalt
0 replies
12h51m

this is slightly heavy due to loading a headless chrome instance. I will look into optimizing this part.

other than that, gpt4 is expensive, but so far it’s been negligible so i am hopeful.

i feel i can keep it around for long.

sigio
0 replies
20h0m

Tried it earlier today, but it was hugged to death. Tried it now, but it only gave me the contents of the cookie-wall, not the page I was looking for.

Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.

sabr
0 replies
14h9m

Pretty cool.

I build something very similar - smort.io . Just prepend smort.io/ before any article URL to easily edit, annotate and share it with anyone.

Also works on ArXiv papers!

This was the Show HN post for Smort - https://news.ycombinator.com/item?id=30673502

remorses
0 replies
22h6m

Is the code open source?

radicalriddler
0 replies
19h46m

Is this open sourced anywhere by any chance? Are you using GPT to do the conversion, or just doing it yourself by ways of HTML -> Markdown substitutions?

ohans
0 replies
12h51m

nitpick: the tooltip (triggered by the question mark icon) does not work on mobile - at least on my iPhone (both chrome & safari)

May be worth taking a look at.

Good stuff otherwise! Cheers on the launch

jamesstidard
0 replies
21h26m

Nice. Turn this into a browser extension and id install it. Feel like id forget about it otherwise

g105b
0 replies
8h22m

I recently built a website using Markdown as the page source, which then uses CommonMark to produce the HTML. I was interested in seeing how different the extracted markdown looked to the original source markdown. It looks identical!

franciscop
0 replies
22h15m

I also made one like this a while back, you can extract to markdown, html, text or PDF. I found pages that are the straight tool are very hard to SEO-position since there's not a lot of text/content on the page, even if the tool could be very useful. Feedback welcome:

https://content-parser.com/

These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.

cpeffer
0 replies
20h14m

Very cool. We posted about a similar tool we built yesterday

https://www.firecrawl.dev/

It also crawls (although you can scrape single pages as well)

brokenmachine
0 replies
19h48m

I use markdownload extension for Firefox. Seems to work pretty ok.

blacklight
0 replies
9h27m

I wrote a similar article a while ago: https://blog.platypush.tech/article/Deliver-articles-to-your...

In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.

ben_ja_min
0 replies
20h7m

I've been looking for this! My method requires too many steps. I look forward to seeing if this improves my results. Thanks!

RamblingCTO
0 replies
13h11m

A year ago I implemented this as well (albeit as a commercial offering with 100 free scrapes per month): https://2markdown.com It also has javascript-enabled browsing available in private beta. Will make it public this week. In my experience, people fall back to the simple scraping and not use js that much, if at all.

JoannaWongs
0 replies
15h45m

Very cool! We also developed similar tool that can extract information from complex pdfs with embedded tables,images or graphs and get your multimodal data sources LLM-Ready. https://hellorag.ai

J_Shelby_J
0 replies
14h43m

I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator

Anyways, it uses astro + markdown.

It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

FounderBurr
0 replies
9h19m

Zzzzzzz