HN comments for: Show HN: I made a tool to clean and convert any webpage to Markdown

LeonidBugaev

18 replies

22h58m

2024-04-14 19:26:38 UTC

One of the cases when AI not needed. There is very good working algorithm to extract content from the pages, one of implementations: https://github.com/buriy/python-readability

IanCal

6 replies

22h18m

2024-04-14 20:06:59 UTC

How do you achieve the same things without AI here using that tool?

chrisweekly

5 replies

22h5m

2024-04-14 20:19:30 UTC

"How do you do it without AI" is a question I (sadly) expect to see more often.

IanCal

3 replies

21h51m

2024-04-14 20:34:11 UTC

Feel free to answer then, how do you do the same functions this does with gpt(3/4) without AI?

Edit -

This is an excellent use of it, a free text human input capable of doing things like extracting summaries. It does not seem to be used at all for the basic task of extracting content, but for post filtering.

cactusfrog

2 replies

21h36m

2024-04-14 20:49:04 UTC

I think “copy from a PDF” could be improved with AI. It’s been 30 years and I still get new lines in the middle of sentences when I try to copy from one.

genewitch

0 replies

19h37m

2024-04-14 22:47:29 UTC

i've long assumed that is a "feature" of PDF akin to DRM. Making copying text from a PDF makes sense from a publisher's standpoint.

IanCal

0 replies

21h25m

2024-04-14 21:00:04 UTC

That's a great use case, you might be able to do this if you've got a copy and paste on the command line with

https://github.com/simonw/llm

In between. An alias like pdfwtf translating to "paste | llm command | copy"

hombre_fatal

0 replies

20h57m

2024-04-14 21:27:46 UTC

Meh, it’s just the “how does it work?” question. How content extractors work is interesting and not obvious nor trivial.

And even when you see how readability parser works, AI handles most of the edge cases that content extractors fail on, so they are genuinely superseded by LLMs.

foundzen

3 replies

21h50m

2024-04-14 20:34:35 UTC

how is it compared to mozilla/readability?

asadm

2 replies

21h35m

2024-04-14 20:49:38 UTC

it uses readibility but does some additional stuff like relink images to local paths etc., which I needed

foundzen

1 replies

13h49m

2024-04-15 04:35:18 UTC

I have had challenges with readability. The output is good for blogs but when we try it for other type of content, it misses on important details even when the page is quite text-heavy just like blog.

asadalt

0 replies

12h49m

2024-04-15 05:35:33 UTC

yeah that’s correct. i put a checkbox to disable readability filter if needed…

fbdab103

2 replies

20h49m

2024-04-14 21:35:22 UTC

I was honestly expecting it to be mostly black magic, but it looks like the meat of the project is a bunch of (surely hard won) regexes. Nifty.

nyokodo

1 replies

17h14m

2024-04-15 01:11:07 UTC

I was … expecting it to be mostly black magic, but … the meat of the project is a bunch of … regexes

Wait, regexes are the epitome of black magic. What do you consider as black magic?

fbdab103

0 replies

1h51m

2024-04-15 16:33:54 UTC

Macros? Any situation where code edits other code?

Sure, I could not write a regex engine, but the language itself can be fine if you keep it to straightfoward stuff. Unlike the famous e-mail parsing regex.

haddr

1 replies

22h28m

2024-04-14 19:56:39 UTC

Some years ago I compared those boilerplate removal tools and I remember that jusText was giving me the best results out of the box (tried readability and few other libraries too). I wonder what is the state of the art today?

jot

0 replies

18h41m

2024-04-14 23:43:34 UTC

This is worth having a look at: https://mixmark-io.github.io/turndown/

With some configuration you can get most of the way there.

jot

0 replies

20h5m

2024-04-14 22:19:37 UTC

Last time I tried readability it worked well with articles but struggled with other kinds of pages. Took away far more content than I wanted it to.

asadalt

0 replies

22h56m

2024-04-14 19:28:36 UTC

oh AI is optional here. I do use readability to clean the html before converting to .md.

ChadNauseam

9 replies

22h52m

2024-04-14 19:33:04 UTC

This is awesome. I kind of want a browser extension that does this to every page I read and saves them somewhere.

ElFitz

2 replies

22h39m

2024-04-14 19:46:03 UTC

Wouldn’t that describe Pocket, Readwise Reader, Matter, or another one of those many apps?

Edit: read too fast. Didn’t notice the automatic and systematic aspects.

BHSPitMonkey

1 replies

21h56m

2024-04-14 20:28:18 UTC

Pocket saves the address; I don't think they save the content.

intelkishan

0 replies

14h45m

2024-04-15 03:39:34 UTC

Just checked it, they also save the content.

smartmic

1 replies

22h29m

2024-04-14 19:55:30 UTC

My choice (manual): Markdown clipper

https://github.com/deathau/markdown-clipper

I guess there are dozens of alternative extensions available out there …

Terretta

0 replies

21h1m

2024-04-14 21:24:02 UTC

This fork:

https://github.com/deathau/markdownload

With extension available for Firefox, Google Chrome, Microsoft Edge and Safari.

Davidbrcz

1 replies

22h10m

2024-04-14 20:14:57 UTC

Singlefile for Firefox https://addons.mozilla.org/en-US/firefox/addon/single-file/

kwhitefoot

0 replies

21h23m

2024-04-14 21:02:03 UTC

Also WebScrapBook: https://github.com/danny0838/webscrapbook

jwoq9118

0 replies

20h21m

2024-04-14 22:03:27 UTC

Omnivore. Saves a copy using web archive.

https://omnivore.app/

ZunarJ5

0 replies

22h35m

2024-04-14 19:50:08 UTC

Wallabag + Obsidian + Wallabag Browser Ext. It's a manual trigger but its great.

screye

8 replies

19h12m

2024-04-14 23:12:31 UTC

Converting websites to markdown comes with 3 distinct problems:

1. Throughly scraping the content of page (high recall)

2. Dropping all the ads/auxilliary content (high precision)

3. And getting the correct layout/section types (formatting)

For #2 and #3 - Trafilatura, Newspaper4k and python-readability based solutions work best out of the box. For #1, any scraping service + selenium is going to do a great job.

Could you elaborate on what your tool does different or better? The area has been stagnant for a while. So curious to hear your learnings.

msp26

3 replies

8h40m

2024-04-15 09:45:02 UTC

Thanks for the links I had no idea those existed.

For my article web scraper (wip) the current steps are:

- Navigate with playwright + adblocker

- Run mozilla's readability on the page

- LLM checks readability output

If check failed

- Trim whole page HTML context

- Convert to markdown with pandoc

- LLM extracts from markdown

privatenumber

2 replies

3h45m

2024-04-15 14:39:47 UTC

Mozilla has released Readability as a standalone package so you can avoid spinning up a browser entirely: https://github.com/mozilla/readability

msp26

0 replies

2h27m

2024-04-15 15:57:15 UTC

I still wanted the browser for UBlock Origin and handling sites with heavy JS. I was using the standalone Readability script already but today I ended up dropping it for Trafilatura. It works a lot better.

The inefficiency of using a browser rather than just taking the html doesn't really matter because the limiting factor is the LLM here.

And yes the LLM is essential for getting clean data. None of the existing methods are flexible enough for all cases even if people say "you don't need AI to do this".

asadalt

0 replies

3h24m

2024-04-15 15:00:48 UTC

you would still need to run. For js based websites.

scary-size

2 replies

11h23m

2024-04-15 07:02:12 UTC

Thoroughly scraping is challenging, especially in an environment where you don’t have (or want) a JavaScript runtime.

For content extraction, I found the approach the Postlight library takes quite neat. It scores individual html nodes based on some heuristics (text length, link density, css classes). It the selects the nodes with the highest score. [1] I ported it to Swift for a personal read later app.

[1] https://github.com/postlight/parser

justech

1 replies

10h11m

2024-04-15 08:13:45 UTC

This is pretty cool. Care to share your Swift port?

scary-size

0 replies

9h52m

2024-04-15 08:32:58 UTC

Not planning to. It’s my first Swift/iOS project. I neither want to polish it nor maintain it publicly. Happy to share it privately, email is in the bio. I’m planning on a blog post describing the general approach though!

cchance

0 replies

14h11m

2024-04-15 04:13:32 UTC

Those are pretty damn cool i didnt know those even existed.

selfie

4 replies

19h43m

2024-04-14 22:42:13 UTC

Vercel!! Watch out for you bill now this is being hugged. Hopefully you are not using <Image /> like they pester you to do.

aleksandern08

1 replies

8h24m

2024-04-15 10:00:46 UTC

What’s the problem with <Image />?

asadm

0 replies

2h55m

2024-04-15 15:29:20 UTC

Vercel charges for image optimizations/dynamic scaling. People get wrong perception that it should be a free service because the DX is so easy...

compootr

0 replies

14h20m

2024-04-15 04:04:27 UTC

it's pretty sad, they motivate you all throughout their docs to use it, when just a webp would work ok

it's literally that meme of the bus — the happy guy is you and vercel and the sad one is your wallet

(unless you need dynamic scaling and minification though)

asadalt

0 replies

15h27m

2024-04-15 02:57:55 UTC

it has actually scaled pretty well and has been negligible cost-wise.

I didn’t have to do anything to handle HN traffic. Just a boilerplate nextjs app.

julienreszka

3 replies

22h30m

2024-04-14 19:54:16 UTC

I think it was hugged to death

asadm

1 replies

22h18m

2024-04-14 20:06:18 UTC

working on it!

asadm

0 replies

22h4m

2024-04-14 20:21:14 UTC

I think it should be up now

Animats

0 replies

22h25m

2024-04-14 19:59:25 UTC

"Either the URL is invalid or the server is too busy".

Aw, tell the user which it is.

osener

2 replies

21h20m

2024-04-14 21:04:44 UTC

Here is an open source alternative to this tool: https://github.com/ozanmakes/scrapedown

asadm

1 replies

21h18m

2024-04-14 21:07:08 UTC

I think this doesn't load js before scraping. Many sites hydrate content using JS.

ushakov

0 replies

19h49m

2024-04-14 22:35:51 UTC

use something like Browserbase?

jot

2 replies

20h37m

2024-04-14 21:47:39 UTC

Great idea to offer image downloads and filtering with GPT!

I built a similar tool last year that doesn't have those features: https://url2text.com/

Apologies if the UI is slow - you can see some example output on the homepage.

The API it's built on is Urlbox's website screenshot API which performs far better when used directly. You can request markdown along with JS rendered HTML, metadata and screenshot all in one go: https://urlbox.com/extracting-text

You can even have it all saved directly to your S3-compatible storage: https://urlbox.com/s3

And/or delivered by webhook: https://urlbox.com/webhooks

I've been running over 1 million renders per month using Urlbox's markdown feature for a side project. It's so much better using markdown like this for embeddings and in prompts.

If you want to scrape whole websites like this you might also want to checkout this new tool by dctanner: https://usescraper.com/

jph00

1 replies

19h52m

2024-04-14 22:32:45 UTC

Looks nice, but url2text doesn't seem to have an API, and urlbox doesn't seem to have an option to skip the screenshot if you only want the text. And for just the text, it looks to be really expensive.

jot

0 replies

19h31m

2024-04-14 22:53:22 UTC

Thanks!

Sorry it's not clearer but you can skip the screenshot in the Urlbox API if you want to with:

  curl -X POST \
    https://api.urlbox.io/v1/render/sync \
    -H 'Authorization: Bearer YOUR_URLBOX_SECRET' \
    -H 'Content-Type: application/json' \
    -d '
  {
    "url": "example.com",
    "format": "md"
  }
  '

Here's the result of that: https://renders.urlbox.io/urlbox1/renders/5799274d37a8b4e604...

Sorry the pricing isn't a good fit for you. Urlbox has been running for over 11 years. We're bootstrapped and profitable with a team of 3 (plus a few contractors). We're priced to be sustainable so our customers can depend on us in the long term. We automatically give volume discounts as your usage grows.

blobcode

2 replies

21h36m

2024-04-14 20:48:16 UTC

This is one of those things that the ever-amazing pandoc (https://pandoc.org/) does very well, on top of supporting virtually every other document format.

rubyn00bie

1 replies

21h30m

2024-04-14 20:54:27 UTC

I second this. Pandoc is up there as one of the most useful tools that exist, that almost no one talks about. It's amazing, easy to use, and works. I regularly see new tools in the space pop-up, but someone would have to have a REALLY unique and compelling feature, or highly optimized use case to get me to use anything else (besides Pandoc).

thangalin

0 replies

21h8m

2024-04-14 21:16:20 UTC

I wrote a series of blog posts about typesetting Markdown using pandoc:

https://dave.autonoma.ca/blog

Eventually, I found pandoc to be a little limiting:

* Awkward to use interpolated variables within prose.

* No real-time preview prior to rendering the final document.

* Limited options for TeX support (e.g., SVG vs. inline; ConTeXt vs. LaTeX).

* Inconsistent syntax for captions and cross-references.

* Requires glue to apply a single YAML metadata source file to multiple documents (e.g., book chapters).

* Does not (reliably) convert straight quotes to curly quotes.

For my purposes, I wanted to convert variable-laden Markdown and R Markdown to text, XHTML, and PDF formats. Eventually I replaced my tool chain of yamlp + pandoc + knitr by writing an integrated FOSS cross-platform desktop editor.

https://keenwrite.com/

KeenWrite uses flexmark-java + Renjin + KeenTeX + KeenQuotes to provide a solution that can replace pandoc + knitr in some situations.

Note how the captions and cross-reference syntax for images, tables, and equations is unified to use a double-colon sigil:

https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/ref...

There's also command-line usage for integrating into build pipelines:

https://gitlab.com/DaveJarvis/KeenWrite/-/blob/main/docs/cmd...

old_dev

1 replies

20h16m

2024-04-14 22:08:52 UTC

It looks like, if the website presents a cookie message, the tool just gets stuck on that and does not parse the actual content. As an example, I tried https://www.cnbc.com/ and all it created was a markdown of the cookie message and some legalese around it.

jot

0 replies

20h7m

2024-04-14 22:17:28 UTC

It's not easy working around things like that. But here's how it could work: https://url2text.com/u/wYVake

We were lucky to build this on a mature API that already solves loads of the edge cases around rendering different kinds of pages.

midzer

1 replies

22h33m

2024-04-14 19:52:12 UTC

"Failed to Convert. Either the URL is invalid or the server is too busy. Please try again later."

asadm

0 replies

22h3m

2024-04-14 20:21:25 UTC

can you try now

katehikes88

1 replies

21h59m

2024-04-14 20:25:22 UTC

links -dump

elinks -dump

lynx -dump

let me guess, you need more?

kwhitefoot

0 replies

21h21m

2024-04-14 21:03:38 UTC

If the page is rendered using JavaScript then yes you need more.

If there were a version of links, elinks, or lynx that executed JS that would be wonderful.

eshoyuan

1 replies

21h38m

2024-04-14 20:46:19 UTC

I tried a lot of tools, but none of them can work on website like https://www.globalrelay.com/company/careers/jobs/?gh_jid=507...

jot

0 replies

19h13m

2024-04-14 23:11:51 UTC

Our tool sadly also fails on this: https://url2text.com/u/KYkpBj

The challenge there is that the content is in an iframe.

If you get the URL used for the iframe you can get the content: https://url2text.com/u/kJWaZY

But that's frustrating as it requires two steps.

We might be able to help you get the content from URLs like these in one step. We have quite a bit of power in the Urlbox API that url2text isn't using.

Drop us an email: support@urlbox.com and we'll see what we can do.

dazzaji

1 replies

22h5m

2024-04-14 20:19:29 UTC

I’m getting the server overload error but assuming this mostly works I’d use it every day!

asadm

0 replies

22h4m

2024-04-14 20:20:54 UTC

can you try now?

cratermoon

1 replies

22h50m

2024-04-14 19:34:48 UTC

I've found htmltidy [1] and pandoc html->markdown sufficiently capable.

1 http://www.html-tidy.org/

2 https://pandoc.org/

fbdab103

0 replies

20h40m

2024-04-14 21:44:27 UTC

Never heard of tidy, this looks promising.

I am kind of tempted/horrified to run all of my final templated HTML through this and see if I can spot any lingering malformations. Depending on how structured the corrections are, could make it a test-suite thing.

cchance

1 replies

14h13m

2024-04-15 04:11:57 UTC

I did cnn.com and it just downloaded a file that said cnn.com can't be found lol

radicalriddler

0 replies

14h6m

2024-04-15 04:18:31 UTC

I wonder if it's not following redirects well

areichert

1 replies

19h48m

2024-04-14 22:36:52 UTC

Nice! I did something similar a while back, but just for substack :)

One example --

URL: https://newsletter.pragmaticengineer.com/p/zirp-software-eng...

Sanitized output: https://substack-ai.vercel.app/u/pragmaticengineer/p/zirp-so...

Raw markdown: https://substack-ai.vercel.app/api/users/pragmaticengineer/p...

(Would be happy to open source it if anyone cares!)

nbbaier

0 replies

19h41m

2024-04-14 22:43:30 UTC

I'd be interested in seeing the code, I love working on tools like this

alsetmusic

1 replies

18h25m

2024-04-15 00:00:12 UTC

Just tried it on a complex marketing page and it did a great job. Congrats!

I'm curious, if you care to share, what kind of load this places on your host? Is this something that you can keep going for free or will it eventually become non-cost efficient to keep running?

asadalt

0 replies

12h51m

2024-04-15 05:33:56 UTC

this is slightly heavy due to loading a headless chrome instance. I will look into optimizing this part.

other than that, gpt4 is expensive, but so far it’s been negligible so i am hopeful.

i feel i can keep it around for long.

sigio

0 replies

20h0m

2024-04-14 22:24:38 UTC

Tried it earlier today, but it was hugged to death. Tried it now, but it only gave me the contents of the cookie-wall, not the page I was looking for.

Tried on another page of the same site, then it only gave me the last article on a 6-article page, some weird things going on.

sabr

0 replies

14h9m

2024-04-15 04:15:17 UTC

Pretty cool.

I build something very similar - smort.io . Just prepend smort.io/ before any article URL to easily edit, annotate and share it with anyone.

Also works on ArXiv papers!

This was the Show HN post for Smort - https://news.ycombinator.com/item?id=30673502

rubymamis

0 replies

21h56m

2024-04-14 20:28:17 UTC

If anyone looking for a C++ solution to convert HTML to Markdown, I'm using this library https://github.com/tim-gromeyer/html2md in my app.

remorses

0 replies

22h6m

2024-04-14 20:18:51 UTC

Is the code open source?

radicalriddler

0 replies

19h46m

2024-04-14 22:38:31 UTC

Is this open sourced anywhere by any chance? Are you using GPT to do the conversion, or just doing it yourself by ways of HTML -> Markdown substitutions?

ohans

0 replies

12h51m

2024-04-15 05:34:12 UTC

nitpick: the tooltip (triggered by the question mark icon) does not work on mobile - at least on my iPhone (both chrome & safari)

May be worth taking a look at.

Good stuff otherwise! Cheers on the launch

jamesstidard

0 replies

21h26m

2024-04-14 20:59:01 UTC

Nice. Turn this into a browser extension and id install it. Feel like id forget about it otherwise

g105b

0 replies

8h22m

2024-04-15 10:02:55 UTC

I recently built a website using Markdown as the page source, which then uses CommonMark to produce the HTML. I was interested in seeing how different the extracted markdown looked to the original source markdown. It looks identical!

franciscop

0 replies

22h15m

2024-04-14 20:09:20 UTC

I also made one like this a while back, you can extract to markdown, html, text or PDF. I found pages that are the straight tool are very hard to SEO-position since there's not a lot of text/content on the page, even if the tool could be very useful. Feedback welcome:

https://content-parser.com/

These are all "wrappers around readability" AFAIK (including mine), which is the Mozilla project to make sites look clean and I use often.

cpeffer

0 replies

20h14m

2024-04-14 22:10:51 UTC

Very cool. We posted about a similar tool we built yesterday

https://www.firecrawl.dev/

It also crawls (although you can scrape single pages as well)

brokenmachine

0 replies

19h48m

2024-04-14 22:36:31 UTC

I use markdownload extension for Firefox. Seems to work pretty ok.

blacklight

0 replies

9h27m

2024-04-15 08:57:19 UTC

I wrote a similar article a while ago: https://blog.platypush.tech/article/Deliver-articles-to-your...

In my case the purpose was to share saved links to my e-reader and used Markdown as an intermediate format through the mercury.js scraping API, but the possibilities are endless.

ben_ja_min

0 replies

20h7m

2024-04-14 22:17:32 UTC

I've been looking for this! My method requires too many steps. I look forward to seeing if this improves my results. Thanks!

RamblingCTO

0 replies

13h11m

2024-04-15 05:13:58 UTC

A year ago I implemented this as well (albeit as a commercial offering with 100 free scrapes per month): https://2markdown.com It also has javascript-enabled browsing available in private beta. Will make it public this week. In my experience, people fall back to the simple scraping and not use js that much, if at all.

JoannaWongs

0 replies

15h45m

2024-04-15 02:40:06 UTC

Very cool! We also developed similar tool that can extract information from complex pdfs with embedded tables,images or graphs and get your multimodal data sources LLM-Ready. https://hellorag.ai

J_Shelby_J

0 replies

14h43m

2024-04-15 03:41:45 UTC

I built a no-code github blog deployer thingy that lets you deploy a blog to to github pages from a codespace. https://github.com/ShelbyJenkins/easy-astro-blog-creator

Anyways, it uses astro + markdown.

It'd be really neat if I could scrape my medium account to convert it to markdown to save me the trouble.

FounderBurr

0 replies

9h19m

2024-04-15 09:05:46 UTC

Zzzzzzz