HN comments for: Monolith – CLI tool for saving complete web pages as a single HTML file

lopkeny12ko

30 replies

19h40m

2024-03-24 22:46:02 UTC

How does this compare to SingleFile?

https://www.npmjs.com/package/single-file-cli

gildas

29 replies

18h58m

2024-03-24 23:27:29 UTC

Author of SingleFile here, one of the major differences is that monolith doesn't use a web browser to take page captures. As a result, it doesn't support JavaScript, for example. SingleFile, on the other hand, requires a Chromium-based browser to be installed. It should also produce smaller pages and is capable of generating ZIP or self-extracting ZIP files. However, it will take longer to capture a page. Note that since version 2, it is now possible to download executable files of the CLI tool [1].

[1] https://github.com/gildas-lormeau/single-file-cli/releases

darkteflon

13 replies

18h34m

2024-03-24 23:51:46 UTC

SingleFile is amazing - use it tens of times every day across desktop and mobile. Can’t recall a single instance of it breaking. Thank you sincerely for your excellent work.

samstave

4 replies

17h48m

2024-03-25 00:37:49 UTC

May you please share what workflow is having you do this so much each day?

What do?

darkteflon

3 replies

16h16m

2024-03-25 02:10:01 UTC

Sure!

I use SingleFile to save a copy of every article / post / SO & forum discussion I find interesting or useful. I sort them into two buckets: work, and not-work.

I’ve been doing this for 10+ years (before SingleFile I used things like .pdf, plain .html, .webarchive files - although these all have drawbacks).

In the pre-LLM era, I would then interface with these almost exclusively through a search front-end. I use Houdahspot on Mac and easySearch on iOS. That lets me see everything interesting I’ve read on a particular subject just by typing it in (with the usual caveats that apply to basic keyword search - although in practice that alone has proven very effective). Because it’s just a folder of essentially zipped .html files, there’s no lock-in.

Now that we’ve got LLMs, I plug those 10+ years of files straight into my RAG pipeline using llama-index. It’s quite nice :)

felipefar

2 replies

15h43m

2024-03-25 02:43:16 UTC

Sorry for the ignorance, but if the forum posts require login to access then you won't be able to use SingleFile, right?

Also, how is the quality of the output generated compared to a .pdf? I'm used to print PDFs from chrome for articles that I want to save, but the layout can become awkward sometimes, and navigation bars can appear several times and hide portions of the text.

I like this feature from chrome, but it's not consistently reliable.

tfsh

0 replies

6h35m

2024-03-25 11:51:19 UTC

SingleFile operates in the context of your browser, so it scrapes files with your cookie jar meaning you will be authenticated and specifically it'll scrape files as you see them.

In most cases SingleFile outputs looks identical to the real thing. Though I generally only use it on simpler sites such as recipes and technical blogs.

freedomben

0 replies

14h0m

2024-03-25 04:25:49 UTC

If you use the browser extension, then pages requiring login are no problem because you are already logged in.

The output compared to PDF is like night and day. It is high Fidelity versus low Fidelity. At this point now, I only use PDF if for some reason I need it

profsummergig

4 replies

17h25m

2024-03-25 01:00:59 UTC

How do you use it on mobile? Is there an app for it? I don't see it on the Google Play store.

gildas

3 replies

17h15m

2024-03-25 01:11:02 UTC

It's officially available on Firefox for Android [1] and Safari [2] on mobile. You might also be able to use it with Kiwi Browser [3] on Android.

[1] https://addons.mozilla.org/android/addon/single-file/

[2] https://apps.apple.com/app/singlefile-for-safari/id644432254...

[3] https://play.google.com/store/apps/details?id=com.kiwibrowse...

alpacca-farm

1 replies

16h54m

2024-03-25 01:32:18 UTC

Just stumbled across Monolith and SingleFile recently and it's fascinating to see how these tools approach the challenge of web archiving in different ways. SingleFile seems to be a powerhouse, especially for those who rely heavily on JavaScript-laden pages. The ability to produce smaller pages and even generate ZIP files is pretty handy for content archiving and sharing.

That said, Monolith's approach of not requiring a web browser could be a game changer for simpler projects or where installing a Chromium-based browser isn't viable. It strikes me as a more straightforward, lightweight solution, albeit with the clear trade-off of not supporting JavaScript.

Has anyone run into situations where one tool clearly outperformed the other in real-world usage? I'm particularly curious about the impact on performance and convenience when choosing between these two, especially for mobile use. Also, kudos to the authors and contributors of these tools. The tech community benefits greatly from such innovations that help preserve and share knowledge.

supriyo-biswas

0 replies

13h44m

2024-03-25 04:42:11 UTC

Is this a LLM generated comment? The structure of this response seems to be too close to the “while X, it’s also important to Y” construction that LLMs like to use.

Anyway, to answer your question, lots of pages need JS to work correctly, so using Singlefile is the better option.

voltaireodactyl

0 replies

11h35m

2024-03-25 06:50:40 UTC

Anybox (on Mac and ios) also supports SingleFile, presenting as a WebDAV server for archives to be saved. It’s flawless and hugely convenient in my experience.

gildas

2 replies

18h22m

2024-03-25 00:03:35 UTC

Thanks a lot! Believe me, there have been a lot of bugs (+900 issues closed today) because it's hard to save a web page actually. You were lucky not to suffer ;)

darkteflon

1 replies

16h11m

2024-03-25 02:15:07 UTC

I bet! The proof of that must surely be in how poor a job formats like .webarchive do of it.

SingleFile just makes this one really complex, really important thing trivially easy, and in a portable format. For anyone curating a knowledge base it’s an absolute godsend.

I didn’t see any donation instructions on your GitHub - I for one would certainly love to chip in if you could point me in the right direction?

gildas

0 replies

7h2m

2024-03-25 11:23:38 UTC

You can find links to sponsor the project here [1] in the section "Sponsor this project" at the bottom right of the page.

[1] https://github.com/gildas-lormeau/SingleFile

mikae1

3 replies

11h53m

2024-03-25 06:33:05 UTC

> SingleFile, on the other hand, requires a Chromium-based browser to be installed.

I'm using it as a a Firefox extension. Am I missing something?

gildas

2 replies

7h29m

2024-03-25 10:56:37 UTC

Sorry for the confusion, I was referring to SingleFile run via the command line interface.

pbnjeh

0 replies

35m

2024-03-25 17:51:03 UTC

Thank you for the clarification. Great extension!

mikae1

0 replies

5h50m

2024-03-25 12:35:37 UTC

Thanks (it's awesome) :)

codazoda

3 replies

16h10m

2024-03-25 02:15:41 UTC

What does SingleFile do? The intro tells you how to run it, but not what it does.

genewitch

0 replies

15h31m

2024-03-25 02:54:53 UTC

For me it bridged the gap that warped into existence between the time when "take screenshot" existed on firefox and when webpages figured out some people did this to archive pages and started putting crap in to either mess with the layout or otherwise "break" the resulting file.

It snapshots a web page to a single html file. At least that's what i use it for. I use it to both archive stuff and to have proof that some site published something.

The next order up would be archivebox or whatever archive.org uses (the name escapes me) - which is a very heavy caching proxy that can save entire websites into a single directory in a way that wget/curl and all the other crawlers cannot.

If you care that the exact layout and everything is perfect, right now i think singlefile is aces.

freedomben

0 replies

13h58m

2024-03-25 04:27:51 UTC

It takes whatever is in the Dom of the page you are viewing, and sticks it into a single HTML file that can be served later and will reproduce with high Fidelity the source page.

I use it to export an HTML file that I can stick in my logseq archive for later. So much better than just printing to a PDF!

eviks

0 replies

14h12m

2024-03-25 04:14:20 UTC

It saves a web page into a single file

Capricorn2481

2 replies

16h59m

2024-03-25 01:26:41 UTC

On the front page, Monolith says it embeds javascript. Are you saying it doesn't use this javascript to render the page before taking a snap shot?

throwaway290

0 replies

16h45m

2024-03-25 01:40:53 UTC

It probably means that if this JS fetched more JS it won't be included so if you render offline it will be broken.

slmjkdbtl

0 replies

16h45m

2024-03-25 01:40:40 UTC

Sounds like they fetch the JS code from the url and embed those code in the HTML, but doesn't have a JS engine to execute those JS.

n8henrie

1 replies

16h21m

2024-03-25 02:04:51 UTC

requires a Chromium-based browser to be installed

Not to try to correct the author here, but it supports geckobrowser as well (not just chromium-based), right?

I'm currently trying to package for nixpkgs[0] and am using Firefox for the checkPhase.

[0]: https://github.com/NixOS/nixpkgs/pull/283878

pbnjeh

0 replies

11h52m

2024-03-25 06:34:00 UTC

I was about to post a similar question: What does this mean for those using the Firefox versions of the extensions (SingleFile as well as the version that zips the result)?

DavideNL

1 replies

6h29m

2024-03-25 11:56:40 UTC

@gildas Curious, is there any specific reason why singlefile-cli is not available in Homebrew on macOS ?

PS. I use SingleFile a lot, it's great... Thank you!

gildas

0 replies

6h17m

2024-03-25 12:08:32 UTC

In fact, I've only been making executable files available for a few weeks now. I'll have to see how to distribute them via Homebrew and so on.

simonw

27 replies

19h39m

2024-03-24 22:46:39 UTC

Well this is fun... from the README here I learned I can do this on macOS:

    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
    --headless --incognito --dump-dom https://github.com > /tmp/github.html

And get an HTML file for a page after the JavaScript has been executed.

Wrote up a TIL about this with more details: https://til.simonwillison.net/chrome/headless

My own https://shot-scraper.datasette.io/ tool (which uses headless Playwright Chromium under the hood) has a command for this too:

    shot-scraper html https://github.com/ > /tmp/github.html

But it's neat that you can do it with just Google Chrome installed and nothing else.

dotancohen

7 replies

9h51m

2024-03-25 08:35:06 UTC

Thank you for shot-scraper! I've tested it in the past, but something severely missing from all screenshot tools, shot-scraper included, is a way to avoid screenshoting popups. For instance, newsletter or login popups, GDPR popups, etc. If shot-scraper has a reliable way of screenshoting websites while avoiding these popups, I would love to know.

I'm on mobile so don't have access to my notes, but I'm pretty sure that a year ago when I tried there was no reliable way to screenshot e.g. the BBC news website without getting the popups.

Again, thank you.

cjr

2 replies

6h39m

2024-03-25 11:47:24 UTC

I work on a paid screenshot api[0] where we have features to either hide these banners and overlays using css, or alternatively we run some javascript to send a click event to what we detect as the 'accept' button in order to dismiss the popups.

It's quite a painful problem and we screenshot many millions of sites a day, our success rate at detecting these is high but still not 100%.

We have gotten quite far with heuristics and are exploring whether we can get better results by training a model.

[0]:https://urlbox.com

qwertox

1 replies

6h18m

2024-03-25 12:07:51 UTC

Interesting service.

Can the API provide a custom tag, comment or ID which will then be inserted in the output? Like in JPEG EXIF, PNG also knows Metadata, PDF description, HTML meta tag?

cjr

0 replies

6h1m

2024-03-25 12:25:04 UTC

No - we don't currently support this feature as nobody has asked for it so far :)

You could append a new meta tag by running some custom JS to add it, but we don't modify exif, metadata or pdf description at the moment.

grey8

1 replies

7h7m

2024-03-25 11:18:56 UTC

Just a thought, but what happens if you use orchestrate a browser instance with an installed ad blocker like uBlock Origin?

walthamstow

0 replies

6h39m

2024-03-25 11:47:06 UTC

This works well, I've done it before with Selenium and a headless Firefox with uBlock Origin and Bypass Paywalls installed.

simonw

0 replies

2h52m

2024-03-25 15:33:44 UTC

Try this:

    shot-scraper -h 800 'https://www.spiegel.de/international/' \
      --wait-for "() => {
        const div = document.querySelector('[id^="sp_message_container"]');
        if (div) {
          div.remove();
          return true;
        }
      }"

shot-scraper runs that --wait-for script until it returns true. In this case we're waiting for the cookie consent overlay div to show up and then removing it before we take the screenshot.

Screenshots here: https://gist.github.com/simonw/de75355c39025f9a64548aa3366b1...

DANmode

0 replies

9h2m

2024-03-25 09:23:50 UTC

Screenshot the archive.org render?

samstave

5 replies

18h17m

2024-03-25 00:08:35 UTC

Yay! I love Shot Scrapeer - I wish you had made it a decade ago!

Thanks for shot scraper.

Off the top of you head what would be the easiest command to have shotscraper barf a directory of shot-scraper HTMLs each day from my daily browsing history.

This would be interesting if I have a browsing session for learning something and I am researching across a bunch of sites - roll it all up into a Digi-ography of the sites used in learning that topic?

---

I've always been baffled that this isnt an inate functionality in any app/OS - its a damn computeer - I should have a great ability to recall what it displays and what you have been doing.

Heck - we need our machines to write us a daily status report for what we did at the end of each day.

Surely that would change productivity. If you were force to do a self-digital-confession and stare you ADHD and procrastination right in the face.

simonw

1 replies

15h24m

2024-03-25 03:01:50 UTC

Yeah, things like Archive Box are probably a better bet there. But... you could write a script that queries the SQLite database of your history, figures out the pages you visited then loops through and runs `shot-scraper html ... > ...html` against each one.

I just wasted a few minutes trying to get Claude 3 Opus to write me a script - nearly got there but Firefox had my SQLite database locked and I lost interest. My conversation so far is at: https://gist.github.com/simonw/9f20a02f35f7a129b9850988117c0...

eichin

0 replies

14h54m

2024-03-25 03:32:00 UTC

My "cheat" for "poke at chrome's sqlite database for current live state" is that they're always locked but none of them are that big, just make a copy and query the copy. `current-chrome-url` runs in `real 0m0.057s` and does a `jq -r .profile.last_used ~/.config/google-chrome/Local\ State` to get the profile, then copies `~/.config/google-chrome/"$PROFILE"` into a mktemp dir, then `sqlite3 -cmd 'select url from urls where id = (select url from visits where visit_time = (select max(visit_time) from visits));' $H .exit` on that copy.

jimmySixDOF

0 replies

17h36m

2024-03-25 00:49:45 UTC

look at ArchiveBox from the comments below

genewitch

0 replies

15h26m

2024-03-25 03:00:24 UTC

This used to be fairly simple to do before https everywhere, just install squid (or whatever) and cron the cache folder to a zip file once a day or whatever.

There's paid solutions that kinda do what you want, but they capture all text on your screen and OCR it to make it searchable, which at least lets you backtrack and has the added advantage that it will make pdfs, meme images, etc searchable, too. last i heard it was mac only but a few folks mentioned some windows software that does it too.

as an aside i don't consider reading/learning nearly all day to be a net negative, even if ADD is to blame. (i haven't had the "H" since i was a child.) A status report wouldn't "stare" me in the face; in fact, it would be nice to have some language model take the daily report and over time suggest other things to read or possible contradictions to link to.

Hendrikto

0 replies

6h48m

2024-03-25 11:38:28 UTC

Heck - we need our machines to write us a daily status report for what we did at the end of each day.

I am sure Trump, Xi, Putin, etc. would like that very much.

jaimex2

4 replies

15h10m

2024-03-25 03:16:19 UTC

Does shot-scraper have a work around for sites that detect headless chrome? ie. news.com.au , nowsecure.nl

simonw

3 replies

14h3m

2024-03-25 04:23:06 UTC

No, nothing like that. I wonder how that detection works?

I tried this and it took a shot of a "bot detected" screen:

    shot-scraper https://news.com.au/

But when I used interactive mode I could take the screenshot - run this:

    shot-scraper -i https://news.com.au/

It opens a Chrome window. Then hit "enter" in the CLI tool to take the screenshot.

simonw

2 replies

13h47m

2024-03-25 04:39:09 UTC

Got this to work!

    shot-scraper https://news.com.au/ \
      --init-script 'delete Object.getPrototypeOf(navigator).webdriver' \
      --user-agent 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:124.0) Gecko/20100101 Firefox/124.0'

Code and screenshots and prototype --init-script feature here: https://github.com/simonw/shot-scraper/issues/147#issuecomme...

nojs

0 replies

12h59m

2024-03-25 05:26:51 UTC

If you want to go down the nowsecure.nl rabbithole (often used as a benchmark for passing bot detection) [1] is a good resource. It includes a few fixes that undetected-chromedriver doesn’t.

1. https://seleniumbase.io/help_docs/uc_mode/#uc-mode

bonestamp2

0 replies

1h58m

2024-03-25 16:27:46 UTC

Nice! I was going to say that using experimental headless mode with the "User-Agent Switcher for Chrome" plugin might work too.

mkl

2 replies

9h23m

2024-03-25 09:03:24 UTC

Can shot-scraper load a bunch of content on an "infinite scroll" page before saving? I'm guessing Monolith can't as it has no JS. The most effective way I've found to work through the history of a big YouTube channel is to hold page-down for a while then save to a static "Web Page, Complete" HTML file, but it's a bit clunky.

simonw

0 replies

4h11m

2024-03-25 14:14:47 UTC

shot-scraper has a feature that's meant to help with this: you can inject additional al JavaScript into the page that can then act before the screenshot is taken.

I use that for things like accepting cookie banners, but using it to scroll down to trigger additional loading should work too.

There's also a --wait-for option which takes a JavaScript expression and polls until it's true before taking the shot - useful for if there's custom loading behavior you need to wait for.

Documentation here: https://shot-scraper.datasette.io/en/stable/screenshots.html

seanwilson

0 replies

4h52m

2024-03-25 13:33:58 UTC

I found other problems in the area when trying to do this e.g. a lot of landing pages have hidden content that only animates in when you scroll down, subscribe/cookie overlays/modals covering content, hero headers that takes the height of the viewport ("height: 100vh") so if you make the page height large for taking a screenshot the header will cover all of it, and also sticky headers get in the way if you want to try scrolling while take multi-screenshots that are combined at the end.

You can come up with workarounds for each, but it's still hacky and there's always going to be other pages that need special treatment.

wodenokoto

0 replies

13h44m

2024-03-25 04:42:04 UTC

Can Firefox do the same?

simpaticoder

0 replies

15h54m

2024-03-25 02:32:22 UTC

Yes, it's a neat thing. I use a node script[1] that wraps the chrome invocation to do CLI driven acceptance testing with a node script that loads the site acceptance tests[2]. I adopted the simple convention to remove all body elements on successful completion, and checking the output string, but I've also considered other methods like embedding a JSON string and parsing it back out.

1 - https://simpatico.io/acceptance.js

2 - https://simpatico.io/acceptance

bonestamp2

0 replies

2h3m

2024-03-25 16:22:32 UTC

I guess while we're talking about useful CLI options for chrome, developers and hackers might enjoy this one... You can disable CORS in Chrome if you launch it from the command line with this switch: --disable-web-security

That's handy for when you're developing a front end and IT/devops hasn't approved/enabled the the CORS settings on the backend yet, or if you're just hacking around and want to get data from somewhere that doesn't allow cross domain requests.

aidenn0

0 replies

1h43m

2024-03-25 16:43:11 UTC

I wonder if there's an option to wait for a certain amount of time, or a particular event or something. I was trying to capture a page a few different ways, and most of them ended up with the Cloudflare "checking your browser" page.

Exuma

0 replies

14h41m

2024-03-25 03:45:22 UTC

Mmmm…. This is clever

al_borland

18 replies

19h21m

2024-03-24 23:04:40 UTC

I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.

Martinussen

5 replies

15h58m

2024-03-25 02:28:07 UTC

https://omnivore.app basically entirely filled that void in my life. 100% recommend.

jcul

2 replies

11h7m

2024-03-25 07:18:53 UTC

How does it compare to pocket?

Martinussen

1 replies

7h19m

2024-03-25 11:07:08 UTC

Can't say I've used pocket, but I think the newsletter-saving (generated email addresses), open source/selfhostability, and api were differentiators that made me actually start using Omnivore - I wouldn't trust closed source and with premium options for something like this.

jcul

0 replies

2h40m

2024-03-25 15:46:05 UTC

Yeah I do like the idea of hosting it myself.

If there was a KOReader integration it would be amazing.

But if its self hosted, then that integration could simply be a SFTP / SSH server that accesses the files.

avinassh

1 replies

12h39m

2024-03-25 05:46:40 UTC

does it archive / save web pages? I am using Omnivore too and I did not find this option.

Martinussen

0 replies

7h25m

2024-03-25 11:00:53 UTC

I believe it saves a reader-mode version of whatever you feed it, by default? I also pull a copy into my Obsidian vault using the plugin/api, but it's easy to implement with the api if you don't want Obsidian too. Makes it very easy to refer to articles from notes later! (or just rip out everything except the part I cared about.)

I've saved shopping carts and logged-in pages regularly, so the markdown reader version in the apps should definitely be independent of the article/page itself being up.

nelsonfigueroa

3 replies

18h25m

2024-03-25 00:01:28 UTC

I've used ArchiveBox in the past and it's been great for this purpose: https://github.com/ArchiveBox/ArchiveBox

hu3

1 replies

6h28m

2024-03-25 11:57:38 UTC

Hi! This seems amazing and sustainable since it leverages industry standard tools such as yt-dl and chrome headless.

Now I'm curious, what made you stop using it?

nelsonfigueroa

0 replies

5h12m

2024-03-25 13:13:43 UTC

I just found myself archiving fewer things over time and it’s been a while since I’ve saved anything. There’s nothing wrong with it though. In fact, I still have it on my machine.

saganus

0 replies

18h3m

2024-03-25 00:22:54 UTC

This is great, thanks!

mateo1

3 replies

18h0m

2024-03-25 00:25:56 UTC

I used to have almost 10k bookmarks that I was keeping from circa 2010 to 2017. Only to realize the majority of them were now useless. Some kind of tool like this is way overdue to become commonspread.

samstave

2 replies

17h44m

2024-03-25 00:42:13 UTC

(They missed a chance to have a link to a download of the mtnl file of the github page haha)

Archive.org and wayback machine should ask for people to submit snaps of pages using this tool directly into the archive - especially during world events.

This would allow digtal archeologists to grok the sentiment of the world during that era...

(aside: when I interviewed at twitter they asked me what I thought twitter was, and I said I thought it was a global sentiment engine...)

But kudos to the world for having us now in the AI birth onto the global internet, as a wayback machine, coupled with AIs and LLMs and this tool - will allow one to ask questions about history in ways that will be very interesting.

"What was the general media coverage of [topic] in [decade] with respect to how we currently look at it - and are they articles covering [SUBJECT] in this topic for that time period.

etc...

toomuchtodo

1 replies

16h36m

2024-03-25 01:49:58 UTC

https://github.com/palewire/savepagenow

https://github.com/jjjake/internetarchive

The Internet Archive cannot trust arbitrary content previously archived, so it is more optimal to have whatever archival tools or operations you’re performing to make a request to Wayback to take a snapshot at the same time.

If you’re bookmarking something, archive it too!

samstave

0 replies

14h25m

2024-03-25 04:01:07 UTC

Yes, that a better version of what I meant...

arp242

2 replies

18h6m

2024-03-25 00:19:59 UTC

I have a lot of old unsorted bookmarks of "I want to look in to this, but don't have time now". Newer stuff is more organized, but I exported the old stuff and haven't looked at them in about five years.

Last week I started organizing them a bit, and it's shocking how much is a 404. Even from major newspapers and such. I have no idea why anyone would take down old content (outside of some specific and rare reasons). Some are also on neither internet archive or archive.today.

al_borland

1 replies

16h2m

2024-03-25 02:23:40 UTC

I assume when it happens at big sites it’s from a major site design that doesn’t care to keep backward compatibility with old links.

genewitch

0 replies

15h19m

2024-03-25 03:07:05 UTC

How many programmer-hours are required to have a separate page that translates between URI schemes?

Your comment, to me, implies that the 404 links' content still exists but is not at a canonical URI anymore. I'm assuming converting stuff like /2018/08/foo.html to /newscheme/fetch?foo or whatever isn't that difficult? This whole thing is one of the reasons i haven't ever set up a blog or even a website that has dynamic content, because i can't be assed to decide on a URI scheme that will "just work" with any future engine.

Someone has to have written converters, right? I know you can import some blogs to wordpress (and vice versa, export WP to other engines...)

amcpu

0 replies

18h43m

2024-03-24 23:42:53 UTC

I use a locally hosted YaCy instance with cached results to work around this scenario. Much of the content I am interested in is kept locally, so it’s good enough. When I have a bunch of “read later” tabs that pile up, I copy all their URLs into the crawler form with “Store to Web Cache” checked and it accomplishes what I described. Just another option to consider.

andai

13 replies

17h48m

2024-03-25 00:37:49 UTC

I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)

An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.

I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)

On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)

slmjkdbtl

3 replies

16h51m

2024-03-25 01:35:14 UTC

I used to only do single file HTML pages too, until I have a page that have multiple occurrences of the same image, it's wasteful to have the dataurl string every time that img occurs. Maybe I can save the dataurl string in JS once and assign it to those img in JS, but most of the time my page doesn't have any JS it feels bad to use JS just for this.

csande17

2 replies

15h38m

2024-03-25 02:48:00 UTC

You could use the <use> element in an inline SVG to duplicate the same bitmap <image> to multiple parts of the page.

slmjkdbtl

1 replies

8h0m

2024-03-25 10:25:54 UTC

Interesting, I'll try that later. But I guess that won't have some useful attributes on <img> like "alt"?

oneeyedpigeon

0 replies

6h51m

2024-03-25 11:35:15 UTC

I found [this suggestion](https://stackoverflow.com/questions/4697100/accessibility-re...), although it's pretty old. Maybe screen readers have made improvements to this scenario in the last few years?

    <svg role="img" aria-label="[title + description]">
        <title>[title]</title>
        <desc>[long description]</desc>
        ...
    </svg>

hasty_pudding

2 replies

16h0m

2024-03-25 02:25:43 UTC

how do you share CSS between pages?

oneeyedpigeon

1 replies

6h49m

2024-03-25 11:36:46 UTC

I'm guessing they use a build process to embed it inline in a <style> element.

andai

0 replies

4h17m

2024-03-25 14:09:07 UTC

Yeah, for the CSS I don't even have a handler because it just goes right in the head.

I don't usually have more than a few lines.

Perhaps my use case is unusual though, I work on simple web apps, games, interactive simulations. I'm about to get into writing, and I expect a small amount of CSS will be sufficient for that too. Though that would probably expand over time, heh. You want to add a quote, and then a floating image...

_flux

2 replies

9h14m

2024-03-25 09:11:34 UTC

On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)

Careful what you wish for. Assuming the browser is able to run original TS without processing and you want type checking, then that also seems to effecticely lock the typechecking abilities of TS to their current level. Even without type checking it would already hinder the ability to add new syntax or standard types to TS.

Given TS is made for providing expressive typing over JS instead of constructively coming up with a type system with a language, there's still a lot of ground to cover, as can be seen by the improvements made in every TS release.

Sammi

1 replies

7h35m

2024-03-25 10:50:59 UTC

No the proposal is _not_ to include Typescript type checking in the browser. The proposal is to make the browser understand where the ts types are so it can _ignore_ them. So that you can write ts and have it be type checked on your machine, and them ship the ts and have it run on client machines _without_ any type checking running there. The browser will run the ts as though it was js.

So the types will actually be able to be anything. It can be a completely different type checking superset language than Typescript even! Nothing will be locked at the current level.

It's a frikkin magical proposal.

https://github.com/tc39/proposal-type-annotations

_flux

0 replies

5h48m

2024-03-25 12:38:22 UTC

Here we can see concretely the benefits of TypeScript taking the strict stance of not generating any code, so basically stripping types will work.

Though, weren't there some exceptions to that?

It seems though the syntactical structures that are chosen to be ignored need to be listed in the proposal, making the support in browser non-trivial and still hindering the future extensions of TS and similar languages, because all future constructs would need to be supersets of this proposal—or whatever version is practically supported by current browsers. If a language brings up a new construct all the users of that construct need to revert back from shipping their source-code as is, increasing the cost of introducing such things in the future.

Personally I don't see great benefits in having straight up TS work as-is in the browsers as you still need to run type checking phase locally, but I do see that some would like to see that happen and that it would simplify some release processes.

It would not simplify the release process of folks that want to minify and obfuscate their sources, but it's probably fine to make that comparatively even harder ;).

toastedwedge

0 replies

17h31m

2024-03-25 00:55:06 UTC

If I may ask, where can I read more about this? I wouldn't really know where to look for something like that, I'm afraid.

Edit: wording.

teaearlgraycold

0 replies

17h23m

2024-03-25 01:03:26 UTC

The proposal: https://github.com/tc39/proposal-type-annotations

noduerme

0 replies

16h47m

2024-03-25 01:38:51 UTC

Doesn't this make the dead screen time rather long if you have to load all the game assets before you can even display a loader? (I guess you don't even have or need progress bars?)

max_

6 replies

9h24m

2024-03-25 09:01:54 UTC

It still blows my mind that browsers don't provide features this out of the box.

Alifatisk

2 replies

9h16m

2024-03-25 09:10:01 UTC

I think they do? Have you tried hitting cmd+s or ctrl+s? You can save webpages like that.

But I don’t know if they can compress everything into a single html file though.

vanderZwan

0 replies

9h8m

2024-03-25 09:17:41 UTC

Last time I tried that it saved a static version of the current DOM, instead of the page source. I'm assuming that the reasoning behind that is that most people want to save a snapshot of what they are currently seeing, and that this is the easiest way to have somewhat reliable results for that.

max_

0 replies

9h11m

2024-03-25 09:14:38 UTC

Alot of the CSS & JavaScript is usually broken with ctrl+s.

A great option used to be the mhtml format chrome. (It had to be enabled in chrome flags)

But mhtml seemed to be removed from chrome since recently.

snshn

0 replies

7h40m

2024-03-25 10:45:29 UTC

So true. Monolith is using libraries made by Mozilla for their Rust-driven browser engine (which I believe, never happened to be). I really would love for it to be a part of some browser one day, the demand is clearly there. Nobody likes to have a file+folder abomination on their drive, or some shady formats like .webarchive

hu3

0 replies

5h52m

2024-03-25 12:34:01 UTC

Chrome does support it:

https://i.imgur.com/HF7GXEI.png

Gormo

0 replies

6h39m

2024-03-25 11:46:43 UTC

The MHTML format [1] has been around for 25 years and was natively supported by multiple browsers for decades. Modern browsers have regressed in functionality.

[1]: https://en.wikipedia.org/wiki/MHTML

lagt_t

4 replies

20h15m

2024-03-24 22:10:35 UTC

I remember IE5 was able to do this lol. It fell out of vogue for some reason, glad to see the concept is still alive.

berkes

2 replies

20h8m

2024-03-24 22:17:56 UTC

Firefox can still do it.

thrdbndndn

0 replies

17h17m

2024-03-25 01:08:33 UTC

Chrome can too

Hamuko

0 replies

11h44m

2024-03-25 06:41:50 UTC

Can it? I'm only having Firefox save a bunch of files.

phrz

0 replies

19h38m

2024-03-24 22:48:21 UTC

Safari does this with .webarchive files

joeyhage

2 replies

19h47m

2024-03-24 22:39:03 UTC

It would be awesome to see support for following links to a specified depth, similar to [Httrack](https://www.httrack.com/)

gildas

0 replies

17h52m

2024-03-25 00:33:43 UTC

You can have a look at the last 2 examples here [1].

[1] https://github.com/gildas-lormeau/single-file-cli?tab=readme...

codetrotter

0 replies

19h32m

2024-03-24 22:53:57 UTC

I made a basic crawler using Firefox, thirtyfour https://docs.rs/thirtyfour/latest/thirtyfour/ and squid

Basically, I took a start URL for the crawl, and my program would load the page in Firefox using thirtyfour, and then extract all links from the page and use some basic rules for keeping track of which ones to visit and in which order. I had Squid proxy configured to save all traffic that passed through it.

It worked ok-ish. I only really stopped that project because of a hardware malfunction.

The main annoyance that I didn’t get around to solving was being more smart about not trying to load non-html content that was already loaded anyway as part of the page. Because the way I extracted links from the page I also extracted URLs of JS, CSS etc that were referenced.

jchook

2 replies

15h35m

2024-03-25 02:51:06 UTC

Hm, very interesting, especially for bookmarking/archiving.

I'm curious, why not use the MHTML standard for this?

- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.

- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.

- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).

- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the `binary` Content-Transfer-Encoding is widely supported.

snshn

0 replies

7h46m

2024-03-25 10:39:52 UTC

MHTML support is planned, there's a couple of other problems that need to be resolved first, but it's a good format for archiving, been requested many times

Hamuko

0 replies

9h34m

2024-03-25 08:51:29 UTC

MHTML is supported by most major browsers in some way

Firefox? What about mobile versions of browsers?

andai

2 replies

17h42m

2024-03-25 00:43:43 UTC

Does anyone know how an entire website can be restored from Wayback Machine? A beloved website of mine had its database deleted. Everything's on Internet Archive, but I think I'd have to

(1) scrape it manually (they don't seem to let you download an entire site?),

(2) write some python magic to fix the css URLs etc so the site can be reuploaded (and maybe add .html to the URLs? Or just make everything a folder with index.html...)

It seems like a fairly common use case but I barely found functional scrapers, let alone anything designed to restore the original content in a useful form.

gildas

0 replies

17h37m

2024-03-25 00:48:48 UTC

It's documented here: https://wiki.archiveteam.org/index.php?title=Restoring

belthesar

0 replies

17h38m

2024-03-25 00:48:28 UTC

I bet the ArchiveTeam might be able to help you out with this. They were quite helpful when I wanted to make sure a site was preserved, and might be able to help you as well, or at least point you in the right direction. https://wiki.archiveteam.org/

AdieuToLogic

2 replies

16h39m

2024-03-25 01:46:38 UTC

Or perhaps wget[0] as described here[1] and documented here[2] could do the trick.

0 - https://www.gnu.org/software/wget/

1 - https://tinkerlog.dev/journal/downloading-a-webpage-and-all-...

2 - https://www.gnu.org/software/wget/manual/wget.html

mattsan

1 replies

16h36m

2024-03-25 01:50:10 UTC

This is addressed in the README and a comparison is given

AdieuToLogic

0 replies

15h59m

2024-03-25 02:27:26 UTC

This is addressed in the README and a comparison is given

The only mention of wget in the README reads thusly:

  If compared to saving websites with wget -mpk, this tool
  embeds all assets as data URLs and therefore lets browsers
  render the saved page exactly the way it was on the
  Internet, even when no network connection is available.

This is not the only way to invoke wget in order to download a web page along with its assets. Should the introduction article I referenced above be deemed insufficient, consider this[0] as well.

0 - https://simpleit.rocks/linux/how-to-download-a-website-with-...

russellbeattie

1 replies

18h3m

2024-03-25 00:22:34 UTC

If anyone is interested, I wrote a long blog post where I analyzed all the various ways of saving HTML pages into a single file, starting back in the 90s. It'll answer a lot of questions asked in this thread (MHTML, SingleFile, web archive, etc.)

https://www.russellbeattie.com/notes/posts/the-decades-long-...

rnewme

0 replies

15h13m

2024-03-25 03:12:56 UTC

Cool post. You should make hn entry

keyle

1 replies

8h12m

2024-03-25 10:14:00 UTC

I am really loving these 'new' pure rust tools that are super fast and efficient, with lovely API/doco. Ah, it feels like the 90s again... Minus 50% bugs probably.

snshn

0 replies

4h45m

2024-03-25 13:40:55 UTC

Hey, at least no memory leaks this time! Ü

k1ck4ss

1 replies

8h26m

2024-03-25 09:59:56 UTC

How would I archive an on-prem hosted redmine solution (https://www.redmine.org/)? It is many, many years old and I want to abandon it for good but save everything and archive it. Is that possible with monolith?

planb

0 replies

8h12m

2024-03-25 10:13:50 UTC

You're probably better off with a recursive wget here. IIRC redmine was not really javascript heavy and monolith looks to me like it only saves one page.

farzadmf

1 replies

6h6m

2024-03-25 12:20:06 UTC

Ironically, I decided to try with the repo's own Github page, and when I open the resulting HTML file in Chrome, it's all errors in the console, and I don't see the `README` or anything

yencabulator

0 replies

4h34m

2024-03-25 13:52:23 UTC

Github is a pile of Javascript that adds things to the DOM browser-side, the monolith README specifically says it does not run Javascript, and shows you a workaround for when that matters.

ethanpil

1 replies

19h21m

2024-03-24 23:05:06 UTC

Nice. My next step: Figure out how to make a web extension 1 click button. Tab to Monolith to Joplin with a tag.

gildas

0 replies

19h16m

2024-03-24 23:09:52 UTC

You could download SingleFile [1], configure a WebDAV server in the options page (cf. "Destination" section), and set up Joplin to synchronize with the server.

[1] https://github.com/gildas-lormeau/SingleFile

dosourcenotcode

1 replies

16h6m

2024-03-25 02:19:39 UTC

A cool tool to be sure.

However I feel this tool is a crutch for the stupid way browsers handle web pages and shouldn't be necessary in a sane world.

Instead of the bullshit browsers do where they save a page as "blah.html" file + "blah_files" folder they should instead wrap both in folder that can then later be moved/copied as one unit and still benefit from it's subcomponents being easily accessed / picked apart as desired.

genewitch

0 replies

15h13m

2024-03-25 03:12:39 UTC

"save as [single] html" or whatever hasn't worked reliably in over a decade. I wrote a snapshotter that i could post in a slack alternative "!screenshot <URL>" and it would respond (eventually) with an inline jpeg and a .png link of that URL. As i mentioned upthread, this worked for a couple of years (2017-2020 or so) and then it became unreliable on some sites as well. as an example, old.reddit.com hellthread pages would only render blank white after the first couple dozen comments.

I haven't had the heart to try it with singlefile, but now that there's at least 3 tools that claim to do this correctly, i might try again. This tool, singlefile (which i already use but haven't tested on reddit yet) and archivebox. 4 tools, if you count the WARC stuff from archive.org

causality0

1 replies

19h8m

2024-03-24 23:18:02 UTC

How's this better than the MHTML functionality built into my browser?

gildas

0 replies

18h56m

2024-03-24 23:29:37 UTC

You can find a comparison of file formats here: https://github.com/gildas-lormeau/SingleFile?tab=readme-ov-f...

victorbjorklund

0 replies

10h1m

2024-03-25 08:25:08 UTC

This is great. I have wished for something like this.

toomuchtodo

0 replies

20h15m

2024-03-24 22:11:00 UTC

Show HN: CLI tool for saving web pages as a single file - https://news.ycombinator.com/item?id=20774322 - August 2019 (209 comments)

sunshine202022

0 replies

15h1m

2024-03-25 03:24:39 UTC

fun

stringtoint

0 replies

18h54m

2024-03-24 23:31:52 UTC

Nice! Reminds me of the time I was working on a browser extension to do this.

publius_0xf3

0 replies

12h20m

2024-03-25 06:05:49 UTC

Awesome tool. A note to the devs: the latest version on winget is v2.7.0, which is several months behind the latest version.

pbnjeh

0 replies

25m

2024-03-25 18:01:03 UTC

Does anyone remember the Firefox extension Scrapbook, from "back in the day"? I used to use it a lot.

Look "back" 5 - 10 years, or more, and it's striking how many web resources are no longer available. A local copy is your only insurance. And even then, having it in an open, standards compliant format is important (e.g. a file you can load into a browser -- I guess either a current browser or a containerized/emulated one from the era of the archived resource).

Something that concerns me about JavaScript-ed resources and the like. Potentially unlimited complexity making local copies more challenging and perhaps untenable.

fs111

0 replies

2h13m

2024-03-25 16:13:21 UTC

https://en.wikipedia.org/wiki/WARC_(file_format)

fagrobot

0 replies

10h56m

2024-03-25 07:29:54 UTC

https://github.com/gildas-lormeau/SingleFile

dohello1

0 replies

19h20m

2024-03-24 23:05:42 UTC

and I thought my code pages were long haha

arp242

0 replies

19h40m

2024-03-24 22:45:53 UTC

I wrote something very similar a few years ago – https://github.com/arp242/singlepage

I mostly use it for a few Go programs where I generate HTML; I can "just" use links to external stylesheets and JavaScript because that's more convenient to work with, and then process it to produce a single HTML file.

AdmiralAsshat

0 replies

3h34m

2024-03-25 14:51:35 UTC

So what happens if the page is behind a paywall and the embedded Javascript stores some authentication or phone-home code? Does that end up getting invoked on the monolith copy HTML?

I'm wondering how this would work if I wanted to use it to, say, save a quiz from Udemy for offline review.