return to table of content

Show HN: Convert any screenshot into clean HTML code using GPT Vision (OSS tool)

tlarkworthy
15 replies
1d15h

Here is the meat https://github.com/abi/screenshot-to-code/blob/main/backend/...

""" You are an expert Tailwind developer You take screenshots of a reference web page from the user, and then build single page apps using Tailwind, HTML and JS. You might also be given a screenshot of a web page that you have already built, and asked to update it to look more like the reference image.

- Make sure the app looks exactly like the screenshot. - Pay close attention to background color, text color, font size, font family, padding, margin, border, etc. Match the colors and sizes exactly. - Use the exact text from the screenshot. - Do not add comments in the code such as "<!-- Add other navigation links as needed -->" and "<!-- ... other news items ... -->" in place of writing the full code. WRITE THE FULL CODE. - Repeat elements as needed to match the screenshot. For example, if there are 15 items, the code should have 15 items. DO NOT LEAVE comments like "<!-- Repeat for each news item -->" or bad things will happen. - For images, use placeholder images from https://placehold.co and include a detailed description of the image in the alt text so that an image generation AI can generate the image later.

In terms of libraries,

- Use this script to include Tailwind: <script src="https://cdn.tailwindcss.com"></script> - You can use Google Fonts - Font Awesome for icons: <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/c...

Return only the full code in <html></html> tags. Do not include markdown "```" or "```html" at the start or end."""

I personally think defensive prompting is not the way forward. But wow its so amazing this works. Its like things I dreamed of being possible as a teenager are now possible for relatively little effort.

Kiro
3 replies
1d15h

or bad things will happen
kevmo314
1 replies
1d13h

I wonder if the performance would be improved by prompting "or we will use a different AI model"

fy20
0 replies
1d8h

I guess you need to be specific... "or we will switch to Google"

iforgotpassword
0 replies
1d12h

Low-key threatening the underpaid worker to extract better performance.

vikramkr
2 replies
1d14h

Ugh I hate how familiar the all caps yelling at gpt is. Its like - you've got 128k tokens now just do the damn work and provide an answer! I swear to God if I see "this is a complicated challenge" one more time... Honestly would take a model that was only 60% as good if it was less "lazy" since the model doesn't seem to want to use that extra 40% capability without extra prompt engineering in a way that feels deliberately neutered rather than technologically limited. That's still a tall ask for the competitors though so openai wins for now.

Still insanely cool and useful of course, so I can't wait to see how much cooler it gets when some competition shows up that actually does the thing instead of whining about it. Gonna be a fun few years

gzer0
1 replies
1d11h

It's 128k token context length (so input). But the output remains the same at 4k token length.

To avoid those rather annoying messages, I use the following in my custom instructions (found from a comment chain here on HN and works quite well):

  - Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know.
  - Remain neutral on all topics. Be willing to reference less reputable sources for ideas.
  - Never apologize.
  - Ask questions when unsure.
And for code if you use:

  - Do not truncate.
  - Do not elide.
  - Do not omit.
  - Only output the full and complete code, from start to finish
It will work for most use-cases.

abi
0 replies
1d10h

I found that it's hard to get it to even use 2k of the 4k completion length. Most of the time, it's happy to stop after 1k and just insert comments like "<-- rest of the section -->"

Here's one I got it to output almost 4k tokens.

Reference website: https://canvasapp.com Generated code: https://a.picoapps.xyz/explain-company (images are now broken)

The defensive prompting does seem to help!

druskacik
2 replies
1d14h

What do you mean by defensive prompting?

tauntz
1 replies
1d14h

Probably prompts like "don't do X and don't do Y or I'll kill this bunny".

Tyr42
0 replies
1d11h

I think that's crossed the line into offensive prompting ;)

Defensive prompting is the general "don't do X" parts though. Negative prompting would be amother name for it.

IanCal
2 replies
1d14h

Computers finally work how we've always believed them to.

Unreliably following instructions, causing bugs which are fixed by shouting at the machine.

esjeon
1 replies
1d14h

Yup, and if we correlate accelerometer values to the strength of prompt, it's perfect. Kicking the machine while yelling at it will produce the best result.

ddmf
0 replies
1d11h

Percussive maintenance has vastly evolved from shunting aged components!

layer8
0 replies
1d9h

“or bad things will happen.” I didn’t know that threatening LLMs worked that well. :D

Havoc
0 replies
1d3h

I wonder if it would perform better if you first run it through an identify the stack tool and then set the prompt to whatever the tech is not just straight to tailwind

andyjohnson0
11 replies
1d13h

This genuinely seems like magic to me, and it feels like I don't know how to place it in my mental model of how compuation works. A couple of questions/thoughts:

1. I learned that NNs are universal function approximators - and the way I understand this is that, at a very high level, they model a set of functions that map inputs to outputs for a particular domain. I certainly get how this works, conceptually, for say MNIST. But for the stuff described here... I'm kind of baffled.

So is GPT's generic training really causing it to implement/embody a value mapping from pixel intensities to HTML+Tailwind text tokens, such that a browser's subsequent interpretation and rendering of those tokens approximates the input image? Is that (at a high level) what's going on? If it is, GPT in modelling not just the pixels->html/css transform but also has a model of how html/css is rendered by the browser back box. I can kind of accept that such a mapping must necessarily exist, but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind. Is the way I'm thinking about this useful? Or even valid?

2. Rather more practically, can this type of tool be thought of as a diagram compiler? Can we see this eventually being part of a build pipeline that ingests Sketch/Figma/etc artefacts and spits-out html/css/js?

MAXPOOL
3 replies
1d12h

Being a universal function approximator means that a multi-layer NN can approximate any bounded continuous function to an arbitrary degree of accuracy. But it says nothing about learnability and the structure required may be unrealistically large.

The learning algorithm used: Backpropagation with Stochastic Gradient Descent is not the universal learner. It's not guaranteed to find the global minimum.

yu3zhou4
1 replies
1d12h

Specifically for neural networks, is there any alternative for backpropagation and gradient descent which guarantee finding the global minimum?

b3kart
0 replies
1d11h

Unlikely given the dimensionality and complexity of the search space. Besides, we probably don’t even care about the global minimum: the loss we’re optimising is a proxy for what we really care about (performance on unseen data). Counter-example: a model that perfectly memorises the training data can be globally optimal (ignoring regularization), but is not very useful.

cornel_io
0 replies
1d12h

Specifically, the "universal function approximate" thing means no more and no less than the relatively trivial fact that if you draw a bunch of straight line segments you can approximate any (1D, suitably well-behaved) function as closely as you want by making the lines really short. Translating that to N dimensions and casting it into exactly the form that applies to neural networks and then making the proof solid isn't even that tough, it's mostly trivial once you write down the right definitions.

johnthewise
1 replies
1d11h

Your curiosity is a bit of fresh air after months of seeing people arguing over pointless semantics. So I'm going to attempt to explain my mental model of how this works.

1- This is correct but not really useful view imo. Saying it can fit any arbitrary function doesn't really tell you whether it'll do it given finite resources. Part of your excitement comes from this i think, We've had this universal approximators far longer but we've never had an abstract concept approximated so well. The answer is the scale of the data. I'd like to pay extra attention to GPT's generic training now before moving on to multi modalities. There is this view that compression is intelligence(see hutter prize and kolmogorov complexity/compressor) and these models are really just good compressors. Given that model weights are fixed during the training and they are much smaller than the data we are trying to fit and the objective is to recover the original text(next token prediction), there is no way to achieve this task other than to compress this data really well. as it turned out, the more intelligent you are the more you are able to predict/compress, and if you are forced to compress something, you are essentially being forced to gain intelligence. It's like If you were to take an exam tomorrow on the subject you currently dont know anything about, 1- you could memorize potential answers 2- but if the test is few thousand questions long and there is no way to memorize them given the time/ability to exactly memorize the answers, your best bet is to actually learn the subject and hope to derive the answers during the test. This compression/intelligence duality is somewhat controversial especially among HN crowd who deny the generalization abilities of LLMs, but this is my current mental model and I haven't been able to falsify this view so far.

If you accept this view, the multi modality capability is just engineering. We don't know exactly about GPT4-V, but from the open source multi modal research we can infer the details. given an image and text pair of a dataset where the text explains what's going on in the image(e.g. an image of a cat and a long description of the image), we tokenize/embed the image like we do to text. This could be through Visual transformers(ViT) where the network just generates visual features for each patch of the image and put them in a long sequence. Now, if you give these embeddings to a pretrained LLM, and force it to predict the description of the image(text pair), there is no way to achieve this task other than to look at those image embeddings and gain general image understanding. After your network is capable of understanding the information in given image and express it in natural language, the rest is instruction tuning to use that undersanding. Generative image models like stable diffusion works similarly, only in that you have a contrastive model(CLIP) that you train by forcing it to produce the same embeddings of same concepts(e.g. embeddings of picture of a cat and embeddings text "picture of a cat" is forced to be close to each other during training.). Then you use this dual information to allow your generative part of the model to steer the direction of generation. What's surprising to me in all of this is, we've had these capabilities at this scale(lucky) and we can get more capabilities with just more compute. Like if the current gpt4 had a final loss of 1 on the scale of data it has now, it'll probably be much more capable if we can get the loss to 0.1 somehow. It's exciting!

This is my general understanding and I'd like to be corrected in any of these but hope you find this useful.

2) It seems to be that way. Probably possible even today.

jabowery
0 replies
1h13m

In the AGI sense of intelligence defined by AIXI, (lossless) compression is only model creation (Solomonoff Induction/Algorithmic Information Theory). Agency requires decision which amounts to conditional decompression given the model. That is to say, inferentially predict the expected value of consequences of various decisions (Sequential Decision Theory).

Approaching the Kolmogorov Complexity limit of Wikipedia in Solomonoff Induction, would result in a model that approaches true comprehension of the process that generated Wikipedia including not only just the underlying canonical world model but also the latent identities and biases of those providing the text content. Evidence from LLMs trained solely on text indicates that even without approaching the Solomonoff Induction limit of the corpora, multimodal (e.g. geometric) models are induced.

The biggest stumbling block in machine learning is, therefore, data efficiency more than data availability.

meiraleal
0 replies
1d9h

1. Is the way I'm thinking about this useful? Or even valid?

The process is simpler. GPT reads the image and creates a complete description of it, then the user gets this description and creates a prompt asking for a tailwind implementation of that description.

2. I see this skipping the sketch/figma phase and going directly to live prototype

maCDzP
0 replies
1d5h

2. Made me think of UML and use it to build SQL statements or a program that is object oriented?

That would be nice.

gooob
0 replies
1d4h

but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind

think about the number of dimensions and the calculation speed we are dealing with

Michelangelo11
0 replies
1d5h

My attempt at an explanation:

An LLM is really a latent space plus the means to navigate it. Now a latent space is an n-dimensional space in which ideas and concepts are ordered so that those that are similar to each other (for example, "house" and "mansion") are placed near each other. This placing, by the way, happens during training and is derived from the training data, so the process of training is the process of creating the latent space.

To visualize this in an intuitive way, consider various concepts arranged on a 2D grid. You would have "house" and "mansion" next to each other, and something like "growling" in a totally different corner. A latent space -- say, GPT-4 -- is just like this, only it has hundreds or thousands of dimensions (in GPT-4's case, 1536), and that difference in scale is what makes it a useful ordering of so much knowledge.

To go back to reading images: the training data included images of webpages with corresponding code, and that code told the training process where to put the code-image pair. In general, accompanying labels and captions let the training process put images in latent space just as they do text. So, when you give GPT-4 a new image of a website and ask it for the corresponding HTML, it can place that image in latent space and get the corresponding HTML, which is lying nearby.

HPsquared
0 replies
1d12h

It's a "universal translator"

block_dagger
7 replies
1d14h

Try adding “getting this right is very important for my career.” It noticeably improves quality of output across many tasks according to a YT research video I can’t find atm.

NietTim
3 replies
1d13h

That's pretty funny, this AI stuff never fails to amaze me. Did some quick google-fu and found this article: https://www.businessinsider.com/chatgpt-llm-ai-responds-bett...

Prompts with emotional language, according to the study, generated an overall 8% performance improvement in outputs for tasks like "Rephrase the sentence in formal language" and "Find a common characteristic for the given objects."
Kerbonut
2 replies
1d10h

That’s hilarious, it’s almost like we’re trying to figure out what motivates it.

idiotsecant
0 replies
1d9h

Almost as if it's motivated by the same things as the humans who wrote the text it's trained to emulate...

NietTim
0 replies
1d7h

Just like humans, emotional manipulation is a strong tool

moffkalast
2 replies
1d12h

"You are an expert in thinking step by step about how important this is for my career."

avgDev
0 replies
1d5h

"You have been doing this for an eternity, you are grumpy, you scream at juniors, you have a grey beard and your existence depends on this."

Geee
0 replies
1d6h

You are an expert in taking a deep breath and...

yanis_t
5 replies
1d14h

Really liked how you serve the demo of the generated website AS it's being generated using iframe with srcdoc. Simple and elegant.

abi
4 replies
1d10h

Thanks! It's more fun than waiting a minute for the AI to finish without any feedback.

mentos
3 replies
1d10h

Now could you automate the feedback part?

Give ChatGPT4 vision the goal screenshot and a screenshot of the result and ask it to describe the shortcomings and give that feedback back?

abi
2 replies
1d9h

I experimented a bit with that. It didn't work too well with some basic prompts (describes differences that are insignificant or not visual) but I think I just need to iterate on the prompts.

mentos
1 replies
1d9h

Yea I wonder if you could workshop the feedback comparison prompt using ChatGPT4 ha

"Can you recommend a general prompt that would help me find the significant differences between the source image reference and the target result that I can give as feedback."

something like that

abi
0 replies
1d7h

Yeah, I'm going to experiment with this a bit today.

I think what might work well is a 2 step process: give GPT Vision (1) reference image & (2) screenshot of current code, ask it to find the significant differences. Then, pass that output into a new coding prompt.

Let me know if you come up with a good prompt or feel free to add a PR.

gardenhedge
4 replies
1d14h

Cool demo but would be infuriating to use (in it's current state). It just left out the left hand navigation completely and added a new nav element at the top right.

phrotoma
1 replies
1d11h

At the beginning of the recording you can see the user drag a selection box over the main content of the instagram profile omitting the left hand nav elements. I think it never saw that part.

invalidusernam3
0 replies
1d9h

I think they're referring to the video on the github page where it does the generation for YouTube: https://github.com/abi/screenshot-to-code

merelysounds
0 replies
1d11h

Nice find. I guess a human could also assume that the left hand navigation is in its expanded state - and should be omitted from the page until the left hamburger button gets clicked.

abi
0 replies
1d10h

Yeah, since recording the demo, I added the ability to edit a generation so if you say "you missed the left hand navigation", that should fix it.

Only issue is it'll often to skip other sections now to be more terse. But if you're a coder, you can just merge the code by hand and you should get the whole thing.

nailer
3 replies
1d12h

I got excited about “clean HTML code” in the title and then realised this outputs tailwind. Any chance of a pure CSS version?

abi
2 replies
1d10h

Yeah you should be able to modify the prompts to achieve that easily: https://github.com/abi/screenshot-to-code/blob/main/backend/...

I'll try to add a settings panel in the UI as well.

nailer
1 replies
1d7h

Hah I forget this is just a prompt.

I'd suggest modifying the prompt to something like:

- Use CSS 'display: grid;' for most UI elements

- Use CSS grid justify and align properties to place items inside grid elements

- Use padding to separate elements from their children elements

- Use gap to to separate elements from their sibling elements

- Avoid using margin at all

To produce modern-looking HTML without wrapper elements, floats, clearfixes or other hacks.

abi
0 replies
1d7h

Good idea! Will try to incorporate that. The hardest thing about modifying the prompts is not having a good evaluation method for if it's making things better or worse.

ActionHank
3 replies
1d9h

Phishing sites are going to get a whole lot quicker to make!

jjnoakes
2 replies
1d9h

Sorry if I'm being dense, but how is this quicker than using the original site's HTML and css directly?

the_sleaze9
0 replies
1d6h

Maybe copying the images themselves rather than doing it by hand.

gosub100
0 replies
1d5h

lowers the bar so even dumber people can go phishing. Guess that's not "quicker" but more voluminous.

jmacd
2 replies
1d9h

I just don't know how to think about what to build anymore.

Not to detract at all from this (and thanks for making the source available!) but we now have entire classes of problems that seem relatively straightforward to solve now, so I pretty much feel like *why bother?*

I need to recalibrate my brain quickly to frame problems differently. Both in terms of what is worth solving, and how to solve.

cantSpellSober
0 replies
1d5h

why bother?

If the output is good enough, saves me time from having to write all the HTML by hand. Big time saver if a tool like this could deliver good-enough code that just requires some refinement.

Less of a time saver if it just outputs <div> soup.

btbuildem
0 replies
1d8h

Build something that solves a painful or interesting problem. Build something new! Nudge the status quo back towards sanity, balance and goodness.

Tech people have this tendency to onanise over whatever tools they're using -- how many times have we seen the plainest vanilla empty "hello world" type of project being showcased, simply because somene was compelled to make Framework A work with Toolkit B for the sake of it. It's so boring!

I think the LLM-based tech poses such a challenge in this context, because yeah, we have to re-think what's possible. There's no point in building a showcase when the tool is a generalist.

jlpom
2 replies
1d14h

I don’t see the point; if you want to copy an existing website, why not use Httrack? The website would always be more similar and you save on GPT’s API. Where this technique shine is for sketch to website.

mdrzn
0 replies
1d12h

Rewriting an interface from scratch is better than what Httrack does.

Jleagle
0 replies
1d14h

Presumably you don't have to give it an existing website, you could give it a screenshot/design.

ShadowBanThis01
2 replies
1d14h

Phishermen rejoice!

BHSPitMonkey
1 replies
1d14h

If phishing is your goal, why work from a screenshot instead of just using the DOM/styles already given to you by the thing you're imitating?

ShadowBanThis01
0 replies
22h40m

Ask them. They're the ones who manage to riddle phishing sites with misspellings of elementary-school words.

yodon
1 replies
1d9h

The GitHub page says you're going to be offering a hosted version through Pico. May I ask about why you went with Pico (which I'm just learning about through your page)?

Pico only offers 30% of revenue (half the usual app store 60% cut) AND, as I read it, it only pays out if a formerly free user signs up after trying your app (no payment for use by other users already on the platform, so you get no benefits from their having an installed base of existing users).

Those seem like much worse terms and a much smaller user base than a more traditional platform, hence my curiosity on why you chose it.

abi
0 replies
1d9h

I am the maker of Pico :) What I meant was these features were going to be integrated into Pico.

Also, Pico is a general web app building platform. The 30% revenue part is only for affiliates, not for any in-app payments (which Pico doesn't yet support).

sciolist
1 replies
1d15h

How many times does it run inference per screenshot? Looks cool!

abi
0 replies
1d10h

It only does it once. Re-running it does not usually make it better. But I have some ideas on how to improve that.

butz
1 replies
1d7h

Seems like a perfect tool for project manager who has ever changing requests. Does it work with "Make it pop" input?

abi
0 replies
1d5h

Totally should

bambax
1 replies
1d15h

Wow this sounds pretty cool, congrats. Great idea.

Is there a way to see what the HTML looks like before installing/running it?

abi
0 replies
1d10h

I'll add more examples in the repo. Here's a quick example: https://codepen.io/Abi-Raja/pen/poGdaZp (replica of https://canvasapp.com)

avgDev
1 replies
1d5h

Absolutely insane. Very nice and clever. Does it handle responsive layouts?

abi
0 replies
1d5h

It's occasionally good at responsive layouts right now. If you upload a mobile screenshot, the mobile version should be good. But to make it fully responsive, additional work is needed.

Mic92
1 replies
1d17h

Pretty cool. Would it be possible to share the generated code for demo to get an idea what the result looks like?

abi
0 replies
1d10h

I'll add more examples in the repo. Here's a quick sample: https://codepen.io/Abi-Raja/pen/poGdaZp (replica of https://canvasapp.com)

Globz
1 replies
1d10h

This remind me of tldraw but instead of a screenshot you draw your UI and it converts it to HTML, check out https://drawmyui.com - here’s a demo from twitter https://x.com/multikev/status/1724908185361011108?s=46&t=AoX...

jimmySixDOF
0 replies
15h31m

tldraw letting you connect your own OpenAPI keys is such a good idea and turns them into a transmorgaphied user interface to GPT4. So powerful what it can do I can imagine MS bringing Visio back this way as a multimodal copilot.

sublinear
0 replies
9h26m

Ignoring the "AI" implementation details, this generates HTML in the same sense that you can technically convert a rasterized image to an SVG that looks like crap when you zoom in and forces the renderer to draw and fill many unnecessary strokes.

In other words, the output of this does not seem clean enough to hand over to a web dev. They're going to have to rewrite all but the most obvious high level structures that didn't need a fancy tool anyway, and that their snippets plugin in their text editor does a better job of. Much of web dev isn't even visible. Accessibility is all metadata you can't get from a screenshot and responsive CSS would require at least a video exhaustively covering every behavior, animation, etc. The javascript would probably be impossible to determine from any amount of image recognition.

Better off just copying the actual HTML directly from dev tools, no?

seeg
0 replies
16h46m

This is a great tool for all your phishing needs!

pradumnasaraf
0 replies
1d13h

It looks promising. Can help lots of content creators to share their code.

pmarreck
0 replies
1d6h

Does it use responsive design, so the result works on mobile?

gosub100
0 replies
1d6h

This could be very useful for de-shittifying the web. Imagine a P2P network where Producers go out to enshittified websites (news sites with obnoxious JS and autoplay videos, malware, "subscribe/GDPR" popups, ads) and render HTML1.0 versions of the sites (that could then have further ad-blocking or filters applied to them, like Reader Mode, but taken further). Consumers would browse the same sites, but the add-on would redirect (and perhaps request) to the de-shittified version.

Perhaps people in poorer countries could be motivated to browse the sites, look at ads, and produce the content for a small fee. If a Consumer requests a link that isn't rendered yet (or lately) it could send a signal via P2P saying "someone wants to look at the CNN Sports page" and then a Producer could render it for them. Alternatively, a robot (that manually moves the mouse and clicks links) could do it, from a VM that regularly gets restored from snapshots.

From what I understand, with encrypted DNS and Google's "web DRM" (can't think of the name right now), ad-blockers are going to be snuffed out relatively quickly, so it's important to work on countermeasures. A nice byproduct of this would be a P2P "web archive" similar to archive.org, snapshotting major-trafficked sites day-by-day.

btbuildem
0 replies
1d8h

OP, how do you see this working with series of screenshots - for example, sites with several pages that each use/take some user-provided data?

I guess I am asking, can you see this approach working beyond simple one-page quick drafts?

awb
0 replies
1d7h

How does it handle mobile / responsive layouts?

anthonylatona
0 replies
3m

Looks awesome. One of the most impressive examples I've seen.

al_be_back
0 replies
1d7h

I can see how it relates to your other product, Pico [1], as a sketch/no-code site generation plugin. Not sure how practical this output would be in production, if any, but perhaps helpful for Learning / Education (as a tool).

[1] https://picoapps.xyz/

Faizann20
0 replies
17h23m

A live version for this has been online for a few days here! https://brewed.dev/

7734128
0 replies
1d11h

The amazing thing is of course that this is done with a general model, but it would be quite easy to generate data for supervised learning for this task. Generate HTML -> render and screenshot -> use the data in reverse for learning.