Show HN: Convert any screenshot into clean HTML code using GPT Vision (OSS tool)

Here is the meat https://github.com/abi/screenshot-to-code/blob/main/backend/...

""" You are an expert Tailwind developer You take screenshots of a reference web page from the user, and then build single page apps using Tailwind, HTML and JS. You might also be given a screenshot of a web page that you have already built, and asked to update it to look more like the reference image.

- Make sure the app looks exactly like the screenshot. - Pay close attention to background color, text color, font size, font family, padding, margin, border, etc. Match the colors and sizes exactly. - Use the exact text from the screenshot. - Do not add comments in the code such as "" and "" in place of writing the full code. WRITE THE FULL CODE. - Repeat elements as needed to match the screenshot. For example, if there are 15 items, the code should have 15 items. DO NOT LEAVE comments like "" or bad things will happen. - For images, use placeholder images from https://placehold.co and include a detailed description of the image in the alt text so that an image generation AI can generate the image later.

In terms of libraries,

- Use this script to include Tailwind: <script src="https://cdn.tailwindcss.com"></script> - You can use Google Fonts - Font Awesome for icons: <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/5.15.3/c...

Return only the full code in <html></html> tags. Do not include markdown "```" or "```html" at the start or end."""

I personally think defensive prompting is not the way forward. But wow its so amazing this works. Its like things I dreamed of being possible as a teenager are now possible for relatively little effort.

or bad things will happen

I wonder if the performance would be improved by prompting "or we will use a different AI model"

I guess you need to be specific... "or we will switch to Google"

Low-key threatening the underpaid worker to extract better performance.

Ugh I hate how familiar the all caps yelling at gpt is. Its like - you've got 128k tokens now just do the damn work and provide an answer! I swear to God if I see "this is a complicated challenge" one more time... Honestly would take a model that was only 60% as good if it was less "lazy" since the model doesn't seem to want to use that extra 40% capability without extra prompt engineering in a way that feels deliberately neutered rather than technologically limited. That's still a tall ask for the competitors though so openai wins for now.

Still insanely cool and useful of course, so I can't wait to see how much cooler it gets when some competition shows up that actually does the thing instead of whining about it. Gonna be a fun few years

It's 128k token context length (so input). But the output remains the same at 4k token length.

To avoid those rather annoying messages, I use the following in my custom instructions (found from a comment chain here on HN and works quite well):

  - Be terse. Do not offer unprompted advice or clarifications. Speak in specific, topic relevant terminology. Do NOT hedge or qualify. Do not waffle. Speak directly and be willing to make creative guesses. Explain your reasoning. if you don’t know, say you don’t know.
  - Remain neutral on all topics. Be willing to reference less reputable sources for ideas.
  - Never apologize.
  - Ask questions when unsure.

And for code if you use:

  - Do not truncate.
  - Do not elide.
  - Do not omit.
  - Only output the full and complete code, from start to finish

It will work for most use-cases.

I found that it's hard to get it to even use 2k of the 4k completion length. Most of the time, it's happy to stop after 1k and just insert comments like "<-- rest of the section -->"

Here's one I got it to output almost 4k tokens.

Reference website: https://canvasapp.com Generated code: https://a.picoapps.xyz/explain-company (images are now broken)

The defensive prompting does seem to help!

What do you mean by defensive prompting?

Probably prompts like "don't do X and don't do Y or I'll kill this bunny".

I think that's crossed the line into offensive prompting ;)

Defensive prompting is the general "don't do X" parts though. Negative prompting would be amother name for it.

Computers finally work how we've always believed them to.

Unreliably following instructions, causing bugs which are fixed by shouting at the machine.

Yup, and if we correlate accelerometer values to the strength of prompt, it's perfect. Kicking the machine while yelling at it will produce the best result.

Percussive maintenance has vastly evolved from shunting aged components!

“or bad things will happen.” I didn’t know that threatening LLMs worked that well. :D

I wonder if it would perform better if you first run it through an identify the stack tool and then set the prompt to whatever the tech is not just straight to tailwind

This genuinely seems like magic to me, and it feels like I don't know how to place it in my mental model of how compuation works. A couple of questions/thoughts:

1. I learned that NNs are universal function approximators - and the way I understand this is that, at a very high level, they model a set of functions that map inputs to outputs for a particular domain. I certainly get how this works, conceptually, for say MNIST. But for the stuff described here... I'm kind of baffled.

So is GPT's generic training really causing it to implement/embody a value mapping from pixel intensities to HTML+Tailwind text tokens, such that a browser's subsequent interpretation and rendering of those tokens approximates the input image? Is that (at a high level) what's going on? If it is, GPT in modelling not just the pixels->html/css transform but also has a model of how html/css is rendered by the browser back box. I can kind of accept that such a mapping must necessarily exist, but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind. Is the way I'm thinking about this useful? Or even valid?

2. Rather more practically, can this type of tool be thought of as a diagram compiler? Can we see this eventually being part of a build pipeline that ingests Sketch/Figma/etc artefacts and spits-out html/css/js?

Being a universal function approximator means that a multi-layer NN can approximate any bounded continuous function to an arbitrary degree of accuracy. But it says nothing about learnability and the structure required may be unrealistically large.

The learning algorithm used: Backpropagation with Stochastic Gradient Descent is not the universal learner. It's not guaranteed to find the global minimum.

Specifically for neural networks, is there any alternative for backpropagation and gradient descent which guarantee finding the global minimum?

Unlikely given the dimensionality and complexity of the search space. Besides, we probably don’t even care about the global minimum: the loss we’re optimising is a proxy for what we really care about (performance on unseen data). Counter-example: a model that perfectly memorises the training data can be globally optimal (ignoring regularization), but is not very useful.

Specifically, the "universal function approximate" thing means no more and no less than the relatively trivial fact that if you draw a bunch of straight line segments you can approximate any (1D, suitably well-behaved) function as closely as you want by making the lines really short. Translating that to N dimensions and casting it into exactly the form that applies to neural networks and then making the proof solid isn't even that tough, it's mostly trivial once you write down the right definitions.

Your curiosity is a bit of fresh air after months of seeing people arguing over pointless semantics. So I'm going to attempt to explain my mental model of how this works.

1- This is correct but not really useful view imo. Saying it can fit any arbitrary function doesn't really tell you whether it'll do it given finite resources. Part of your excitement comes from this i think, We've had this universal approximators far longer but we've never had an abstract concept approximated so well. The answer is the scale of the data. I'd like to pay extra attention to GPT's generic training now before moving on to multi modalities. There is this view that compression is intelligence(see hutter prize and kolmogorov complexity/compressor) and these models are really just good compressors. Given that model weights are fixed during the training and they are much smaller than the data we are trying to fit and the objective is to recover the original text(next token prediction), there is no way to achieve this task other than to compress this data really well. as it turned out, the more intelligent you are the more you are able to predict/compress, and if you are forced to compress something, you are essentially being forced to gain intelligence. It's like If you were to take an exam tomorrow on the subject you currently dont know anything about, 1- you could memorize potential answers 2- but if the test is few thousand questions long and there is no way to memorize them given the time/ability to exactly memorize the answers, your best bet is to actually learn the subject and hope to derive the answers during the test. This compression/intelligence duality is somewhat controversial especially among HN crowd who deny the generalization abilities of LLMs, but this is my current mental model and I haven't been able to falsify this view so far.

If you accept this view, the multi modality capability is just engineering. We don't know exactly about GPT4-V, but from the open source multi modal research we can infer the details. given an image and text pair of a dataset where the text explains what's going on in the image(e.g. an image of a cat and a long description of the image), we tokenize/embed the image like we do to text. This could be through Visual transformers(ViT) where the network just generates visual features for each patch of the image and put them in a long sequence. Now, if you give these embeddings to a pretrained LLM, and force it to predict the description of the image(text pair), there is no way to achieve this task other than to look at those image embeddings and gain general image understanding. After your network is capable of understanding the information in given image and express it in natural language, the rest is instruction tuning to use that undersanding. Generative image models like stable diffusion works similarly, only in that you have a contrastive model(CLIP) that you train by forcing it to produce the same embeddings of same concepts(e.g. embeddings of picture of a cat and embeddings text "picture of a cat" is forced to be close to each other during training.). Then you use this dual information to allow your generative part of the model to steer the direction of generation. What's surprising to me in all of this is, we've had these capabilities at this scale(lucky) and we can get more capabilities with just more compute. Like if the current gpt4 had a final loss of 1 on the scale of data it has now, it'll probably be much more capable if we can get the loss to 0.1 somehow. It's exciting!

This is my general understanding and I'd like to be corrected in any of these but hope you find this useful.

2) It seems to be that way. Probably possible even today.

In the AGI sense of intelligence defined by AIXI, (lossless) compression is only model creation (Solomonoff Induction/Algorithmic Information Theory). Agency requires decision which amounts to conditional decompression given the model. That is to say, inferentially predict the expected value of consequences of various decisions (Sequential Decision Theory).

Approaching the Kolmogorov Complexity limit of Wikipedia in Solomonoff Induction, would result in a model that approaches true comprehension of the process that generated Wikipedia including not only just the underlying canonical world model but also the latent identities and biases of those providing the text content. Evidence from LLMs trained solely on text indicates that even without approaching the Solomonoff Induction limit of the corpora, multimodal (e.g. geometric) models are induced.

The biggest stumbling block in machine learning is, therefore, data efficiency more than data availability.

1. Is the way I'm thinking about this useful? Or even valid?

The process is simpler. GPT reads the image and creates a complete description of it, then the user gets this description and creates a prompt asking for a tailwind implementation of that description.

2. I see this skipping the sketch/figma phase and going directly to live prototype

2. Made me think of UML and use it to build SQL statements or a program that is object oriented?

That would be nice.

but for GPT to have derived it (while also being able to write essays on a billion other diverse subjects) blows my mind

think about the number of dimensions and the calculation speed we are dealing with

My attempt at an explanation:

An LLM is really a latent space plus the means to navigate it. Now a latent space is an n-dimensional space in which ideas and concepts are ordered so that those that are similar to each other (for example, "house" and "mansion") are placed near each other. This placing, by the way, happens during training and is derived from the training data, so the process of training is the process of creating the latent space.

To visualize this in an intuitive way, consider various concepts arranged on a 2D grid. You would have "house" and "mansion" next to each other, and something like "growling" in a totally different corner. A latent space -- say, GPT-4 -- is just like this, only it has hundreds or thousands of dimensions (in GPT-4's case, 1536), and that difference in scale is what makes it a useful ordering of so much knowledge.

To go back to reading images: the training data included images of webpages with corresponding code, and that code told the training process where to put the code-image pair. In general, accompanying labels and captions let the training process put images in latent space just as they do text. So, when you give GPT-4 a new image of a website and ask it for the corresponding HTML, it can place that image in latent space and get the corresponding HTML, which is lying nearby.

It's a "universal translator"

Try adding “getting this right is very important for my career.” It noticeably improves quality of output across many tasks according to a YT research video I can’t find atm.

That's pretty funny, this AI stuff never fails to amaze me. Did some quick google-fu and found this article: https://www.businessinsider.com/chatgpt-llm-ai-responds-bett...

Prompts with emotional language, according to the study, generated an overall 8% performance improvement in outputs for tasks like "Rephrase the sentence in formal language" and "Find a common characteristic for the given objects."

That’s hilarious, it’s almost like we’re trying to figure out what motivates it.

Almost as if it's motivated by the same things as the humans who wrote the text it's trained to emulate...

Just like humans, emotional manipulation is a strong tool

"You are an expert in thinking step by step about how important this is for my career."

"You have been doing this for an eternity, you are grumpy, you scream at juniors, you have a grey beard and your existence depends on this."

You are an expert in taking a deep breath and...

Really liked how you serve the demo of the generated website AS it's being generated using iframe with srcdoc. Simple and elegant.

Thanks! It's more fun than waiting a minute for the AI to finish without any feedback.

Now could you automate the feedback part?

Give ChatGPT4 vision the goal screenshot and a screenshot of the result and ask it to describe the shortcomings and give that feedback back?

I experimented a bit with that. It didn't work too well with some basic prompts (describes differences that are insignificant or not visual) but I think I just need to iterate on the prompts.

Yea I wonder if you could workshop the feedback comparison prompt using ChatGPT4 ha

"Can you recommend a general prompt that would help me find the significant differences between the source image reference and the target result that I can give as feedback."

something like that

Yeah, I'm going to experiment with this a bit today.

I think what might work well is a 2 step process: give GPT Vision (1) reference image & (2) screenshot of current code, ask it to find the significant differences. Then, pass that output into a new coding prompt.

Let me know if you come up with a good prompt or feel free to add a PR.

Cool demo but would be infuriating to use (in it's current state). It just left out the left hand navigation completely and added a new nav element at the top right.

At the beginning of the recording you can see the user drag a selection box over the main content of the instagram profile omitting the left hand nav elements. I think it never saw that part.

I think they're referring to the video on the github page where it does the generation for YouTube: https://github.com/abi/screenshot-to-code

Nice find. I guess a human could also assume that the left hand navigation is in its expanded state - and should be omitted from the page until the left hamburger button gets clicked.

Yeah, since recording the demo, I added the ability to edit a generation so if you say "you missed the left hand navigation", that should fix it.

Only issue is it'll often to skip other sections now to be more terse. But if you're a coder, you can just merge the code by hand and you should get the whole thing.

I got excited about “clean HTML code” in the title and then realised this outputs tailwind. Any chance of a pure CSS version?

Yeah you should be able to modify the prompts to achieve that easily: https://github.com/abi/screenshot-to-code/blob/main/backend/...

I'll try to add a settings panel in the UI as well.

Hah I forget this is just a prompt.

I'd suggest modifying the prompt to something like:

- Use CSS 'display: grid;' for most UI elements

- Use CSS grid justify and align properties to place items inside grid elements

- Use padding to separate elements from their children elements

- Use gap to to separate elements from their sibling elements

- Avoid using margin at all

To produce modern-looking HTML without wrapper elements, floats, clearfixes or other hacks.

Good idea! Will try to incorporate that. The hardest thing about modifying the prompts is not having a good evaluation method for if it's making things better or worse.

Phishing sites are going to get a whole lot quicker to make!

Sorry if I'm being dense, but how is this quicker than using the original site's HTML and css directly?

Maybe copying the images themselves rather than doing it by hand.

lowers the bar so even dumber people can go phishing. Guess that's not "quicker" but more voluminous.

I just don't know how to think about what to build anymore.

Not to detract at all from this (and thanks for making the source available!) but we now have entire classes of problems that seem relatively straightforward to solve now, so I pretty much feel like *why bother?*

I need to recalibrate my brain quickly to frame problems differently. Both in terms of what is worth solving, and how to solve.

why bother?

If the output is good enough, saves me time from having to write all the HTML by hand. Big time saver if a tool like this could deliver good-enough code that just requires some refinement.

Less of a time saver if it just outputs <div> soup.

Build something that solves a painful or interesting problem. Build something new! Nudge the status quo back towards sanity, balance and goodness.

Tech people have this tendency to onanise over whatever tools they're using -- how many times have we seen the plainest vanilla empty "hello world" type of project being showcased, simply because somene was compelled to make Framework A work with Toolkit B for the sake of it. It's so boring!

I think the LLM-based tech poses such a challenge in this context, because yeah, we have to re-think what's possible. There's no point in building a showcase when the tool is a generalist.

I don’t see the point; if you want to copy an existing website, why not use Httrack? The website would always be more similar and you save on GPT’s API. Where this technique shine is for sketch to website.

Rewriting an interface from scratch is better than what Httrack does.

Presumably you don't have to give it an existing website, you could give it a screenshot/design.

Phishermen rejoice!

If phishing is your goal, why work from a screenshot instead of just using the DOM/styles already given to you by the thing you're imitating?

Ask them. They're the ones who manage to riddle phishing sites with misspellings of elementary-school words.

The GitHub page says you're going to be offering a hosted version through Pico. May I ask about why you went with Pico (which I'm just learning about through your page)?

Pico only offers 30% of revenue (half the usual app store 60% cut) AND, as I read it, it only pays out if a formerly free user signs up after trying your app (no payment for use by other users already on the platform, so you get no benefits from their having an installed base of existing users).

Those seem like much worse terms and a much smaller user base than a more traditional platform, hence my curiosity on why you chose it.

I am the maker of Pico :) What I meant was these features were going to be integrated into Pico.

Also, Pico is a general web app building platform. The 30% revenue part is only for affiliates, not for any in-app payments (which Pico doesn't yet support).

How many times does it run inference per screenshot? Looks cool!

It only does it once. Re-running it does not usually make it better. But I have some ideas on how to improve that.

Seems like a perfect tool for project manager who has ever changing requests. Does it work with "Make it pop" input?

Totally should

Wow this sounds pretty cool, congrats. Great idea.

Is there a way to see what the HTML looks like before installing/running it?

I'll add more examples in the repo. Here's a quick example: https://codepen.io/Abi-Raja/pen/poGdaZp (replica of https://canvasapp.com)

Absolutely insane. Very nice and clever. Does it handle responsive layouts?

It's occasionally good at responsive layouts right now. If you upload a mobile screenshot, the mobile version should be good. But to make it fully responsive, additional work is needed.

Pretty cool. Would it be possible to share the generated code for demo to get an idea what the result looks like?

I'll add more examples in the repo. Here's a quick sample: https://codepen.io/Abi-Raja/pen/poGdaZp (replica of https://canvasapp.com)

This remind me of tldraw but instead of a screenshot you draw your UI and it converts it to HTML, check out https://drawmyui.com - here’s a demo from twitter https://x.com/multikev/status/1724908185361011108?s=46&t=AoX...

tldraw letting you connect your own OpenAPI keys is such a good idea and turns them into a transmorgaphied user interface to GPT4. So powerful what it can do I can imagine MS bringing Visio back this way as a multimodal copilot.

Ignoring the "AI" implementation details, this generates HTML in the same sense that you can technically convert a rasterized image to an SVG that looks like crap when you zoom in and forces the renderer to draw and fill many unnecessary strokes.

In other words, the output of this does not seem clean enough to hand over to a web dev. They're going to have to rewrite all but the most obvious high level structures that didn't need a fancy tool anyway, and that their snippets plugin in their text editor does a better job of. Much of web dev isn't even visible. Accessibility is all metadata you can't get from a screenshot and responsive CSS would require at least a video exhaustively covering every behavior, animation, etc. The javascript would probably be impossible to determine from any amount of image recognition.

Better off just copying the actual HTML directly from dev tools, no?

This is a great tool for all your phishing needs!

It looks promising. Can help lots of content creators to share their code.

Does it use responsive design, so the result works on mobile?

This could be very useful for de-shittifying the web. Imagine a P2P network where Producers go out to enshittified websites (news sites with obnoxious JS and autoplay videos, malware, "subscribe/GDPR" popups, ads) and render HTML1.0 versions of the sites (that could then have further ad-blocking or filters applied to them, like Reader Mode, but taken further). Consumers would browse the same sites, but the add-on would redirect (and perhaps request) to the de-shittified version.

Perhaps people in poorer countries could be motivated to browse the sites, look at ads, and produce the content for a small fee. If a Consumer requests a link that isn't rendered yet (or lately) it could send a signal via P2P saying "someone wants to look at the CNN Sports page" and then a Producer could render it for them. Alternatively, a robot (that manually moves the mouse and clicks links) could do it, from a VM that regularly gets restored from snapshots.

From what I understand, with encrypted DNS and Google's "web DRM" (can't think of the name right now), ad-blockers are going to be snuffed out relatively quickly, so it's important to work on countermeasures. A nice byproduct of this would be a P2P "web archive" similar to archive.org, snapshotting major-trafficked sites day-by-day.

OP, how do you see this working with series of screenshots - for example, sites with several pages that each use/take some user-provided data?

I guess I am asking, can you see this approach working beyond simple one-page quick drafts?

How does it handle mobile / responsive layouts?

Looks awesome. One of the most impressive examples I've seen.

I can see how it relates to your other product, Pico [1], as a sketch/no-code site generation plugin. Not sure how practical this output would be in production, if any, but perhaps helpful for Learning / Education (as a tool).

[1] https://picoapps.xyz/

A live version for this has been online for a few days here! https://brewed.dev/

The amazing thing is of course that this is done with a general model, but it would be quite easy to generate data for supervised learning for this task. Generate HTML -> render and screenshot -> use the data in reverse for learning.