return to table of content

Using GPT-4 Vision with Vimium to browse the web

return to table of content

Using GPT-4 Vision with Vimium to browse the web

return to table of content

Using GPT-4 Vision with Vimium to browse the web

return to table of content

Using GPT-4 Vision with Vimium to browse the web

return to table of content

Using GPT-4 Vision with Vimium to browse the web

transistorfan
48 replies
1d13h

At my work there are a large contingent of people who essentially do manual data copying between legacy programs (govt), because the tech debt is so large that we can't figure out a way to plug these things together. Excited for tools like this to eventually act as a layer that can run over these sort of problems, as bizarre a solution as it is from a compute perspective

bboygravity
10 replies
1d10h

Funny that you and others on here don't seem to realize that literally everybody who uses the internet has the exact same data entry problem all the time. Blame it on "old software", but how about the entire internet?

copying (or in most cases even worse: re-typing) form data from one location on the screen into yet another webform.

Username, password, email address, physical address, credit card info etc etc.

Some extensions try to help with data entry, but none of them work properly and consistently enough to really help. Even consistently filling just username and pw is too much to ask.

It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

I would pay a montly fee for any software that solves this once and for all and it sounds like it's coming (and I'm already paying their monthly fee).

fragmede
2 replies
1d9h

consistently filling out username and password is all I wanted from my password manager, but it turns out it handles credit card number and other bits of information for me as well.

mewpmewp2
0 replies
14h59m

Doesn't chrome out of the box handle all of that?

arkitaip
0 replies
1d8h

I've used Bitwarden to faster fill out job applications.

TeMPOraL
2 replies
1d8h

It's my number 1 frustration when using the internet (worse than ads) and I find it mind-blowing that this hasn't been solved yet with or without LLMs.

Simple: it's because not solving this problem is how our godawful industry makes most of its money. Empowering the user means relinquishing control over their "journey"[0]. Ergonomics means fewer opportunities to upsell or show ads.

I don't have the link handy, but I'm reminded of one of the earliest Windows user interface guidelines documents, back from Windows 95/98 era, which, in a section about theming/visual style, already recognized that theyhave toallow for full flexibility, because vendors will insist on fucking the experience up for the sake of branding anyway, and resisting it is futile[1].

--

[0] - I'm trying really hard to hold back my contempt towards terms like this, and the whole salesy way of viewing human-computer interactions.

[1] - They put it in much more polite terms, but the feeling of helplessness was already there.

musha68k
0 replies
1d1h

Ted Nelson’s “intertwingularity” isn’t far off from the data entry problem described. He argues for universal data access where duplication is obsolete. Imagine form data as a single, linkable object across the web, editable in one place, reflected everywhere—no re-typing, just seamless auto-fill. That’s the unrealized potential of hypertext.

itronitron
0 replies
1d6h

> because vendors will insist on fucking the experience up for the sake of branding anyway

I see that you too have at some point installed printer driver software.

williamcotton
0 replies
1d7h

Bash pipes? The free flow of information through composable tools.

The commercial web? Not the above.

This is just a baseline. I’m sure that an LLM can help the issue but the biggest problem is that these varied HTTP-with-datastores are islands passing messages in bottles back and forth while a bash pipeline is akin to fiber optics.

pseudosaid
0 replies
1d9h

use a password manager. i havent copy pasted form data twice on a site in a long time

loud_cloud
0 replies
1d7h

FTL. See NiagraFiles.

anonzzzies
0 replies
1d8h

Yeah, my dream would be using this to scrape pages, pop the content into my provide db, serving it up in my own format (which is going to be a white page with letters with inline images and videos that are not ads. And my interactions fed back to the vision model to post in the original. So I never have to see a ‘design’ (heavy js riddled unreadable crap) again in my life. And so I can, with my own tooling, browse and reuse my history including content instead rely on all the broken stuff bolted on the web.

haswell
7 replies
1d13h

The industry buzzword is "Robotic Process Automation", which as a category of products has been focused on using various forms of ML/AI to glue these things together in a common/structured way (in addition to good old fashioned screen scraping).

Up this this point, these products have been quite brittle. The recent explosion of AI tech seems like quite a boon for this space.

leovander
3 replies
1d11h

In the OP's specific instance when would you reach out for a traditional ETL tool vs an RPA solution?

transistorfan
1 replies
1d11h

How much does the involvement of a bank of fax machines complicate things?

Roark66
0 replies
1d10h

A little perhaps, but not much. One can replace a bank of physical fax machines with modems.

It would be an interesting job for sure. Why wasn't it done before? I can imagine only two reasons. One, there isn't that much data to move and it makes no sense to build software for what few people spend 30min per day on. Two, the data in the legacy system is images and people are not just moving it between systems, but they also do categorisation, verification etc. In which case an AI model may be useful, but almost always hard coded rules will be faster.

teaearlgraycold
0 replies
1d11h

RPA is for data sources and destinations that are meant for human consumption and entry. So you’d use RPA to take an image of a table and enter every row into a web form.

keepamovin
2 replies
1d10h

I totally agree on all points, especially around what AI means for this.

I'm kind of in a happy accident situation because I was working on something for RPA, which then became a layer that was factored as its own product, but now might be able to come full circle as a result of AI.

Essentially this layer can function as a "delivery medium" for RPA agent creation, that you can use on any device without download. However, as it has many others uses I've been working on those, but I've been seeking a great reason to get back into RPA.

I have a cool idea to leverage human-guided AI creation of data maps and action tours for RPA, but similar to what you say, unless great care is taken you can end up with a brittle approach. Also, as the market has been quite saturated many reasonable approaches, I just haven't felt compelled.

Yet now I think the possible merging of GPT level AIs with browser instrumentation to deliver an augmented way to browse the web makes that incredibly compelling.

So I'm incredibly thrilled that I have this happy accident of BrowserBox^0 (the factored out layer originally from RPA work above) which provides a pluggable/iframe-emebeddable interface for remotely controlling a headless browser. So now I want to look at unifying BrowserBox with this kind of GPT driven exploration.

It's even cooler, because, as BB enables co-browsing by default (multiplayer browsing) and turns the browser into a "client-server" architecture, I can see plugging in GPT-4V as a connecting client with some kind of minimal API affordance for it to use would, like the very cool vimium keyboard-enabled browsing in the OP, would be such interesting project to try!

We're open source so if you want to check us out or get involved in this quest, come say hi, maybe get involved if you're game!

0:https://github.com/BrowserBox/BrowserBox

jimmySixDOF
1 replies
1d7h

I have watched your project for a while as a possible option for embedded browsers for XR applications like WebXR but the high licensing cost was a factor and solutions like Hyperbeam or Vueplex in Unity have been possible. Defiantly agree that multimodal LLM integration is a huge opportunity and multiplayer browsing with AI in realtime is a super cool idea if you package it right.

keepamovin
0 replies
1d5h

Hi jimmySixDOF thank you for the kind words and the attention on our project! :)

Regarding pricing we have heard that feedback over time and gradually adjusted our licensing costs. It should now be much more affordable as it is targeted towards large deployments, with decreasing cost and increasing value at scale.

If you'd like to send an email with any thoughts on our current prices onhttps://dosyago.comto cris@dosyago.com I'd highly value it!

Your idea of WebXR and embedding within Unity is very interesting, and I think it could be a fit.

aikinai
6 replies
1d13h

I remember years ago thinking it was weird in Ghost in the Shell when a robot had fingers on its fingers to type really fast. Maybe that really won’t happen since they can plug into USB at least, but they will probably use the screen and keyboard input sometimes at least.

yjftsjthsd-h
3 replies
1d12h

USB is an attack vector; if it's not exploiting your USB driver it's connecting your data pins to mains power. Keyboards are an air gap.

simbolit
2 replies
1d1h

Isn't the keyboard connected to the computer via USB?

If I have access to the keyboard, I have access to a USB cable plugged into the computer, right?

Perhaps I misunderstand something....

yjftsjthsd-h
1 replies
23h28m

I meant the reverse; the computer attacking the robot using it

simbolit
0 replies
22h35m

Uhhhhh, thanks. That makes a lot of sense!

pixl97
0 replies
1d11h

The issue with USB is you have to have power protection circuits. Analog interface at least in the show appeared much harder to hack.

nomel
0 replies
1d13h

Why would a keyboard be required? I think the intent to hit a letter would more easily be sent over a bluetooth HID "device". ;)

hubraumhugo
5 replies
1d11h

I believe that LLMs will automate most of our data entry/copy/transformation work. 80% of the world's data is unstructured and scattered across formats like HTML, PDFs, or images that are hard to access and analyze. Multimodal models can now tap into that data without having to rely on complex OCR technologies or expensive tooling.

If you go to platforms like Upwork, there are thousands of VAs in low-cost labor countries that do nothing else than manual data entry work. IMO that's a complete waste of human capital and I've made it my personal mission to automate such tedious and un-creative data work withhttps://kadoa.com.

kristopolous
3 replies
1d11h

I was thinking what the payoff would be to pose as human for these terrible pay click jobs and then assign them to an LLM en masse. There's an arbitrage there ... it may be a good strategy.

I heard recently "click-work" works out to about $4/hr* If you could do that x50, passively, it's a fine income.

* - seehttps://journals.sagepub.com/doi/full/10.1177/14614448231183...or listen tohttps://kpfa.org/episode/against-the-grain-october-30-2023/... it's a fascinating study. Terrible pay (way below minimum wage) but surprisingly high worker satisfaction. The users seem to view it as entertainment essentially categorizing it as casual gaming.

The "asshole innovator" in me wonders if one could simply make it more entertaining and forego paying the user entirely.

hubraumhugo
1 replies
1d11h

Interesting. Instead of doing the click work manually, microworkers will just instruct and guide multiple GPTs.

kristopolous
0 replies
1d10h

maybe. A lot of modern clickwork is actually model training and there is a model-collapse phenomena (https://arxiv.org/abs/2305.17493) which means that itshould bebanned for such work. I bet a number of clever people on the platforms are already trying to instrument AI to do the work regardless - it's pretty close to "free money" if you can pull it off and not get caught and at a spigot size where there's no real serious consequences if you do.

ishan0102
0 replies
1d11h

Yeah this seems easy to build but would rather work on making tools that improve accessibility 10x

ishan0102
0 replies
1d11h

Yup, that's my long term goal. I want an "anything API" that brings structure to anything on the web.

yreg
3 replies
1d4h

A long, long time ago I worked on a small project for a major multinational grocery chain.

I made them a tool that parses an Excel file with a specific structure and calls some endpoints in their internal system to submit the data.

I was curious, so I asked how they are doing it currently. They led me to a computer at the back of their office. The wallpaper had two rectangles, one of them said MS EXCEL and the other said INTERNET EXPLORER. Then the person opened these apps, carefully positioned both windows exactly into those rectangles and ran some auto-clicker - the kind cheaters would use in RuneScape – which moved the cursor and copied and pasted the values from the Excel into the various forms on the website.

Amazing.

Valgrim
1 replies
1d2h

I worked with a client who used a multi-millon dollar system for moving goods automatically into packaging stations. The system was built and maintained by a major european company. All the data was transfered automatically between systems normally, but one day, for some reason, there was an internal communication error inside the machine which caused a lot of packages to be sent without being recorded as such.

Now normally we would just have contacted the company and asked them for a data extraction so we could cross-reference the data. But since it wasn't clear who was at fault, and we knew it would take weeks for that extraction, we looked for an internal solution first.

Now there was a subsystem in the machine that worked only in Internet Explorer, with an old authentication scheme, that we could use to see the information we needed, so I, being the only person in the team without formal analysis training but having made my way there from a clerk job, knew exactly what to do.

I fired up the old IE, Excel, wrote in 5 minutes a VBA script that did exactly what you described, click there copy that etc, and 30 minutes later we had our extraction, and resolved the issue completely before the packages were even shipped.

All hail Excel.

mst
0 replies
21h39m

For all its flaws as a programming language, VBA made an excellent bodging language and I salute your expedient field hack.

kspacewalk2
0 replies
23h56m

I wonder if it used something like AutoIt[0]. I remember using it at one of my more boring co-op jobs about 20 years ago to automate moving data between a spreadsheet and some obscure database product.

[0]https://en.wikipedia.org/wiki/AutoIt

gumballindie
2 replies
1d2h

Wow. Leaking confidential tax payer data.

transistorfan
1 replies
14h11m

I should have been clearer, it's between two apps that we host internally - applications on our own intranet cannot talk to each other. If you want to get any data out of either of these apps to the world, you need to do a manual export and email/usb which would obviously flag

gumballindie
0 replies
1h55m

Correct, but chat gpt reads screen data to be able to "click" around. So you would need to expose at least data that is displayed on screen to this external product.

Roark66
1 replies
1d10h

Whenever I hear about such a thing (people doing legacy system data extraction manually) I wonder if in every case someone got the estimate for the "proper" solution and just decided a bunch of people typing is cheaper?

Integrating things like Chatgpt will still require people who know what they are doing to look at it, and I wouldn't be surprised if the first advice they give is "don't use chatgpt for it".

spaceman_2020
0 replies
1d2h

If the market forces work as they’re supposed to (not a given anymore), then corporations that adopt better tech will see better profits through lower expenses. And then the laggards will have to adapt or die.

Also remember that this is essentially v1 of the software- the Windows 95 of this adoption cycle

specialist
0 replies
1d5h

a large contingent of people who essentially do manual data copying

Yup.

I was briefly part of a decades long effort to migrate off a main frame backend. It was basically a very expensive shared flat file database (eg FileMaker Pro). Used by thousands of applications, neither inventoried or managed. Surely a handful were critical for daily operations, but no one remembered which ones.

And the source data (quality) was filthy.

I suggested we pay some students to manually copy just the bits of data our spiffy "modern" apps needed.

No one was amused.

--

I also suggested we find a suitable COBOL runtime and just forklift the mainframe's "critical" infra into a virtual machine.

No one was amused.

Lastly, I suggested we throttle access to every unidentified mainframe client. Progressively making it slower over time. Surely we'd hear about anything critical breaking.

That suggestion flew like a lead zeppelin.

morkalork
0 replies
1d13h

Kinda sci-fi, we're so close to a future where when/if original source code is lost, a mainframe runs in an emulator and the human operating it is also emulated.

monkeydust
0 replies
1d7h

This has been fruitful ground for RPA offerings like UIPath and Automation Anywhere. Multi-model LLMs open up chance to disrupt them

alexirobbins
0 replies
1d8h

Working on this layer athttps://autotab.com. This sounds like an amazing problem for browser automation to solve, would love to talk with you if you’re interested!

abrichr
0 replies
4h25m

This type of use case is exactly why are buildinghttps://github.com/OpenAdaptAI/OpenAdapt

Garlef
0 replies
1d10h

"Chinese Room Automation"

FooBarWidget
0 replies
1d8h

It's bizarre computationally, but at this point maybe we have to compare it to the alternative: hiring a person. At least the AI only consumes electricity (which is hopefully green), while a person consumes food (grown with mined fertilizers), or meat (which we know is really bad for the environment).

ishan0102
9 replies
1d13h

Hey! Creator here, thanks for sharing! Let me know if anyone has questions and feel free to contribute, I've left some potential next steps in the README.

roland35
1 replies
1d6h

what terminal are you using???

ishan0102
0 replies
1d2h

Warp! (warp.dev)

jgalentine007
1 replies
1d13h

Very cool use for Vimium, I like the approach!

ishan0102
0 replies
1d13h

Thank you!

celeste_lan
1 replies
1d13h

Omg I also just released something pretty similar earlier todayhttps://github.com/Jiayi-Pan/GPT-V-on-Web. But it received little attention.

ishan0102
0 replies
1d12h

Woah looks great, not surprised that multiple people thought of this! Your prompt looks much better than mine, I'm not really taking advantage of any of the default Vimium shortcuts.

squeegmeister
0 replies
1d11h

How does this differ from how ChatGPT currently browses the web?

poulpy123
0 replies
1d2h

could it be used to make a bot that visit and parse websites to extrat relevant information without writing a parser for each websites ?

jimmySixDOF
0 replies
1d10h

Nice. I know Open Interpreter are trying to get Selenium automated to natural language control and quite a few other projects are also popping up on HN lately. The vimium approach is a lot lighter so looks promising. One way or another the as-published world wide web is turning into its own dynamic API overlay server. Ingest all the Sources!

snake_doc
8 replies
1d12h

Ah, very similar to Adept’s[1] concept? Though, their product seems not yet ready.

[1]https://www.adept.ai/

jatins
3 replies
1d9h

It's also a little insane to me that what Adept has been supposedly building for years with 300+ mil in funding can now be built in a day with Open AI APIs?

I think Adept pivoted along the way but original concept was very similar to this.

sunshadow
1 replies
1d6h

But its too expensive to become practical with the OpenAI API. Also, demo is cool until you see the real-world webpages, then you'll realize that this only works less than %50 of webpages.

og_kalu
0 replies
1d4h

GPT-4V may be surprisingly robust here. Set of mark prompting(which is accomplished here with Vim) improves grounding by a silly high amount.https://som-gpt4v.github.io/

abrichr
0 replies
4h23m

Agreed! This is part of the motivation behindhttps://github.com/OpenAdaptAI/OpenAdapt

ishan0102
1 replies
1d11h

Yep, took inspiration from them and a couple other startups

QkPrsMizkYvt
0 replies
1d7h

What other startups did you use for inspiration?

karmasimida
0 replies
1d11h

This is precisely the demo I am thinking.

amks
0 replies
21h46m
FooBarWidget
4 replies
1d9h

Many Dutch companies pay salaries by

1. receiving payslips from the accountant, and then

2. manually initiating bank transfers to each employee for the amount in the corresponding payslip, and then

3. manually initiating a bank transfer to the tax authority to pay the withholded salary taxes.

This is completely useless manual labor. There should be no reason for this to be a manual procedure. And yet it's almost impossible to automate this. The accountant portal either has no API, or it has an API but lets you download the data as PDF, and/or the API costs good money. The bank either has no API, or it requires you to sign up for a developer account as if you're going to publish a public app, when you're just looking to automate some internal procedures.

So the easiest way to pay salaries and taxes is still to hire a person to do it manually. Hopefully one day that won't be necessary anymore. I wouldn't trust an AI to actually initiate the bank transfers, but maybe they can just prepare the transactions and then a person has to approve the submission.

nvm0n2
0 replies
1d4h

That's just a bank problem. Certainly this isn't how payroll works for large companies. Banks usually let you upload XML files that define a set of SWIFT payments, this is how I do payroll even for a small company. The accountants supply the XML file too, presumably they have an app that generates it.

martinald
0 replies
1d8h

I don't think this really has much to do with AI. In the UK there are solutions like Pento now which do all this, including automating payments via open banking to the user and the tax authority and automatically filing tax filings:

https://www.pento.io/la/payroll-software

is_true
0 replies
1d4h

In my country it's similar but for some data you have to upload to the government agency's site, I think it was earlier this year that they released a statement saying that people using software to perform actions on the website could get banned.

abrichr
0 replies
4h21m

Thanks for the tip!

Automating repetitive GUI workflows is the goal ofhttps://github.com/OpenAdaptAI/OpenAdapt

thekid314
3 replies
1d13h

I'm curious to see what it does when it sees a captcha.

ishan0102
2 replies
1d11h

From OpenAI docs[1]: "For safety reasons, we have implemented a system to block the submission of CAPTCHAs."

[1]https://platform.openai.com/docs/guides/vision

xur17
1 replies
1d11h

Yeah, I've been feeding screenshots from selenium to the vision API, and when I trigger bot detection on a website, chatgpt refuses to process the image.

NorwegianDude
0 replies
1d8h

It does solve, or at least try to solve, captchas for me. It gets like half the characters correct, it's very bad at it.

maccam912
3 replies
1d12h

I've been playing with a similar idea of screenshots and actions from GPT-4 Vision for browsing, but after trying and failing to overlay info in the screenshot, I ended up just getting the accessibility tree from playwright and sending that along as text so the model would know what options it had for interaction. In my case it seemed to work better, I see the creator is here and has a list of future ideas, maybe add this to the list if you think its a good idea?

manmal
1 replies
1d11h

Probably better to capture all the content and not just what fits on one screen. Most pages should fit as text (or HTML?) in the new extended token window.

arbuge
0 replies
1d4h

Better watch token costs. The per token costs are lower now but even so a full context load still costs almost $4.

ishan0102
0 replies
1d11h

Cool that’s a solid idea, I was trying to only use visual data but this could make the agent a lot more powerful, I’ll try this really soon

imranq
3 replies
1d13h

Is the vision model directly reading the screen and therefore also reading the Vimeo tags? It might be more effective to export the DOM tags and the associated elements as a Json object that is fed into chatGPT without using the vision component

dymk
2 replies
1d13h

Currently the Vision API doesn't support JSON mode or function calling, so we have to rely on more primitive prompting methods.
maccam912
1 replies
1d11h

I found that it works well to ask it to generate JSON the best it can, then pass it to gpt-3.5-turbo with the JSON response mode and instruct it to just clean up whatever input it received.

ishan0102
0 replies
1d11h

Perfect, I have this as a todo in my readme and I’ll implement this soon

e12e
3 replies
23h57m

It's insane that this is now possible:

https://github.com/ishan0102/vimGPT/blob/682b5e539541cd6d710...

"You need to choose which action to take to help a user do this task: {objective}. Your options are navigate, type, click, and done. Navigate should take you to the specified URL. Type and click take strings where if you want to click on an object, return the string with the yellow character sequence you want to click on, and to type just a string with the message you want to type. For clicks, please only respond with the 1-2 letter sequence in the yellow box, and if there are multiple valid options choose the one you think a user would select. For typing, please return a click to click on the box along with a type with the message to write. When the page seems satisfactory, return done as a key with no value. You must respond in JSON only with no other fluff or bad things will happen. The JSON keys must ONLY be one of navigate, type, or click. Do not return the JSON inside a code block."
Maxion
2 replies
22h52m

The speed at which this is moving at is mind boggling. This may become crazier than the dot.com boom.

pms
1 replies
18h51m

Until you realize that it doesn't work well with less popular videos (any items really), because "Large Language Models Struggle to Learn Long-Tail Knowledge" [1].

[1]https://proceedings.mlr.press/v202/kandpal23a.html

heroprotagonist
0 replies
0m

Except in this case, the knowledge is 'how to search the web for X" instead of 'an understanding or familiarity with X'.

jackconsidine
2 replies
1d13h

Looks extremely cool. Trying to run it though, I get stuck at "Getting actions for the given objective..." (using the example on the repo)

ishan0102
1 replies
1d13h

Huh weird, I'm getting that too. OpenAI has been having periodic outages today, think that might be why since it was working fine earlier.

jechamt
0 replies
1d8h

https://www.bleepingcomputer.com/news/security/openai-confir...News reports and theirhttps://status.openai.com/incidents/21vl32gvx3hbincident reports indicate they are mitigating / fighting off attacks recently

burcs
2 replies
1d13h

This is amazing, I feel like these vision models are going to make everything so much more accessible. Between the Be My Eyes app integration and now this, I'm really excited for how this transforms the web.

ctoth
1 replies
1d13h

I agree, and I think we're a year or two away from a full end-to-end trained screen reader. The ground truth from existing systems would provide great training material.

As a technical blind person, my only concern is the inherent loss of privacy while sharing stuff with the big models.

supriyo-biswas
0 replies
1d13h
reqo
1 replies
1d10h

How will tools like this affect web tracking or generally advertisements on the internet? Imagine you could have an agent browse the web for you and fetch exactly what you are seraching for without you seeing any ads/pop ups or being tracked along the way! Could be a great ”ad blocker”! Could it perhaps also make SEO useless and thus improve the quality of internet? But I wonder if it also could have negative effects such as the ads being “interweaved” into the fetch content somehow!

og_kalu
0 replies
18h32m

Since this is sending screenshots of pages to GPT, won't it see the ads as well?

ranulo
1 replies
1d7h

This could enable human language test automation scripts and could either improve my life as a QA engineer a lot or completely destroy it. Not sure yet.

sunshadow
0 replies
1d6h

You're good until this is cheaper than your salary.

lachlan_gray
1 replies
1d9h

I think vim is unintentionally a great “embodiment” for chatgpt. There’s nothing that can’t be done with a stream of text, and the internet is full of vimscript already

I started a similar experiment if anyone else is thinking along the same lines :)

https://github.com/LachlanGray/vim-agent

gsuuon
0 replies
1d

This is a neat idea!

karmasimida
1 replies
1d11h

We can create an autopilot for browser.

It is going to incredibly difficult moving forward to distinguish bot traffic, if this is deployed at scale.

The problem I see is this isn't going to be cheap or even affordable in short term.

ishan0102
0 replies
1d11h

I think costs can come down if you finetune open source models like llava or cogvlm. This demo also cost about 6 cents so it's not insanely expensive either, especially with clever prompting.

braindead_in
1 replies
1d9h

Why not build a new browser with GPT baked in?

reustle
0 replies
1d8h

Curious, how would that differ? Assuming it is just grabbing the rendered HTML DOM after each action, isn’t it nearly the same?

ternaus
0 replies
22h32m

Love the idea.

It also shows that GPT-4V created a new angle in web scraping.

I guess, this or similar code would be leveraged in many projects like:

1. Scrape XXX websites, say LinkedIn or Twitter use all types of methods in the DOM to prevent it, but fighting working well GPT-4V + OCR would be ultra hard.

2. Give me an analysis of what these XXX companies are doing. And this could be done for competitors, to understand the landscape of some industry, or even plainly to get news.

Large-scale scrapping, not depending on the source code of the pages is a powerful infrastructural change.

startages
0 replies
1d2h

There is just so much you can do with GPT-4 vision, I just hope it's more affordable.

snthpy
0 replies
1d12h

Looks cool. Unfortunately I expected this to enhance my Vimium experience but it looks like this is using Vimium to enhance GPT4, right?

rpigab
0 replies
6h7m

This is amazing that it's possible and works, but I wonder if the electricity cost is sustainable in the long run.

For handicapped people who depend on tools like this for accessibility, it's justified, but I wouldn't use it myself if it uses too much power.

I'm sure OpenAI and friends love operating at a loss until everyone uses their products, then enshittify or raise prices, like Netflix, Microsoft, Google, etc., but CO2 emissions can't be easily reversed.

I'd be glad to listen to other points of view though, maybe everything we do on computers is already bad for the environment anyway and comparing which one pollutes more is vain, idk.

owenpalmer
0 replies
1d11h

This will be fantastic for accessibility

nostrowski
0 replies
23h22m

This will be in a future history book under a chapter titled "the beginning of the end"

mediumsmart
0 replies
9h28m

this is awesome and great news,nevermind that the AI found the wrong video in the demo

https://www.youtube.com/watch?v=jRyX1tC2OS0

mackross
0 replies
1d7h

Been playing with this through the ChatGPT interface for the past few weeks. Couple of tips. Update the css to get rid of the gradients and rounded corners. I found red with bold white text to be most consistent. Increase the font size. If two labels overlap, push them apart and add an arrow to the element. Send both images to the API, a version with the annotations added and a version without.

jonathanlb
0 replies
1d2h

Hmm interesting. I'm curious what this means for accessibility and screen readers.

gvv
0 replies
1d9h

Nice job! The horrors GPT-4 must endure to watch ads, truly inhumane

doctorM
0 replies
20h43m

i think this is actively dangerous. well not yet. but getting there.

i know - ai isn't meant to be sentient. but if it looks like a duck and quacks like a duck...

how do i know that the comments here aren't done by dedicated hacker news ai bots?

the potential danger could come from lack of supervision down the road.

i didn't get much sleep last night so this is less coherent than it could be.

dangerwill
0 replies
23h7m

How is this making your browsing experience any better? You still have to know what you want to do, and it is just faster to type Rick roll into youtube directly and click the links directly instead of having to type k, or vh, or whatever. You are just adding a useless chatgpt middleman between you and the browser that you likely spend all day in anyway and should be adept at navigating

comment_ran
0 replies
1d13h

It's so cool. I was wondering if we can make crawler tool much easier and better. It's more similar to the "human" way to interact with a website.

bnchrch
0 replies
1d14h

Personally. This is what Im really excited about chatgpt for. Data has just become alot more free to access.

bilekas
0 replies
1d5h

This is actually pretty interesting.. I am thinking maybe it would be faster than writing up selenium tests themselves if we could just give a few instructions.

I'm still going through the source, but really nice idea and great example of enriching the GPT with tools like vimium.

DalasNoin
0 replies
23h35m

I tried to use it, but unfortunately it often did not add the little annotations for the different options to the screen and it got stuck in a loop. This bot works by adding a two letter combination to each clickable option, but sometimes they don't show up. It managed to sign in to twitter ones, but really quickly I burned through the 100 images api limit.

Maybe for a future version it only uses vision for difficult situations in which it gets stuck and otherwise uses the text based browser?