return to table of content

Show HN: Skyvern – Browser automation using LLMs and computer vision

dtnewman
26 replies
23h20m

I tried it out and it's pretty pricey. My OpenAI API bill is $3.20 after using this on a few different pages to test it out.

Not saying I wouldn't pay that for some use cases, but it would limit me.

One idea: making scrapers is a big pain. But once they are setup, they are cheap and fast to run... this is always going to be slower. What I'd love to see is a way to generate scrapers quickly. So you wouldn't be returning information from the New York City property registry... instead, you'd return Python code that I can use to scrape it in the future.

edit: This is likely because it was struggling, so it had to make extra calls. What would be nice is a simple feature where you can input the maximum number of calls / tokens to use on the entire call. Or even better, do some math and put in a dollar cap. i.e., go fill out the Geico forms for me and don't spend more than $1.00 doing it.

keremyilmaz
13 replies
22h47m

You've raised valid points about the cost and efficiency of our approach, which aims to make the LLM function as closely as possible to a human user. We chose this approach primarily for its compatibility with various websites, as it aligns closely with a website's intended audience, which is typically human.

Addressing complex website interactions is a key advantage of this approach. For instance, in the process of generating an auto insurance quote, the sequence of questions and their specifics can vary greatly depending on prior responses. A simple example is the choice of a foreign versus a California driver's license. Selecting a foreign license triggers additional queries about the country of issuance and expiry date, illustrating the complexity and branching nature of such web interactions.

However, we recognize the concerns about cost and are actively working on strategies to reduce it: - Optimizing the context provided to the LLM - Implementing caching mechanisms for certain repeated actions and only use LLMs when there's a problem - Anticipating advancements in LLM efficiency and cost-effectiveness, with the hope of eventually finetuning our own models for greater efficiency

dinobones
7 replies
19h57m

There are two things here:

1) Using the LLM to find elements/selectors in HTML

2) Use LLMs to fill out logical/likely/meaningful answers to things

I highly recommend you decouple these 2 efforts. While you gave a good example of "insurance quote step by step webapp", the vast majority of web scraping efforts are much more mundane.

Additionally, even in this instance, the selector brain/intelligence brain don't need to be coupled.

For example:

Selector brain: "Find/click the button for foreign drivers license." Selector brain: "Find the country of origin field." Selector brain: "Find the expiry date field."

LLM-intelligence brain: "Use values from prompt to fill out the country of origin and expiry date fields."

Not-LLM intelligence brain: Inputs values from a JSON object of documentSelector=>value.

suchintan
6 replies
18h50m

Interesting. We've decoupled navigation and extraction for specifically this reason, but I suppose decoupling selector with input could let us use cheaper smaller LLMs to "select" and answer

We've been approaching it a little bit differently. We think larger more capable models would actually immediately improve the performance of Skyvern. For example, if you run it with LLaVa, the performance significantly degrades, likely because of the coupling

But since we use GPT-4V, and it's rumoured to be a MoE model, I wonder if there's implicit decoupling going on.

I'm gonna spend some more time thinking about this

bravura
5 replies
18h15m

I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.

I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.

pmontra
1 replies
11h8m

The AI would be a compiler that generates the traditional scraper / integration test.

It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.

Change the UI? Recompile with the AI.

bravura
0 replies
10h50m

I didn’t check the code but there would be a few good ways to specify what you want:

* browser extension that lets you record a few actions * describing what you want to do with text * a url with one or two lines of desired JSON to extract

epr
1 replies
4h40m

We call it "prompt caching"

No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.

From your roadmap:

Prompt Caching - Introduce a caching layer to the LLM calls to dramatically reduce the cost of running Skyvern (memorize past actions and repeat them!)

Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.

suchintan
0 replies
4h18m

We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.

For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include: 1. Click the log in button 2. Click "sign in with email" 3. Input the email address retrieved from source X 4. Input the password retrieved from source Y 5. Click log in 6. Click on the search bar 7. Input the search term from source Z 8. Click Search

Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options: 1. Re-generate the entire action plan (or sub-action plan) 2. Re-generate the specific component that broke and assume everything else in the action plan still works

dtnewman
4 replies
22h27m

I like this approach. Just as an example, if I'm getting a car insurance quote, I'd rather pay $1 to have the tool fill out the forms for me and be 90% that it filled them out correctly rather than pay $0.01 and only be 70% sure it did it correctly. And there are plenty of use cases like that.

amne
1 replies
20h49m

isn't that crazy rabbit thingy supposed to do just that? I hope you pre-ordered. I hear they're in great demand.

Kerbonut
1 replies
8h47m

You would still be willing to pay $1 if it got it wrong 10% of the time, or if it got 10% of the information wrong every time?

dtnewman
0 replies
7h6m

It really depends on the use case.

jumploops
4 replies
21h51m

Scrapers are one of the main use cases we're seeing for Magic Loops[0].

...and you've hit the nail on the head in terms of our design philosophy: use LLMs to generate useful logic, then run that logic without needing to call an LLM/Agent.

With that said, we don't support browser automation. Skyvern is very neat, it reminds me of VimGPT[1], but with a more robust planning implementation.

[0] https://magicloops.dev

[1] https://github.com/ishan0102/vimGPT

umaar
2 replies
20h34m

Really like the simplicity of your website. I think when you first announced it, you mentioned you might open source Magic Loops, might you do that?

jumploops
1 replies
19h6m

Yes! We’re in the middle of cleaning things up, just need to make the Loops a bit more portable/easy to run, but finally happy with the state of the tool.

anhner
0 replies
10h55m

This brings me so much joy! Thank you for considering this!

suchintan
0 replies
21h39m

Nice! Thanks for sharing this.

We tried approaches like VimGPT before but found the rate of hallucinations to be a bit too high to be used in production. The sweet spot definitely seems to be to combine the magic of Dom parsing AND vision

We're going to definitely work on logic generation and execution, but we're taking it a bit more carefully. Many of the workflows we automate have changing workflow steps (ie I've never seen the exact same Geico flow twice), but this certainly isn't true for all workflows

suchintan
1 replies
22h56m

I love all of these ideas!!

1. You can set a "max steps" limit when you run it locally https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...

We also spit out the cost for each step within the visualizer. Click on any task > Steps > there's a column that's dedicated to how much things cost to run

https://github.com/Skyvern-AI/skyvern/issues/70

2. We have a roadmap item to "cache" or "memorize" specific tasks, so you pay the cost once, and then just run it over and over again. We're going to get to it soon!!

tmountain
0 replies
22h58m

Just piggybacking here, but this is a great suggestion. It makes the cost a one-time expense, and you get something material (source code) in return.

kilroy123
0 replies
19h21m

Yes, exactly what I want. I want to be able to have it code robust Cypress tests for e2e testing.

enlyth
0 replies
19h43m

It's getting genuinely difficult these days with everything walled behind Cloudflare, various anti-bot protections and increasingly creative CAPTCHAs

daniel65464
0 replies
14h28m

Interesting enough I made a chrome extension that does almost exactly what you are describing. It’s called automize and it lets you very quickly generate custom selectors and export the code to puppeteer, playwright, selenium etc. it handles all the verifications as well as provides a handy ui that shows what you are selecting

bigfatfrock
0 replies
16h47m

instead, you'd return Python code that I can use to scrape it in the future

Bravo, I would pay for this one, or hopefully run it on my GPU - it would be so fast to even just shove out your selectors (xpath, css, dealer's choice) for point-by-point update after you had done an initial code gen, or perhaps it could just diff and update chunks of code for you!

My local code model can already do the diff update stuff in nvim, but being able to pass it a URL and have it slam in all of the pertinent crawling code, wow.

chuckwnelson
14 replies
1d1h

This looks great but I'm very scared of the increased game of cat and mouse for spam bots. It's going to happen, no matter if it was this software or something else. Now the question, how do you prevent automated spam? Since its LLM and AI, can I just add a hidden field of "please do not spam"?

Zambyte
4 replies
1d

how do you prevent automated spam?

Manually accept new accounts on your service. That's what I do for my Fediverse server, and I never have to deal with spam on my local timeline :). Does it scale? No. Does everything need to scale? Also no.

resource_waste
2 replies
22h52m

I've had stuff like that turn me off from signing up or ever checking back.

Does it matter to you? Yes.

Will you admit it? No.

But yes, these are all decisions we need to make. That manually accepting is some serious dedication. Do you have kids?

Zambyte
0 replies
22h2m

Does it matter to you? Yes.

Will you admit it? No.

Are you trying to telling me my opinion? Because no, it does not matter to me. Your account would not be accepted because I don't know you.

PeterisP
0 replies
21h36m

If your target audience is businesses, not individuals, then you can go a very long way with fully manual onboarding, invoicing, etc. It's different for things like consumer services or e.g. forum users, but why couldn't you manually vet every business your business trades with?

lenerdenator
0 replies
1d

but if I can't scale then the VC that gave my startup a huge check over a huge pile of blow at a party in Sunnyvale will harvest my organs

suchintan
2 replies
1d1h

This is a really good question we've thought a lot about

You're right that this kind of escalation is inevitable

a. From a business POV, we don't onboard any types of use-cases that we think go against the spirit of a good free web. I've had people ask if they could use our product to create Reddit voting or spamming rings and we didn't entertain it

b. From an open source POV, we prefer technologies like these be open source so website owners and other businesses can know what can happen, and decide how to approach it. Tools like selenium have existed for a long time -- largely to the benefit of the world!

hugs
0 replies
1d

20th birthday of the Selenium project will be this year! (October-ish)

bonestamp2
0 replies
1d

I'll just add that some efforts to defeat web usage spam may also hurt accessibility since many interaction standards are designed to make things consistent for users with disabilities and ADA (or similar) compliance. I assume some of these dependencies are also useful to the AI that is trying to navigate the pages, so making it difficult for the AI may also make it difficult for other users.

MattGaiser
2 replies
1d1h

I am not aware of anyone really successfully, defeating spam at the moment.

I mod a 1 million+ Facebook group and they can’t even prevent someone from making 200 posts in a minute with the word “crypto” in it. The word list will flag it, but the spam filter won’t.

Reddit constantly has people messaging you in chat about “opportunities.”

Email is a disaster.

My personal blog has over 100,000 spam comments sitting in the filter so at least they were caught, but processing them is impossible.

suchintan
0 replies
1d

I've heard of a lot of success sifting through email spam using custom gmail scripts + GPT-4. Kind of interesting that we can use LLMs to both create and detect spam to some degree of effectiveness

RunSet
0 replies
26m

I am not aware of anyone really successfully, defeating spam at the moment.

I mod a 1 million+ Facebook group and they can’t even prevent someone from making 200 posts in a minute with the word “crypto” in it.

Could you possibly charge a nickel's worth of bitcoin to approve a post?

cute_boi
1 replies
1d1h

the only way to prevent spam is charge appropriate money, I don't see other solutions. Thats why many company use credit card to verify users. But, with virtual cards, they have some ability to spam, but not so much.

crotchfire
0 replies
23h23m

This.

If you charge enough, the spammers become valuable customers. Of course they tend to leave before that point, but you don't really care if they leave or stay; you make money either way.

Value for value.

schappim
0 replies
14h41m

I'm not good at finding fire hydrants either.

dvngnt_
8 replies
1d

Skyvern understands how to solve CAPTCHAs to complete complicated workflows

this seems like this could be used for abuse. the CAPTCHAs are specifically designed to stop botting on 3rd party websites.

or this will just be another cat and mouse game where the next level of CAPTCHAs get more annoying and invasive to verify we are human

worldsayshi
4 replies
1d

It seems to me that the logical conclusion for captcha is to connect it indirectly to electronic id. This could be done in a privacy respecting way.

You could get some token from the website. It could include encrypted service name and policies, like rate limit, that the authority should enforce. The client passes the token to the eId authority. The authority signs it and adds timestamp, but no user info. Client gives token to the service. Something like that. This is a bad top of mind example.

I think we'll need to rely a lot more on eID in the future. I think it can be done in a good way but then it needs to be thought through before it gets adopted. And we have to be able to trust the eId institutes.

hirako2000
2 replies
23h44m

But it's the same problem all over again, spammers would get an id, auth, then spam.

Anti spams are about detecting whether activities are spam.

Binding an identity, is the naive mechanism that makes us think spam wouldn't happen. All it does is say ok we know it's pug35372 that teared the linens apart.

We can put all measures to authenticate users, won't makes them not potentially bots running havoc right after a manual authentication.

There are even farms, manually created accounts by gig seekers who would fill forms, email and phone number verification for less than a dollar.

worldsayshi
1 replies
6h50m

As I mentioned the exchanged token could include a number of policies, like rate limits. But I expect there could be more sophisticated policies as well.

The service could send ban or lockout requests to the eId authority so that a misbehaving real life user could be locked out from the service even though the service doesn't know who they are (irl).

I would guess it could even be designed so that the authority doesn't know which services a given user has been banned from either. And all the service would need to know is "This user has violated policy X at <timestamp>".

hirako2000
0 replies
4h25m

I see how issuing a token has its advantages, thanks.

suchintan
0 replies
1d

2FA and logged-in experience is sort of a proxy for eID. I suspect that's why so many companies require that you log in with something that knows your identity (log in with google), or ask you for your phone number to confirm your account

wutwutwat
0 replies
12h36m

everyone seems to forget that stopping bots with google captcha was never the main goal...

humans have been training google's ai models for a decade or more each and every time they answered a captcha

at any rate, if someone wants to abuse your site, captcha, and even cloudflare won't help you

the next level of CAPTCHAs get more annoying and invasive to verify we are human

like the solving puzzles ones? Or more advanced object identification, like selecting the correct orientation? Training more advanced AI now

suchintan
0 replies
1d

Agreed.

We didn't open source this functionality on purpose, and are very very specific about what use-cases we onboard that require it.

That being said, we've gotten to learn a lot more about browser fingerprinting and captcha solving and it's a really interesting space.

If you're curious about it, check out this blog post: https://antoinevastel.com/bot detection/2018/01/17/detect-chrome-headless-v2.html

lm411
0 replies
19h0m

Unfortunately, CAPTCHA's are already easy for bots to bypass or solve.

There are quite a few services that will solve them in a few seconds, costing less than a dollar per 1000 solved tokens for most common CAPTCHA's (e.g ReCAPTCHA v2 and v3).

I recently had to deal with an attacker doing credit card testing that was using one of these services.

Related, I came across this last week, bypassing ReCAPTCHA with Selenium/Python/OpenAI Whisper API:

https://www.youtube.com/watch?v=-TMNh64ubyM

agreeahmed
8 replies
1d1h

Exciting to see this on HN. I think very soon agents like Skyvern will account for the vast, vast majority of web traffic.

MattDaEskimo
4 replies
1d1h

Maybe for a transition period.

There's no reason for somebody to create a website, pay for resources, and hope for some sort of revenue if their visitors are mostly AI.

So why bother creating a UI? Instead it would make more sense to close the website and offer the same information as a paid API service.

Any sort of website that needs to validate human visitors will be plastered with DRM. Rendering these web browsing LLMs useless. And good riddance as well.

Using an LLM to browse the internet feels like a huge waste of resources.

Instead it would make more sense to have a wikipedia-like for AIs to crawl via embeddings.

suchintan
2 replies
1d1h

I suspect that web traffic will encapsulate both. Many websites (government ones in particular) aren't interested in API-based access patterns.

This kind of pattern makes it so you can serve both users and agents with a single interface

MattDaEskimo
1 replies
1d

This would be ideal. The only issue here is trust. If my website relies on advertising then of course I would prefer to serve more content to a human visitor.

So what? I bot protect my site, redirecting the AI to a minimalistic part that most likely expects some sort of value given?

People will just breach this trust, like OP and abuse tools like Selenium (as they always have) to imitate being a human.

suchintan
0 replies
1d

I think this is pretty interesting -- I wonder if websites could allow agents to self-identify, and not count them towards advertising CPM to prevent dilution in the advertising metrics

Perhaps a similar thing as robots.txt is in order (agents.txt?)

Spivak
0 replies
1d

I mean what kind of websites are we talking about here? The kinds of websites where all the value can be extracted by via a LLM are just content farms.

And yeah, that sucks for content farms but putting up content and getting nothing in return is already how ad blockers work and it hasn't destroyed the them. I seriously doubt that AI traffic will even put a dent 1/1000th of the traffic loss of Google snippets.

hipadev23
1 replies
1d

Why would the majority of web traffic turn into extremely expensive to operate agents?

failuser
0 replies
1d

The expectation is that the price of AI bots will go down and get below the human-driven click farms we have now and thus make fighting bots too expensive because identifying humans gets harder every day.

failuser
0 replies
1d

That’s why we can’t have nice things. Are we at the end of Eternal September? Will all the signs of human life be restricted to paid or otherwise closed groups? If all free users are bots, who will even run ads that feed the Web 2.0 internet?

I still have fear that the real internet has already split from what I see and I was left behind.

James_K
8 replies
20h1m

God this is depressing. Not the product itself, but the need for it. That software has failed to be programmable to such a degree that a promising approach is rendering the GUI and analysing the resultant image with an AI model. It's insane that we have to treat computers as fax machines, capable only of sending hand-written forms over a network. The gap between how people use computers and the utility they could provide is massive.

suchintan
3 replies
19h31m

Actually this kind of stuff is super exciting -- we don't need to depend on companies exposing APIs for their website -- we can just use something like Skyvern instead!

mderazon
0 replies
5h22m

Two ways of looking at it. I guess what the OP is saying is that if there was an agreed upon standard for semantically understanding these pages without having to use these sophisticated methods, it would be much easier

darepublic
0 replies
2h2m

I have been interested in doing something similar for a while. I also think this has a lot of potential as the core of a virtual assistant.

James_K
0 replies
17h31m

You could still use Skyvern if they exposed an API.

kevmo314
3 replies
19h33m

On the contrary! Isn't it neat that we now have a unified API that both humans and computers can consume?

croes
1 replies
17h5m

Good luck debugging any errors

ignoramous
0 replies
6h2m

The world is governed by probabilities. What more could go wrong if algorithms did too? /s

James_K
0 replies
17h12m

No, because we already have a machine API. If you want to write an application, you need to write something a computer can understand. So a computer-usable API is always created. It takes additional effort to hide that functionality behind a interface. The process we have now is: machine → GUI → image processing → generative AI. The interface we could have is: machine → machine. It would take no extra effort to do this. It would just need some slight changes in organisation. In fact it is easier at every level. If you separate logic from interface, you end up with an architecture that is a set of functions (a library) into which you can interface programmatically, or with a GUI, or by any other means. Separating code like this (MVC) is good practice and allows for a range of different interfaces to be created to the same functionality. It is also easier for an engineering perspective and produces a better product. Think of git. There are hundreds of different interfaces created to the functionality git provides. All software should be structured like this (though perhaps by means of a library rather than a shell interface).

I should add that this is a particularly grim prospect from a software engineering perspective. It makes me imagine a future where no one bothers exposing a stable API to anything, so the only way to interact with other people's code is using an AI middle-man.

is_true
6 replies
1d

Weeks to automate something? Anyone experienced would be able to automate most workflows in a couple of days top.

suchintan
3 replies
1d

You're right -- we should have written days to weeks.

What's interesting here is that large companies like UI Path charge thousands of dollars to build a single robot for companies.. I wonder if that large up-front expense will still be necessary in this new world

is_true
2 replies
21h37m

That's crazy. We usually create robots and most of the time we charge less than a thousand USD.

We have a lot of tooling in place now so most things take minutes. The harder step is getting the data in the client's infrastructure

suchintan
1 replies
21h20m

When you say "getting the data in the client's infrastructure", do you mean self-hosting the robots? or something else?

is_true
0 replies
18h26m

No. Getting the data on the client's DB, filestore, or similar. For some ERPs we create insert queries, others have import functions.

dang
1 replies
23h36m

I've edited the text above to say "days or even weeks".

suchintan
0 replies
23h32m

thank you!!

mosselman
5 replies
1d1h

If I were to build some custom GPT powered thing for this. Is there a similar project I can use with a command line interface or some programmatic interface?

keremyilmaz
4 replies
1d1h

Skyvern is actually an API-first product! The UI we built is mainly for simplicity and being able to debug the steps our agent takes.

You can easily copy sample curl requests through our UI. Feel free to check out the quickstart on our GitHub and let us know if you have any questions.

mosselman
3 replies
1d

Thanks I will check it out.

Any idea on pricing/business model?

msikora
1 replies
18h22m

Wait, this is not Open Source??

suchintan
0 replies
23h33m

We tend to charge per request our users send us.. although the exact amount depends a lot on the exact task you want to run. Want to send Skyvern on a 40+ page journey to answer a question? It's a bit more expensive than just navigating to a page and extracting information

I'd love to chat about your use-case. Happy to follow-up over email (suchintan@skyvern.com) or over a quick call (https://meetings.hubspot.com/suchintan)

suchintan
3 replies
22h52m

Saw the launch yesteday. Love all of the excitement in the space!

LaVague is all about generating selenium code to interact with a specific page, and do it step-by-step

Skyvern is all about taking a simple instruction and converting it to a series of LLM-driven actions. It's meant to be more autonomous ("tell Skyvern what to do")

spxneo
2 replies
22h47m

Isn't that the same thing when you interact with the underlying webpage?

suchintan
0 replies
22h23m

We're quite different than LaVague. LaVague passes in the entire HTML DOM to the LLM to help it generate XPaths and valid Selenium code. (https://github.com/lavague-ai/LaVague/blob/main/src/lavague/...)

Try this at your own risk.. any reasonable website would result in extraordinarily high input token costs

We spend quite a bit of our time building a layer between the HTML and the LLM call to distill important pieces of information down to actions the LLM can take.. better weighing cost vs output. We're still not at 100% coverage.

LZ_Khan
0 replies
13h19m

It is similar. hence the timing of the plug, probably :)

giamma
4 replies
1d

At first I thought this was a test tool for Web applications, but now I understand it's meant to be a better RPA.

Would it be usable for test automation? Would API allow to create asserts?

dvngnt_
1 replies
1d

there are already some existing solutions for e2e testing. I would say playwright with codegen works well enough but there are ones that make it even easier by wrapping around openapi but seems overkill

denidoman
0 replies
19h24m

It sounds interesting, could you please share the links if it is open sourced?

suchintan
0 replies
1d

Yes absolutely. You can prompt it to "terminate" if some state isn't met (ie XYZ text isn't displayed on the screen), and treat terminated results as failures

For example, you could instruct it to go to hackernews and terminate if you don't see a comment from giamma by passing in this payload:

{ "url": "https://news.ycombinator.com", "navigation_goal": "goal is met if you see a post from giamma. Terminate if you don't" }

shnkr
3 replies
1d

the moment I saw vision in the title I knew what was going on. it was first demoed[0] by AI Jason around 4 months back. is it any different?

https://m.youtube.com/watch?v=IXRkmqEYGZA

suchintan
2 replies
23h45m

Love this video

self-operating-computer This is quite different than https://github.com/OthersideAI/self-operating-computer

Self-operating-computer uses pixel mapping to control your computer. This is a very good approach, but it's extremely unreliable. GPT-4V frequently hallucinates pixel outputs, causing it to miss interactions, or enter fail-loops

The approach by AI Jason

AI Jason is using image-only methods to interact with the browser. This is a great first step, but this approach tends to be rife with hallucinations or errors. We do dom parsing in addition to image anaylsis to help GPT-4V correlate information in the image to the interactable elements within the DOM. This dramatically boosts its ability to perform the same task over and over again reliably (which proved impossible with the image-only approach)

shnkr
1 replies
23h22m

nice. I was looking for simpler hacks as V didn't scale for me. Later I couldn't find time and this got back burnered.

interesting concept for problem solving though. congrats!

suchintan
0 replies
22h39m

Thanks! We definitely experimented with V only (that's the dream), but there's too much context missing:

1. What's behind a select option? You don't know until you click it, which means you need another iteration. This sucks. 2. How do you consistently correlate things in the images to actual actions (ie upload a file to a file input, click on a button, insert a date into a date)? Having the additional HTML Tag information dramatically improves the action selection process (click vs upload vs type)

dinobones
3 replies
1d

Roughly how much does it cost to run to scrape a page? I see from the code this is basically an OpenAI API wrapper but you make no mention of that anywhere on your landing page/documentation, nor any mention of which LLMs this is capable of working with.

Also, an idea is to offer a "record" and "replay" mode. Let the LLM run through the instructions, find the selectors, record and save them. Then you can run through again without using the LLM, replaying the interaction log, until the workflow breaks, then re-generate the "interaction log" or whatever.

suchintan
0 replies
1d

This is a great call-out. It's something currently in our roadmap

Re: cost for execution. This really depends on the page, but currently it costs between 5 cents and 20 cents per page to execute (today).

We have an improvement planned to help it "remember" or "cache" actions it's done in the past so it can just replay them and bring the cost down to near zero.

Re: LLMs it's capable of working with, currently it's only GPT-4V. I'll get this updated soon!

pstorm
0 replies
1d

Based on #2, it seems like they only use the LLM when the page changes. I had a prototype of this sort of system working and it was surprisingly fault tolerant.

pkiv
0 replies
1d

If you want to build it yourself, you could try using https://browserbase.com/. We offer managed headless browsers work everywhere, every-time. It costs $0.10 per browser session/hour (billed minutely). Feel free to shoot me an email if you want access! paul@browserbase.com

ushakov
2 replies
1d

How does this compare to OpenAdapt?

I have a feeling that this tech will become a commodity and will probably be built-in into the OS or Browser.

Props for open-sourcing though!

suchintan
0 replies
1d

I agree -- this will likely get commoditized, which is why we didn't focus on making this a chrome extension. The API access pattern makes this particularly appealing as you can run multiple instances in the cloud

suchintan
0 replies
23h29m

Ah cool -- we weren't familiar with OpenAdapt. Will check it out.

One big decision we made was to focus on browser automations (instead of computer automation like Adept or OpenAdapt). The reason for this was that we wanted to leverage the information available inside of a DOM to improve the quality of our agent's actions. We found that relying on image-only analysis with X,Y coordinate interactions wasn't able to offer high enough reliability for production workflows

somethingAlex
1 replies
17h34m

I don’t know what my use case for this would be. I don’t tend to do anything regularly through a browser that I’d want to automate.

Would be kind of handy to have a “pull all my relevant tax info documents from these sites and zip them up” automation but I only do that once a year.

I’m probably being unimaginative. Anybody have any interesting use cases?

Anyone have

suchintan
0 replies
17h1m

Imagine that exact use-case -- pulling up relevant tax information and filling it

Now imagine it from the accountant POV, where they have the same use-case for hundreds of clients

This is where we've seen something like Skyvern really shine. It's targeting industries and companies that are doing rote work at a significant scale

smusamashah
1 replies
13h52m

Coming up next in Windows and Chrome, unrecordable unscreenshotable pages, to avoid all AI tools. Banking apps on Androidare already unscreenshotable now. Given how LLMs just bypass all html obfuscation, that's going to be the next step to protect these (ad) businesses.

Zuiii
0 replies
11h44m

The analog hole.

Until all recording and general computing devices become tamper-proof and locked down, people will always be able to take perfect (yes, perfect) recordings with some work or good-enough recordings trivially.

For this, I'd use a usb camera. For those apps that disable screenshots, I'd just take a picture using the phone of the first person next to me.

In my experience, only the ignorant, fools, and lawyers/lawmakers willingly waste resources on this security theater, with the later group using it trick other people or prevent them from exercising their rights (recording media).

Google should remove this misfuture. This future is only enabling abuse at this point.

samsullivan
1 replies
6h37m

You should consider focusing on intercepting network requests. Most if not all sites I scrape end up fetching data from some api. Like others have said, if you instead had the LLM create an ad hoc script for the scraping task and then use the feedback loop to continuously improve the outputs it would be really cool. I'd pay between $5 - $50 for each working output script.

suchintan
0 replies
4h15m

We're definitely planning this.

Skyvern currently intercepts all network calls that gets made -- we save them all as a HAR file for debugging purposes... but never looks at them

A lot lot lot of scraping use-cases become simpler if you can just inspect a search api or a details API and get the information you're looking for. I'll add this to our roadmap!

samstave
1 replies
1d

>(1) Automating post-checkup data entry with patient data inside medical EHR systems (ie submitting billing codes, adding notes, etc),

FULL FUCKING STOP.

[We talk about AI alignment. THIS is an aligment issue]

Do you understand billing code fraud?

If you supply this function - you will *eliminate ANY AND ALL human accountability* unless you have ALSO built a fully auditable provenance from DR <-ehr-whatever-> codes.

Codes ARE why the US health system is BS.

Here - if you want to be altruistic - then you will take it upon the fact that CODES are one of the most F'd up aspects of costing.

Codes = [medical service provided]

so code = 50 = checkup = [$50 <--- WHO THE HECK KNOWS]

So lets say I am Big Hospital. "No, we will only allow $25 for code 50" - and so they get that deal.

I am single clinic so they have to charge $50

Build a dashboard for what the large medical groups can negotiate per code, vs how a small hospital or clinic group gets per code.

Only automate it if you can literally show a dash of all providers and groups and what they can charge per code.

Infact - code pricing is a medical stock market.

(each hospital group negotiates between the price they will pay per code, how much lobbying is a factor and all these other factors...

what we really need an LLM for is to literally map out all the BS in the Code negotiations btwn groups, pharma, insurance, lobbying, kickbacks, political)

Thats the medical holy grail.

[EDIT: Just to show how passionate I am on this issue - here are some SOURCE:

I have designed and built & commissioned out 11+ hospitals.

Built the first iphone app for medical.. it was rejected by YC (hl-7 nurse comm system on iTouch devices) (2006?)

opensourced that app to OpenVista.

Brother was joint chiefs dr / head of va

worked with building medical apps and blocked by every EHR...

Zuckerbergs name is on top of some of the things I built at SFGH before he got there...(and ECH mtn vw)

Ive seen way beyond the kimono

suchintan
0 replies
23h43m

We know very little about this space, except that the entire process is a little bit crazy.

We've talked to a few companies now that would use a product like Skyvern to just automate billing information gathering to make sure patients don't get screwed in the billing process

Are you open to chatting? I'd love to pick your brain about what's behind the kimono

suchintan@skyvern.com or https://meetings.hubspot.com/suchintan

razfar
1 replies
1d

I'm curious about the computer vision aspect of this tool. Specifically, how was the model which draws bounding boxes around interactable elements trained? Definitely a step beyond existing browser automation software!

ilaksh
1 replies
21h29m

Looks terrific. I hope you will consider adding support for Claude 3.

hubraumhugo
1 replies
1d

AI should automate tedious and un-creative work, and data entry tasks definitely fit this description. Rule-based RPA will likely be replaced by fine-tuned AI agents for things like form filling and similar.

Can you share some data on costs and scalability?

At Kadoa, we're working on fully automating unstructured data ETL from websites, PDFs, etc. We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of sources daily in a reliable, scalable, and cost-efficient way is a whole different beast.

Using LLMs for every data extraction would be way too expensive and very slow. Instead, we use LLMs to generate the scraper and data transformation code and subsequently adapt it to website changes, which is highly efficient.

suchintan
0 replies
1d

Nice! We love what you're doing at Kadoa.

We're trying our best not to move into the web scraping space -- we're focusing on automating uncreative, boring, tedious tasks.

We've seen a lot of success going after form-filling on government websites, which would usually be very boring, but happens to work pretty well for us

hirako2000
1 replies
1d

It reminds me of that bug a kid found to bypass the password locked screen of a very popular Linux distro.

Might be great for pen testing.

suchintan
0 replies
23h57m

That's a great idea! I hadn't thought of pen-testing as a possible value prop for this product

cryptoBros2023
1 replies
15h3m

I wonder if we could reduce the call by switching to a local llama?

wintonzheng
0 replies
10h14m

https://github.com/Skyvern-AI/skyvern/issues/76 we're planning to introduce a llm router in a week and you should be able to call your local llama after that.

We're prioritizing on cloude 3, as its performance seems to be good. That said, please join our discord and bring more thoughts/requests to us. code contribution is also more than welcome

chadash
1 replies
23h31m

First of all, wonderful work. I'm gonna be using this for sure. I can think of many use cases. What would be nice though is a simple API. I send you what I need, you send me a jobId that I can use to check the status of my job and then let me download the results when I'm done.

I played with the Geico example, and it seems to do a good job on the happy path there. But I tried another one where it struggled... I want to get me car rental prices from https://www.costcotravel.com/. I gave it airport + time of pickup and dropoff, but it struggled to hit the "rental car" tab. It got caught up on hitting the Rental Car button at the top, which brings up a popup that it doesn't seem to read.

When I put in https://www.costcotravel.com/Rental-Cars, it entered JFK into the pickup location, but then failed to click the popup.

suchintan
0 replies
23h25m

We have a simple API we're building as a part of our cloud offering. It's in private beta today -- if you'd like to check it out please email me at suchintan@skyvern.com and I'd be happy to chat

Thanks for the feedback re: costcotravel.com Skyvern definitely does NOT have 100% coverage of the web. This is one of the reasons we were excited to open source -- so we could learn about more websites where it doesn't work as expected

I've filed an issue for this case here: https://github.com/Skyvern-AI/skyvern/issues/69

barfbagginus
1 replies
12h33m

I wonder if the focus of this system can be shifted from corporate needs and applied to the needs of individuals who wish to organize and build tools seeking to de-enshittify platforms.

There are a great deal of platform features designed to atomize, isolate, and exploit individuals. Finding meaningful connection on platforms increasingly means navigating past the noise of antagonist individuals, overcoming profit extracting attacks on our attention, and endlessly doomscrolling until we find those ephemeral opportunities to genuinely connect.

I wonder if llms and browser automation tooling could help us build overlays that dynamically peel back the layers of enshitware that have been bolted on to our cybernetic perceptions of the world.

If you feel they can, and if you feel people with those aims are welcome in your community, and can find each other to collaborate, then I would be very interested in sending in PRs and helping you burn down backlogged items that benefit non-commercial de-enshittification use cases.

wintonzheng
0 replies
10h33m

I'm Shu, also cofounder of skyvern. first of all, you are more than welcome to join our coummunity. one big reason of open sourcing skyvern is to serve the individuals. This project was inspired by problems we learnt from tlking to corporates but it doesn't have to always serve those use case. problems like boring form filling are pretty common in real life.

Second, llms definitely can help bridge the gap. My 58y old mom who grew up in the rural area of China doesn't know much about internet and doesn't know how to order takeout on her phone. She only knows the basic usage of wechat, the whatsapp in China and text messages. I've been a coder for 10+ years and I still find it so darn hard to keep up with tools and information out there. I do hope skyvern becomes what you're saying and help people get access to more in the world.

andy_ppp
1 replies
12h37m

I think I’d really like a react-native version of this! Any plans?

wintonzheng
0 replies
10h11m

we're a pretty small team and don't have a plan for it in the near future :(

I would love to know the reason you're interested in react native though if you don't mind sharing! pls email me or suchintan at shu@skyvern.com / suchintan@skyvern.com or join our discord

BasieP2
1 replies
20h20m

Is this (finally) a step towards a better way of automated frontend testing?

We're currently testing dom instead of vision.

suchintan
0 replies
19h29m

This can definitely be used for front end testing. Just tell it to do something like a user and monitor whether it's successful or not

Here's a prompt example to try out

{ "url": "https://news.ycombinator.com", "navigation_goal": "goal is met if you see a post from basiep2. Terminate if you don't" }

999900000999
1 replies
20h48m

Don't make me sign up for a demo, I'd rather just give you my credit card number and try it myself.

Aside from that cool project!

suchintan
0 replies
19h28m

We're gonna build a self-serve UI soon! We just wanted to get it into people's hands ASAP :)

Feel free to email me at suchintan@skyvern.com -- I can let you know when the self-serve UI is live

xeonmc
0 replies
13h58m

What do you call an LLM with vision? LLVM

...oh, that's why it's called Skyvern

tonyoconnell
0 replies
17h58m

To keep costs down, you could start at sitemap, use an open source model via open router to guess the page to navigate to and scrape the text, links, forms, from the page using regex and fall back to GPT 4 and Vision.

t14000
0 replies
10h31m

Exciting stuff, my employer would be interested but it's AGPL3 licensed so it's a non-starter for them.

aussieguy1234
0 replies
14h41m

There was another AI/browser automation project posted yesterday that got to the front page https://github.com/lavague-ai/LaVague

I guess the main advantage of this new project is that its probably more accurate by using computer vision, but as others has said it uses much more resources.

Costs will come down over time though.

Get ready for alot of "Back Office" jobs to be automated away.

abrichr
0 replies
1h39m

Congratulations on shipping!

Check out https://github.com/OpenAdaptAI/OpenAdapt for an open source (MIT license) alternative that also works on desktop (including Citrix!)