I tried it out and it's pretty pricey. My OpenAI API bill is $3.20 after using this on a few different pages to test it out.
Not saying I wouldn't pay that for some use cases, but it would limit me.
One idea: making scrapers is a big pain. But once they are setup, they are cheap and fast to run... this is always going to be slower. What I'd love to see is a way to generate scrapers quickly. So you wouldn't be returning information from the New York City property registry... instead, you'd return Python code that I can use to scrape it in the future.
edit: This is likely because it was struggling, so it had to make extra calls. What would be nice is a simple feature where you can input the maximum number of calls / tokens to use on the entire call. Or even better, do some math and put in a dollar cap. i.e., go fill out the Geico forms for me and don't spend more than $1.00 doing it.
You've raised valid points about the cost and efficiency of our approach, which aims to make the LLM function as closely as possible to a human user. We chose this approach primarily for its compatibility with various websites, as it aligns closely with a website's intended audience, which is typically human.
Addressing complex website interactions is a key advantage of this approach. For instance, in the process of generating an auto insurance quote, the sequence of questions and their specifics can vary greatly depending on prior responses. A simple example is the choice of a foreign versus a California driver's license. Selecting a foreign license triggers additional queries about the country of issuance and expiry date, illustrating the complexity and branching nature of such web interactions.
However, we recognize the concerns about cost and are actively working on strategies to reduce it: - Optimizing the context provided to the LLM - Implementing caching mechanisms for certain repeated actions and only use LLMs when there's a problem - Anticipating advancements in LLM efficiency and cost-effectiveness, with the hope of eventually finetuning our own models for greater efficiency
There are two things here:
1) Using the LLM to find elements/selectors in HTML
2) Use LLMs to fill out logical/likely/meaningful answers to things
I highly recommend you decouple these 2 efforts. While you gave a good example of "insurance quote step by step webapp", the vast majority of web scraping efforts are much more mundane.
Additionally, even in this instance, the selector brain/intelligence brain don't need to be coupled.
For example:
Selector brain: "Find/click the button for foreign drivers license." Selector brain: "Find the country of origin field." Selector brain: "Find the expiry date field."
LLM-intelligence brain: "Use values from prompt to fill out the country of origin and expiry date fields."
Not-LLM intelligence brain: Inputs values from a JSON object of documentSelector=>value.
Interesting. We've decoupled navigation and extraction for specifically this reason, but I suppose decoupling selector with input could let us use cheaper smaller LLMs to "select" and answer
We've been approaching it a little bit differently. We think larger more capable models would actually immediately improve the performance of Skyvern. For example, if you run it with LLaVa, the performance significantly degrades, likely because of the coupling
But since we use GPT-4V, and it's rumoured to be a MoE model, I wonder if there's implicit decoupling going on.
I'm gonna spend some more time thinking about this
I still think you're missing the point. The idea is that you should use vision APIs and LLMs to build traditional browser automation using a DSL or Python.
I don't want to use vision and LLMs for every page. I just want to use vision and LLMs to figure out what elements need to be clicked once. Or maybe every time the site changes the frontend.
This is a great point. This is something already on our roadmap. We call it "prompt caching", but I realize writing this that it's a terrible name. Will update! (https://github.com/Skyvern-AI/Skyvern?tab=readme-ov-file#fea...)
Thank you for this feedback
The AI would be a compiler that generates the traditional scraper / integration test.
It would save all that long time spent going manually thought every page and figuring out which mistake we did, when that input string doesn't go into that input field or the button on the modal window is not clicked.
Change the UI? Recompile with the AI.
I didn’t check the code but there would be a few good ways to specify what you want:
* browser extension that lets you record a few actions * describing what you want to do with text * a url with one or two lines of desired JSON to extract
No, that's something completely different than what bravura is talking about, which is why he made a comment to say explicitly that he still thinks you're missing the point.
From your roadmap:
Adding a caching layer is not what they're asking for. They want to periodically use Skyvern to generate automation code, which they could then deploy themselves in their testing/CI setup. Eventually their target website may make breaking UI changes, then you use Skyvern to generate new automation code. Rinse and repeat. This has nothing to do with an internal caching layer within your service.
We've discussed generating automation code internally a bunch, and what we decided on is to do action generation and memorization, instead of code generation and memorization. They're not that far apart conceptually, but there is one important distinction: The generated output would just be a list of actions and their associated data source.
For example, if Skyvern was asked to log-in to a website and do a search for product X, the generated action plan would include: 1. Click the log in button 2. Click "sign in with email" 3. Input the email address retrieved from source X 4. Input the password retrieved from source Y 5. Click log in 6. Click on the search bar 7. Input the search term from source Z 8. Click Search
Now, if the layout changed and suddenly the log-in button had a different XPath, you have two options: 1. Re-generate the entire action plan (or sub-action plan) 2. Re-generate the specific component that broke and assume everything else in the action plan still works
I like this approach. Just as an example, if I'm getting a car insurance quote, I'd rather pay $1 to have the tool fill out the forms for me and be 90% that it filled them out correctly rather than pay $0.01 and only be 70% sure it did it correctly. And there are plenty of use cases like that.
isn't that crazy rabbit thingy supposed to do just that? I hope you pre-ordered. I hear they're in great demand.
https://www.rabbit.tech/research
You would still be willing to pay $1 if it got it wrong 10% of the time, or if it got 10% of the information wrong every time?
It really depends on the use case.
Scrapers are one of the main use cases we're seeing for Magic Loops[0].
...and you've hit the nail on the head in terms of our design philosophy: use LLMs to generate useful logic, then run that logic without needing to call an LLM/Agent.
With that said, we don't support browser automation. Skyvern is very neat, it reminds me of VimGPT[1], but with a more robust planning implementation.
[0] https://magicloops.dev
[1] https://github.com/ishan0102/vimGPT
Really like the simplicity of your website. I think when you first announced it, you mentioned you might open source Magic Loops, might you do that?
Yes! We’re in the middle of cleaning things up, just need to make the Loops a bit more portable/easy to run, but finally happy with the state of the tool.
This brings me so much joy! Thank you for considering this!
Nice! Thanks for sharing this.
We tried approaches like VimGPT before but found the rate of hallucinations to be a bit too high to be used in production. The sweet spot definitely seems to be to combine the magic of Dom parsing AND vision
We're going to definitely work on logic generation and execution, but we're taking it a bit more carefully. Many of the workflows we automate have changing workflow steps (ie I've never seen the exact same Geico flow twice), but this certainly isn't true for all workflows
I love all of these ideas!!
1. You can set a "max steps" limit when you run it locally https://github.com/Skyvern-AI/skyvern/blob/d0935755963b017ed...
We also spit out the cost for each step within the visualizer. Click on any task > Steps > there's a column that's dedicated to how much things cost to run
https://github.com/Skyvern-AI/skyvern/issues/70
2. We have a roadmap item to "cache" or "memorize" specific tasks, so you pay the cost once, and then just run it over and over again. We're going to get to it soon!!
https://github.com/Skyvern-AI/Skyvern/?tab=readme-ov-file#fe...
Just piggybacking here, but this is a great suggestion. It makes the cost a one-time expense, and you get something material (source code) in return.
Yes, exactly what I want. I want to be able to have it code robust Cypress tests for e2e testing.
It's getting genuinely difficult these days with everything walled behind Cloudflare, various anti-bot protections and increasingly creative CAPTCHAs
Interesting enough I made a chrome extension that does almost exactly what you are describing. It’s called automize and it lets you very quickly generate custom selectors and export the code to puppeteer, playwright, selenium etc. it handles all the verifications as well as provides a handy ui that shows what you are selecting
Bravo, I would pay for this one, or hopefully run it on my GPU - it would be so fast to even just shove out your selectors (xpath, css, dealer's choice) for point-by-point update after you had done an initial code gen, or perhaps it could just diff and update chunks of code for you!
My local code model can already do the diff update stuff in nvim, but being able to pass it a URL and have it slam in all of the pertinent crawling code, wow.