I don't trust the code quality evalution. The other day at work I wanted to split my string by ; but only if it's not within single quotes (think about splitting many SQL statements). I explicitly asked for stdlib python solution and preferrably avoid counting quotes since that's a bit verbose.
GPT4 gave me a regex found on https://stackoverflow.com/a/2787979 (without "), explained it to me and then it successfully added all the necessary unit tests and they passed - I commited all of that to the repo and moved on.
I couldn't get 70B to answer this question even with multiple nudges.
Every time I try something non GPT-4 I always go back - it's feels like a waste of time otherwise. A bit sad that LLMs follow the typical winner-takes-it-all tech curve. However if you could ask the smartest guy in the room your question every time, why wouldn't you?
---
Edit: USE CODE MODE and it'll actually solve it.
I didn't take a look at the code, but to me it sounds quite dangerous to take an implementation AND the unit tests straight from an LLM, commit and move on.
Is this the new normal now?
I guess most people would review the code as if it had been written by a colleague?
Yes, a great way to think of it is as a widely read intern: https://www.oneusefulthing.org/p/on-boarding-your-ai-intern
You’ve still got to avoid prompting for questionable code in the first place, eg, splitting SQL statements on semicolons with an ad-hoc regex is going to fail in edge cases, but may be sufficient for a specific task.
Yes more than sufficient for an internal tool - we can assume good intentions of the users of the tool since people want for this to actually work and have no intention of hacking.
Except now it's a vector if anyone gets access to this internal tool.
I would be fine with this for one off scripts but absolutely can not consider anything less than full sql parsing or something equally robust if it is exposed over the network, even if only internally and behind authn and authz.
For this reason, I tend to ask LLMs additional questions like: "show me another way to do this" or specifically "how would someone with a higher need for security write this?"... knowing that I'm likely to get a more refined answer from different sources that have probably discussed deeper security implications around the same goals, for instance.
If someone uses an LLM to produce the code, I'd guess they'll use it to evaluate the code as well.
This is the part I actually want from an LLM, I write the code and it spots the problems. A mega linter. Unfortunately it's not very good at this yet.
Yeap, I want a code-review bot that just says "this is very improbable; are you sure you didn't mean x instead?"
The old Coverity used to achieve similar results in a different way, spotting probable mistakes based on patterns its heuristics found in the rest of the same codebase.
Right on. These days my llm-assisted workflow feels very similar to the 20% of my day that I used to devote to code review, just now it’s more like 60% of my day.
I’m finding it’s more effective (and pleasurable) to write using GitHub CoPilot and CMD-RIGHT (accept next word). I put a detailed doc comment above and write in tandem with copilot. I’ve written the structure and I review as I write jointly with the model.
This way I don’t need to review a block of code I didn’t write.
<aside>I had an experience yesterday where CoPilot correctly freed all the memory in correct order at the end of a rather complicated C algorithm, even where there was nested mallocs.</aside>
It’s the new boot camp dev. It is still the same as copy pasting SO solutions lol
Mean-spirited, gatekeeping comment unless I’ve misunderstood. Reference to AI is frequently used to punch down like this I’ve noticed.
Reminds me of a Facebook thread I saw a few days ago, on the topic of 3D printing houses. All the comments were angry dismissive "hurr durr that's clearly poor quality work" with no further justification of their position, and it struck me how similar the overall energy was to the "all AI image generation is bad and shit and is also heinous immoral theft and you're literally the worst person in the world and yous should feel bad" sort of raging that you see any time someone posts some SD or Midjourney or whatever pic of a cute puppy riding a tricycle. These comments originate from people who've spent their lives learning skills that are now largely replaceable by a few gigs of download and a Python tutorial. No wonder they're upset.
I take it to mean that the code quality deserves more scrutiny because you can't guarantee what it has provided is quality code, without reviewing it first.
The same applies to brand new devs — it's normal to apply a little more scrutiny because they simply don't have the experience to make the right decisions as confidently (or frequently) as someone more senior.
It's an analogy and the natural fact that output reflects experience and practice over time.
What as in something you should know not to do pretty quickly?
Presumably people look at things before committing the code. And code reviews and pull requests are still normal.
Blindly copying code from any source and running it or committing it to your main branch without even the slightest critical glance is foolish.
Arguably the tests should be easier to review than the implementation.
But if there non-trivial logic in the code of the tests, I agree this is probably a risky approach.
It's very powerful, I can enter implementations for any algorithm by typing 5 words and clicking tab. If I want the AI to use a hashmap to solve my problem in O(n), I just say that. If I need to rewrite a bunch of poorly written code to get rid of dead code, add constants, etc I do that. If I need to convert files between languages or formats, I do that. I have to do a lot more code review than before, and a lot less writing. It saves a huge amount of time, it's pretty easy to measure. Personally, the order of consultation is Github Copilot -> GPT4 -> Grimoire -> Me. If it's going to me, there is a high probability that I'm trying to do too many things at once in an over-complicated function. That or I'm using a relatively niche library and the AI doesn't know the methods.
Hopefully not, I feel it's a waste of time. The time spent on stupid minor mistakes by github copilot I didn't catch probably doesn't really compare to the time I would've spent typing on my own. (I only use that stuff for fancy code completion, nothing more. Every LLM is absolutely moronic. Yesterday I asked chatgpt to convert gohtml to templ, to no avail ...)
it really feels like GPT-4 is Google and Everybody else is Yahoo/Bing. i.e cute but not really
Agreed, though i'm _really_ interested in trying 1M token Gemini. The idea of uploading my full codebase for code assist stuff sounds really interesting. If i can ever get access to the damn thing...
I'm curious how they'll handle this. My understanding is that it takes quite a long time to get an answer, since there's no magic "semantic database" built for you behind the scenes.
That use-case seems inefficient to solve like that in the long run as well, like if you really would have to use a million tokens to do every small query you require on your data it would be prohibitively costly except doing as an experiment.
Don't get your hope high—Google's article mentioned they'll limit it to 128K (at least in the beginning).
Gemini is much better than the free version of GPT 3.5 though. At least in my experience.
Microsoft’s enterprise co-pilot is also fairly decent. It’s really good at providing help to Microsoft related issues or helping you find the right parts of their ridiculously massive documentation site. Which probably isn’t too weird considering.
I tried
"zsh rename index.html.1 to image_1.png for many images"
Gemini
ChatGPT3.5 Not a great first impression of Gemini. ChatGPTs answer isn't perfect but its a lot closer to correct, only needing me to remove the extra 'index' capture of $1.Curious if someone could commit some light copyright infringement and post what GPT4 says to the same prompt.
Edit: Also Phind-34B probably gives the best answer, with the correct capture.
In stable diffusion we build x/y plots to evaluate the results due to seed variance. I find it interesting that LLM guys (seemingly) never do that, since their answers shouldn't be deterministic too.
Here's what gpt4-turbo-preview outputs (with max output of 256 tokens, so the result was truncated).
P.S. have you tried testing what happens when you clearly describe what you want? The prompt you're using is really low quality - more like a google search. If you asked me a question like that I'd tell you to clearly explain what it is you want.
In my experience, Bing's image search is way better than Google's. Also, I'm not going to use a search engine that I have to log in or do a captcha for.
usually id say no, but google's results these last months have been terrible
I'm no fan of Microsoft, but Bing's image search has been better for a long time. Google also removed functionality for no apparent reason.
Thanks for the feedback, could you please post the cached Phind link so we can take a look?
It might also be helpful to try Phind Chat mode in cases like this.
EDIT: It seems like Phind-70B is capable of getting the right regex nearly every time when Chat mode is used or search results are disabled. It seems that the search results are polluting the answer for this example, we'll look into how to fix it.
https://www.phind.com/search?cache=r2a52gs77wtmi277o0xi4z2a
Phind-70B worked well for me just now: https://www.phind.com/agent?cache=clsxokt2u0002ig09n1e11bj9.
For writing/manipulating code, Chat mode might work better than Search.
You may want to improve the ui/ux for getting to your chat. It’s very hard to find on your homepage even when looking for it.
woah I've been using phind for at least a few months and can't believe I never noticed the "Chat" button
You're right! It solved it. I didn't know about the Code/Search distinction. I still struggled for it to write me the unit tests. It does write them, they just don't pass. But this is definitely much closer to GPT4 than I originally thought.
Now if we could get an AI that would switch code/search mode on its own
I've tried it with a question which requires deeper expertise – "What is a good technique for device authentication in the context of IoT?" – and the Search mode is also worse than the Chat mode:
- Search: https://www.phind.com/search?cache=s4e576jlnp1mpw73n9iy4sqc
- Chat: https://www.phind.com/agent?cache=clsyev95o0006le08b5pjrs14
The search was heavily diluted by authentication methods that don't make any sense for machine-to-machine authentication, like multi-factor or biometric authentication, as well as the advice to combine several methods. It also falls into the, admittedly common, trap of assuming that certificate based authentication is more difficult to implement than symmetric key (i.e. pre-shared key) authentication.
The chat answer is not perfect, but the signal-to-noise ratio is much better. The multi-factor authentication advice is again present, but it's the only major error, and it also adds relevant side-topics that point in the right direction (secure credential storage, secure boot, logging of auth attempts). The Python example is cute, but completely useless, though (Python for embedded devices is rare and in any case you wouldn't want a raw TLS socket, but use it in a MQTTS / HTTPS / CoAP+DTLS stack, and last but not least, it provides a server instead of client, even though IoT devices mostly communicate outbound).
Doesn't handle escaped quotes, and the time complexity of that regex is very bad.
The time complexity for all matching a string against any fixed regular expression is O(length of string).
If you want to talk about constant factors, we need to leave our comfortable armchairs and actually benchmark.
[Just to be clear, I am talking about real regular expressions, not Franken-xpressions with back-references etc here. But what the original commenter described is well within the realm of what you can do with regular expressions.]
You are right about escaped quotes etc. That's part of why parsing with regular expressions is hard.
The time complexity for deciding whether an N-letter string matches a regex or not, is O(N). The time complexity of finding all matches is not O(N) - which is needed in OPs case, because they want to split the string.
Also, OP's solution uses lookahead assertions, so it's not a real regular expression.
(I wonder if we can summon @burntsushi for expert opinion on this?)
I see that the future is brighter than ever for the information security industry.
Sure is! We've got a bright and oh so plentiful road ahead, pending we can avoid blowing up the planet.
Can you try this?
"Can you give me an approach for a pathfinding algorithm on a 2D grid that will try to get me from point A to point B while staying under a maximum COST argument, and avoid going into tiles that are on fire, except if no other path is available under the maximum cost?"
I've never found an AI that could solve this, because there's a lot of literature online about A* and tiles with cost, and solving this requires a different approach
Yup, LLMs broke well known benchmarks
same exp