return to table of content

Code search is hard

ayberk
45 replies
1d23h

It indeed is hard, and a good code search platform makes life so much easier. If I ever leave Google, the internal code search is for sure going to be the thing I miss the most. It's so well integrated into how everything else works (blaze target finding, guice bindings etc), I can't imagine my life without it.

I remember to appreciate it even more every time I use Github's search. Not that it's bad, it's just inherently so much harder to build a generalized code search platform.

peter_l_downs
43 replies
1d23h

If you ever leave you can use Livegrep, which was based on code-search work done at Google. I personally don't use it right now but it's great and will probably meet all your needs.

[0] https://github.com/livegrep/livegrep

keybored
22 replies
1d21h

If you ever leave you can use Livegrep, which was based on code-search work done at Google.

If I’ve learned anything from the fainting spells that I-work-at-X have over their internal tools on HN: no, whatever the public/OSS variant is always a mere shadow of the real thing.

scubbo
15 replies
1d20h

I suspect you're being sarcastic - but can confirm that being nearly two years out of Amazon, I still miss its in-house CD system nearly every day. I've actively looked around for OSS replacements and very few come anywhere close.

(I would be _delighted_ for someone to "Umm actually" me by providing a great product!)

mdaniel
10 replies
1d20h

My experience has been that any of these in-house things do not adapt well to the high chaos of external environments, as if there are 3 companies one will find 9 systems and processes in use thus making "one size fits all" a fantasy

But, I'll bite: what made the CD system so dreamy, and what have you evaluated thus far that fall short?

shepherdjerred
9 replies
1d18h

Amazon internal tools for building codes are _amazing_.

Brazil is their internal dependency management tool. It handles building and versioning software. It introduced the concept of version sets which essentially allows you to group up related software, e.g. version 1.0 of my app needs version 1.1 of library x and 2.0 of runtime y. This particular set of software versions get its own version number.

Everything from the CI/CD to the code review tool to your local builds use the same build configuration with Brazil. All software packages in Brazil are built from source on Amazon's gigantic fleet of build servers. Builds are cached, so even though Amazon builds its own version of Make, Java, etc., these are all built and cached by the build servers and downloaded.

A simple Java application at Amazon might have hundreds of dependencies (because you'll need to build Java from scratch), but since this is all cached you don't have to wait very long.

Lastly, you have Pipelines which is their internal CI/CD tool which integrates naturally with Brazil + the build fleet. It can deploy to their internal fleet with Apollo, or to AWS Lambda, S3 buckets, etc.

In all, everything is just very well integrated. I haven't seen anything come close to what you get internally at Amazon.

mdaniel
4 replies
1d16h

so what I'm hearing is that app-1.0 needs app-1.0-runtime-build-20240410 which was, itself, built from a base of runtime-y-2.0 and layering library-x-1.11 upon it, kind of like

  # in some "app-runtimes" project, they assemble your app's runtime
  cat > Dockerfile <<FOO
  FROM public.ecr.aws/runtimes/runtime-y:2.0
  ADD https://cache.example/library-x/1.1/library-x-1.1.jar
  FOO
  tar -cf - Dockerfile | podman build -t public.ecr.aws/app-runtimes/app-1.0-runtime-build:20240410 -

  # then you consume it in your project
  cat > Dockerfile <<FOO
  FROM public.ecr.aws/app-runtimes/app-1.0-runtime-build:20240410
  ADD ./app-1.0.jar
  FOO

  cat > .gitlab-ci.yml <<'YML'
  # you can also distribute artifacts other than just docker images
  # https://docs.gitlab.com/ee/user/packages/package_registry/supported_package_managers.html
  cook image:
    stage: package
    script:
    # or this https://docs.gitlab.com/ee/topics/autodevops/customize.html#customize-buildpacks-with-cloud-native-buildpacks
    - podman build -t $CI_REGISTRY_IMAGE .
    # https://docs.gitlab.com/ee/user/packages/#container-registry is built in
    - podman push     $CI_REGISTRY_IMAGE
  review env:
    stage: staging
    script: auto-devops deploy
    # for free: https://docs.gitlab.com/ee/ci/review_apps/index.html
    environment:
      name: review/${CI_COMMIT_REF_SLUG}
      url: https://${CI_ENVIRONMENT_SLUG}.int.example
      on_stop: teardown-review
  teardown-review:
    stage: staging
    script: auto-devops stop
    when: manual
    environment:
      name: review/${CI_COMMIT_REF_SLUG}
      action: stop
  ... etc ...
  YML
and then, yadda, yadda, blue-green, incremental rollout <https://gitlab.com/gitlab-org/gitlab/-/blob/v16.10.2-ee/lib/...>, feature flags <https://docs.gitlab.com/ee/operations/feature_flags.html>, error capture <https://docs.gitlab.com/ee/operations/error_tracking.html#in...>, project-managed provisioning <https://docs.gitlab.com/ee/user/infrastructure/iac/#integrat...>, on call management <https://docs.gitlab.com/ee/operations/incident_management/>, on call runbooks <https://docs.gitlab.com/ee/user/project/clusters/runbooks/in...>

you can orchestrate all that from ~~Slack~~ Chime :-D if you're into that kind of thing https://docs.gitlab.com/ee/ci/chatops/

xyzzy_plugh
3 replies
1d16h

No, not even close. You might even have it exactly backwards.

mdaniel
2 replies
1d16h

which is why, as I originally asked GP: what have you already tried and what features were they missing

I presume by "exactly backwards" you mean that one should have absolutely zero knobs to influence anything because the Almighty Jeff Build System does all the things, which GitLab also supports but is less amusing to look at on an Internet forum because it's "you can't modify anything, it just works, trust me"

Or, you know, if you have something constructive to add to this discussion feel free to use more words than "lol, no"

xyzzy_plugh
0 replies
1d16h

I don't work at Amazon, and haven't for a long time, and this format is insufficient to fully express what they're doing, so I won't try.

You're better off searching for how Brazil and Apollo work.

That being said, the short of it is that: imagine when you push a new revision to source control, you (you) can run jobs testing every potential consumer of that new revision. As in, you push libx-1.1.2 and anyone consuming libx >= 1.1 (or any variety of filters) is identified. If the tests succeed, you can update their dependencies on your package and even deploy them, safely and gradually, to production without involving the downstream teams at all. If they don't, you can choose your own adventure: pin them, fork, fix them, patch the package, revise the versioning, whatever you want.

It's designed to be extremely safe and put power in the hands of those updating dependencies to do so safely within reason.

Imagine you work on a library and you can test your PR against every consumers.

It's not unlike what Google and other monorepos accomplish but it's quite different also. You can have many live versions simultaneously. You don't have to slog it out and patch all the dependents -- maybe you should, but you have plenty of options.

It all feels very simple. I'm glossing over a lot.

shepherdjerred
0 replies
1d14h

Sorry, I wish I could phrase it better for you. All I can say is that I have tried a _lot_ of tools, and nothing has come close. Amazon has done a lot of work to make efficient tools.

Here's a better explanation: https://gist.github.com/terabyte/15a2d3d407285b8b5a0a7964dd6...

nosefrog
3 replies
1d16h

How did you avoid version hell? At Google, almost everything just shipped from master (except for some things that had more subtle bugs, those did their work on a dev branch and merged into master after testing).

shepherdjerred
1 replies
1d14h

Version sets take care of everything. A version set can be thought of as a Git repo with just one file. The file is just key/value pairs with the dependencies and major/minor version mappings, e.g.

<Name> <Major>-<Minor>

Java 8-123

Lombok 1.12-456

...

A version set revision is essentially a git commit of that version set file. It's what determines exactly what software version you use when building/developing/deploying/etc.

Your pipeline (which is a specific noun at Amazon, not the general term) acts on a single version set. When you clone a repo, you have to choose which version set you want, when you deploy you have to choose a version set, etc.

Unlike most other dependency management systems, there's no notion of a "version of a package" without choosing what version set you're working on, which can choose the minor versions of _all of the packages you're using_.

e.g. imagine you clone a Node project with all of its dependencies. Each dependency will have a package.json file declaring what versions it needs. You have some _additional_ metadata that goes a step further that chooses the exact minor version that a major version is mapped to.

All that to say that the package can declare what major version they depend on, but not what minor version. The version set that you're using determines what minor version is used. The package determines the major version.

Version sets can only have one minor version per major version of a package which prevents consistency issues.

e.g. I can have Java 8-123 and Java 11-123 in my version set, but I cannot have Java 8-123 and Java 8-456 in my version set.

Your pipeline will automatically build in new minor versions into your version set from upstream. If the build fails then someone needs to do something. Every commit produces a new minor version of a package, that is to say that you can say your package is major version X, but the minor version is left up to Brazil.

This scheme actually works pretty well. There are internal tools (Gordian Knot) which performs analysis on your dependencies to make sure that your dependencies are correct.

It's a lot to know. It took me a year or so to fully understand and appreciate. Most engineers at Amazon treat it like they do Git -- learn the things you need to and ignore the rest. For the most part, this stuff is all hands off, you just need one person on the team keeping everything correct.

actionfromafar
0 replies
1d7h

That sounds actually brilliant. Someone decided to brush less version stuff under the carpet.

xyzzy_plugh
0 replies
1d16h

You don't, you embrace version hell.

zaphirplane
0 replies
1d3h

AWS has services around pipeline and deploy, right ?

xyzzy_plugh
0 replies
1d16h

NixOS/nixpkgs is about the closest thing you'll find in the wild. You have to squint a bit, I'll admit.

tripdout
0 replies
1d13h

Is it true that teams don't do branches in source control at all? Just publishing a CR?

sdesol
0 replies
1d18h

(I would be _delighted_ for someone to "Umm actually" me by providing a great product!)

I think the issue is, nobody would be willing to pay for a good solution since they usually come with a steep maintenance cost. I wouldn't be surprised if the in-house CD team at Amazon were putting out fires every week/month behind the scene.

alienchow
3 replies
1d17h

Just left Google a few months ago.

My take is that there's a difference between a company that is willing to invest money into EngProd endeavors, and a company that uses SaaS for everything. While I can understand that most companies don't have the financial means to invest heavily into EngProd, the outcome is that the tightly integrated development experience in the former is far superior. Code Search is definitely #2 on the list of things I miss the most.

fragmede
2 replies
1d17h

What's #1? Memegen?

voiceblue
0 replies
1d16h

The business end of gThanks

alienchow
0 replies
1d16h

Cheesy answer but, the people.

jeffbee
0 replies
1d20h

It's not just that. Livegrep isn't just a pale imitation of something inside Google. It's totally unrelated in implementation, capabilities, and use case.

elevatedastalt
0 replies
1d17h

You would be better off actually trying to understand those sentiments instead of posting sarcastic replies on HN.

The sort of tight knit integration and developer-focus that internal tools at developer-friendly companies like Google has cannot be matched by clobbering together 50 different SaaS products, half of which will probably run out of funding in 3 years.

You literally have entire companies just LARPing internal tools that Google has because they are just that good. Glean is literally Moma. There's really nothing like Critique or Buganizer.

init
18 replies
1d22h

I've used both Code Search and Livegrep. No, Livegrep does not even come close to what Code Search can do.

Sourcegraph is the closest thing I know of.

tayo42
7 replies
1d22h

Is there like a summary of what's missing from public attempts and what makes it so much better?

sdesol
4 replies
1d21h

The short answer is context. The reason why Google's internal code search is so good, is it is tied into their build system. This means, when you search, you know exactly what files to consider. Without context, you are making an educated guess, with regards to what files to consider.

riku_iki
3 replies
1d21h

How exactly integration with build system helps Google? Maybe you could give specific example?..

isker
1 replies
1d20h

Try clicking around https://source.chromium.org/chromium/chromium/src, which is built with Kythe (I believe, or perhaps it's using something internal to Google that Kythe is the open source version of).

By hooking into C++ compilation, Kythe is giving you things like _macro-aware_ navigation. Instead of trying to process raw source text off to the side, it's using the same data the compiler used to compile the code in the first place. So things like cross-references are "perfect", with no false positives in the results: Kythe knows the difference between two symbols in two different source files with the same name, whereas a search engine naively indexing source text, or even something with limited semantic knowledge like tree sitter, cannot perfectly make the distinction.

dmoy
0 replies
20h32m

Yes the clicking around on semantic links on source.chromoum.org is served off of an index built by the Kythe team at Google.

The internal Kythe has some interesting bits (mostly around scaling) that aren't open sourced, but it's probably doable to run something on chromium scale without too much of that.

The grep/search box up top is a different index, maintained by a different team.

sdesol
0 replies
1d21h

If you want to build a product with a build system, you need to tell it what source to include. With this information, you know what files to consider and if you are dealing with a statically typed language like C or C++, you have build artifacts that can tell you where the implementation was defined. All of this, takes the guess work out of answering questions like "What foo() implentation was used".

If all you know are repo branches, the best you can do is return matches from different repo branches with the hopes that one of them is right.

Edit: I should also add that with a build system, you know what version of a file to use.

j2kun
0 replies
1d21h

Google builds all the code in its momnorepo continuously, and the built artifacts are available for the search. Open source tools are never going to incur the cost of actually building all the code it indexes.

DannyBee
0 replies
1d16h

The short summary is: It's a suite of stuff that someone actually thought about making work together well, instead of a random assortment of pieces that, with tons of work, might be able to be cobbled together into a working system.

All the answers about the technical details or better/worseness mostly miss the point entirely - the public stuff doesn't work as well because it's 1000 providers who produce 1000 pieces that trade integration flexibility for product coherence. On purpose mind you, because it's hard to survive in business (or attract open source users if that's your thing) otherwise.

If you are trying to do something like make "code review" and "code search" work together well, it's a lot easier to build a coherent, easy to use system that feels good to a user if you are trying to make two things total work together, and the product management directly talks to each other.

Most open source doesn't have product management to begin with, and the corporate stuff often does but that's just one provider.

They also have a matrix of, generously, 10-20 tools with meaningful marketshare they might need to try to work with.

So if you are a code search provider are trying to make a code search tool integrate well with any of the top 20 code review tools, well, good luck.

Sometimes people come along and do a good enough job abstracting a problem that you can make this work (LSP is a good example), but it's pretty rare.

Now try it with "discover, search, edit, build, test, release, deploy, debug", etc. Once you are talking about 10x10x10x10x10x10x10x10 combinations of possible tools, with nobody who gets to decide which combinations are the well lit path, ...

Also, when you work somewhere like Google or Amazon, it's not just that someone made those specific things work really well together, but often, they have both data and insight into where you get stuck overall in the dev process and why (so they can fix it).

At a place like Google, I can actually tell you all the paths that people take when trying to achieve a journey. So that means I know all the loops (counts, times, etc) through development tools that start with something like "user opens their editor". Whether that's "open editor, make change, build, test, review, submit" or "open editor, make change, go to lunch", or "open editor, go look at docs, go back to editor, go back to docs, etc".

So i have real answers to something like "how often do people start in their IDE, discover they can't figure out how to do X, leave the IDE to go find the answer, not find it, give up, and go to lunch". I can tell you what the top X where that happens is, and how much time is or is not wasted through this path, etc.

Just as an example. I can then use all of this to improve the tooling so users can get more done.

You will not find this in most public tooling, and to the degree telemetry exists that you could generate for your own use, nobody thinks about how all that telemetry works together.

Now, mind you, all the above is meant as an explanation - i'm trying to explain why the public attempts don't end up as "good". But myself, good/bad is all about what you value.

Most tradeoffs here were deliberate.

But they are tradeoffs.

Some people value the flexibility more than coherence. or whatever. I'm not gonna judge them, but I can explain why you can't have it all :)

birktj
6 replies
1d21h

I see most replies here ar mentioning the the build integration is what is mainly missing in the public tools. I wonder if nix and nixpkgs could be used here? Nix is a language agnostic build-system and with nixpkgs it has a build instructions for a massive amount of packages. Artifacts for all packages are also available via hydra.

Nix should also have enough context so that for any project it can get the source code of all dependencies and (optionally) all build-time dependencies.

jeffbee
4 replies
1d20h

Build integration is not the main thing that is missing between Livegrep and Code Search. The main thing that is missing is the semantic index. Kythe knows the difference between this::fn(int) and this::fn(double) and that::fn(double) and so on. So you can find all the callers of the nullary constructor of some class, without false positives of the callers of the copy constructor or the move constructor. Livegrep simply doesn't have that ability at all. Livegrep is what it says it is on the box: grep.

humanrebar
3 replies
1d19h

The build system coherence provided by a monorepo with a single build system is what makes you understand this::fn(double) as a single thing. Otherwise, you will get N different mostly compatible but subtly different flavors of entities depending on the build flavor, combinations of versioned dependencies, and other things.

jeffbee
2 replies
1d19h

Sure. Also, if you eat a bunch of glass, you will get a stomach ache. I have no idea why anyone uses a polyrepo.

humanrebar
1 replies
1d19h

The problem with monorepos is that they're so great that everyone has a few.

refulgentis
0 replies
1d18h

God that is good.

yencabulator
0 replies
1h30m

Nix builds suck for development because there is no incrementality there. Any source file changes in any way, and your typical nix flake will rebuild the project from scratch. At best, you get to reuse builds of dependencies.

isker
2 replies
1d22h

Agreed. There are some public building blocks available (e.g. Kythe or meta's Glean) but having something generic that produces the kind of experience you can get on cs.chromium.org seems impossible. You need such bespoke build integration across an entire organization to get there.

Basic text search, as opposed to navigation, is all you'll get from anything out of the box.

init
1 replies
1d22h

In a past job I built a code search clone on top of Kythe, Zoekt and LSP (for languages that didn't have bazel integration). I got help from another colleague to make the UI based on Monaco. We create a demo that many people loved but we didn't productionize it for a few reasons (it was an unfunded hackathon project and the company was considering another solution when they already had Livegrep)

Producing the Kythe graph from the bazel artifacts was the most expensive part.

Working with Kythe is also not easy as there is no documentation on how to run it at scale.

isker
0 replies
1d22h

Very cool. I tried to do things with Kythe at $JOB in the past, but gave up because the build (really, the many many independent builds) precluded any really useful integration.

I did end up making a nice UI for vanilla Zoekt, as I mentioned elsewhere: https://github.com/isker/neogrok.

jeffbee
0 replies
1d20h

Just want to note that Livegrep, its antecedent "codesearch", and other things that are basically grep bear no resemblance to that which a person working at Google calls "Code Search".

fmobus
0 replies
1d11h

The guide bindings layer thing is nice, but its UI could be improved. I wish I could directly find for providers/usages from the search box.

sqs
14 replies
1d22h

I'm at Sourcegraph (mentioned in the blog post). We obviously have to deal with massive scale, but for anyone starting out adding code search to their product, I'd recommend not starting with an index and just doing on-the-fly searching until that does not scale. It actually will scale well for longer than you think if you just need to find the first N matches (because that result buffer can be filled without needing to search everything exhaustively). Happy to chat with anyone who's building this kind of thing, including with folks at Val Town, which is awesome.

morgante
3 replies
1d20h

I've been surprised at how far you can get without indexing.

Ex. I always assume we'll need to add an index to speed up GritQL (https://github.com/getgrit/gritql), but we've gotten pretty far with doing search entirely on the fly.

worldsayshi
2 replies
1d20h

What does 'on the fly' entail here?

simonw
1 replies
1d20h

I'm going to guess brute force - scan everything for the search term, rather than trying to use an index.

I'm always amazed at how fast ripgrep (rg) can brute force it's way through hundreds of MBs of source code.

morgante
0 replies
1d19h

Yes, exactly. When doing a search, we parse and search every file without any indexing.

Of course, it could still be sped up considerably with an index but brute force is surprisingly effective (we use some of the same techniques/crates as ripgrep).

mechanicker
3 replies
1d3h

I hope this and SCIP becomes a standard and we have more programming languages emitting symbols in SCIP format.

mdaniel
1 replies
1d2h

I thought SCIP got promoted into https://lsif.dev/ but chasing the https://github.com/sourcegraph/lsif-java link resolves to https://github.com/sourcegraph/scip-java so maybe I had the evolution relationship backward. Anyway, I'm thankful at least that code is still Apache 2

https://github.com/topics/lsif may interest this audience, too, since the scip topic tag seems to clash with something else

Also, I learned last night that GitLab embraces LSIF, too https://docs.gitlab.com/ee/topics/autodevops/stages.html#aut...

ivanovm
0 replies
23h3m

My, and my friends experiences with SCIP indexers built by Sourcegraph have been less than stellar. They are buggy and sparsely maintained

klysm
0 replies
1d19h

I apply this thinking to lots of problems. Do the dumb thing that involves the least state and prove we need to lean more towards memory for speed. It’s much simpler to keep things correct when nothing is cached

isker
0 replies
1d22h

And when you're ready to do indexed search, Zoekt (over which Sourcegraph graciously took maintainership a while ago) is the best way to do it that I've found. After discounting both Livegrep and Hound (they both struggled to perform in various dimensions with the amount of stuff we wanted indexed, Hound moreso than Livegrep), we migrated to Zoekt from a (necessarily) very old and creaky deployment of OpenGrok and it's night and day, both in terms of indexing performance and search performance/ergonomics.

Sourcegraph of course adds many more sophisticated features on top of just the code search that Zoekt provides.

hinkley
0 replies
1d19h

There was someone doing temporal databases that was compressing blocks on disk and doing streaming decompress and search on them. Things in L2 cache go very very fast.

beembeem
0 replies
1d2h

Any opinions on mozilla's DXR?

baobun
0 replies
1d15h

You'll also be in a much better spot to pick appropriate indexing when you actually have sizable and representative workloads.

FalconSensei
0 replies
1d1h

Do you plan on ever allowing users to change the font size?

hiAndrewQuinn
10 replies
1d13h

Basic code searching skills seems like something new developers are never explicitly taught, but which is an absolutely crucial skill to build early on.

I guess the knowledge progression I would recommend would look something kind this:

- Learning about Ctrl+F, which works basically everywhere.

- Transitioning to ripgrep https://github.com/BurntSushi/ripgrep - I wouldn't even call this optional, it's truly an incredible and very discoverable tool. Requires keeping a terminal open, but that's a good thing for a newbie!

- Optional, but highly recommended: Learning one of the powerhouse command line editors. Teenage me recommended Emacs; current me recommends vanilla vim, purely because some flavor of it is installed almost everywhere. This is so that you can grep around and edit in the same window.

- In the same vein, moving back from ripgrep and learning about good old fashioned grep, with a few flags rg uses by default: `grep -r` for recursive search, `grep -ri` for case insensitive recursive search, and `grep -ril` for case insensitive recursive "just show me which files this string is found in" search. Some others too, season to taste.

- Finally hitting the wall with what ripgrep can do for you and switching to an actual indexed, dedicated code search tool.

datascienced
6 replies
1d13h

Also Github is a fantastic tool for searching code across repos, ones you may not even have cloned yet! Either public ones or org ones.

tex0
2 replies
1d9h

The new GitHub CS is pretty great indeed. Still not on par with it's role model, but getting closer.

greymalik
1 replies
1d6h

What’s its role model?

lambdaba
0 replies
1d6h

Sourcegraph?

follower
1 replies
1d3h

GitHub's code search functionality is only available to people who are logged in.

It used to be possible to perform global/multi-org/multi-repo/single-repo code searches without being logged in but over time they removed all code search functionality for people who are not logged in.

It is completely stupid that it's not possible for a non-logged-in person to code search even within a single repo[0].

It is textbook enshittification by a company with an monumental amount of leverage over developers.

(The process will presumably continue until the day when being logged in is required to even view code from the myriad Free and Open Source projects who find themselves trapped there.)

[0] Which is why I, somewhat begrudgingly[1], use Sourcegraph for my non-local code search needs these days.

[1] Primarily because Sourcegraph are susceptible to the same forces that lead to enshittification but given they also have less leverage I've left that as a problem for future me to worry about. (But also the site is quite "heavy" for when one just wants to do a "quick" search in a single repo...)

sgift
0 replies
1d2h

You make it sound as if being logged in to github is somehow a big hurdle. It's free and it's easy, so why should one care if it's only available to logged in users?

lambdaba
0 replies
1d12h

yeah, I particularly like the combo regex + path: or lang:

plugin-baby
1 replies
1d10h

Apart from speed, what advantages does ripgrep offer over git grep when searching git repos?

burntsushi
0 replies
1d7h

ripgrep author here.

Better Unicode support in the regex engine. More flexible ignore rules (you aren't just limited to what `.gitignore` says, you can also use `.ignore` and `.rgignore`). Automatic support for searching UTF-16 files. No special flags required to search outside of git repositories or even across multiple git repositories in one search. Preprocessors via the `--pre` flag that let you transform data before searching it (e.g., running `pdftotext` on `*.pdf` files). And maybe some other things.

`git grep` on the other hand has `--and/--or/--not` and `--show-function` that ripgrep doesn't have (yet).

mcintyre1994
0 replies
1d10h

I’d also point out that VSCode uses ripgrep for its search feature which is a great starting point.

sdesol
6 replies
1d22h

It’s hard to find any accounts of code-search using FTS

I'm actually going to be doing this soon. I've thought about code search for close to a decade, but I walked away from it, because there really isn't a business for it. However, now with AI, I'm more interested in using it to help find relevant context and I have no reason to believe FTS won't work. In the past I used Lucene, but I'm planning on going all in with Postgres.

The magic to fast code search (search in general), is keeping things small. As long as your search solution is context aware, you can easily leverage Postgres sharding to reduce index sizes. I'm a strong believer in "disk space is cheap, time isn't", which means I'm not afraid to create as many indexes as required, to shave 100's of milliseconds of searches.

bevekspldnw
5 replies
1d20h

Mmm, it’s not that straight forward: indexes can vastly slow down large scale ingest, so it’s really about when to index as well.

I work with a lot of multi billion row datasets and a lot of my recent focus has been on developing strategies to avoid the slow down with ingest, and then enjoying the speed up for indexed on search.

I’ve also gotten some mjnd boggling speed increases by summarizing key searchable data in smaller tables, some with JSONB columns that are abstractions of other data, indexing those, and using pg prewarm to serve those tables purely from memory. I literally went from queries taking actual days to < 1 sec.

sdesol
3 replies
1d19h

Yeah I agree. I've had a lot of practice so far with coordinating between hundreds of thousands of tables to ensure ingestion/lookup is fast. Everything boils down to optimizing for your query patterns.

I also believe in using what I call "compass tables" (like your summarization tables), which I guess are indexes of indexes.

bevekspldnw
2 replies
1d18h

Scaling databases both oddly frustrating and also rewarding. Getting that first query that executes at 10x of the old one feels great. The week of agony that makes it possible…less so.

sdesol
1 replies
1d17h

Fully agree. I do have to give hardware a lot of credit though. With SSD and now NVME, fast random read/write speed is what makes a lot of things possible.

bevekspldnw
0 replies
1d15h

Yup, I just wish Samsung made an 8TB NVME!

philippemnoel
0 replies
1d1h

That's wild. Quite impressive how far Postgres can be tuned. Is this all with tsvector?

pomdtr
5 replies
1d19h

Hey! I'm a val.town fanboy and I immediately thought about a workaround while reading the blog post:

What if I dumped every publics vals in Github, in order to be able to user their (awesome) search ?

So here is my own "Val Town Search": https://val-town-search.pomdtr.me

And here is the repo containing all vals, updated hourly thanks to a github action: https://github.com/pomdtr/val-town-mirror

simonw
1 replies
1d19h

Well this is fun...

    git clone https://github.com/pomdtr/val-town-mirror
    cd val-town-mirror
    rg news.ycombinator.com
Now I can ripgrep search public Vals, e.g. to see who's hitting Hacker News from a Val.

pomdtr
0 replies
1d18h

Yeah, and you can finally run/debug vals locally (kind of, the version query param is not yet supported)

nbbaier
0 replies
1d16h

This is great!

MatthiasPortzel
0 replies
1d18h

That is, uh, one solution, to say the least.

There’s a HN comment I’ll never forget where the commenter suggests that Discord move their search infrastructure to a series of text file searched with ripgrep, but Val.town’s scale is small enough that they could actually consider it.

peter_l_downs
4 replies
1d23h

Surprised not to see Livegrep [0] on the list of options. Very well-engineered technology; the codebase is clean (if a little underdocumented on the architecture side) and you should be able to index your code without much difficulty. Built with Bazel (~meh, but useful if you don't have an existing cpp toolchain all set up) and there are prebuilt containers you can run. Try that first.

By the way, there's a demo running here for the linux kernel, you can try it out and see what you think: https://livegrep.com/search/linux

EDIT: by the way, "code search" is deeply underspecified. Before trying to compare all these different options, you really would benefit from writing down all the different types of queries you think your users will want to ask, including why they want to run that query and what results they'd expect. Building/tuning search is almost as difficult a product problem as it is an engineering problem.

[0] https://github.com/livegrep/livegrep

isker
3 replies
1d22h

When I investigated using livegrep for code search at work, it really struggled to scale to a large number of repositories. At least at the time (a few years ago) indexing in livegrep was a monolithic operation: you index all repos at once, which produces one giant index. This does not work well once you're past a certain threshold.

I also recall that the indexes it produces are pretty heavyweight in terms of memory requirements, but I don't have any numbers on hand to justify that claim.

Zoekt (also mentioned in TFA) has the correct properties in this regard. Except in niche configurations that are probably only employed at sourcegraph, each repo is (re)indexed independently and produces a separate set of index files.

But its builtin web UI left much to be desired (especially compared to livegrep), so I built one: https://github.com/isker/neogrok.

omitmyname
1 replies
1d2h

Oh my god. This is amazing. I was thinking of building such thing myself. Thank you!

omitmyname
0 replies
1d

is there any way to open file like in zoekt? it's so much better than native zoekt ui except this:(

peter_l_downs
0 replies
1d22h

I like this better than livegrep. I haven't actually operated either zoekt OR livegrep before, but I'll probably start with zoekt+neogrok next time I want to stand up a codesearch page. Thanks for building and sharing this!

worldsayshi
3 replies
1d23h

I suppose using something like tree sitter to get a consistent abstract syntax tree to work with would be a good starting point. And then try building a custom analyzer (if using elasticsearch lingo) with that?

samatman
1 replies
1d21h

Might be overkill unless you're looking to do semantic search. I've thought about what a search DSL for code would look like, it's challenging to embody a query like "method which takes an Int64 and has a variable idx in it" into something compact and memorable.

But a tokenizer seems like a good place to start, I think that's the right granularity for this kind of application. You'd want to do some chunking so that foo.bar doesn't find every foo and every bar, that sort of thing. Code search is, as the title says, a hard problem. But a language-aware token stream, the one you'd get from the lexer, is probably where one should start in building the database.

worldsayshi
0 replies
1d20h

Sure you should definitively not try to do the overkill use case first but I would assume that tree sitter can emit "just" tokens as well? Getting the flexibility and control of a tool like tree sitter should allow you to quickly throw away stuff like comments and keywords if you want since you can do syntax aware filtering.

Then again I haven't used tree-sitter, can just imagine that this is a strength of it.

azornathogron
0 replies
1d22h

Another option is to start with Kythe, which is Google's own open source framework for getting a uniform cross-language semantic layer: https://kythe.io/

Worth looking at as a source of inspiration and design ideas even if you don't want to use it itself.

jillesvangurp
3 replies
1d11h

It's why IDE and developer tool builders have long had the insight that in order to do code search properly, you need to open up the compiler platform as a lot of what you need to do boils down to reconstructing the exact same internal representations that a compiler would use. And of course good code search is the basis for refactoring support, auto completion, and other common IDE features.

Easier said then done of course as tools are often an afterthought for compiler builders. Even Jetbrains made this mistake with Kotlin initially, which is something they are partially rectifying with Kotlin 2.0 now to make it easier to support things like incremental compilation. The Rust community had this insight as well with a big effort a few years ago to make Rust more IDE friendly.

IBM actually nailed this with Eclipse back in the day and that hasn't really been matched since then. Intellij never even got close to this being 2-3 orders of magnitudes slower. We're talking seconds vs. milliseconds here. Eclipse had a blazing fast incremental compiler for Java that could even partially compile code in the presence of syntax errors. The IDEs representation of that code was hooked into that compiler.

With Eclipse, you could introduce a typo and break part of your code and watch the IDE mark all the files that now had issues across your code base getting red squiggles instantly. Fix the typo and the squiggles went away, also without any delay.

That's only possible if you have a mapping between those files and your syntax tree, which is exactly what Eclipse was doing because it was hooked into the incremental compiler.

Intellij was never able to do this, it will actively lie to you about things being fine/not fine until you rebuild your code and it will show phantom errors a lot when it's internal state gets out of sync with what's on disk. It often requires full rebuilds to fix this. If you run something, there's a several second lag while it compiles things. Reason: the IDE internal state is calculated separately from the compiler and this gets out of sync easily. When you run something, it has to compile your code because it hasn't been compiled yet. That's often when you find out the IDE was lying to you about things being ready to run.

With Eclipse all this was instantly and unambiguous because it shared the internal state with the compiler. If it compiled, your IDE would be error free, if it didn't it wouldn't be. And it compiled incrementally and really quickly so you would know instantly. It had many flaws and annoying bugs but that's a feature I miss.

dikei
0 replies
1d8h

While Eclipse truely have an incredible incremental compiler for Java, IntelliJ's better integration with external build systems like maven and gradle, together with better cross-languages support, was what win me over.

callmeal
0 replies
1d9h

With Eclipse all this was instantly and unambiguous ...

Still is, and is the main reason why a lot of us will never jump ship.

jackbravo
3 replies
1d23h

Would LLM vector embeddings work in this context? I'm guessing they should since they are very good at understanding code.

CityOfThrowaway
1 replies
1d22h

Yes but generating that index would be expensive

anonymousDan
0 replies
1d19h

Why exactly? You mean to construct the embeddings or to embed the queries?

ivanovm
0 replies
23h12m

I've found embeddings to perform quite poorly on code because 1) user queries are not semantically similar to target code in most cases 2) often times two very concretely related pieces of code are not at all semantically similar

simonw
2 replies
1d19h

A feature I'd appreciate from Val Town is the ability to point it to a GitHub repo that I own and have it write the source code for all of my Vals to that repo, on an ongoing basis.

Then I could use GitHub code search, or even "git pull" and run ripgrep.

nbbaier
1 replies
1d16h

I've actually built a tool in Val Town that could be used as the basis for something like this: https://www.val.town/v/nbbaier/valToGH

Right now it only commits one val, but it would be trivial to write it into a loop and then use a scheduled val to have it run over all your vals as a cron job!

nbbaier
0 replies
1d14h

nvm, just went in and changed it so now you could theoretically do all your vals at once. I realized though that it will not update file contents as is, so need to figure that out ...

semiquaver
2 replies
1d23h

Be careful with trigram indexes. At least in the postgres 10 era they caused severe index bloat for frequently updated tables.

peter_l_downs
1 replies
1d23h

Interesting, do you know anywhere I can easily read more about this? (I will do my own research, too.)

boyter
0 replies
1d19h

Its a result of trigrams themselves. For example turning searchcode (please ignore plug, this is just the example I had to hand) goes from 1 thing you would need to index into 8.

    "searchcode"   -> [sea, ear, arc, rch, chc, hco, cod, ode]
As a result the index rapidly becomes larger than you would expect.

boyter
0 replies
1d19h

Yes, although the lack of detail about the sparse grams is frustrating.

fizx
2 replies
1d18h

There's a million paths, but here's one I like.

Use ElasticSearch. It will scale more than Postgres. Three hosted options are AWS, Elastic, Bonsai. I founded Bonsai and retired (so am partial), but they will provide the best human support for you, and you won't have to worry about java Xmx.

Your goal with ES is to use the Regex PatternAnalyzer to split the code into reasonable exact code-shaped tokens (not english words).

Here's a rough GPT4 explanation with sample config that I'd head towards: https://chat.openai.com/share/e4d08586-b7ef-48f2-9de1-7f82ea...

philippemnoel
0 replies
1d1h

Elasticsearch is good, and it does scale, but it is much more cumbersome and expensive to scale and operate than Postgres. If you use the managed service, you'll pay for the operational pain in the form of higher pricing.

The Postgres movement is strong and extensions like ParadeDB https://github.com/paradedb/paradedb are designed specifically to solve this pain point (Disclaimer: I work for ParadeDB)

bytefish
0 replies
1d17h

GitLab is also using ElasticSearch, so one could recreate the ElasticSearch Indices they came up with. [1]

They also share some of the challenges, they faced along the way. It also discusses interesting challenges, like implementing the authorization model. [2], [3]

When GitHub removed its most useful Search feature, which is sorting results by date, I wrote a small “Search Engine” with ElasticSearch to selectively index Microsoft repositories. It works good enough for my needs. [4]

[1] https://gitlab.com/gitlab-org/gitlab/-/blob/7bbbc00bd871aeb6...

[2] https://about.gitlab.com/blog/2019/07/16/elasticsearch-updat...

[3] https://about.gitlab.com/blog/2020/04/28/elasticsearch-updat...

[4] https://github.com/bytefish/ElasticsearchCodeSearch

bawolff
0 replies
1d20h

Yeah, i agree, that is weird. Especially if you search for something super common like "function" you basically DoS it.

thesuperbigfrog
1 replies
1d19h

OpenGrok (https://github.com/oracle/opengrok) is a wonderful tool to search a codebase.

It runs on-prem and handles lots of popular programming languages.

AstralJaeger
0 replies
1d13h

I fully agree with you there, OpenGrok is a wonderful, oudated-looking and feeling but lightning fast code search engine!

nojvek
1 replies
1d7h

You can do to_tsvector “plain” and keep the strings intact. No lemming, stemming.

We use plain tsvectors on a gin index and change the queries to allow prefix based searching. So “wo he” matches “hello world”.

Perhaps I should write a blog about it. Took me a few days to read PG documentation to get where we are at.

The only thing it doesn’t handle is typo tolerance.

philippemnoel
0 replies
1d1h

Tsvector is amazing and it goes a long way, but unfortunately as you say it lacks some of more complex FTS features like typo tolerance, language tokenizers, etc.

metalrain
1 replies
1d15h

I think you need to parse the code and build AST to make good search. Even then normalizing over different aliases, may not be simple.

vladak
0 replies
1d10h

The question is what code. In preprocessed languages there can be lots of ifdefs and such for various environments and architectures.

johnthescott
1 replies
1d17h

the rum index has worked well for us on roughly 1TB of pdfs. written by postgrespro, same folks who wrote core text search and json indexing. not sure why rum not in core. we have no problems.

   https://github.com/postgrespro/rum

philippemnoel
0 replies
1d1h

RUM is good, but it lacks some of the more complex features like language tokenizers, etc. that a full search engine library like Lucene/Tantivy (and ParadeDB in Postgres) offer

herrington_d
1 replies
1d21h

Is it possible to combine n-gram and AST to dump a better indexing?

Take `sourceCode.toString()` as an example, the AST can dump it to `sourceCode` and `toString`. A further indexer can break `sourceCode` to `source` and `code`.

For ast dumping, project like https://github.com/ast-grep/ast-grep can help.

boyter
0 replies
1d19h

You could, but I don't know what you gain out of it. The underlying index would be almost the same size, and n-gram would also allow you to search for e.t for example which you are losing in this process.

healeycodes
1 replies
1d23h

When a val is deployed on val town, my understanding is that it's parsed/compiled. At that point, can you save the parts of the program that people might search for? Names of imports, functions, variables, comments, etc.

MH15
0 replies
1d20h

A val is just Typescript, no? So unless they are also storing the AST it would be JavaScript and that's it

boyter
1 replies
1d20h

Code search is indeed hard. Stop words, stemming and such do rule out most off the shelf indexing solutions but you can usually turn them off. You can even get around the splitting issues of things like

    a.toString()
With some pre-processing of the content. However were you really get into a world of pain is allowing someone to search for ring in the example. You can use partial term search, prefix, infix, or suffix but this massively bloats the index and is slow to run.

The next thing you try is trigrams, and suddenly you have to deal with false positive matches. So you add a positional portion to your index, and all of a sudden the underlying index is larger than the content you are indexing.

Its good fun though. For those curious about it I would also suggest reading posts by Michael Stapelberg https://michael.stapelberg.ch/posts/ who writes about Debian Code Search (which I believe he started) in addition to the other posts mentioned here. Shameless plug, I also write about this https://boyter.org/posts/how-i-built-my-own-index-for-search... where I go into some of the issues when building a custom index for searchcode.com

Oddly enough I think you can go a long way brute forcing the search if you don't do anything obviously wrong. For situations where you are only allowed to search a small portion of the content, say just your own (which looks applicable in this situation) that's what I would do. Adding an index is really only useful when you start searching at scale or you are getting semantic search out of it. For keywords which is what the article appears to be talking about, that's what I would be inclined to do.

sgift
0 replies
1d2h

The preprocessing that you need is (in Lucene nomenclature, but it's the same principle for search in general) an Analyzer (the component, which knows to prepare the plain text that gets inside for storing it in an index and the corresponding component for a search query) made for code search. That's not different from analyzers for other languages (Stemming sucks for almost everything but English). Thinking about it .. the frontend of most compilers for a language could maybe make a pretty good Analyzer. It already knows language specific components and can split them into parts it needs for further processing, which is basically what an analyzer does.

bch
1 replies
1d11h

Why am I not seeing anything here about ctags[0] or cscope[1]? Are they that out of fashion? cscope language comprehension appears limited to C/C++ and Java, but “ctags” (I think I use “uctags” atm) language support is quite broad and ubiquitous…

[0] https://en.wikipedia.org/wiki/Ctags

[1] https://en.wikipedia.org/wiki/Cscope

signa11
0 replies
1d11h

exactly THIS <sorry for shouting !> the only problem with `cscope` is that for modern c++ based code-bases it is woefully inadequate. for plain / vanilla c based code-bases f.e. linux-kernel etc. it is just _excellent_

language-servers using clangd/ccls/... are definitely useful, but quite resource heavy. for example, each of these tools seem to starting new threads per file (!) and there are no knobs to not do that. i don't really understand this rationale at all. yes, i have seen this exact behavior with both clangd and ccls. oftentimes, the memory in these processes balloon to some godawful numbers (more with clangd than ccls), necessitating a kill.

moreover, this might be an unpopular opinion, but mixing any regex based tool (ripgrep/... come to mind) with language-server f.e. because the language server does not really find what you are looking for, or does not do that fast enough, are major points against it. if you already have language-server running, regex based tools should not be required at all.

i don't really understand the reason for sql'ization of code searches at all. it is not a 'natural' interface. typical usage is to see 'who calls this function', 'where is the definition at' of this function etc. etc.

skybrian
0 replies
1d23h

It seems like some of their gists have documentation attached and maybe that’s enough? I’m not sure I’m all that interested in seeing undocumented gists in search results.

ricardobeat
0 replies
1d21h

I don't understand their hand-waving of Zoekt. It was built exactly for this purpose, and is not a "new infrastructure commitment" any more than the other options. The server is a single binary, the indexer is also a single binary, can't get any simpler than that.

To me it doesn't make sense to be more scared of it than Elasticsearch...

reeyadalli
0 replies
1d10h

I have never actually given it much thought about the difference between code search and normal "literature". Interesting read!!

philippemnoel
0 replies
1d21h

ParadeDB founder here. We'd love to be supported on Render, if the Render folks are open to it...

kermatt
0 replies
1d2h

Are any of the tools mentioned in these comments better suited to searching SQL code, both DML and DDL?

We maintain a tree of files with each object in a separate "CREATE TABLE|VIEW|PROCEDURE|FUNCTION" script. This supports code search with grep, but something that could find references to an object when the name qualifications are not uniform would be very useful:

INSERT INTO table INSERT INTO schema.table INSERT INTO database.schema.table

Can all be done with regex, but search is not so easy for programmers new to expressions.

jessemhan
0 replies
1d21h

Good scalable codebase search is tough. We built a scalable, fast, and super simple solution for codebase semantic search: https://phorm.ai

ivanovm
0 replies
23h15m

One of the most interesting approaches to code search I've seen recently (no affiliation) https://github.com/pyjarrett/septum

The hardest part about getting code search right imo is grabbing the right amount of surrounding context, which septum is aimed at solving on a per-file basis.

Another one I'm surprised hasn't been mentioned is stack-graphs (https://github.com/github/stack-graphs), which tries to incrementally resolve symbolic relationships across the whole codebase. It powers github's cross-file precise indexing and conceptually makes a lot of sense, though I've struggled to get the open source version to work

hanwenn
0 replies
9h23m

Hi,

I wrote zoekt. From what I understand valtown does, I would try to use brute force first (ie. something equivalent to ripgrep). Once that starts breaking down, you could use last-updated-timestamps to reduce the brute force:

* make a trigram index using Zoekt or Hound for JS snippets older than X * do brute force on snippets newer than X * advance X as you're indexing newer data.

If the snippets are small, you can probably use a (trigram => snippets) index for space savings relative to a (trigram => offset) index.

ethanwillis
0 replies
1d13h

There are tools from bioinformatics that would be more applicable here for code search than the ones linguistics has made for searching natural language.

ectopasm83
0 replies
1d5h

Lemmatization: some search indexes are even fancy enough to substitute synonyms for more common words, so that you can search for “excellent” and get results for documents including “great.”

This isn't what lemmatization is about.

Stemming the word ‘Caring‘ would return ‘Car‘. Lemmatizing the word ‘Caring‘ would return ‘Care‘.

civilized
0 replies
1d16h

Is "hard" a bit of an overstatement for problems like "I'm using a library that mangles the query"? Couldn't you search for the literal text the user inputs? Maybe let them use regex?

chasil
0 replies
1d23h

Oracle has USER/ALL/DBA_SOURCE views, and all of the PL/SQL (SQL/PSM) code that has been loaded into the database is presented there. These are all cleartext visible unless they have been purposefully obfuscated.

It has columns for the owner, object name, LINE[NUMBER] and TEXT[VARCHAR2(4000)] columns and you can use LIKE or regexp_like() on any of the retained source code.

I wonder if EnterpriseDB implements these inside of Postgres, and/or if they are otherwise available as an extension.

Since most of SQL/PSM came from Oracle anyway, these would be an obvious desired feature.

https://en.wikipedia.org/wiki/SQL/PSM

campbel
0 replies
1d21h

Sourcegraph’s maintained fork of Zoekt is pretty cool, but is pretty fearfully niche and would be a big, new infrastructure commitment.

I don't think Zoekt is as scary as this article makes it out to be. I set this up at my current company after getting experience with it at Shopify and its really great.

amarshall
0 replies
1d14h

GitHub’s search is excellent

Is it? I find it near-useless most of the time, and cloning + ripgrep to be way more efficient. Perhaps the problem is more in the UX being awful than the actual search.

Macha
0 replies
1d21h

This is a pretty bad index: it has words that should be stop words, like function, and won’t split a.toString() into two tokens because . is not a default word boundary.

So github used to (maybe still does) "fix" this one and it's annoying. Although github are ramping up their IDE like find-usages, it's still not perfect, so somethings you just want to a text search equivalent for "foo.bar()" for all the uses it misses and this stemming behaviour then finds every while where foo and bar are mentioned which bloats results.

IshKebab
0 replies
1d23h

I would use Hound.

727564797069706
0 replies
1d21h

If you're serious about scaling up, definitely consider Vespa (https://vespa.ai).

At serious scale, Vespa will likely knock all the other options out of the park.