It indeed is hard, and a good code search platform makes life so much easier. If I ever leave Google, the internal code search is for sure going to be the thing I miss the most. It's so well integrated into how everything else works (blaze target finding, guice bindings etc), I can't imagine my life without it.
I remember to appreciate it even more every time I use Github's search. Not that it's bad, it's just inherently so much harder to build a generalized code search platform.
If you ever leave you can use Livegrep, which was based on code-search work done at Google. I personally don't use it right now but it's great and will probably meet all your needs.
[0] https://github.com/livegrep/livegrep
If I’ve learned anything from the fainting spells that I-work-at-X have over their internal tools on HN: no, whatever the public/OSS variant is always a mere shadow of the real thing.
I suspect you're being sarcastic - but can confirm that being nearly two years out of Amazon, I still miss its in-house CD system nearly every day. I've actively looked around for OSS replacements and very few come anywhere close.
(I would be _delighted_ for someone to "Umm actually" me by providing a great product!)
My experience has been that any of these in-house things do not adapt well to the high chaos of external environments, as if there are 3 companies one will find 9 systems and processes in use thus making "one size fits all" a fantasy
But, I'll bite: what made the CD system so dreamy, and what have you evaluated thus far that fall short?
Amazon internal tools for building codes are _amazing_.
Brazil is their internal dependency management tool. It handles building and versioning software. It introduced the concept of version sets which essentially allows you to group up related software, e.g. version 1.0 of my app needs version 1.1 of library x and 2.0 of runtime y. This particular set of software versions get its own version number.
Everything from the CI/CD to the code review tool to your local builds use the same build configuration with Brazil. All software packages in Brazil are built from source on Amazon's gigantic fleet of build servers. Builds are cached, so even though Amazon builds its own version of Make, Java, etc., these are all built and cached by the build servers and downloaded.
A simple Java application at Amazon might have hundreds of dependencies (because you'll need to build Java from scratch), but since this is all cached you don't have to wait very long.
Lastly, you have Pipelines which is their internal CI/CD tool which integrates naturally with Brazil + the build fleet. It can deploy to their internal fleet with Apollo, or to AWS Lambda, S3 buckets, etc.
In all, everything is just very well integrated. I haven't seen anything come close to what you get internally at Amazon.
so what I'm hearing is that app-1.0 needs app-1.0-runtime-build-20240410 which was, itself, built from a base of runtime-y-2.0 and layering library-x-1.11 upon it, kind of like
and then, yadda, yadda, blue-green, incremental rollout <https://gitlab.com/gitlab-org/gitlab/-/blob/v16.10.2-ee/lib/...>, feature flags <https://docs.gitlab.com/ee/operations/feature_flags.html>, error capture <https://docs.gitlab.com/ee/operations/error_tracking.html#in...>, project-managed provisioning <https://docs.gitlab.com/ee/user/infrastructure/iac/#integrat...>, on call management <https://docs.gitlab.com/ee/operations/incident_management/>, on call runbooks <https://docs.gitlab.com/ee/user/project/clusters/runbooks/in...>you can orchestrate all that from ~~Slack~~ Chime :-D if you're into that kind of thing https://docs.gitlab.com/ee/ci/chatops/
No, not even close. You might even have it exactly backwards.
which is why, as I originally asked GP: what have you already tried and what features were they missing
I presume by "exactly backwards" you mean that one should have absolutely zero knobs to influence anything because the Almighty Jeff Build System does all the things, which GitLab also supports but is less amusing to look at on an Internet forum because it's "you can't modify anything, it just works, trust me"
Or, you know, if you have something constructive to add to this discussion feel free to use more words than "lol, no"
I don't work at Amazon, and haven't for a long time, and this format is insufficient to fully express what they're doing, so I won't try.
You're better off searching for how Brazil and Apollo work.
That being said, the short of it is that: imagine when you push a new revision to source control, you (you) can run jobs testing every potential consumer of that new revision. As in, you push libx-1.1.2 and anyone consuming libx >= 1.1 (or any variety of filters) is identified. If the tests succeed, you can update their dependencies on your package and even deploy them, safely and gradually, to production without involving the downstream teams at all. If they don't, you can choose your own adventure: pin them, fork, fix them, patch the package, revise the versioning, whatever you want.
It's designed to be extremely safe and put power in the hands of those updating dependencies to do so safely within reason.
Imagine you work on a library and you can test your PR against every consumers.
It's not unlike what Google and other monorepos accomplish but it's quite different also. You can have many live versions simultaneously. You don't have to slog it out and patch all the dependents -- maybe you should, but you have plenty of options.
It all feels very simple. I'm glossing over a lot.
Sorry, I wish I could phrase it better for you. All I can say is that I have tried a _lot_ of tools, and nothing has come close. Amazon has done a lot of work to make efficient tools.
Here's a better explanation: https://gist.github.com/terabyte/15a2d3d407285b8b5a0a7964dd6...
How did you avoid version hell? At Google, almost everything just shipped from master (except for some things that had more subtle bugs, those did their work on a dev branch and merged into master after testing).
Version sets take care of everything. A version set can be thought of as a Git repo with just one file. The file is just key/value pairs with the dependencies and major/minor version mappings, e.g.
<Name> <Major>-<Minor>
Java 8-123
Lombok 1.12-456
...
A version set revision is essentially a git commit of that version set file. It's what determines exactly what software version you use when building/developing/deploying/etc.
Your pipeline (which is a specific noun at Amazon, not the general term) acts on a single version set. When you clone a repo, you have to choose which version set you want, when you deploy you have to choose a version set, etc.
Unlike most other dependency management systems, there's no notion of a "version of a package" without choosing what version set you're working on, which can choose the minor versions of _all of the packages you're using_.
e.g. imagine you clone a Node project with all of its dependencies. Each dependency will have a package.json file declaring what versions it needs. You have some _additional_ metadata that goes a step further that chooses the exact minor version that a major version is mapped to.
All that to say that the package can declare what major version they depend on, but not what minor version. The version set that you're using determines what minor version is used. The package determines the major version.
Version sets can only have one minor version per major version of a package which prevents consistency issues.
e.g. I can have Java 8-123 and Java 11-123 in my version set, but I cannot have Java 8-123 and Java 8-456 in my version set.
Your pipeline will automatically build in new minor versions into your version set from upstream. If the build fails then someone needs to do something. Every commit produces a new minor version of a package, that is to say that you can say your package is major version X, but the minor version is left up to Brazil.
This scheme actually works pretty well. There are internal tools (Gordian Knot) which performs analysis on your dependencies to make sure that your dependencies are correct.
It's a lot to know. It took me a year or so to fully understand and appreciate. Most engineers at Amazon treat it like they do Git -- learn the things you need to and ignore the rest. For the most part, this stuff is all hands off, you just need one person on the team keeping everything correct.
That sounds actually brilliant. Someone decided to brush less version stuff under the carpet.
You don't, you embrace version hell.
AWS has services around pipeline and deploy, right ?
NixOS/nixpkgs is about the closest thing you'll find in the wild. You have to squint a bit, I'll admit.
Is it true that teams don't do branches in source control at all? Just publishing a CR?
I think the issue is, nobody would be willing to pay for a good solution since they usually come with a steep maintenance cost. I wouldn't be surprised if the in-house CD team at Amazon were putting out fires every week/month behind the scene.
Just left Google a few months ago.
My take is that there's a difference between a company that is willing to invest money into EngProd endeavors, and a company that uses SaaS for everything. While I can understand that most companies don't have the financial means to invest heavily into EngProd, the outcome is that the tightly integrated development experience in the former is far superior. Code Search is definitely #2 on the list of things I miss the most.
What's #1? Memegen?
The business end of gThanks
Cheesy answer but, the people.
It's not just that. Livegrep isn't just a pale imitation of something inside Google. It's totally unrelated in implementation, capabilities, and use case.
You would be better off actually trying to understand those sentiments instead of posting sarcastic replies on HN.
The sort of tight knit integration and developer-focus that internal tools at developer-friendly companies like Google has cannot be matched by clobbering together 50 different SaaS products, half of which will probably run out of funding in 3 years.
You literally have entire companies just LARPing internal tools that Google has because they are just that good. Glean is literally Moma. There's really nothing like Critique or Buganizer.
I've used both Code Search and Livegrep. No, Livegrep does not even come close to what Code Search can do.
Sourcegraph is the closest thing I know of.
Is there like a summary of what's missing from public attempts and what makes it so much better?
The short answer is context. The reason why Google's internal code search is so good, is it is tied into their build system. This means, when you search, you know exactly what files to consider. Without context, you are making an educated guess, with regards to what files to consider.
How exactly integration with build system helps Google? Maybe you could give specific example?..
Try clicking around https://source.chromium.org/chromium/chromium/src, which is built with Kythe (I believe, or perhaps it's using something internal to Google that Kythe is the open source version of).
By hooking into C++ compilation, Kythe is giving you things like _macro-aware_ navigation. Instead of trying to process raw source text off to the side, it's using the same data the compiler used to compile the code in the first place. So things like cross-references are "perfect", with no false positives in the results: Kythe knows the difference between two symbols in two different source files with the same name, whereas a search engine naively indexing source text, or even something with limited semantic knowledge like tree sitter, cannot perfectly make the distinction.
Yes the clicking around on semantic links on source.chromoum.org is served off of an index built by the Kythe team at Google.
The internal Kythe has some interesting bits (mostly around scaling) that aren't open sourced, but it's probably doable to run something on chromium scale without too much of that.
The grep/search box up top is a different index, maintained by a different team.
If you want to build a product with a build system, you need to tell it what source to include. With this information, you know what files to consider and if you are dealing with a statically typed language like C or C++, you have build artifacts that can tell you where the implementation was defined. All of this, takes the guess work out of answering questions like "What foo() implentation was used".
If all you know are repo branches, the best you can do is return matches from different repo branches with the hopes that one of them is right.
Edit: I should also add that with a build system, you know what version of a file to use.
Google builds all the code in its momnorepo continuously, and the built artifacts are available for the search. Open source tools are never going to incur the cost of actually building all the code it indexes.
The short summary is: It's a suite of stuff that someone actually thought about making work together well, instead of a random assortment of pieces that, with tons of work, might be able to be cobbled together into a working system.
All the answers about the technical details or better/worseness mostly miss the point entirely - the public stuff doesn't work as well because it's 1000 providers who produce 1000 pieces that trade integration flexibility for product coherence. On purpose mind you, because it's hard to survive in business (or attract open source users if that's your thing) otherwise.
If you are trying to do something like make "code review" and "code search" work together well, it's a lot easier to build a coherent, easy to use system that feels good to a user if you are trying to make two things total work together, and the product management directly talks to each other.
Most open source doesn't have product management to begin with, and the corporate stuff often does but that's just one provider.
They also have a matrix of, generously, 10-20 tools with meaningful marketshare they might need to try to work with.
So if you are a code search provider are trying to make a code search tool integrate well with any of the top 20 code review tools, well, good luck.
Sometimes people come along and do a good enough job abstracting a problem that you can make this work (LSP is a good example), but it's pretty rare.
Now try it with "discover, search, edit, build, test, release, deploy, debug", etc. Once you are talking about 10x10x10x10x10x10x10x10 combinations of possible tools, with nobody who gets to decide which combinations are the well lit path, ...
Also, when you work somewhere like Google or Amazon, it's not just that someone made those specific things work really well together, but often, they have both data and insight into where you get stuck overall in the dev process and why (so they can fix it).
At a place like Google, I can actually tell you all the paths that people take when trying to achieve a journey. So that means I know all the loops (counts, times, etc) through development tools that start with something like "user opens their editor". Whether that's "open editor, make change, build, test, review, submit" or "open editor, make change, go to lunch", or "open editor, go look at docs, go back to editor, go back to docs, etc".
So i have real answers to something like "how often do people start in their IDE, discover they can't figure out how to do X, leave the IDE to go find the answer, not find it, give up, and go to lunch". I can tell you what the top X where that happens is, and how much time is or is not wasted through this path, etc.
Just as an example. I can then use all of this to improve the tooling so users can get more done.
You will not find this in most public tooling, and to the degree telemetry exists that you could generate for your own use, nobody thinks about how all that telemetry works together.
Now, mind you, all the above is meant as an explanation - i'm trying to explain why the public attempts don't end up as "good". But myself, good/bad is all about what you value.
Most tradeoffs here were deliberate.
But they are tradeoffs.
Some people value the flexibility more than coherence. or whatever. I'm not gonna judge them, but I can explain why you can't have it all :)
I see most replies here ar mentioning the the build integration is what is mainly missing in the public tools. I wonder if nix and nixpkgs could be used here? Nix is a language agnostic build-system and with nixpkgs it has a build instructions for a massive amount of packages. Artifacts for all packages are also available via hydra.
Nix should also have enough context so that for any project it can get the source code of all dependencies and (optionally) all build-time dependencies.
Build integration is not the main thing that is missing between Livegrep and Code Search. The main thing that is missing is the semantic index. Kythe knows the difference between this::fn(int) and this::fn(double) and that::fn(double) and so on. So you can find all the callers of the nullary constructor of some class, without false positives of the callers of the copy constructor or the move constructor. Livegrep simply doesn't have that ability at all. Livegrep is what it says it is on the box: grep.
The build system coherence provided by a monorepo with a single build system is what makes you understand this::fn(double) as a single thing. Otherwise, you will get N different mostly compatible but subtly different flavors of entities depending on the build flavor, combinations of versioned dependencies, and other things.
Sure. Also, if you eat a bunch of glass, you will get a stomach ache. I have no idea why anyone uses a polyrepo.
The problem with monorepos is that they're so great that everyone has a few.
God that is good.
Nix builds suck for development because there is no incrementality there. Any source file changes in any way, and your typical nix flake will rebuild the project from scratch. At best, you get to reuse builds of dependencies.
Agreed. There are some public building blocks available (e.g. Kythe or meta's Glean) but having something generic that produces the kind of experience you can get on cs.chromium.org seems impossible. You need such bespoke build integration across an entire organization to get there.
Basic text search, as opposed to navigation, is all you'll get from anything out of the box.
In a past job I built a code search clone on top of Kythe, Zoekt and LSP (for languages that didn't have bazel integration). I got help from another colleague to make the UI based on Monaco. We create a demo that many people loved but we didn't productionize it for a few reasons (it was an unfunded hackathon project and the company was considering another solution when they already had Livegrep)
Producing the Kythe graph from the bazel artifacts was the most expensive part.
Working with Kythe is also not easy as there is no documentation on how to run it at scale.
Very cool. I tried to do things with Kythe at $JOB in the past, but gave up because the build (really, the many many independent builds) precluded any really useful integration.
I did end up making a nice UI for vanilla Zoekt, as I mentioned elsewhere: https://github.com/isker/neogrok.
Just want to note that Livegrep, its antecedent "codesearch", and other things that are basically grep bear no resemblance to that which a person working at Google calls "Code Search".
The guide bindings layer thing is nice, but its UI could be improved. I wish I could directly find for providers/usages from the search box.