GitHub: Can no longer search code without being logged in

I'm ready to believe the most charitable explanation for this: the new code search (which I have found to be exceptionally good) has a lot more going on under the hood than a normal search engine, which makes it a lot more resource-intensive - limiting it to signed-in accounts saves a huge amount of server resources that would otherwise be used to serve crawlers.

My guess is that the trade-off here is genuinely a question of if they spend 3-4x (I'm guessing, but I think this is likely a low-ball estimate) on their search infrastructure, v.s. having people angry at them for requiring login to use the feature.

I actually built my own code search tool for use with GitHub repos, but I've mostly stopped using that because the new GitHub code search is so useful by default: https://simonwillison.net/2020/Nov/28/datasette-ripgrep/

That's interesting but the technical reasons are really besides the point.

The end result is that it's no longer possible to participate in "GitHub hosted open source" without actually being a member of GitHub.

You can still clone and grep, can't you? You can't file issues or submit PRs through Github without an account, either.

You're kind of making my point, one needs to take their code elsewhere to be able to browse it.

In other words, GitHub is no longer suitable to host open source repositories because of all the closed "extras" it adds on top.

Are you suggesting that there's a place that you can file a PR (or equivalent) without an account?

They're shady AF, but some features require an account not because they're moat water but for some other reason.

There is actually such a place! The Linux kernel (the original use case for Git) is developed by emailing patches to the appropriate mailing list. The mailing list is compatible with self-hosted mail systems.

But you still need an account. email one in this case.

You don't need an account to read the list.

http://vger.kernel.org/vger-lists.html

Unless you count your ISP of course.

You don't need an account to browse GitHub PRs or issues either.

You don't need an account to send email. You can send emails using nothing but a CLI tool on a *nix machine. Although most mainstream email services will send those to the spam folder these days.

I thought damn near every ISP blocks port 25 outgoing. I even had to open a ticket to get it unblocked on a vps I use.

probably for the best.

You need an e-mail for GitHub as well, but in addition a GitHub account

I'm really focusing on the search feature that's no longer available for a given repository unless you're logged in. Everything else is "Microsoft added extras to git". If all I can do on GitHub is use it to git clone, then what's the purpose of it?

It's clear the "extinguish" phase of Microsoft's open source venture has started. There're plenty of very good alternatives, it's time to move on.

How does providing more features to some people make the core service unusable for "open source"?

This is adding less features for libre software users. I hope everyone migrates to local SourceHut servers when possible.

Do you mean "git clone"? I thought the idea about git is that you always take the code into your local machine before doing anything. Sure, github gives some bells and whistles to obviate the need from time to time, but that's basically decorations.

And can you search a git repo on git://my-domain.example.com? Or can you browse that without cloning? And even cgit, can you search that?

Are these now also no longer "open source"?

No.

None of this is somehow a core requirement of "open source".

Its never been possible to participate in github hosted open source without an account...

I have a GitHub account. Until recently most of my GitHub participation was done without being logged in.

What do you do that doesn't require authentication?

Read code and issues mostly.

It's rather pedantic, but I wouldn't call that participating. That's more in the realm of observation.

An account has always been required for active participation, e.g. committing directly to any repo, submitting a PR, opening an issue, etc.

Yes, and I use an account for those. But I value not having to use an account for any work before that, which is just a crucial.

You can still do that without being logged in, thats also not participating in open source. More observing open source then anything

These are prerequisites to what I suspect you have in mind when you consider open source participation. Of which I do a decent amount, logged in of course, but this comes first.

right, but being a member isn't a pay-wall. This feels more like an academic argument than a practical one.

Becoming a member means sharing your personal information with GitHub and abiding by their terms of service. It also implies that they will then track and profile locations where you sign in, technologies you use to interact with their services etc. That's a lot to give for "I want to browse a piece of code someone made for free".

And it's what they get in return for hosting and providing indexing and a git server.

I'm not saying it's a deal that makes sense for everyone. But it's certainly a barter trade and is well within their prerogative to offer.

The only bit of "personal information" you need to provide is an email, and you're free to use a burner.

That's a lot to give for "I want to browse a piece of code someone made for free".

You can still do this. All repositories are accessible without signing in.

Facebook recently estimated that knowing about a user costs close to €11/month. That's a lot of money for hosting and indexing git repositories... I guess we need an alternative.

And it's EXACTLY the same on Codeberg, and GitLab, and SourceHut, and anything else online. Some of these don't even have a "search" feature.

This is just degrading in to conspiratorial FUD.

Giving them your personal information is a form of payment. I don't mean this in a pedantic or academic sense, but a very real one.

Not at all. FOSS should be built on FOSS tools.

Agreed. If this is indeed a resource issue and they want to just block crawlers/bots/anonymous users from using lots of resources then perhaps having the old "less resource intense search" for users who aren't logged in and having the new "more resource intense search" for users who are logged in would be an improvement.

Ideally yes but the old code search may also require costly infra, like building specialized indexes of the codebase (eg an elasticsearch index).

Since those are "fixed costs" per repo rather than per search, they'd now be much more expensive per search if they were only used by logged-out users.

Pure outside speculation of course.

The kinds of people averse to user accounts wouldn't be using ("participating in") Github in the first place.

Or to put it another way: It's a surprise you need a Github account to use ("participate in") Github?

The most realistic one is, that people will get fed up and register an account, and github can brag about how many new users they brought in.

The other side is, that existing users will get pissed at github, because they can't even search anymore without logging in, and sometimes that's a pain (not their pc, public pc, incognito tab, the time needed to do the 2fa, etc.).

Github can still keep the cheap, fast basic search for users not logged in, but they didn't.

people will get fed up and register an account, and github can brag about how many new users they brought in

This makes no sense. The number of people who use GitHub code search but don’t already have a GitHub account is surely negligible

it's about having you logged in for the day.

logged in user ad impression is 8 to 12 usd. anonymous user is .01cent per thousands.

I don’t see how that relates to my comment

I believe the previous poster means that there are people who have an account but aren't logged in all the time.

I honestly don’t understand why. On my computer I log in once and that’s it. They don’t automatically log you out. I haven’t logged into GitHub since setting up my computer.

There are ads on GitHub?

Keeping the old search running for logged out users has substantial costs too:

1: You have to maintain two full independent search and indexing implementations

2: User confusion. Why are the search results different depending on if I'm logged in or logged out?

I agree with you, but I also feel like you're throwing pearls at swine. Some people will always have an incredible amount of entitlement to justify their laziness or incompetence.

I mean, the person opening the issue escalated this into socioeconomic issues just because they don't know how to use grep... What else is there to say?

I mean, the person opening the issue escalated this into socioeconomic issues just because they don't know how to use grep... What else is there to say?

I suppose one could try to see the value in the argument and engage in good faith discussion but it's easier to flippantly dismiss people like this, I imagine.

Sometimes you have to be flippant when engaging with trolls.

There wasn't enough thought behind the issue, other than being outraged at their unsatisfied entitlement

In reality, how many people are regularly using github search without having a github account? Can this change really be expected to bring in a meaningful number of users?

Trying to think this through, my own best guess for what is going on is that there is some amount of traffic coming from bots/scripts and github would like to have all queries associated with an account so that they can block accounts? I'm not sure if this makes sense, but I think it's a reasonable possibility.

most open source projects migrated to gh when they pledged it would always be free and open for open source...

now under new ownership they closed search and "pray they don't alter the terms further"

ps: also everyone here guessing wrong. Microsoft is an advertising company. requiring you profile for a service is self explanatory.

Is it really an advertising company?

Seems to be around 5pc revenue from a quick search.

Not that I want to defend Microsoft.

I’m not saying you’re wrong, if anything it makes their decision even more ridiculous, but GitHub has completely saturated the market. Other than pissing people off what do they really benefit from this?

GitHub has completely saturated the market

That's an assumption which github may not agree with.

If you really care about being able to search github without logging in https://grep.app/ is pretty good, I would often use it instead of the old github search cause I found the results to be better.

o wow, the KPI of user acquisition should be through the roof with login required

the most realistic one is to prevent other scrapers from building their own tools on top of github code search.

by putting login, they can throttle or block it entirely.

This irks me to no end.

They justify it with "engagement" and it kinda works to a point, since if you don't get pissed with the broken search and register an account, you're not considered a "user", and if you try to use the tool but don't want to register you're not "engaging" with it, I guess... But it makes the initial contact that much more miserable.

I see lots of reasons being proposed, but my guess is that they want or need to avoid content being scraped by AI related companies and researchers.

I just assume this to be the case for all future decisions like this. Surely the Reddit API changes were at least partly from this? Reddit is one of the richest sources of (pretty) authentic human interaction that humanity has ever made. Every AI company and startup was going to mercilessly mine this forever. I'm sure Reddit would much rather sell (our) interactions instead.

Does that mean LLMs are going to speak like redditors? Oh the humanity.

They already do. :)

https://www.reddit.com/r/ChatGPT/comments/10j531u/i_made_cha...

As someone who contributes a decent amount of OSS code on GitHub, I want my code scraped by AI related companies. If GitHub starts making opinionated decisions about how to “protect” my code, then I won’t like that.

If GitHub starts making opinionated decisions about how to “protect” my code . . .

They already did.

. . . then I won’t like that.

There is no universe in the multiverse where Microsoft or one of its subsidiaries gives a single flying fuck. Something about those who forget history being doomed to repeat it . . . .

A scraper won't be using search. Only crawling links. If they required login to browse a repository that would stop bots but also cause a 1000x greater riot.

It's not their code to "protect". The license of each project dictates what can and can't be done with it.

And also, what is stopping the scrapers from creating a bunch of accounts to scrape with, and rotating them once they get banned?

If they are doing this intentionally, then fuck Microsoft. Bunch of assholes as always. If this is mere incompetence, then again I'm not surprised that they fumbled such a simple task. Grep is literally 50 years old. They don't even have to write any code to do text search, they could use free software if they love it so much. This isn't some amateur startup with a CRUD app and 3 devs. This is one of the largest and wealthiest tech companies in the world, and they can't even get text search to work without a login? What a bunch of losers.

If there is a 3rd explanation, I'd love to hear it.

https://github.blog/2023-02-06-the-technology-behind-githubs...

Where in that post does it say that the user not being logged in is making it difficult/impossible for them to return search results?

If they are doing this intentionally, then fuck Microsoft. Bunch of assholes as always.

They explicitly said that they "had" to require being logged in, so, yeah: https://github.com/orgs/community/discussions/77046#discussi... . Although they do assure us, however, that they're totally sorry for the inconvenience.

You are correct, but I was just covering all the possibilities since I never trust what they say either.

It's likely to avoid crawling, especially to train competitors to GitHub Copilot or other LLMs.

Why can’t you train on the git repository itself? Clones are not behind a login.

The way AIs are evolving, I wouldn’t be surprised if they have the ability to go to GitHub and search on demand based on a users query.

How do you discover the repository, especially in an automated fashion?

This seems like the most probable reason of any I’ve read so far.

Aside from performance concerns, there have been many occurrences of bad actors using code search to find hardcoded credentials, and then using that to gain unintended access to additional systems.

I suspect the change has more to do with security concerns (and having an audit trail) than performance.

I’d like to see these many occurrences.

I think the solution to hard coded credentials isn’t making search harder (security through obscurity == bad) but the other things GitHub is doing to detect and mitigate hard coded credentials.

Search is still possible using google and other methods, so any theoretical gains from forcing login are dangerous to count on as vulnerable projects are still vulnerable.

If anything, I think the solution to projects with hard coded credentials are to make the credentials easier to find and exploit so they fixed more quickly after being created. The most dangerous are ones they are hard to find so someone uses them for long periods of time without detection.

I’d like to see these many occurrences.

Some high-profile cases where credentials were leaked on public GitHub: Uber in 2014 and 2021 [1, 2] and Twitch in 2021.

Search is still possible using google and other methods

Yes, you can search with Google or other sources. But the thing is, pretty much only GitHub has easy access to all the code present there, readily available to search within minutes of pushing.

You could try mirroring GitHub yourself, but you'd need enormous disk space and bandwidth, and would quickly hit rate limits. You also wouldn't be able to do fast full regex search like you can with GitHub's search, as you don't have their search infrastructure.

Hackers are aware of this and do make use of GitHub's search to identify possible leaked credentials. I recently experimented with uploading a dummy AWS credential pair to a public Git repo, and saw that numerous IPs started trying those credentials less than 5 minutes after I pushed.

any theoretical gains from forcing login are dangerous to count on as vulnerable projects are still vulnerable

Indeed, requiring login is not going to _solve_ the problem. But it could lessen the impact of leaked credentials at large scale, by making it more difficult for automated systems to harvest them.

Perhaps more significantly, requiring login could give better audit trails in incidence response situations, as the logs would indicate which accounts were searching for secrets.

I think the solution to projects with hard coded credentials are to make the credentials easier to find and exploit so they fixed more quickly after being created

Yes, this can help! There are several other companies that specialize in secret detection. I've also written Nosey Parker, a fast regex-based detection tool that has higher-precision rules than similar tools [4]. GitHub also has its own offering in Advanced Security to address this problem.

[1] https://www.reuters.com/article/uk-uber-tech-lyft-hacking-ex... [2] https://www.securonix.com/blog/securonix-threat-research-ube... [3] https://news.ycombinator.com/item?id=28770590 [4] https://github.com/praetorian-inc/noseyparker

Bad actors are hardly deterred by the requirement to be logged in to be able to search.

Yes, you are correct. But requiring that one be logged in to use the search functionality makes a stronger audit trail possible: looking at the search logs would indicate who has been hunting for secrets!

As far as their code search under the hood it's just Elasticsearch, nothing special. Github's explanation sounds like mostly bullshit. Forced authentication of public endpoints is not an appropriate solution for the issue they're claiming. People building bots who actually care about this endpoint can just have their bots login with free accounts. This doesn't actually prevent load on their servers in a meaningful way.

When they’re logged in, they are much easier to throttle or even ban compared to IP hopping to dodge IP throttling/bans.

GitHub's code search is no longer using Elasticsearch, it is using an entirely in house (Rust-based) search engine nicknamed Blackbird: https://github.blog/2023-02-06-the-technology-behind-githubs...

bots login with free accounts

those accounts still need to be created, which is hard to automate

also, now GH can just ban more easily detectable accounts that scrape, vs anonymous visitors

Probably code search was misused by bots to search secrets.

is there any reason why can't GitHub put their code search behind something like Cloudflare Turnstile? if HaveIBeenPwned can do this, why not GitHub? https://www.troyhunt.com/fighting-api-bots-with-cloudflares-...

Yeah but that’s been the case since search existed. It seems unlikely that somehow now that’s become an urgent problem or requiring login will fix anything.

Might also help with thwarting easy access to tokens/keys... especially for those that are not currently in their filtered list.

While a bad actor can certainly scrape/clone the same data... Given repos with thousands of files... Scanning by scraping vs searching vs cloning.... Search is certainly the cheapest for the attacker while the most expensive for GitHub

That could be mostly solved by requiring login only for searches across all repos. Searches in a single repo could be anonymous.

In fact, that's exactly how it was until in June GitHub started requiring login for searches even in just a single repo.

You'd think, with all that work going on under the hood, that they'd be able to support searches for repos by multiple paths:

Give me repos with project.toml and a flake.nix at the root

The interface supports it, but searches come back empty. I end up leaving scripts running while I sleep which walk search results and narrow them by just checking to see if the files are there.

If you're out there Microsoft, please fix your search so I can stop being a bad citizen.

The new search is still so useless that I just maintain a full local clone of our org's code. Someone was requesting perms to run a script to search across the org today -- it would've taken 10+ minutes to run while respecting undocumented secondary rate limits (which there's no way to check via headers, by the way). I rewrote it to use ripgrep locally, and it's down to 20 seconds.

If the new code search is more resource intensive, then why not have a slimmed down, less resource intensive version for when users aren't logged in?

This may involve running two different sets of (for example, ElasticSearch) clusters, which doesn’t justify itself from a business perspective, especially given that non-logged in users can still clone code and use local tooling for search.

If that's true then they should still provide the old non-heavy search for non-logged in users.

We have been running Internet without requiring user accounts. Search was possible. Has something changed? Has github been having real problems with bots? No?

If that's the case, they should really fall back on a grep of the codebase because whatever is "under the hood" sucks. It routinely returns no results for exact string matches in code. These days I clone the repo and use ripgrep instead of codesearch because it's worse than useless, it's wrong.

Check out the Chromium code search. It has regex support, and doesn't require login. Although it is smaller than all of GitHub.

Disclosure: I work at Google.

https://source.chromium.org/chromium

I suppose maintaining two systems adds overhead too, but they could've let you use "basic search" without login, with the improvements login-gated.

why cant they just use the old approach for anyone and the good stuff for loggedin users tho.

They were already rate limiting unauthenticated users.

So are expected to believe that a hostile user who has zillion of ip addresses (to get around the rate limits) won't also be able to make a handful of accounts?

Part of microsoft's AI moat is about controlling competitions ability to use information on github.

I would not be surprised if they start restricting git clone next.

Could you still call code open source (in line with the repo’s license) if the code is gated behind a login, or even a paid subscription?

IIRC GitHub is already rate limited to slow down excessive cloning or API usage.

Yes. You can put a piece of code behind a paywall even. The only requirement under GPL and similar copyleft licenses is to make the code available to those you make the software available to (which can be only paying users, if you so wish), and to allow them to redistribute it under the same terms (which tends to mean that if you do make it pay-only, one of your users can just publish the code if they wish). Absolutely nothing in any commonly used open source license requires anyone to post something publicly and freely.

The user can ask for the code without paying any fees.

Yes (though the GPL does allow you to make the code available via post and charge a reasonable postage fee), if they're a user in the first place. You don't need to give the code to anyone you didn't distribute the software to.

Vvia postmail, but today with ubiquitous internet connections that way has no sense.

I don't see a real issue with forcing you to make a free account and log in. After all it's also permissible to only offer source code in the form of shipping you a CD in return for postage fees. If people don't like how the code is provided they are free to upload it somewhere else.

A paid subscription could be crossing the line. Or maybe it would be fine as long as no profits go to the entity that provided you with the binary. Hard to tell.

It requires accepting additional terms, seems pretty likely to be a breach of the obligation to provide the sources (if it's the only way to obtain them).

Yes. Open source doesn't require that you distribute the software to anyone. It only requires the the people you do distribute it to (if any) all have access to the source code under an open source license.

If the binaries are publicly available anywhere the sources need to be as well (or at least be provided upon request, but I don't think many developers would like to deal with that)

API Usage yes as it requires a token, but git clone requires no auth and still seems to work behind datacenter IPs (e.g. mullvad) meaning that there is little stopping someone from mass cloning.

It's a perverse reality but it seems that in order to keep some ecosystems open one has to take actions which are resource-wasteful (though I would argue in the larger picture it will save resources)

Yes of course because neither of those things are relevant to the license.

I would not be surprised if they start restricting git clone next.

I would, considering the amount of CI systems this would break. Say goodbye to large parts of NPM, go package management, Jenkins scripts, etc.

Don’t forget Dockerfiles!

That would be pure chaos.

A charitable interpretation might be that search requires a fair amount of compute, and is therefore a big denial of service vector.

I am not sure how much behavioral data GitHub can gather from logged in user, and how useful that is compared to the code that is there anyway. Maybe to figure out which parts of code are important? But that isn't really user-specific.

Yes, it's a real problem for anyone offering any sort of search capabilities. Like, about 0.5% of the traffic to my search engine is human. I'm not aware of any search engine that doesn't have similar stats.

Off topic: how do you determine what percentage of search is coming from humans?

Well about 99% of the search requests I got back when I was using cloudflare couldn't get past their bot-mitigation, and of what made it through, at least half looked very automated.

I'm a human and I can't get past cloudflare "bot mitigation" with my browser. Bot mitigation actually just means your browser executing the latest bleeding edge javascript functions to make sure your behavior is monetizable.

No that's not actually true at all. The website always worked with text-only browsers, cloudflare or not. Thoroughly tested with the likes of w3m and dillo.

Virtually all of the traffic that was intercepted claimed to be modern Chrome or Safari or similar, which should be capable of "executing the latest bleeding edge javascript functions".

The primary reason why anyone gets shit from bot mitigation is IP reputation, this is far more important (and effective) than looking at browser characteristics.

Github could require captcha for non-logged in users, I suppose.

I'm a human, yet I am unable to get past Steam's captcha. It is not the only site that I cannot prove to not be a robot. I'm guessing the number of collateral damage is worth it to them. I'm not a big gamer, and wouldn't be a big source of revenue for them anyway.

Steam has a captcha ?

This (behavioural data) is precisely Microsoft's playbook - no charitable interpretations ought apply. As far as I am concerned, no Open Source project has any justification for still being on the platform as of the day of the MS buyout. It's not as though there aren't good alternatives just a git clone away.

This (behavioural data) is precisely Microsoft's playbook

What behavioral data can you glean from a code search like Github's? The context is very different than, for example, Google's, so is there really much useful data you can get here?

From a code search in the wild, with no context? Not a lot. From a code search from a person who's logged in, identified? Well, probably still not a lot, but it's another factoid about that person to hang onto the knowledge graph.

Another factor: anonymous faceted regex search across a huge volume of code allows bad actors to find hardcoded credentials and gain access to additional systems, without a good audit trail.

But yes, there are multiple good explanations for why they would lock down the API.

Nothing other than Microsoft's attempt to stop AI from learning repo code.

I doubt they would use the code search feature instead of just cloning the repo and feeding it to the AI.

I don't doubt it.

If you wished to build a ML bot that taught switches/loops; instead of cloning every repo all you'd need to do is search for switches/loops within X lang.

That would be ridiculously slow and inefficient.

And downloading a blind repo isn't?

Much less so. It can easily be parallelized without running into rate limits of the proprietary search API. Then once you've cloned it once it's on your system and you won't have to talk to github again. So much more sensible.

Sensible isn't a thing when it comes to malicious activity. Same could be used to cheat the search.

Downloading a repo and than neither knowing the repo has the syntax you wish to learn is still going to be more resourceful than having a webpage thrown at you with all search results.

Very inefficient way to do data access for more realistic training objectives.

Why else do you use search? If not to find a specific piece of code.

They probably want to use user info on what they search for, what they use, etc to better filter between good/bad popular/not-popular code for finetuning things.

If people can search anonymously it gets a lot harder to datamine

Retrieval-augmented generation could be the specific AI use case they’re guarding against if they’re restricting search but not cloning.

Can we please stop treating GitHub like it's an open platform? It's not. It's a closed, walled garden like every other one. The fact that they host a bunch of open source projects you use doesn't make them better; if anything, it makes them worse, because they've helped these projects wall off a part of their contribution infrastructure behind the lock and key of a corporate account.

I think it also needs to be said that GitHub has helped countless open source projects grow- Git Hosting, wiki hosting, issues, GitHub Actions, GitHub Pages, a nice API...

There are LOTS of reasons to host your open source project on GitHub

Sourceforge used to have similar reasons.

And if github truly goes to shit like source forge did people will just move to a clone.

But if you look at any source forge repo right now, I think you'll see github as a long way to fall before it gets that bad.

The tragedy here is that folks don't learn.

Learn what, exactly? Anything can go sour, including community-run open source stuff. There is never any guarantee for the future about anything.

Yet many expect that everything always stay the same, and flock like sheeps to the next one that sells themselves as otherwise.

The are more reasons to run away from GH and just set a locally managed SourceHut instance for sanity and interop.

If I have to sign up for a "corporate account" or I have to sign up for a random self-hosted gitlab account or I have to sign up for your bugzilla & wiki & gitwhatever.... it makes no difference. As a user, it's all the same. Actually... of all of these options, I prefer a "corporate account" (aka github) because I can participate in a nearly infinite amount of projects without having to create new accounts/logging in/etc.

No one has walled anything off in any way that is materially different than any other option in the space.

I would argue that Github has done a LOT of good in the space. Making good software, making it freely available. Keeping it reasonably open and accessible. Keeping it standards compliant where there are standards. Having API's for the rest. And in general, giving a huge amount of storage and compute away for free for open source projects.

The beauty with git is that it doesn’t need an account. It can all be done via email. Hosted repos that use that are much more open and allow you to use the tools you want to contribute.

A lot of these walled garden platforms have contributed a ton and github has made source code more easy to host, no doubt. At the same time, we need to ask them to do better and not allow them to concentrate power for when the inevitable enshittification begins.

I suspect they are trying to block bots like everyone running a website with useful data these days.

The bots can still do a git clone and index everything, this just inconveniences normal users working on "some other" PC (or browser or incognito tab), where they are not logged in, and/or don't want to log in (coworkers PC, 2fa, whatever).

Maybe if the bot operators have the resources, but it's far from trivial to keep an up-to-date mirror of every project on Github, especially if Github is actively putting up barriers to prevent it. Once login is required, it becomes much harder to bypass rate limits because the company can rate-limit signups from unknown domains, enforce 2FA, etc.

They can do all this (rate limits, etc.) for unregister accounts already, and most users would never notice that (since a human only does a few searches per time unit), but they decided to require a login anyway.

A lot of bot traffic is just mindless "follow any link" traffic, not specialized bots to do X. It really is hugely pointless and wasteful to have tons of these bots request tons of comparatively expensive search links.

The bots can still do a git clone and index everything

Of course they can, but then they're gonna be chewing up their own disk space and bandwidth for anything after the initial hit. I think the real problem is that the bots hit the GitHub servers over, and over, and over again.

I run a small website that tracks releases of a software, it probes releases every few hours, with each run consuming a several hundreds of API requests. GitHub API token limits are pretty generous. There is no paid offerings to increase the limits, so I suppose if you ask GitHub nicely, they will increase the limits given a reasonable justification.

it probes releases every few hours

I can't think of any software I've ever used that I was this concerned about release schedule. Even if I was a user of your site, I might check it daily, but even that is doubtful.

This has been happening since... forever?

I guess code search requires a lot of processing power so they'd rather know who's doing it in case they need to be throttled.

No. It was possible previously. The whole search feature on github seems to have been rewritten recently with several regressions in functionality.

They wrote a pretty interesting blog about the rewrite.

https://github.blog/2023-02-06-the-technology-behind-githubs...

I wasn't aware of the specifics. I was only going by the UX changes. Good to know more detail. Thanks

Yeah, I used to use github to search for codes that I don't remember in what repository I used, by searching for keywords that I can remember, but that doesn't work anymore.

I stopped being able to do GitHub code search without being logged in several months ago. The search bar would search through open issues; clicking code on the left in the results prompts me to login. Has been this way since at least June.

as an alternative, you may add `1s` after `github` in your address bar to open the repository with a browser-based VS Code, and then use Ctrl + Shift + F to search across all files

Why do it via 3rd party if you can just change github.com to github.dev and get 1st party VSCode? Or better yet, just press "." (dot) character on keyboard and VSCode will pop up.

But that is repository search only, of course and not github-wide.

https://docs.github.com/en/codespaces/the-githubdev-web-base...

ironically, if i'm not mistaken, github.dev requires being logged in

github.dev also requires being logged into github.com.

A useful tip, thank you, but I think it will ask you to authenticate which is the OP's main gripe.

EDIT: No, I was wrong, you can search it without authentication.

The web has really closed in 2023. StackOverflow, Reddit, Github, Twitter all put the brakes on scraping and API access. The trigger was preventing AI training combined with a push to increase profitabilty due to commercial realities (tech recession, new ownership at Stack and X, Reddit wants to IPO).

I believe a long term effect will be a rise in the marketplace for proprietary data. Search engines, AI tools, anyone who needs the data will have to pay for the firehose or API access. (That, in turn, may cause antitrust issues if only the richest companies can afford access to that data.)

Yes and no. ExpertsExchange was notable for having "walled" responses and was similar to StackOverflow. In fact, I remember when StackOverflow was launched, it was compared to expertsexchange but "open".

It is time to move from centralized services to something more distributed and "micro transaction based". I know a lot of people in HN will dislike this butn All the ones you mentioned (Q&A forum, News link + commenting aggregator, Git Hosting + Approval Flow + Wiki + Ticketing system ) can and should be implemented in a completely distributed manner: (using Kademilla/Bittorrent technologies, IPFS and CryptoTokens for paying micro transactions).

Going 100% distributed is the only way these sort of things are going to stay "open" and free of corporate greed (we have seen time and time again, original founders may start the project with good intentions, but in the long term, the product gets sold and bean counters take the lead).

It's been going on longer than this, but this year has had a lot further locking down...

https://theoatmeal.com/comics/reaching_people

... and everybody will just think it's normal for "StackOverflow, Reddit, Github, Twitter" to sell access to content that other people created and own...

If something is truly proprietary, then the compensation should go to its owner, not to some random Git server operator or whatever. And the owner, not the server operator, should be setting the price and terms. If the server needs to make money, it can charge for the basic service itself.

On the other hand, if somebody created something to give it away, on a platform that at that time provided a channel effective for giving things away and advertised itself as suitable for giving things away, then the platform changing the rules midstream is morally piracy of that person's work.

Probably a very large fraction of the material on those platforms would never have been put there in the first place if the actual owners had expected random restrictions and arbitrary access charges.

If they want to radically rewrite the rules like that, then they need to not sell access to any preexisting content unless the original creator explicitly opts in. But we had Reddit, for instance, actively reinstating posts that had been mass-deleted by their authors specifically to prevent that kind of abuse.

This is annoying at worst. I search code without being logged in 3 or 4 times a year, and I'm very active on GitHub.

... try to look at it from a lens other than your own.

Creating an account and logging in does not seem like a big ordeal.

If you're very active on GitHub you probably stay usually logged in any case

Started at least 6 months ago and was announced in their changelog https://github.blog/changelog/2023-06-07-code-search-now-req... (HN discussion https://news.ycombinator.com/item?id=36230929)

Surprised to see this as news as it feels like it's been years at this point. I can scarcely remember when I was last able to search without being logged in.

I understand perfectly why this has been done from a business standpoint as it invariably increases 'user engagement' by driving people to become literal users!

this. that has been my personal experience as i browse github without being logged in for the most part.

even searching a specific file name inside the repo has been locked behind the same way.

If folks want to continue searching open source, https://sourcegraph.com/search does not require sign in and also includes major projects that are not on GitHub.

(Full disclosure: I'm the Sourcegraph CTO)

Why did you decide to allow search without a sign in but not for cody?

Thank you. I don't want to sign into my personal github account on my work computer, so I always use sourcegraph if I need to search a repo.

I remember longing for the ability for github's search to rival git clone + grep, but I never expected a login wall to come with it. IIRC I expected the login wall to just be because the feature was in beta and would be removed when it became the primary search.

HackerNews: Can no longer view the frontpage without seeing a complaint about a free and ad-free service not being provided to users who are not logged in.

Do you have any references for this claim? I browse HN pretty much everyday, and "I can't do X without logging in" doesn't seem to be a popular type of post. The only example I can think of is Reddit's API changes, which were a while ago and loosely related to logging in.

If it's free, then you are the product. That's an adage that holds true in most cases, and in this case, your code is being used to train Copilot, among other things. Even if that weren't the case though, and the service really was as benevolent as you seem to believe it is, would you be sitting there clapping as the service gets worse for a decent subset of users? Is that exciting and interesting to you?

Even if I agreed with the position of person who posted the question (which I don't really, because search is not necessarily cheap and could be used for a DOS style attack), why do people feel the need to be like this? Like, what benefit is saying, "Is it not enough to monetize every bowel movement, you now feel the need to track which individual lines of code I'm browsing?" here? Chill out.

The underlying argument isn't even that good. Besides that repos themselves are _not_ gated behind a login and so an open repo is able to be accessed publicly, including cloning the repo, if the point is that the author wants his repos and code to be useful to the public, then if there is ever a need to interact beyond just downloading the code and searching, such as creating an issue or making a PR, that contributor would necessarily need to log in.

Yes, there is very often a need to quickly search in source code (to figure out how something works, to discuss about it in places other than GitHub issues...).

Cloning is usually fast but not with very large repositories, and you might not even have enough space on your current device to clone them.

By the way if a lot of people will end up cloning everytime instead of logging-in (after all logging in to GitHub is a pain with the 2fa) I don't think their systems will have less load.

And not everyone might want to have a GitHub account

I still can’t get over the fact that GitHub blatantly disabled regular search in favour of some Git mumbo jumbo nonsense that forbids you from looking up recent code. Any new technology, be it an API or otherwise, you can’t explore it on GitHub because “search by recency” is no longer a thing.

What an asinine thing to do. My GitHub usage has dropped exactly to 0% for this very reason. I know I am not alone either since their org forums is filled with complaints about this.

Not going to use any strong words but I want to. Disgraceful UX choice.

There's a forum thread about that here: https://github.com/orgs/community/discussions/52932#discussi...

We are thinking about way to make this work in the new code search system, but don't have a date for when it will be possible.

GitHub search is not exactly supported by ads. It was a much more, uh, baffling decision to have limited public access to Twitter, which is literally ad impressions. I guess I can't comprehend that level of business genius.

Has there been an update on whether the "they literally couldn't afford their cloud hosting bill" theory was bunk?

This doesn't seem to be much of a problem given that making an account is free.

Yes, but does come with some checks and balances: an e-mail account (allowing them to filter out any dodgy domains or throwaway services, a captcha (bot detection of sorts anyway, may be hidden, would also be used for public search), a barrier to entry via their signup procedure, etc. And then they would have a unique token - your username / ID - to help detect flooding / scraping, set a limit of e.g. 100 searches per hour or whatever they have deemed normal and acceptable behaviour.

To be fair, I’m surprised it was open in the first place. For a website that does not have any ads, having its enormous amount of data not only readily available but also searchable felt like a bad choice.

What I hope never happens is the actual enshittification of the site for logged in users.

Why a bad choice??

I really don’t mind at all and neither should you

Welp, everyone pack up. This one person has decided that there's nothing to mind here.

I wonder how much scraping for various purposes, but maybe more directed towards training LLM's, has affected this change?

To train LLMs it seems asinine to try to scrape things instead of making a clone.

The only automated searches I could envisage to MAYBE be happening are those looking for vulnerabilities, and there are probably only a few dozen actors doing it. State actors seem actually more likely to use clones of everything.

This issue blows it way out of proportions. The code is still freely available, Github just doesn't want to give away resources for free because it's a for-profit company. If you're mad at a for-profit company for not giving away things for free, you're simply delusional.

This issue blows it way out of proportions. The code is still freely available, Github just doesn't want to give away resources for free because it's a for-profit company. If you're mad at a for-profit company for not giving away things for free, you're simply delusional.

GitHub is still giving away resources for free, they've done it forever because it was their business model: attract as many people as possible through open source repositories, a fraction of which will then use their non-free services (and more recently, also to train AI models).

Now that they're so dominant (and owned by Microsoft) they can afford to worsen the services, but people have every right to be pissed about it and push developers to move their repositories.

Controversial opinion: when a service provider makes antagonistic changes to its service, it's actually a good thing, as it pushes people to find different solutions and not depend too heavily on one service provider

Sometimes yes… sometimes it progresses way too slowly.

To those arguing that GitHub needs to do this because search is resource intensive, why can't GitHub put their code search behind something like Cloudflare Turnstile?

See e.g. Troy Hunt doing this for HaveIBeenPwned: https://www.troyhunt.com/fighting-api-bots-with-cloudflares-...

I have had my own issues with Cloudflare Turnstile (https://news.ycombinator.com/item?id=38412057), but in principle, I don't see why can't this be done

Yes, this is standard now on Github. I noticed this weeks or even a few months ago. Now I often resort to cloning repos and running my own dumb grep. Thanks for being helpful, Github.

You all talk a big game, but your actions are ineffectual. You don't consider the leverage you have to "encourage" Microsoft to turn public GitHub search back on.

Real leverage looks like all of us banding together to boycott Microsoft's most profitable income streams. Not GitHub. But cloud services like Azure, and Office, gaming, etc:

https://www.nasdaq.com/articles/these-2-revenue-streams-acco...

It's about making unilateral ensh*ttification decisions like this so expensive for parent companies that they become unthinkable. The board and investors sometimes need to be reminded how expensive these decisions really are.

We've all seen stock prices tank by half in one day since the Dot Bomb. A weak signal starts a big wave in this age of algorithmic trading.

A high-profile company publicly switching to AWS/GCP to avoid eating MS's sh@t sends a strong signal.

More importantly, after everything that MS has put us through over the decades, divesting is FUN! :-)

Also this ide-in-browser by default(?) broke the ui. I can’t even h-scroll code on mobile without wasting half of the phone’s charge.

My problem with code search is that when I’m searching I’ll randomly get dumped to github.com/search with the search blanked, and anything I put in the search box at that URL is ignored. I have to go back in my history until I’m in the normal GitHub UI again. My search has also been discarded there, but at least I can interact with it.

Searching code server-side (especially with the newly added intellisense-ish features) is probably expensive, yet they think they deserve such a feature for free, when code searching/browsing isn't GitHub's primary goal.

Why not either learn to use local code search tools and clone a repo, or just log in and search then? Are either of these things such a hurdle?

I sympathize that the person opening the post is frustrated by the chain of PEBKAC events that unfolded, but that's not really an issue with GitHub.

The issue opener is behaving like an entitled toddler by making his problems "societal" problems. Honestly, the lack of self awareness with some people is astounding.

I see a lot of people upset about the openness of the web or whatever, and I am too, but I don't see many who point to specific harms that this causes. Of course, this leads others to believe that the complaints are nitpicky and lazy. Since I use this feature heavily, I feel like I might weigh in why this really sucks.

First off, I have a GitHub account. But I'm often in contexts where signing into it is annoying, and intentionally so (it lets me push code!) I usually don't want it on my work machine, even though I search GitHub a whole lot from there. Some of it is of course related to my job directly, which you might plausibly make the argument for that my employer should somehow compensate them for a free service that indirectly makes them money, but a lot of it is literally my work computer being a second machine that I work on, with open source code running on it, and usually these searches are to contribute value back to the community–either because I want to file a bug against it later, or maybe even deciding whether it is worth my while to contribute code to it. And I do this all the time from other devices, too: I might have logged into GitHub on my phone once, but my iPad? My mom's computer? Being able to search GitHub is an excellent way for answer people's tech questions at any time, much like if Google asked you to sign in everywhere it would be a massive pain.

Second, and also quite important, is that I can't use GitHub links as a way of pointing people at code anymore. I frequently (check my history!) will post a comment like "yeah the Foo project uses bar API a bunch like this, [GitHub search link]". It's a very quick and very direct way of sharing this information. Of course I have no idea if the people on the other side are logged in or not, which means that if you put them behind a login wall I will slowly stop sharing these, because people will complain that they can't see what I sent them. It's the same way I hesitate to send people links to Twitter/X these days, because whether the content will be accessible is a coin flip. And I'm definitely not going to ask people to clone the repo (on what, their phones?) to see what I'm seeing, so I might as well link to Sourcegraph instead.

Github has not allowed site-wide code search without a login for years.

What seems to be new new is that you can't search even a single repo.

You can just clone it and use your local tools.

If the code is hosted or mirrored elsewhere, and elsewhere uses GitWeb, you can use GitWeb's grep search.

It's fair to assume it's very costly to run and has been abused by bad actors. I'm actually surprised it was open to begin with.

KPI driven development.

I imagine Microsoft view the code on their platform as an asset, as they can use it all to train AI that they can sell on. They don't want anyone else doing the same thing with the code on their platform.

everyone here will forget we only have encryption today because openbsd was in Canada.

every usa based project had to quit shipping any cryptography code. but who cares about history. who cares you can only participate in some open source project today if you have an account with usa companies? nobody cares. screw the Iranian engineering student. and hope the usa doesn't outlaw encryption again.

Was going to say that the comments on the link is a dumpster fire, but it looks like HN is not much better.

I've been using sourcegraph when I'm not logged in, which has also the bonus of being a lot better (at least when compared to the old search).

It's as simple as appending the repo URL, starting from github.com: e.g. sourcegraph.com/github.com/rust-lang/rust/ (you can try searching for unit on this repo)

Been using https://grep.app for that

Original HN Discussion: https://news.ycombinator.com/item?id=22396824

Sourcegraph works too: https://sourcegraph.com/search

I have another one: You can't logout of GitHub without enabling 2FA. You can't log a ticket about it either.

So it begins...

Now that the entire programming world has just about hard coded GitHub into the very center of everything, it's time to start enshittifying.

I have a simple guess. We are in a post zirp (zero interest rate policy) world. I've seen very large companies embracing degrowth.

Quite possibly Microsoft told it's subsidiary GitHub that they needed to spend less money and this is how they save resources while impacting the users they care about the least.

This is pretty disappointing on GH's part. They seem to be moving further and further away from things that made the platform great.

https://sourcegraph.com/search has a really powerful code search can be a pretty powerful alternative and allows you to search codebases without logging in and across different code hosts.

This is honestly very low on the list of things I dislike about Microsoft GitHub. git clone + rg still works fine anyways.

You also can't sort code-filtered search results by last-indexed anymore. The removal of this feature imposes a new tax on searching popular domains of work involving rapidly changing APIs and library preferences, such as deep learning.