HN comments for: Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?

jackienotchan

39 replies

2h36m

2024-07-30 15:54:46 UTC

Affected companies are becoming increasingly frustrated with the army of AI crawlers out there as they won't stick to any scraping best practices (respect robot.txt, use public APIs, no peak load). It's not necessarily about copyright, but the heavy scraping traffic also leads to increased infra costs.

What's the endgame here? AI can already solve captchas, so the arms race for bot protection is pretty much lost.

londons_explore

7 replies

1h55m

2024-07-30 16:35:42 UTC

AI can already solve captchas, so the arms race for bot protection is pretty much lost.

Require login, then verify the user account is associated with an email address at least 10 yrs old. Pretty much eliminates bots. Eliminates a few real users too, but not many.

tempfile

4 replies

1h50m

2024-07-30 16:40:26 UTC

require login

this is not a solution if you want a public internet (and sites that don't care about the public internet already don't have a problem)

londons_explore

3 replies

1h46m

2024-07-30 16:44:48 UTC

for read-only content, I just stick it behind a cache and let the bots go wild.

tempfile

0 replies

1h35m

2024-07-30 16:54:57 UTC

I presume OSM has already considered this and ruled it out (probably because the map should be dynamic)

jgalt212

0 replies

1h10m

2024-07-30 17:20:32 UTC

You can't cache this stuff for bot consumption. Humans only want to see the popular stuff. Bots download everything. The size of your cache then equals the size of your content database.

dartos

0 replies

1h35m

2024-07-30 16:55:33 UTC

That’s just passing the buck.

Someone still needs to pay for that traffic. If it gets too much for cloud flare or whoever, you’re gonna get the bill.

mcherm

0 replies

1h2m

2024-07-30 17:28:07 UTC

This is about OpenStreetMap, so you are proposing that my minor daughter not be allowed to read a map?

_heimdall

0 replies

53m

2024-07-30 17:37:23 UTC

I must be an outlier here, but I don't keep email addresses that long. After a couple years they're on too many spam lists. I'll wind those addresses down and use them for a couple years only for short interactions that I expect spam from, and ultimately close then down completely the next cycle.

At best any email I have is 4 or 5 years old.

kjkjadksj

4 replies

1h50m

2024-07-30 16:40:41 UTC

Seems to me eventually we might hit a point where stuff like api access is whitelisted. You will have to build a real relationship with a real human at the company to validate you aren’t a bot. This might include in person meeting as anything else could be spoofed. Back to the 1960s business world we go. Thanks, technologists, for pulling the rug under us all.

tedivm

0 replies

1h43m

2024-07-30 16:46:58 UTC

Scraping often uses the same APIs that the website itself does, so to make that work a lot of sites will have to put their content around authentication of some sort.

For example, I have a project that crawls the SCP Wiki (following best practices, ratelimiting, etc). If they were to restrict the API that I use it would break the website for people, so if they do want to limit the access they have no choice but to instead put it behind some set of credentials that they could trace back to a user and eliminate the public site itself. For a lot of sites that's just not reasonable.

smt88

0 replies

1h43m

2024-07-30 16:47:06 UTC

You can't whitelist and also have a consumer-facing service. There is no reliable way to differentiate between a legitimate user and the AI company's scraper.

bunderbunder

0 replies

1h44m

2024-07-30 16:46:46 UTC

Scraping implies not API - they're accessing the site as a user agent. And whitelisting access to the actual web pages isn't a tenable option for many websites. Humans generally hate being forced to sign up for an account before they can see this page that they found in a Google search.

brightball

0 replies

1h43m

2024-07-30 16:47:36 UTC

I could definitely see this. I worked for a company that had a few popular free inspector tools on their website. The constant traffic load of bots was nuts.

candiddevmike

3 replies

2h29m

2024-07-30 16:01:17 UTC

Invite only authenticated islands based on trust. Which seems like the end result of the rampant centralization of the internet.

zeroCalories

1 replies

2h11m

2024-07-30 16:19:50 UTC

The open web is on a crash course. I don't necessarily believe in copyright claims, but I think it makes sense to aggressively prosecute scrapers for DDOSing.

tempfile

0 replies

1h46m

2024-07-30 16:44:00 UTC

This would already be happening if we could track them.

danielmarkbruce

0 replies

2h4m

2024-07-30 16:25:56 UTC

Bars and coffee shops?

zkid18

2 replies

1h55m

2024-07-30 16:34:59 UTC

Many would oppose the idea, but if any service (e.g. eBay, LinkedIn, Facebook) were to dump the snapshot to S3 every month, that could be a solution. You can't prevent scraping anyway.

glitchc

0 replies

1h34m

2024-07-30 16:56:35 UTC

Data from S3 isn't free though, still costs money and has a limit based on the tier you purchase.

Scoundreller

0 replies

1h50m

2024-07-30 16:40:43 UTC

Yeah, you can get dumps of Wikipedia and stackoverflow/stackexchange that way.

(Not sure if created by the admins or a 3rd party, but done once for many is better than overlapping individual efforts).

yifanl

2 replies

1h50m

2024-07-30 16:40:00 UTC

You can rather easily set up semi-hard rate limiting with a proof of work scheme. Will very trivially affect human users, while bot spammers have to eat up the cost of a million hash reversions per hour or whatever.

skoocda

0 replies

1h46m

2024-07-30 16:43:56 UTC

e.g. HashCash

dartos

0 replies

1h36m

2024-07-30 16:54:37 UTC

Yep. That works well enough for password hashing algorithms to deter brute force attackers.

This is a similar situation.

hibikir

2 replies

2h5m

2024-07-30 16:25:29 UTC

The idea is not to make scraping impossible, but to make it expensive. A human doesn't make requests as fast as a bot, so the pretend human is still rate limited. Eventually, you need an account, and tracking of that also happens, and accounts matching specific patterns get purged, and so on. This will not stop scraping, but the point is not to stop it, but to make it expensive and slow. Eventually, expensive enough that it might be better off to not pretend to be a human, pay for a license, and then the arms race goes away.

Can defenses be good enough it's better to not even try to fight? It's a far harder question than wondering if a random bot can make a dozen requests pretending to be human

amiga386

1 replies

1h28m

2024-07-30 17:02:18 UTC

I liked the analogy to Gabe Newell's "piracy is a service problem" adage, embodied in Virgin API consumer vs Chad third-party scraper https://x.com/gf_256/status/1514131084702797827

Make it easier to get the data, put less roadblocks in the way for legitimate access, and you'll find fewer scrapers. Even if you make scraping _very_ hard, people will still prefer scraping if legitimate use is even more cumbersome than scraping, or you refuse to even offer a legitimate option.

Admittedly, we are talking here because some people are scraping OSM when they could get the entire dataset for free... but I'm hoping these people are outliers, and most consume the non-profit org's data in the way they ask.

rat9988

0 replies

1h1m

2024-07-30 17:28:58 UTC

I think this very example proves that the adage is wrong, or at least doesn't capture many things for the full picture.

__MatrixMan__

2 replies

2h14m

2024-07-30 16:16:34 UTC

I don't know if the AI's have an endgame in mind. As for the humans, I think it's an internet built for a dark forest. We'll stop assuming that everything is benign except for the malicious parts which we track and block. Instead we'll assume that everything is malicious except for the parts which our explicitly trusted circle of peers have endorsed. When we get burned, we'll prune the trust relationship that misled us, and we'll find ways to incentivize the kind of trust hygiene necessary to make that work.

When I compare that to our current internet the first thought is "but that won't scale to the whole planet". But the thing is, it doesn't need to. All of the problems I need computers to solve are local problems anyway.

bunderbunder

1 replies

1h40m

2024-07-30 16:50:14 UTC

Arguably, trying to scale everything to the whole planet is the root cause of most of these problems. So "that won't scale to the whole planet" might, in the long view, be a feature and not a bug.

__MatrixMan__

0 replies

1h30m

2024-07-30 17:00:35 UTC

Right. If your use case for the internet is exerting influence over people who don't trust you, then it's past time that we shut you down anyhow.

For everyone else, this transition will not be a big deal (although your friends may ask you to occasionally spend a few cycles maintaining your part of a web of trust, because your bad decisions might affect them more than they currently do).

bgorman

1 replies

2h3m

2024-07-30 16:27:17 UTC

Web Attestation, cryptography to the rescue.

rs999gti

0 replies

2h2m

2024-07-30 16:28:29 UTC

How? Watermark everything with a hash?

MattGaiser

1 replies

1h37m

2024-07-30 16:53:39 UTC

Feed bad data to heavy users. Instead of blocking, use poison.

tempfile

0 replies

1h34m

2024-07-30 16:56:00 UTC

Presumes you can distinguish the heavy users. If you knew who the heavy users were, you could just block them.

zild3d

0 replies

57m

2024-07-30 17:33:24 UTC

isn't the answer just rate limiting unauthenticated requests to a level that's reasonable/expected for a human?

tempfile

0 replies

1h43m

2024-07-30 16:47:29 UTC

An optimistic outcome would be that public content becomes fully peer-to-peer. If you want to download an article, you must seed at least the same amount of bandwidth to serve another copy. You still have to deal with leechers, I guess.

jgalt212

0 replies

57m

2024-07-30 17:33:17 UTC

What's the endgame here?

We've had good success with

- Cloudflare Turnstile

- Rate Limiting (be careful here, as some of these scrapers use large numbers of IP addresses and User Agents)

agilob

0 replies

1h48m

2024-07-30 16:42:04 UTC

lower max upload speed for certain IPs to 5kb/s

MisterBastahrd

0 replies

2h34m

2024-07-30 15:56:21 UTC

How long before companies start putting AI restrictions on new account creation simply because of the sheer amount of noise and storage issues associated with bot spam?

MattDaEskimo

0 replies

1h56m

2024-07-30 16:34:17 UTC

API-based interactions w/ Authentication.

Websites previously would have their own in-house API to freely deliver content to anyone who requests it.

Now, a website should be a simple interface for a user that communicates with an external API and display it. It's the user's responsibility to have access to the API.

Any information worth taking should be locked away by Authentication - which has become stupid simple using oAuth w/ major providers.

So these people trying to extract content by paying someone or using a paid service should rather use the API which packages it for them and is fairly priced.

Lastly, robots.txt should be enforced by law. There is no difference from stealing something from a store, and stealing content from a website.

AI (and greed) has killed the open freedoms of the Internet.

mouse_

12 replies

3h10m

2024-07-30 15:20:46 UTC

AI = IP thieves

I don't think they're going to donate...

RicoElectrico

5 replies

3h1m

2024-07-30 15:29:40 UTC

The irony is that as indicated in the comments it's far more easy to just download the data dump for the whole planet. It's 70-ish GB right now.

klyrs

4 replies

2h57m

2024-07-30 15:33:33 UTC

If their scraper is sufficiently diligent, they will also download the data dump. The ultimate hope is that when the AI wakes up, it will realize that its training data has both a fragmented dataset and also a giant tarball, and delete the fragments. This sounds like one of those situations where people prefer amortized cost analysis to avoid looking like fools.

o11c

2 replies

2h21m

2024-07-30 16:09:15 UTC

If.

(unfortunately the historical example is poor, since Philip II proved both able and willing to make it happen, whereas AI has no demonstration of a path to utility)

klyrs

1 replies

1h53m

2024-07-30 16:37:01 UTC

Wow, y'all don't recognize parody. I really laid it on thick there, even calling the perpetrators of this madness fools, and still, whoosh. Hopeless.

o11c

0 replies

47m

2024-07-30 17:43:02 UTC

Poe's Law.

Given the massive destruction currently being done eagerly, I dare not ever assume parody.

llm_trw

0 replies

2h38m

2024-07-30 15:52:47 UTC

People just don't know any better.

foverzar

4 replies

2h36m

2024-07-30 15:54:11 UTC

Well it's nice to see the whole concept of IP finally collapsing as it should.

pessimizer

0 replies

1h53m

2024-07-30 16:37:44 UTC

Except it's not collapsing. The only legal changes have been to allow billionaires to do whatever they want whenever they want, and have been made by judges and not legislatively. You're still going to get sued to oblivion.

edit: if we let them, they're just going to merge with the media companies and cross-license to each other.

p_j_w

0 replies

1h20m

2024-07-30 17:10:03 UTC

It's only collapsing for people with money.

loceng

0 replies

2h20m

2024-07-30 16:10:00 UTC

I noticed too that now that the old dinosaur incumbents in various industrial complexes can no longer compete, they want to get rid of non-compete clauses - so they can instead poach talent or at least those who had access to the latest technologies and process of actually innovative companies.

asddubs

0 replies

2h28m

2024-07-30 16:02:46 UTC

that's a little optimistic

tonetegeatinst

0 replies

1h56m

2024-07-30 16:34:00 UTC

I mean is it though? Even before gpt4 llm's existed....and not just from openAI.

I get not liking crawling and I hate openAI for how they ruined the term open source, but this is not new.

Iv had stuff scraped before and iv done web scrapping as well. Hell even excel will help you scrape web data. While some of the increase of training data has helped models like gpt4, its not just a factor of more data.

infecto

12 replies

2h33m

2024-07-30 15:57:31 UTC

Honest question, what are "AI Companies" scraping from OSM?

edent

9 replies

2h25m

2024-07-30 16:05:06 UTC

Because - and I cannot stress this enough - they are both ignorant and greedy.

Whenever I've traced back an AI bot scraping my sites, I've tried to enter into a dialogue with them. I've offered API access and data dumps. But most of them are barely above the level of "script kiddies". They've read a tutorial on scraping so that's the only thing they know.

They also genuinely believe that any public information is theirs for the taking. That's all they want to do; consume. They have no interest in giving back.

carimura

3 replies

2h7m

2024-07-30 16:23:43 UTC

i think you answered the "why", not the "what". :)

BizarroLand

2 replies

1h47m

2024-07-30 16:42:56 UTC

The what is "everything they can get".

They are the modern equivalent of torrent users who don't seed.

briandear

1 replies

1h32m

2024-07-30 16:58:48 UTC

Torrent users who do seed (assuming it’s copyrighted material) are no better. They’re just stealing someone else’s content and facilitating its theft.

If a company scrapes data, and then publishes the data for others to scrape.. they are still part of the problem — the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.

Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.

joshuaissac

0 replies

35m

2024-07-30 17:55:41 UTC

assuming it’s copyrighted material [...] They’re just stealing someone else’s content and facilitating its theft.

All content created by someone is copyrighted by default, but that does not mean it is theft to share it. Linux ISOs are copyrighted, but the copyright allows sharing, for example. But even in cases where this is not permitted, it would not be theft, but copyright infringement.

the altruism of letting other piggyback from their scraping doesn’t negate that they essentially are stealing data.

It does. OpenStreetMap (OSM) data comes with a copyright licence that allows sharing the data. The problem with scraping is that the scrapers are putting unacceptably load on the OSM servers.

Stealing from grocery store and giving away some of what you steal doesn’t absolve the original theft.

This is only comparable if the company that scrapes the data enters the data centre and steals the servers used by the OpenStreetMap Foundation (containing the material to be scraped), and the thing stolen from the grocery store also contains some intellectual property to be copied (e.g. a book or a CD, rather than an apple or an orange).

indymike

1 replies

2h15m

2024-07-30 16:14:55 UTC

ignorant and greedy.

This is going to be the title of my book on AI that I totally need to write.

noah_buddy

0 replies

2h1m

2024-07-30 16:29:31 UTC

Avarice and ignorance ;)

briandear

1 replies

1h34m

2024-07-30 16:56:04 UTC

The is a similar argument used to argue against people stealing music and movies — people would pirate content that someone else invested money to create. But the dominant attitude prior to streaming ubiquitousness, among the tech “information should be free crowd” was that torrents of copyright material were perfectly fine. This is no different. But it is different — when it’s your resources that are being stolen/misused/etc.

My opinion is that if you are building a business that relies on someone else’s creation — that company should be paid. This isn’t just about “AI” companies — but all sorts of companies that essentially scour the web to repackage someone else’s data. To me this also includes those paywall elimination tools — even the “non profits” should pay — even if their motives are non-profit, they still have revenue. (A charity stealing food from the grocery store is wrong, a grocery store donating to a charity is a different thing.)

However another aspect of this is government data and data created with government funds — scientific research for example. If a government grant paid for the research, I shouldn’t have to pay Nature to access it. If that breaks the academic publishing model — good. It’s already broken. We shouldn’t have to pay private companies to access public records, lawsuit filings, etc.

hooverd

0 replies

1h28m

2024-07-30 17:02:10 UTC

At least people who only leech torrents don't think they're doing to bring about the singularity by doing so.

bunderbunder

0 replies

2h2m

2024-07-30 16:28:39 UTC

I don't know that this take is wrong, per se, but I think it's possibly a situation where the "actor with a single mind" model of thinking about corporate behavior fails to be particularly useful.

Scraping tends to run counter to the company's interests, too. It's relatively time-consuming - and therefore, assuming you pay your staff, expensive - compared to paying for an API key or data dump. So when engineers and data scientists do opt for it, it's really just individuals following the path of least resistance. Scraping doesn't require approval from anyone outside of their team, while paying for an API key or data dump tends to require going through a whole obnoxious procurement process, possibly coordinating management of said key with the security team, etc.

The same can be said for people opting to use GPT to generate synthetic data instead of paying for data. The GPT-generated data tends to be specious and ill-suited to the task of building and testing production-grade models, and the cost of constant tweaking and re-generation of the data quickly adds up. Just buying a commercial license for an appropriate data set from the Linguistic Data Consortium might only be a fraction as expensive once you factor in all the costs, but before you can even get to that option you first need to get through a gauntlet of managers who'll happily pay $250 per developer per year to get on the Copilot hype train but don't have the lateral thinking skills to understand how a $6,000 lump sum for a data set could help their data scientists generate ROI.

lovethevoid

0 replies

2h7m

2024-07-30 16:23:12 UTC

Anything they can, judging by the fact that they're hitting random endpoints instead of using those offered to developers. Similar thing happened to readthedocs[1] causing a surge of costs that nobody wants to answer for.

In the readthedocs situation there was one case that was a bugged crawler causing it to try and scrape the same HTML files repeatedly to the tune of 75TB, could also be happening here with OSM (partially).

[1] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...

TrackerFF

0 replies

1h54m

2024-07-30 16:35:54 UTC

Street names and numbers, businesses etc. associated with those streets, and stuff like that.

Say you have some idea, like...you want to build a tool that aids cargo ships, fishing vessels, or other vessels with the most efficient route (with respect to fuel usage) between ports.

The first thing you need to do, is to map all ports. There may not exist any such pre-compiled list, but you could always use map tools like OSM to scan all coastlines, and see if there are any associated ports, docks, etc. there.

Then when you find one, you save the location, name, and other info you can find.

This is pure brute force, and can naturally be quite expensive for the providers. But since you're a dinky one-man startup with zero funds, that's what you do - you can't be bothered with searching through hundreds (to thousands) of lists in various formats, from various sites, that may contain the info you're looking for.

lnxg33k1

8 replies

2h55m

2024-07-30 15:34:55 UTC

The people working for these companies are just clueless, arrogant, ignorant, unaware of others, just trying to hit some productivity target to get promoted, of course they're not going to bother checking whether there are other ways to do something avoiding annoying open source projects

rqtwteye

7 replies

2h34m

2024-07-30 15:56:01 UTC

In my company it’s easier to buy commercial software for $10000 than it is to donate $100 for open source voluntarily. I think they need to open up a store where you can buy donations disguised as licenses so the bean counters don’t even realize this could be free.

resource_waste

1 replies

2h25m

2024-07-30 16:05:17 UTC

The difference in quality is apparent though.

$10,000 and you have an account manager that will actually follow up on issues.

I recently paid $5k for software and its incredible the difference. Its like I have a part time contractor and software.

persnickety

0 replies

2h17m

2024-07-30 16:13:26 UTC

It's not clear from your comment, did you pay for commercial software or paid for an open source contributor's time?

And, regardless of the answer, what was your experience with the other option, for comparison?

Aachen

1 replies

2h1m

2024-07-30 16:29:15 UTC

Same here. We are all Linux users and people linked articles about openssl being underpaid and all that (back when that was a topic), but after migrating from a paid chat tool to Signal, nobody agreed with me that we should maybe donate to Signal now. Both chat solutions are open source SaaS, but the former tool has a paid subscription for commercial use (which does nothing more than the personal version) whereas Signal calls it a voluntary donation. I still don't understand my colleagues' opinion

Paying thousands of euros for some enterprise Java software as well as Microsoft Office licenses we barely ever use, meanwhile: no problem

_Microft

0 replies

1h41m

2024-07-30 16:49:21 UTC

This old comment by patio11 (Patrick McKenzie, the one behind „Bits about Money“ and the blog at kalzumeus.com) on this topic might be of interest:

https://news.ycombinator.com/item?id=10863978

It explains why companies have a lot less problems with invoices than with donations.

1oooqooq

1 replies

2h24m

2024-07-30 16:06:04 UTC

that just highlight ignorance of your company dept handling the purchase, which is exactly the point of the comment you're replying.

would they be more competent if they allowed the company to make the better "purchase"?

lnxg33k1

0 replies

2h17m

2024-07-30 16:13:39 UTC

"we're not arrogant, we just can't be bothered to do things that don't annoy others in any other way that is the way we expect it to be, and volonteers working for free should also account for our way to expect things"

bloody hell, corporate world is unbelievable

carimura

0 replies

2h4m

2024-07-30 16:26:33 UTC

Assuming you are at a big company, it's optimized for risk mitigation.

nailer

6 replies

2h22m

2024-07-30 16:08:43 UTC

Someone recently pointed out the Aaron Schwartz was threatened with going to prison for scraping, meanwhile there's hundred of billion of dollars right now invested in AI LLMs build from... scraping.

stavros

2 replies

1h53m

2024-07-30 16:37:11 UTC

That's because the megacorps can scrape you, but you can't scrape the megacorps.

jahewson

1 replies

1h43m

2024-07-30 16:47:01 UTC

JSTOR is a non-profit run by academics. Indeed, they have plenty of money, but they’re no megacorp.

SJC_Hacker

0 replies

1h25m

2024-07-30 17:05:13 UTC

Generally it should be "More powerful entities can scrape you, but you can't scrape them back"

Google scraping JSTOR (hey, don't they do that already with Google Scholar?" is much less of a problem then JSTOR attempting to scrape Google.

ToucanLoucan

1 replies

2h19m

2024-07-30 16:10:59 UTC

Rules for thee and not for me. Same as it ever was.

startupsfail

0 replies

2h10m

2024-07-30 16:20:00 UTC

And Markdown is the primary format.

bdjsiqoocwk

0 replies

1h35m

2024-07-30 16:55:02 UTC

Why go to AI LLMs? Scraping and indexing is all Google search does.

Satam

6 replies

2h15m

2024-07-30 16:15:05 UTC

I needed osm data at one point. Never managed to figure out how to do it the proper way. To get data you need, you need to download massive 100Gb files, in obscure formats, and use obscure libraries. Info is scattered, there are HTTP APIs but they’re limited or rate-limited and it’s not clear if you’re supposed to use them.

I know I’m ignorant and I’m happy the project exists, but the usability in the era where devs expect streamlined APIs is not great.

I ended up using some free project that had pre-transformed osm data for what i needed.

spywaregorilla

0 replies

2h8m

2024-07-30 16:22:47 UTC

https://wiki.openstreetmap.org/wiki/OSM_JSON

Looks pretty sensible to me?

ks2048

0 replies

1h43m

2024-07-30 16:47:00 UTC

On https://www.openstreetmap.org/, click "Export" (upper-left). It lets you choose a small rectangle (click "Manually select a different area"). It gives you a .osm right from the browser.

For literally single point, on the map icons on the right, one is arrow with question mark ("Query features"). With this you can click on single features and get their data.

hollow-moe

0 replies

1h50m

2024-07-30 16:40:51 UTC

If you're talking about the new-ish data dumps provided in protobuf format, this is a heavily optimised binary format. OrganicMaps uses these files directly to be able to store and lookup whole countries locally. With this format, the dump for France is only 4.3Gb at the time of writing. Also, instead of downloading the whole map, you can use one of the numerous mirrors like Geofabrik [0] to download only the part you're interested in. [0] https://download.geofabrik.de/

GeoAtreides

0 replies

1h32m

2024-07-30 16:58:39 UTC

13-15 years ago I was able to download the OSM data for my country, import it in Postgre (PostGIS), run GIS query on it, then render and print my own maps. I don't remember being difficult, though indeed it required lots of disk space.

Doctor_Fegg

0 replies

2h8m

2024-07-30 16:22:45 UTC

That's kind of by design. Providing streamlined APIs requires a funding model to both host those APIs and pay an army of devops to maintain them. The OSM Foundation is intentionally small and doesn't do that. Rather, it encourages a decentralised ecosystem where anyone can take the data and build services on it - some commercial, some hobbyist, some paid-for, some free. It works really well, and IMO better than the big-budget maximalist approach of the Wikimedia Foundation.

Aachen

0 replies

2h10m

2024-07-30 16:20:13 UTC

What non-obscure formats or libraries would you suggest for a planet's worth of geographic data?

I've also downloaded planet.osm before and parsed it on my desktop with iirc osmosis. Never used that format or tool anywhere else but it's not like OSM has so many competitors offering you large amounts of geospatial data in a freely usable way. What do you considered established mechanisms for this?

snehk

5 replies

2h50m

2024-07-30 15:40:25 UTC

You can literally set up your own OpenStreetMap instance in ten minutes. It's a simple 'docker run'-command. Sure, indexing will take a bit but even that can't take that long given their resources. That's just ridiculously greedy.

dawnerd

2 replies

2h16m

2024-07-30 16:14:45 UTC

Link then? Because last time I tried it was a bit more complex than that.

shironandon

0 replies

53m

2024-07-30 17:37:45 UTC

https://github.com/openstreetmap/openstreetmap-website/blob/...

claytonjy

0 replies

2h2m

2024-07-30 16:27:56 UTC

I used OSMRouter maybe 7 or 8 years ago to process a couple billion routes and it was about as simple as GP described. Just need Docker and an index file. the US one was massive so I kept needing a bigger VM just to load it, but once I did I was able to make HTTP calls over the network to get what I needed. Took a few days to get a working setup but only a few hours to rip through billions of requests, and I was making them synchronously with R; could have been much faster if I was smarter then.

orblivion

0 replies

1h31m

2024-07-30 16:59:51 UTC

A while ago I very briefly tried Headway out of curiosity. This is the easiest Docker based option for the "full stack". It didn't work out of the box. Things went wrong. Which is no surprise, there's a ton of moving parts. And maybe it's not a big deal to work around but I highly doubt that it's 10 minutes of work to get everything working reliably.

joshe

0 replies

1h26m

2024-07-30 17:04:26 UTC

No, it's painful.

exabrial

3 replies

2h30m

2024-07-30 16:00:34 UTC

Once again: Silicon Valley does not understand the concept of willful consent.

JohnFen

1 replies

2h17m

2024-07-30 16:13:24 UTC

Oh, it understands. It just rejects anything that might interfere with income generation.

pessimizer

0 replies

1h57m

2024-07-30 16:33:47 UTC

And like most bad things that companies do, it happens inevitably. The person who doesn't reject anything that might interfere with income generation will be fired and replaced with someone who will.

Meanwhile, the owners will maintain and carefully curate their ignorance about any of those subjects.

tonetegeatinst

0 replies

1h53m

2024-07-30 16:36:58 UTC

Counter argument, its just an excuse to get rid of scraping. Google and every search engine scrapes websites, internet archive scrapes websites to archive stuff, and I scrape data when using excel to import data. Also their are people who want to archive everything.

Iv had my own stuff be scrapped. My biggest issue was bandwidth but I wasn't a big site so it wasn't a big issue.

bofadeez

1 replies

2h24m

2024-07-30 16:06:53 UTC

Still too expensive. Why not just sell it at the prevailing rate per GB for residential IP bandwidth? A thousand max.

Aachen

0 replies

2h14m

2024-07-30 16:16:02 UTC

The joke is that you can already download it for free, no donation or bandwidth reimbursement needed

https://wiki.openstreetmap.org/wiki/Planet.osm

I guess since it's posted to osm.town Mastodon, this is assumed to be known. Was surprised to see it without context here on HN; I can understand the confusion. Apparently most people here are already aware that one can download the full OpenStreetMap data without scraping

nashashmi

0 replies

59m

2024-07-30 17:31:38 UTC

How about a honey pot for AI companies? Endless loop of stupidly generated content. Imagine twitter posts with artificial tweets at the end.

ks2048

0 replies

1h29m

2024-07-30 17:01:38 UTC

Kind of sad that CommonCrawl, or something like it, has not removed the need for tons of different companies hitting all the servers in the world.

I guess part of it is wanting more control (more frequent visits, etc) and part is simply having lots of VC money and doing something they can do to try and impress more investors - "We have proprietary 5 PB dataset!" (literally adds nothing to commoncrawl).

butz

0 replies

1h45m

2024-07-30 16:44:58 UTC

Put planet.osm on torrent. Allow "scraping" only through torrent. Now scrapers are sharing the network load between themselves. Not to mention improved network speed, as probably they all sit on the same AWS instance.

acd

0 replies

1h1m

2024-07-30 17:28:57 UTC

Content attribution where content providers get paid.

Workaccount2

0 replies

1h2m

2024-07-30 17:28:20 UTC

I can hear the AI groaning about regular humans suddenly caring a lot about IP protection and discussing ways to add DRM to protect it.

I really hope the irony isn't lost on everyone.

AshamedCaptain

0 replies

38m

2024-07-30 17:52:45 UTC

Not dissimilar to how instead of just cloning my pre-compressed repos in a simple few seconds operation, "AI" scrappers prefer to request every single revision of every single .c file through the web interface with all the useless (for them) bells an whistles.

Web interface which I have set up as cgi and therefore it will take them longer to finish scrapping than the age of the universe. But in the meanwhile they waste me power and resources.