This tends to be a very unpopular opinion around here, but in almost all cases I find Internet scraping to be unethical and downright malicious. I'm not saying all cases, but I'm saying almost.
A lot of the actors involved tend to be hustle culture types who think they are OWED your data, regardless of the ethics, laws, being a good citizen, whatever. They will blatantly disregard terms of service and hide behind massive setups such as these to circumvent protection etc.
And the problem is, if you run any sort of business or service that is data oriented, there will be thousands of people that will do this, which will cause you to devote enormous amounts of time, effort, money, and infrastructure just to mitigate the issues involved with data scraping. That's before you are even addressing whether or not these people are "stealing" your data. People who feel they are entitled to the crux of your business aren't bothered by being nice in the way they take it - they'll launch services that will cripple infrastructure.
Whenever I deal with a scraping process that decides it wants my entire business, and it wants all of it RIGHT NOW, or in 5 minutes, I want to find the person and sit them down in a room and tell them "hey, develop your own ideas and business. Ok? Thanks"
And if you think this was a problem before, it's exponentially worse over the past few months with every Tom, Susan, and Harry deciding they must have all your data to train their new LLM AI model. By the thousands.
I find it aptly hilarious that your own business model at broadcastify.com is recording publicly accessible radio broadcasts and then selling access to those recordings for commercial gain.
Why is that hilarious? We developed an entire community, infrastructure, system, architecture, everything, from scratch, and provide access to something that never existed in the first place on the Internet. That's a significant key difference here.
This would be analogous to you thinking ancestory.com is "aptly hilarious" for arguing against someone just scraping their site for content.
What makes you think you should be entitled to drive by the very unique house that we built, and pointing right at that house and saying "I think I'll take that all of that for myself!"
Because you fail to see the very obvious parallels to scraping. I’m not criticizing your business (I think you provide a valuable service) but your hypocritical stance on what forms of publicly available information are allowed to be gathered and repackaged.
Google’s original (and OpenAI’s) business model was also building a scraping infrastructure, system, and architecture, from scratch — and providing access to something that never existed in the first place.
It's completely perpendicular, not parallel.
Public safety communications are radio waves that are broadcasted and the ability to passively monitor them is enshrined in United States law. That is a massively key difference.
If I was sending data into your home from my infrastructure without any action from you whatsoever, and you were reaching up into the air and gathering it and repackaging it, AND the law said that I have no intellectual property rights to said data, then that's a whole different story.
You are scraping radio signals and selling it. It’s an exact parallel and if you fail to see this it is indeed hilarious.
If you don't understand the difference between intercepting radio signals and Web scraping, I'd say your understanding of physics and technology is pretty hilarious.
Look around in your house dude, there are radio signals present in your house right now as we speak - you just can't see them - the data literally exists right in your home without you even having to do anything. And the law grants to the unequivocal right in the United States to intercept those radio signals.
So you only point that scrapping data is bad because the cost? How do you know that the site someone is scraping doesnt have fixed cost?
no, scraping data is bad because this is against owners wishes.
In US, if you broadcast, the by law you consent to be received and recorded.
If you scrape data, there is no such law. And if you get consent (say by finding the permissive robots.txt), then go ahead and scrape.
The broadcasters weren't happy about home cassette recording either, and the case went all the way up to the supreme court. If I can legally record cable, then it's nit a stretch to say I can also "record" what's on the public Internet for my own use.
Morally speaking, we have to consider the other side of the equation - operator may not be happy about being scraped, but as a user, is it okay for me to build or use a scraper-based price-comparison or price-tracking platform? I'd say yes, even though most sellers wouldn't want to have this data scraped.
I see a difference between "scrape for personal use", "scrape for public good" and "scrape to earn money from".
Everything is fine for personal use - you are choosing how to consume the websites, and if you choose to do it by extracting all the data into tables, that's fine.
Public good scraping is slightly murkier morally but I guess it's also fine? Similar to "fair use" copyright exceptions. (Unless it's commercial companies pretending to do "public good" solely for their own benefits, like AI "open dataset". Those should be banned.)
"Scrape to earn money from" is not OK. And sadly, this seems to be the majority of all scraping projects, such as: copy the sites wholesale and display your own ads on them, collect data to train AI on, for SEO (=make everyone's search results worse).
The good analogy is what would you do in a public place like a cafe: can you do your personal work? No problem at all. Can you put a non-commercial poster or sign? This may be OK. Can you earn money off it (say sell your own stuff inside)? No way.
The analogy here is that a website that is connected to the internet is considered "free to browse" just as a radio signal is "free to listen to".
The issue isn't listening or browsing (so long as it's not DoS-ing), it's what you do with that information and whether you have permission to use the information (copyright of the host / broadcaster) in the way that you are and in the way that was intended.
It is difficult to get a man to understand something, when his salary depends on his not understanding it.
Every time you use Google you benefit from scraping. Scraping is how the world works for the last 25+ years.
You are trying to draw a distinction between data that is pushed and data that is pulled, and maybe there is some economic argument there in terms of resource usage, but that is very context-dependent.
In UK listening to public radio broadcasts is illegal. I think this law is idiotic and ignore it. It seems you do too since there appear to be streams from UK on your site :)
Google benefits from legal scraping - ban them from robots.txt and they'll stop.
Please don't mix consensual and non-consensual scraping, the difference is huge.
You realise web scraping is a legal right too?
Why is it ethical if you build upon other people's data, but unethical if others do it?
Nobody cares how valuable you think your service is. Who's the judge of what's entitled to scrape or not? If you think you're the judge, I find it somewhat arrogant.
It is even more hilarious that you defend a position that, to me, looks authoritarian and individualistic. Might not be your intention, but it's what I read.
Why is it ethical if you build upon other people's data, but unethical if others do it?
Because they GAVE IT TO ME, that's why.
Who's the judge of what's entitled to scrape or not? If you think you're the judge, I find it somewhat arrogant.
You find it arrogant that I want to protect my business interests from people who solely want to just "take" from the hard work my team has put together. Would you be arrogant if you built a platform over 20+ years, and then scrapers just took the data for themselves?
...looks authoritarian and individualistic.
These assertions are ridiculous. LOL. Hyperbole at it's finest.
They gave you a right to resell their broadcast content?
yes, US goverment did
Please read other thread replies.
Look, when you publicize information that is not a human creation or art, you are GIVING IT TO THE PUBLIC.
The berne convention intentionally left out database sui generis rights outside the scope of copyright. Only in the European Union you have the kind of protection you're looking for. And even in the EU, I've never came across a case where the law was enforced in courts. Maybe because it's a ridiculous right, in my opinion, that would make information flow disfuncional in society.
You gave it them when they visited you.
If your business is just that you have a bundle of information and expose it over an open website, I’m not really sure how you’re able to maintain a mentality that you are somehow entitled to ownership of that information. You already put it out there, it’s now public, any illusion to exclusivity is now gone because anyone could come along at any time and make a copy without your knowledge. A moral position on this issue is even more confusing to me. Do you think that you e.g. own the knowledge on which radio frequencies are used where? Do you think you have a moral claim on ownership of (presumably unpaid) user-submitted information? I think the only legitimate moral grievance you have is high traffic volumes from inconsiderate scrapers.
Do you think you have a moral claim on ownership of (presumably unpaid) user-submitted information?
You damn right I do. I own, develop, and maintain the entire system that enabled the body of works to exist in the first place.
Do you think that you have a claim on ownership of the data because you drove by, saw what you liked, and decided that now you'll just rip the baton out of my hand?
I don’t think that meets the bar. Running a website is absolutely not equivalent to the collective effort people put in to populate that website with the information that actually gives the overall artifact its value. There is a large history of outrage when similar information repository websites with user-generated content violate expectations of openness. Nevermind the fact that the actual information itself isn’t even private or proprietary, just obscure and distributed.
I wouldn’t claim ownership nor want to, when I scrape stuff I usually just want information in a different format. I’m confused as to how you think you can even “own” data to begin with. Suppose that your users uploaded songs instead of RF info, do you believe you own their music solely because they chose to share it on your site? Do you think your users would believe that?
I’m confused as to how you think you can even “own” data to begin with.
It's actually very simple. If I'm in a position to restrict access to the data, then I own it, unless there is some legal authority that has jurisdiction over me that says I must make it available to the public.
Operating a website doesn't automatically put you in that position, as evidenced by the fact that scraping does not require your consent to be possible. Ultimately there's little practical difference between someone's eyes viewing information and a program viewing that same information, a copy has been made in some form. Scraping a new site takes maybe a few hours of python to accomplish, the barrier is low.
I don't think you understand. If I decide as the owner of a site, that I don't want you scraping my business and I block you, then I am in that position. I'm automatically in that position because I can implement the blocks necessary to uphold the the terms of use of my business, or I can just do it for arbitrary reasons. Maybe you are hammering my server. Maybe I'm in a bad mood this morning and don't like that you're using Python.
I can unilaterally decide whether or not you use my business, in any way shape or form, even if I just don't like you, as long as I don't violate any laws (discrimination etc).
I absolutely understand, it's just not hard to make scraper traffic appear as (or be) legitimate browser traffic and/or simply distributed across numerous IPs. Other technical controls all have trivial circumvention methods. There is legal precedent (at least in the US) suggesting that scraping public information may be permissible under law (see HiQ Labs v. LinkedIn). Scrapers only ever need to succeed once.
Under these circumstances, how can a website operator feel any sense of practical control over scrapers?
This is kind of a silly argument. If a physical business trespasses me for shoplifting, I can just put on a disguise and go back and shoplift more. Why do business think they have control over shoplifters?
This is kind of a silly argument, for every item you shoplift: do you ask if you can take it without paying and then get granted permission?
Given that you haven't fixed your problem with scrapers (given the complaints you're making right in this thread.) It's obvious you're not in a position to restrict the data-- otherwise you'd not be complaining about scrapers, and thus you don't own it.
Considering Walgreens is still fighting shoplifters, it’s obvious they’re not in a position to restrict their merchandise. They must not own it.
I'm glad you agree with my point that Walgreens owns their merchandise not because they stop shoplifters and restrict access, its because they purchased it and have title over it, and since GP has done no such thing they don't actually own it.
Well, exactly. blantonl claims that his ownership rights are based on his ability to restrict access to things which is not a mainstream view.
Your example illustrates this nicely. Walgreens owns the goods on their shelves regardless of shoplifters.
So, in which jurisdiction are you? Because in US courts have confirmed multiple times that scraping public websites is legal.
https://techcrunch.com/2022/04/18/web-scraping-legal-court/
Are you just trolling at this point?
_You are handing the baton over_ in an HTTP response. If you don't want to do that, then change the logic of your server.
Good grief man.
Then any store is handing over the baton because you can walk in, take merchandise off the shelf, and walk out.
That's not at all what's happening here. This is me walking in, with a polite and well-formed request, regarding a piece of merchandise: "May I have <item>?"
And the store, clearly and with a signed receipt, saying, "Here is the item you requested. Have a nice day."
I think your basic arguments are either:
- scraping is immoral
- we should bake DRM into the internet
There's no technical or legal difference between a scraping or web request, and I can't really believe that you think that non-scraping web requests are immoral, so I think that probably isn't your argument.
Moving onto DRM, I think most people don't want it baked into the internet. I think individual entities can choose to use it if they want--that's basically how you protect against scraping, so I think people irritated by having their content copied and thus devalued (or their ads replaced) should probably just do that.
It seems like you have this imaginary strawman that you hate and it seems like that's the foundation of why you dislike this.
No. The foundation of why I dislike it is simple. If I own some data, then I get to dictate the terms of how that data is used. Period.
“Hustle culture types” is simply a little anecdote about the types that would look you in the eye and tell you they are entitled to disregard what I said above. They’ll usually wrap it in some altruistic bs to justify as well.
Why do you put it on the open internet if you don't want machines to find and read it?
ToS is nice but you can't expect that it applies - the user (of the machine doing the scraping) might be a child which makes the potential contract automatically void, for example. Also, there are people under jurisdictions where such things have no power, or that don't recognize your rights to the data.
And the whole thing of putting data out publicly and then just expecting machines to see the pile of data and go "oh so where do I sign the ToS?" is weird...
Just put it behind a rate limited API key...
What makes you think putting data on the Internet all the sudden means I unilaterally surrender the rights to my intellectual property?
If I choose to make my data available to some businesses to make discovery of it easier, and I choose to decline to allow others to unilaterally copy my data to develop a different business, that's my right. And it is unethical and unreasonable for any other person to assume otherwise that they are entitled to the same rights I granted someone else.
If I own some data, I get to the be arbitrator of the who/what/when/where on the use of the data. Period.
Sure, you can do whatever you like. Cut the connection if you don't like it. But I can do whatever I like too - read the data that your machine sent me, for example. If your machine sends my machine data it's IMHO reasonable to expect that you don't care about me having it unless we agreed otherwise. But in many countries ToS is not considered a legal contract at all - just having it on your site somewhere is not enough. Sometimes not even having users check the ToS checkmark would form a valid contract.
There are many kinds of data that can't be owned at all. Actually it's the other way around - there is a very small subset of data that can be owned. You can try to cover it under some kind of a non-disclosure clause in a contract, but again - a contract would have to exist.
Look, you are trying to argue that you might want to take some data from me and use it in a personal, non-commercial sense. Cool.
The entire purpose of the OP article is to develop a system to directly circumvent data access and protection mechanisms for profit. Pure and simple.
Spare me the altruistic BS. No one is developing and utilizing a cluster of freaking distributed servers with forty 4G modems to do anything other than steal data from services that don't want their data stolen, so they can use it for profit
You have to call a spade a spade here.
What I'm saying is - your machine is fully capable of providing just the right amount of data to fulfill your purposes. If you don't like people taking it all, don't build a machine that gives it to them at 1 Gb/s. Stuff about some ToS or rights or IP ownership is just noise.
Because intellectual property doesn't exist.
Scraping doesn’t imply IP violation.
As an analogy, imagine that a gardener builds a beautiful flower garden, bisected by a cute stone path, which she invites the public to view freely, save for a single restriction; a sign reading "keep off the flower beds."
There is a well-understood social contract here. I should not drive my car along the path, even if don't crush the flowers. I shouldn't walk on the flower beds, even if that sign isn't legally enforceable. And if a runaway lawnmower, RC car, or some other machine of mine does end up in the garden, I am responsible, because it was my machine.
With websites, there is even a TOS specifically for scrapers - robots.txt. The fact that it is easy to bypass or ignore is no excuse for actually bypassing or ignoring it.
The anonymity of the Internet functions as a ring of Gyges, where since people don't face consequences (even social ones), they feel entitled to do as they will. However, just because you can do something does not mean you have a right to do something.
Robots.txt is definitely not any kind of ToS - some people (Google) said they will respect it. No reason to expect people even knowing about the concept - practically nobody knows about it, not even most developers.
And again - there are countries where any ToS without explicit signature or other kind of legal agreement don't apply at all.
Just like writing "by using the toilet you agree to transfer your soul for infinity" on a piece of toilet paper taped somewhere in the vicinity of a toilet gives you nothing - even if it was a more reasonable contract, nobody agreed to anything.
As for your other point, I think this is more like standing next to a highway with a sign that reads "don't drive cars here" and expecting people to stop and turn around. They didn't even see your sign at their speed and it's kinda unreasonable to expect they would be checking for that kind of a sign on a highway. At least make it properly - big, red, reflective (e.g. a Connection Reset, or at least 403 Forbidden).
Robots.txt is definitely not any kind of ToS - some people (Google) said they will respect it. No reason to expect people even knowing about the concept - practically nobody knows about it, not even most developers
Oh that's bullshit, how do you expect to be taken seriously with such nonsense?
Is it? Just ask around. I have web app devs around me, they don't know it. Only those who actually specialize on web sites (for presentation) do.
Yes, there is no legal enforcement mechanism behind robots.txt. Nor do I particularly want there to be. However, most people agree that reasonable requests made regarding the use of someone's property should be followed. The capability to do something without consequences is not the same as the right to do something.
Our gardener should not need to build a brick wall around their public garden to keep your lawnmower out.
I think this analogy would be improved if the sign said "Please don't take any pictures." This is far more restrictive than a sign saying "Please don't take any seeds or cuttings." The latter is more understandable because such activity damages the flower garden (particularly if everyone starts taking seeds and cuttings).
Now let's say a photographer visits the flower garden, takes images, and sells them online as post cards? As long as the photographer is not hindering other people (flooding the site with repeat requests, in the analogy), it doesn't seem to be a problem.
On the other hand, let's say we don't have a flower garden, we have an art gallery or a street artist's display - or the pages of a recently published book. Now the issue is distributing copyrighted material without paying the creator... but what if there's a broad social consensus that copyright is out of control and should have been radically shortened decades ago?
The vast majority of data being scraped is not copyrightable creative work, however, so as long as you're not obnoxiously hammering a site, scraping seems perfectly ethical.
Serving HTML will get you scraped. Your terms don't overrule fair use.
What if you got that data from me/users and I/we claim the same rights (like GDPR for example)? Will you still honour ownership as above?
There's a lot of local history locked up in facebook's nostalgia groups. I want to archive it in an open format.
I want to grab new rental listings and put them in an RSS feed, so I only look at each one once.
That's my uses for data scraping right now. If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.
If that destroys someone's business, I don't actually care. Maybe it's selfish, but my right to re-format data for my own convenience outweighs their right to make a profit.
Exhibit A
Yeah, it's as unsympathetic framing of my position as I can offer.
But it's basically the same question as adblockers: Can I do what I want with the 1's and 0's on my own machine?
I'm not going to accept that I owe anyone a business model.
I'm not going to disagree with your use case here.
But I'm going to assume that you have some level of a conscious and you don't really mean you could give 3 shits about someone else's hard work so you can have some satisfaction at home. Because at face value that's exactly what you said.
No, I think that's fair. Unsympathetic framing, but not inaccurate. It's that whole "information wants to be free" thing.
BTW, kudos for presenting your point of view in a hostile forum and holding your own. I should have said that up front.
Not that I think you shouldn't do it or you're doing something wrong, but describing it as a right irks me the wrong way. You don't have any right to expect someone else's computers to work for you.
I'm not sure how to phrase it except in terms of competing rights, but I take your point.
At the point where I'm scraping, the data's on my computer though.
You could call them interests .
It's often in a business's interest to format data in a specific way to make money, for example interlacing it with ads.
Nice.
I use web scraping to identify and monitor fraud.
Exhibit A: https://archive.ph/0ZUA8
This website is used to recruit people to set up "lead generation" Google Business Profiles and leave paid reviews.
Exhibit B: https://archive.ph/WWZuw
This is an example of the Craigslist ad used to initially attract people to the website above.
Exhibit C: https://archive.ph/wip/7Xig4
This is one of the Google Maps contributors which left paid reviews.
If you start with the reviews on that profile, you'll find a network of Google Business Profiles for fake service-area businesses connected through paid reviews.
Web scraping allows me to collect this type of data at scale.
I also use scraping to monitor the status of fake listings. If they are removed, the actor behind them will often get them reinstated. This allows me to report them again.
I don't care if you use Web scraping to solve the Israeli / Palestinian conflict. You're not entitled to anyone's data, computers, services, etc because you've decided for altruistic reasons that it is appropriate.
Cool use case. Love it. Fascinating stuff. But if Google told you to stop, would you? Or would you instead decide to build a 5 server cluster of 200 4G modems spread across continents to continue your work? Because if you did I would assume that you've decided to move on from a cute little altruistic process into a commercial use of someone else's data to make a profit.
Wait - so you are saying that information on the public internet isn’t public? Man, I wish people would remember the origin of the web and the entire reason it exists. If you don’t want information public, protect it - otherwise, I say it’s fair game.
Remember the OP article is about a system that is designed to completely and directly circumvent protections.
If an organization puts a series of processes in place to prevent scrapers from wholesale taking data in violation of terms of service, and you develop a 5 server cluster of 200x 4G modems it's no longer "fair game" and you're directly being unethical in your use of someone else's services.
Yeah, I think it's fair to say that in the presence of anti-bot measures (whether they work or not) that the content on the website isn't public anymore.
Available to someone meeting certain criteria (student discount, senior discount) doesn't mean available to anyone. I see no reason that "not available to be consumed by autonomous agents" is somehow invalid in a way that unlimited refills is only available to humans and not robots.
Maybe it is not the opinion which is unpopular, but the way it is being presented.
I agree that there is a line at using someone else’s data to make a profit, but it is kind of ironic that you mention Google, because their exact business model is scraping websites to feed their search results and litter it with ads to make a profit. For me there is a big line between aggregating publicly available data (search results, reviews, news, job postings, etc. ) and intentionally violating terms of service like signing up for fake accounts an harvesting user data. So entitled maybe not (sites can try to prevent you from scraping), but if you make something publicly available you shouldn’t be surprised when people use it in ways you may not originally have intended (within legal boundaries of course).
Maybe you should though. It's always worth it to think about which giant's shoulder you're standing on. It's giants all the way down.
That's a lot of righteous anger for somebody building a business on top of other people's data.
"Broadcastify is the worlds largest source of public safety, aircraft, rail, and marine radio live audio streams."
I have no sympathy whatsoever. You're just complaining about the very thing you're doing. If it's fair for you to do that, it's fair for others to do it to you.
They volunteer to provide the data to us. Every single last one of them. Nowhere in our business model did we make the conscious decision to say "hey, look at that business, they have something, and I'm going to take it."
Reading public website data is not "taking it". It is still there.
Observing publicly available information is not theft, nor is it illegal.
Of course copyright rules apply, but that is for if you reproduce something.
reproduce something
No one is developing a 5 server cluster with 200+ 4g modems to observe publicly available information. They are using said cluster to deliberately work around blocks, rate limits, and restrictions on scrapers who are scraping content solely to reproduce the data and use it for commercial purposes (make money)
Aren't you also volunteering your data? Don't browsers just talk to your webserver and say "Hey, what do you have?" and your site responds in kind.
I absolutely agree. In fact, I think the problem is that like everything, there is an optimal point for efficiency, and crossing that line by making things "too easy" when it comes to data means too much power for one person to handle ethically. Absolute power may corrupt absolutely, but near absolutely power also corrupts quite nicely, too.
In short, we should have limits to amount of scraping possible, simply because humans can never be trusted past a certain point to remain ethical. After all, ethics at its first approximation is only a mechanism to improve societal cohesiveness, and it only works as long as the person doesn't have enough power to "do away" with society.
Would you make the same argument of the inverse: data gathering?
Is it unethical for a mouse to eat the cheese without triggering the trap?
The Web (you said the "Internet", but you meant the Web) was not envisioned to be a commercial space. Your statement is antithetical to the original idea of the open Web. It's when the MBAs joined the party circa 2k and decided to profit out of it that all of these confused and wrong opinions about what the Web should be arose and that lead to the situation today. Your statement is a vast display of zero historical context. MBAs are obviously not very concerned with history. They just want to protect their own little turd for their own little profit and vanity, which is why they now put it behind a paywall, JS, and anti-bot proxies.