How big is YouTube?

This is such a clever way of sampling, kudos to the authors. Back when I was at Pew we tried to map YouTube using random walks through the API's "related videos" endpoint and it seemed like we hit a saturation point after a year, but the magnitude described here suggests there's a quite a long tail that flies under the radar. Google started locking down the API almost immediately after we published our study, I'm glad to see folks still pursuing research with good old-fashioned scraping. Our analysis was at the channel level and focused only on popular ones but it's interesting how some of the figures on TubeStats are pretty close to what we found (e.g. language distribution): https://www.pewresearch.org/internet/2019/07/25/a-week-in-th...

This technique isn't new. Biologists use it to count the number of fish in a lake. (Catch 100 fish, tag them, wait a week, catch 100 fish again, count the number of tagged fishes in this batch)

Do you get the same 100 dumb fish?

Why are they dumb? Free tag.

Imagine being the only fish without a tag. Everyone at school will know how lame you are.

This comment. Please see here.

It would be illegal not to have a tag. If the fish has nothing to hide, it shouldn't worry about being tagged.

And, also, the fish gets tagged for its own good.

Everyone at school will know how lame you are.

They'll even call you tinfoil fish.

Catching fish is theoretically not perfectly random (risk-averse fish are less likely to get selected/caught) but that's the best method in those circumstances and it's reasonable to argue that the effect is insignificant.

You make a very weak argument, and are simply assuming the conclusion.

What makes it the "best method"? Would it be better to use a seine, or a trap, or hook-and-line? How would we know if there are subpopulations that have different likelihood of capture by different methods?

To say it's "reasonable to argue that the effect is insignificant" is purely assertion. Why is it unreasonable to argue that a fish could learn from the first experience and be less likely to be captured a second time?

If what you mean is that it's better than a completely blind guess, then I'd agree. But it's not clearly the best method nor is it clearly unbiased.

Fair points. But, mark-recapture is about practicality. It's not perfect, but it's a solid compromise between accuracy and feasibility (so I mean best in these regards, to be 100% clear). Sure, different methods might skew results, but this technique is about getting a reliable estimate, not pinpoint accuracy. As for learning behavior in fish, that's considered in many studies (and many other things, like listed here: https://fishbio.com/fate-chance-encounters-mark-recapture-st... ), but overall, it doesn't hugely skew the population estimates. So, again, it's about what works best in the field, not in theory.

only if you're within a 100 mile radius of me the ultimate dumb fish

Wouldn't a previously caught fish be less likely to fall for the same trick a second time?

That's typically the Lincoln-Petersen Estimator. You can use this type of approach to estimate the number of bugs in your code too! If reviewer A catches 4 bugs, and reviewer B catches 5 bugs, with 2 being the same, then you can estimate there are 10 total bugs in the code (7 caught, 3 uncaught) based on the Lincoln-Petersen Estimator.

But this implies that all bugs are of equal likelihood of being found which I would highly doubt, no?

Yes, it's obviously not a perfect estimate, but can be directionally helpful.

You could bucket bugs into categories by severity or type and that might improve the estimate, as well.

Oh this is a really interesting concept.

I guess it underestimates the number hard to find bugs though since it assumes same likelyhood to be found.

A similar approach is “bebugging” or fault seeding: purposely adding bugs to measure the effectiveness of your testing and to estimate how many real bugs remain. (Just don’t forget to remove the seeded bugs!)

https://en.m.wikipedia.org/wiki/Bebugging

That's not actually the technique the authors are using. Catching 100 fish would be analogous to "sample 100 YouTube videos at random", but they don't have a direct method of doing so. Instead, they're guessing possible YouTube video links at random and seeing how many resolve to videos.

In the "100 fish" example, the formula for approximating the total number of fish is:

    total ~= caught / tagged
    (where caught=100 in the example)

In their YouTube sampling method, the formula for approximating the total number of videos is:

    total ~= (valid / tried) * 2^64

Notice that this is flipped: in the fish example the main measurement is "tagged" (the number of fish that were tagged the second time you caught them), which is in the denominator. But when counting YouTube videos, the main measurement is "valid" (the number of urls that resolved to videos), which is in the numerator.

Did you understand where the 2^64 came from in their explanation btw? I would have thought it would be (64^10)*16 according to their description of the string.

Edit: Oh because 64^10 * 16 = (2^6)^10 * (2^4)

The YouTube identifiers are actually 64 bit integers encoded using url-safe base64 encoding. Hence the limited number of possible characters for the 11th position.

It’s not even new in the YouTube space as they acknowledge from 2011

https://dl.acm.org/doi/10.1145/2068816.2068851

That's only vaguely the same. It would be much closer if they divided the lake into a 3D grid and sampled random cubes from it.

Also related is the unseen species problem (if you sample N things, and get Y repeats, what's the estimated total population size?).

https://en.wikipedia.org/wiki/Unseen_species_problem http://www.stat.yale.edu/~yw562/reprints/species-si.pdf

Isn’t this just a variation of the Monte Carlo method?

I made the same connection but it’s still the first time I’ve seen it used for reverse looking up IDs.

I think YouTube locked down their APIs after the Cambridge Analytica scandal.

in the end, that scandal was the open web's official death sentence :(

The issue wasn't the analytics either. The issue was the engagement algorithms and lack of accountability. Those problems still exist today.

And what kind of accountability is that? An engagement algorithm is a simple thing that gives people more of what they want. It just turns out that what we want is a lot more negative than most people are willing to admit to themselves.

Engagement can be quite unrelated to what people like. A well crafted troll comment will draw tons of engagement, not because people like it.

If people didn't like engaging with troll comments, they wouldn't do it. It's not required, and they aren't getting paid.

This comment has a remarkable lack of nuance in it. That isn't even remotely close to how how human motivation works. We do all kinds of things motivated by emotions that have nothing to do with "liking" it.

I don't think people "like" it as much as hate elicits a response from your brain, like it or not.

If people had perfect self-control, they wouldn't do it. IMO it's somewhat irresponsible for the algorithm makers to profit from that - it's basically selling an unregulated, heavily optimized drug. They downrank scammy content for instance, which limits its reach - why not also downrank trolling? (obviously bc the former directly impacts profits, but not the latter, but still)

I would rephrase that to 'what we predictably respond to'.

You can legitimately claim that people respond in a very striking and predictable way to being set on fire, and even find ways to exploit this behavior for your benefit somehow, and it still doesn't make setting people on fire a net benefit or a service to them in any way.

Just because you can condition an intelligent organism in a certain way doesn't make that become a desirable outcome. Maybe you're identifying a doomsday switch, an exploit in the code that resists patching and bricks the machine. If you successfully do that, it's very much on you whether you make the logical leap to 'therefore we must apply this as hard as possible!'

So as usual, the exploitative agents get to destroy the commons and come out on top.

We need to figure out how to target the malicious individuals and groups instead of getting creeped out by them to the point of destroying most of the so praised democratizing of computing. Between this and locking down the local desktop and mobile software and hardware, we've never got to having the promised "bicycle for the mind".

Subscription VPNs show that there is potential for private security online.

Could a company be a bug you put a bounty on?

no one promised you anything

In which ways were the Cambridge Analytica thing and the openness of Youtube APIs (or other web APIs) related? I just don't see the connection

The original open API from the Facebook was open for the benefit of the good actors to use their data. You can disagree with how it's used, but u can't disagree with the intention.

With the CA scandal, now all the big companies would lock down their app data and sell ads strictly through their limited API only, so the ads buyer would have much less control before.

It's basically saying: u cant behave with the open data. Then we will do business only

CA was about 3rd parties scraping private user data.

Companies are locking down access to public posts. This has nothing to do with CA, just with companies moving away from the open web towards vertical integration.

Companies requiring users to login to view public posts (Twitter, Instagram, Facebook, Reddit) has nothing to do with protecting user data. It's just that tech companies now want to be in control of who can view their public posts.

I'm a bit hazy on the details of the event but the spirit still applies: there were more access to the data that were not 100% profit driven. Now the it's locked down as the companies want to cover their asses and do not want another CA

Wasn't the "open" data policy used to create Clearview AI to create a profile and provide it to US govt departments?

They actually held out for a couple of years after Facebook and didn't start forcing audits and cutting quotas until 2019/2020

Google started locking down the API almost immediately after we published our study

Isn't this ironic, given how google bots scour the web relentlessly and hammer sites almost to death?

google bots scour the web relentlessly and hammer sites almost to death

I have been hosting sites and online services for a long time now and never had this problem, or heard of this issue ever before.

If your site can't even handle a crawler, you need to seriously question your hosting provider, or your architecture.

Perhaps stop and reconsider such a dismissive opinion given that "you've never had this issue before" then? Or go read up a bit more on how crawlers work in 2023.

If your site is very popular and the content changes frequently, you can find yourself getting crawled a higher frequency than you might want, particularly since Google can crawl your site at a high rate of concurrency, hitting many pages at once, which might not be great for your backend services if you're not used to that level of simultaneous traffic.

"Hammered to death" is probably hyperbole but I have worked with several clients who had to use Google's Search Console tooling[0] to rate-limit how often Googlebot crawled their site because it was indeed too much.

0: https://developers.google.com/search/docs/crawling-indexing/...

if your site is popular and you have a problem with crawlers use robots.txt (in particular the Crawl-delay stanza)

also for less friendly crawlers a rate limiter is needed anyway :(

(of course the existence of such tools doesn't give carte blanche to any crawler to overload sites ... but let's say they implement some sensing, based on response times, that means a significant load is probably needed to increase response times, which definitely can raise some eyebrows, and with autoscaling can cost a lot of money to site operators)

I have a website thats get crawled at least 50 times per second. Is that a real deal? No not really. The site is probably doing 10.000 requests per second. I mean a popular site is indexed a lot. Your webserver should be designed for it. What tech are you using if I may ask?

I worked at a company back in 2005-2010 where we had a massive problem with Googlebot crawlers hammering our servers, stuff like 10-100x the organic traffic.

That's pre-cloud ubiquity so scaling up meant buying servers, installing them on a data center, and paying rent for the racks. It was a fucking nightmare to deal with.

This is one of the most important parts of the EUs upcoming digital services act in my opinion. Platforms have to share data with (vetted) researchers, public interest groups and journalists.

Vetted always means people with the time, resources and desire to navigate through the vetting process, which makes them biased.

I would argue it's better than nothing, and what are they going to be biased towards?

For aggregated data and stats like this I think it could be fully publicly available.

"Rules for thee, but not for me"

This would find things like unlisted videos which don’t have links to them from recommendations.

That’s a really good point. I wonder if they have an estimate of the percentage of YouTube videos that are unlisted.

I was expecting to find out how much data YouTube has, but that number wasn't present. I've used the stats to roughly calculate that the average video is 500 seconds long. Then using a bitrate of 400 KB/s and 13 billion videos, that gives us 2.7 exabytes.

I got 400KB/s from some FHD 24-30 fps videos I downloaded, but this is very approximate. YouTube will encode sections containing less perceptible information with less bitrate, and of course, videos come in all kinds of different resolutions and frame rates, with the distribution changing over the history of the site. If we assume every video is 4K with a bitrate of 1.5MB/s, that's 10 exabytes.

This estimate is low for the amount of storage YouTube needs, since it would store popular videos in multiple datacenters, in both VP9 and AV1. It's possible YouTube compresses unpopular videos or transcodes them on-demand from some other format, which would make this estimate high, but I doubt it.

That storage number is highly likely to be off by an order of magnitude.

400KB/s, or 3.2Mbps as we would commonly use in video encoding, is quite low for original quality upload in FHD or commonly known as 1080p. The 4K video number is just about right for average original upload.

You then have to take into account YouTube at least compress those into 2 video codec, H.264 and VP9. Each codec to have all the resolution from 320P to 1080P or higher depending on the original upload quality. With many popular additional and 4K video also encoded in AV1 as well. Some even comes in HEVC for 360 surround video. Yes you read that right. H.265 HEVC on YouTube.

And all of that doesn't even include replication or redundancy.

I would not be surprised if the total easily exceed 100EB. Which is 100 (2020 ) Dropbox in size.

For comparison, 100-200EB is roughly the order of magnitude of all HDDs shipped per quarter:

https://blocksandfiles.com/2023/07/11/disk-drive-numbers/

Sure but exabyte-level storage is done with tape, not HDDs.

I doubt any of YouTube is being served off tape. Some impressive amount of its data is probably served from RAM.

I mean, it would explain the minutes-long unskippable ads you get sometimes before a video plays. There's probably an IT maintenance guy somewhere, fetching that old video tape from cold storage and mounting it for playback.

I don’t have ads. No wait times.

The 4K video number is just about right for average original upload.

No, it definitely is not.

For Smartphone / CE grade recorded video. Not for professional ones. Remember that number is average.

4k is not the average. The amount of sd videos out number hd and of the hd most are 1p max.

What is 1p?

It's not the average in either case.

EB

I pine for the day when "hella-" extends the SI prefixes. Sadly, they added "ronna-" and "quetta-" in 2022. Seems like I'll have to wait quite some time.

Originally 'quecca' had been suggested for 1030 but was too close to a profane meaning in Portuguese

(from https://iopscience.iop.org/article/10.1088/1681-7575/ac6afd, via wikipedia)

Seems like we missed opportunity to have an official metric fuckton.

For anyone wondering "queca" would be the normal spelling of the "profanity" although it's probably one of the milder ways to refer to "having sex". "Fuck" would be "foda" and variations. Queca is more of a funny way of saying having sex, definitely not as serious as "fuck".

Hyundai Kona on the other hand was way more serious and they changed it to another Island in the Portuguese market. Kona's (actual spelling "cona") closest translation would be "cunt", in the US sense in terms of seriousness, not the Australian more light one.

Source is I'm portuguese

We should petition for "fukka-"

I was under the impression that all Google storage including GCP (and replication) is under 1ZB.

IIUC, ~1ZB is basically the entire hard drive market for the last 5 years, and drives don't last that long...

I suspect YouTube isn't >10% of all Google.

So is YouTube storing somewhere in the realm of 50-100EB somewhere? How many data centers is that?

Keep in mind youtube permanently keeps a copy of the original upload which may have an even larger file size

Do you know that for certain? I always suspected they would, so they could transcode to better formats in the future, but never found anything to confirm it.

I know it was true at least a couple of years ago, but there is no guarantee that they keep it.

On all of the videos I have uploaded to my YouTube channel, I have a "Download original" option. That stretches back a decade.

Granted, none of them are uncompressed 4K terabyte sized files. I haven't got my originals to do a bit-for-bit comparison. But judging by the filesizes and metadata, they are all the originals.

I'm interested in the carbon footprint.

https://sustainability.google/operating-sustainably/net-zero...

Thanks, 10 million metric tons per year. ~0.1% of global emissions.

Pretty wild stuff.

On one hand: just two formats? There are more, e.g. H264. And there can be multiple resolutions. On the same hand: there might be or might have been contractual obligations to always deliver certain resolutions in certain formats.

On the other hand: there might be a lot of videos with ridiculously low view counts.

On the third hand: remember that YouTube had to come up with their own transcoding chips. As they say, it's complicated.

Source: a decade ago, I knew the answer to your question and helped the people in charge of the storage bring costs down. (I found out just the other day that one of them, R.L., died this February... RIP)

For resolutions over 1080, it's only VP9 (and I guess AV1 for some videos), at least from the user perspective. 1080 and lower have H264, though. And I don't think the resolutions below 1080 are enough to matter for the estimate. They should affect it by less than 2x.

The lots of videos with low view counts are accounted for by the article. It sounds like the only ones not included are private videos, which are probably not that numerous.

I did the math on this back in 2013, based on the annual reported number of hours uploaded per minute, and came up with 375PB of content, adding 185TB/day, with a 70% annual growth rate. This does not take into account storing multiple encodes or the originals.

You’re forgetting about replication and erasure coding overhead. 10 exabytes seems very low tbh. I’d put it closer to 50-100EB at this point.

It’s really easy for them to block this method: return a random video for a certain percent of non existing identifiers. Throw in a bit of random for good measure

This is the risk of explaining the method

My guess is that would be very, very difficult without breaking a ton of other invariant assumptions throughout the system, like:

Video IDs are immutable

A video can only be represented by a single unique video ID

etc.

You could keep that assumption and just serve a random video if some external IP dials a non-existent ID.

Presumably, no one would do that except researchers trying to count videos (or randomly find hidden ones?).

You can break the assumption of unicity (if an unassigned ID is later assigned) if you do that internally, although not sure that’d be common but it’s not an assumption that has to be strict for non-attributed ones, and you never use the fake ID.

This is easy to detect, though, at least for public videos. Click through to the channel on a different IP and find the video link, or search for the video's title and description, or find a canonical link to the video by any other means. If the IDs don't match, it's been faked.

You misunderstand that this technique was executed by using YouTube _search_ to find videos, not by querying the exact URL. They can doctor search results however they like.

return a random video for a certain percent of non existing identifiers.

if you got a video from a randomly generated ID, you can immediately query for it again, and see if the video was the same as before.

If it's not the same, you disgard the result and assume the ID generated was actually non-existent.

If it's the same, then you know it's a real ID.

As long as youtube vidoe URLs are immutable, this method will stop the blocks you described.

Unless they cache the false information…

You could use a hashing function that guarantees the same result every time.

but now you have to store state for some arbitrary url, share it across many different clusters/regions, record which IP has seen it etc...

a lot of work, for very little gain.

I think this would be easily detectable anyway as whatever IP’s they are using would be viewing videos that basically no one ever watches.

The paper says they made 1.8e10 attempts to produce 1e4 videos. TFA says they now have 2.5e4 videos, so they’ve made >4e10 attempts so far. No way in hell you can scrape YouTube like that without buying access to a big proxy IP pool (e.g. Luminati).

They could use a random redirect function. Throw the user a video URL which is random, but then redirects to the video that you want. So one could never count the space?

Are the video IDs sequential in the available domain? or just all over the place? Is there anything in common with all known live video IDs that could make it easier to scan the quintillion possibilities?

Or don't return any search results for searches for random hashes

I can’t imagine trying to debug production errors with a system like this.

Though if they didn’t say they were doing that we wouldn’t know the method was invalid. Further, that other video would have its own existing uid, so in theory we could know if they’d duplicated it to thwart these measures.

does that sampling function assume every "area code" contain the same number of usable numbers? In the case of some big sites out there (twitter, etc), it's common to have certain shards be much less dense when they hold data that's more requested - eg there'd be fewer numbers on the area code Justin Bieber is in, etc. That might skew things considerably..

Yeah that was my immediate response - what makes them think that videos are actually distributed randomly across the space? I would assume that the numbers are more like a UUID which has random sections but also meaningful sections as well, so it is quite likely that certain subsets have skewed or more/less density.

Their estimator does not require any assumption of uniformity. Notice that P(random ID is valid) = total_valid_IDs / total_IDs is the same regardless of the distribution of valid IDs.

I don't think this is correct. In order to generate a random number at all you first have to assume a distribution. If they generate a series of random ids, and those random ids are uniformly distributed across the space of all possible ids - while the valid ids have clusters - then won't their method still give a skewed estimate?

Nope. Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.

I'm going to roll a d20, which gives me a uniformly distributed integer from 1 through 20. I'll look into the bin with that number and if there is something there take it.

You don't want me to take any of your things. How can you distribute your 3 items among those 20 bins to minimize the chances that I will take one of your items?

If I were not using a uniformly distributed integer from 1 through 20 and you knew that and knew something about the distribution you could pick bins that my loaded d20 is less likely to choose.

But since my d20 is not loaded, each bin has a 1/20 chance of being the one I try to steal from. Your placement of an item has no affect on the probability that I will get it.

It works the same the other way around. If you place the items using a uniform distribution, then it doesn't matter if I use a loaded d20, or even just always pick bin 1. I'll have the same chance of getting an item no matter how I generate my pick.

In general when you have two parties each picking a number from a given space, if one of the parties is picking uniformly then nothing the other party can do affects the probability of both picking the same number.

Let's take a small example. We have 20 bins. You are going to put 3 things in those bins, with each thing getting its own bin.

Now imagine a number of these bins _can hold zero things_ (not 3). Eg in a world where all bins are the same size, you can always steal 3 things from any of the bins, whereas in a world where the bin sizes vary. You'd hit a few bins which are guaranteed empty. Doesn't this directly affect the probabilities?

The key here is that the query sample is uniformly distributed, and that this is sufficient. I think some other comments in this thread give some decent intuitions why this is true. Cheers

No. Think about it: how could the underlying distribution change the probability of a hit from a uniform random sample?

I'm not sure they've sampled enough videos to be able to make that kind of central limit theorem assertion. A trillionth of a percent is an awfully small sample of the total space.

How did you get the trillionth of a percent? The total space is 2^64 and they found 24,964 videos. Based on their estimated number of videos (13,325,821,970) we can infer[1] that they have made the equivalent of a sample of size 3.46e13. I say equivalent because of the "cheating" they mention in the article which improves the efficiency of their method so their actual sample would be 32000 times smaller I guess). Anyway as shown in the link below that's a good sample size since it gives a precision of about 0.6%.

[1] https://news.ycombinator.com/item?id=38744227

If they aren’t distributed randomly, but the test ids are chosen randomly, it’s still statistically sound.

This is an interesting way to attack mitigations to the German Tank Problem [0].

I expect the optimal solution is to increase the address space to prevent random samples from collecting enough data to arrive at a statistically significant conclusion. There are probably other good solutions which attempt to vary the distribution in different ways, but a truly random sample should limit that angle.

[0] https://en.m.wikipedia.org/wiki/German_tank_problem

I didn't read it in the article but this hinges on it being a discrete uniform distribution. Who knows what kind of shenanigans Google did to the identifiers.

Actually the method works regardless of the distribution. It's an interesting and important feature of random sampling. Consider 1000 IDs assigned in the worst (most skewed) way: as a sequence from 0 to 999. If there are 20 videos they will have IDs 0 to 19. If you draw 500 independent random numbers between 0 and 999, each number will have 2% probability of being in the range 0 to 19. So on average you will find that 2% of your 500 attempts find an existing video. From that you conclude that 2% of the whole space of 1000 IDs are assigned. You find correctly that there are 0.02*1000 = 20 videos.

You've assumed the conclusion, you don't know it's 2%; you don't know there are 1000 possible addresses.

For this example he does, he's saying [0, 999] is the possible space, but the distribution in that space is not done uniformly.

Exactly, and if I make a mistake and assume that the possible space is 0 to 3999, it still works! I'll just need a bigger sample to estimate the number of videos with the same precision. (The method does fail if I exclude valid values, e.g. assume a space of 0 to 499).

Would be quite the challenge to use a skewed distribution of the address space that's skewed enough to mitigate this type of scraping while at the same time minimizing the risk of collisions.

this is exactly what springs to mind when it emerged google "conveniently" autocomplete under certain circumstances, thus making those identifiers more likely to be targeted. this completely skews the sample from the outset

How does a random sample solve for a clustered, say, distribution? Don't the estimations rely on assumptions of continuity?

Suppose I have addresses /v=0x00 to 0xff, but I only use f0 to ff; if you assume the videos are distributed randomly then your estimates will always be skewed, no?

So I take the addressable space and apply an arbitrary filter before assigning addresses.

Equally random samples will be off by the same amount, but you don't know the sparsity that I've applied with my filter?

As long as the sampling isn't skewed, and is properly random and covers the whole space evenly, it will estimate cardinality correctly regardless of the underlying distribution.

There is no way for clustering to alter the probability of a hit or a miss. There is nowhere to "hide". The probability of a hit remains the proportion of the space which is filled.

Does YouTube generate video URLs completely randomly though? I'd be surprised if that was actually the case.

Wait a second. If the IDs are all allocated in a contiguous block, and the author samples the whole space at random—then the estimate will converge to the correct value. If, on the other hand, the IDs are allocated at random, the estimate will also converge to the correct value.

If you believe that there is some structure in YouTube video IDs, that would have no effect on this experiment. It would just reduce the fraction of the total address space that YouTube can use. This is a well-known property of "impure" names, and it means there is a good chance that the IDs have no structure. In other words, the video IDs would be "pure" names.

Is 32,000 a good enough number to estimate the entirety of the Youtube’s video space? It felt to little for what they are trying to accomplish (especially when they started doing year by year analysis)

32000 is just the "cheat factor" by which they increase the method's efficiency.

I'm not sure how much the "cheating" would affect the precision of the result. But assuming it has no effect, it's easy to estimate this precision:

They found X = 24964 videos in a search space of size S = 2^64. For the number of existing videos they report the estimate N = 13,325,821,970. From this we can find their estimate for the probability that a particular ID links to a video: p = N / S ≈ 7.22e-10. So the equivalent number of IDs that they have checked (the number of checks without cheating that would give the same information) is n = X / p ≈ 3.46e13.

Since X is a Binomial, its variance is Var(X)=n⋅P(1-P) (where P is the real proportion corresponding to the estimate p above). And N = X⋅S/n so its variance is Var(X)⋅S^2/n^2. The standard deviation of N is thus σ = S⋅sqrt(P⋅(1-P)/n). Now we don't know P but we can use our estimate p instead to find an estimate of σ!

We find that the standard deviation of their estimator for the number of YouTube videos is approximately S⋅sqrt(p⋅(1-p)/n) ≈ 8.43e7. That's just 0.633% of N so their estimate is quite precise.

When you're estimating a ratio between two outcomes, the rule of thumb is that you want at least 10-100 samples of each outcome, depending on how much precision you want.

They got 10,000 samples of hits, and a huge number of samples of misses. Their result should be very accurate. (32,000 was a different number)

The beauty of this method is that it doesn't matter. Even if YouTube generated sequential IDs, the researchers could still sample them fairly by testing random numbers.

As long as the researchers' numbers were sufficiently random

Randomly, probably not. But I bet that they go pretty uniformly through their URL hashing space.

Anything random-ish will make this estimation method work. On either youtube's side or the measuring side.

Why would they not be random? Nobody has ever found a pattern that I'm aware of, and there are pretty solid claims of past PRNG use. And a leak of the PRNG seed was likely why they mass-privated all unlisted videos a couple years ago.

As a user, for me, YouTube has been shrinking ever since the first adpocalypse. By that I mean I rarely see new videos or relevant recommendations. It is all stale content and pages only couple of pages instead of going infinite like it used to. My most visited page is home page which shows recommended content(it used to be on a dedicated page) and I literally see the same content for days, rarely with new videos. And I have hundreds of subscribed channels. Youtube is a living corpse, I do not care what numbers they put out in the public.

That's what I was about to comment, 13 billion videos and the same 10 videos recommended in a loop, what a product failure.

Not that surprising. 13 billion videos is a lot cheaper to host when everyone is only looking at 10 in a loop.

There's indeed a big difference of objective between YouTube and their users because of that.

However, that's also an opportunity that YouTube hasn't realized yet, they could become a primary platform for content if they had a better algorithm and a better search, it would help them to monetize better those subscriptions.

Youtube has one of the lowest per viewer revenue and the poor discovery isn't an accident in that.

YouTube also seems to forget what videos I've watched, so I end up having videos recommended to me that I've already seen.

People rewatch a lot, so YouTube is suggesting what it thinks you might likely be inclined to rewatch, based on what other people rewatch.

Go to the subscribed page and watch the channels you subscribed to. The home page is whatever YouTube is pushing you to watch - it’s not necessarily what you want to watch…

Yeah some people never do that for some reason. I have a simple little tampermonkey script set up to automatically just redirect me to my subscriptions when I go to the home page on accident.

Funny, for me it's the opposite, I hardly ever set foot outside of my subcriptions page except for searching, which is how I find new channels to add to subs.

The author notes that they used "cheats". Depending on what these do the iid assumption of the samples being independent could be violated. If it is akin to snowball sampling it could have an "excessive" success rate thereby inflating the numbers.

Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often

I agree.

Proving that using cheats and auto complete does not break sample independence and keeps sampling as random as possible would be needed here for stats beginners such as me!

Drunk dialing but having a human operator that each time tries to help you connect with someone, even if you mistyped the number... Doesn't look random to me.

However I did not read the 85 pages paper... Maybe it's addressed there.

Page 9 & 10 of the paper [1] go into some detail:

By constructing a search query that joins together 32 randomly generated identifiers using the OR operator, the efficiency of each search increases by a factor of 32. To further increase search efficiency, randomly generated identifiers can take advantage of case insensitivity in YouTube’s search engine. A search for either "DQW4W9WGXCQ” or “dqw4w9wgxcq” will return an extant video with the ID “dQw4w9WgXcQ”. In effect, YouTube will search for every upper- and lowercase permutation of the search query, returning all matches. Each alphabetical character in positions 1 to 10 increases search efficiency by a factor of 2. Video identifiers with only alphabetical characters in positions 1 to 10 (valid characters for position 11 do not benefit from case-insensitivity) will maximize search efficiency, increasing search efficiency by a factor of 1024. By constructing search queries with 32 randomly generated alphabetical identifiers, each search can effectively search 32,768 valid video identifiers.

They also mention some caveats to this method, namely, that it only includes publicly listed videos:

As our method uses YouTube search, our random set only includes public videos. While an alternative brute force method, involving entering video IDs directly without the case sensitivity shortcut that requires the search engine, would include unlisted videos, too, it still would not include private videos. If our method did include unlisted videos, we would have omitted them for ethical reasons anyway to respect users’ privacy through obscurity (Selinger & Hartzog, 2018). In addition to this limitation, there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample to compare to our sample - would not be computationally realistic.

[1]: https://journalqd.org/article/view/4066

Good observation, but they also acknowledge: > there are considerations inherent in our use of the case insensitivity shortcut, which trusts the YouTube search engine to provide all matching results, and which oversamples IDs with letters, rather than numbers or symbols, in their first ten characters. We do not believe these factors meaningfully affect the quality of our data, and as noted above a more direct “brute force” method - even for the purpose of generating a purely random sample

In short I do believe that the sample is valuable, but it is not a true random sample in the spirit that the post is written, there is a heuristic to have "more hits"

case insensitivity in YouTube’s search engine.

That's very clever. Presumably the video ID in the URL is case-sensitive, but then YouTube went out of their way to index a video's ID for text search, which made this possible.

The data probably wouldn't look so clean if it were skewed. If Google were doing something interesting it probably wouldn't be skewed only by a little bit.

Admittedly, I did not read the paper linked. But my point is not about google doing something funny. Even if we assume that ids are truly random and uniformly distributed this does not mean that the sampling method doesn't have to be iid. This problem is similar to density estimation where Rejection sampling is super inefficient but converges to the correct solution, but MCMC type approaches might need to run multiple times to be sure to have found the solution.

There's probably a checksum in the URL so that typos can be detected without actually trying to access the video.

If you don't know how the checksum is created you can still try all values of it for one sample of the actual ID space.

The fact that reports on "misinformation" don't look at the denominator when considering the volume of impressions is a great example of selectively reporting statistics to support a preconceived notion.

It should be obvious, but "misinformation" is an arbitrary political designation and therefore a constantly moving target. "You should wear an N95 mask to avoid getting COVID" was misinformation in March 2020 but not 3 months later. "Vaccines may not prevent COVID transmission" was misinformation from 2020 until sometime in 2022. "COVID infection may confer significant immunity from future infection" was misinformation for about the same period. The "lab leak hypothesis" was a "racist conspiracy theory" from 2020 until 2023, when it was officially endorsed by YouTube's political sponsors. And so on.

You're correct, no matter how uncomfortable that makes some people. Shame on anyone downvoting you without even attempting a rebuttal.

The obvious rebuttal is that when people talk about “misinformation” colloquially, they’re talking about _deliberate_ misinformation.

The videos they find also show view counts. This allows them to estimate (very roughly!) the view counts across all videos, because it allows them to see the distribution.

The issue with that is if there is one or a handful of videos that have a significant portion of all youtube views. Most likely they will not be in your sample which could lead to a big underestimate.

Not to mention bot farms. The most-viewed source of “misinformation” in the linked 2020 report simply… doesn’t exist anymore. Have we just been hearing about crappy sites buying views these past few years?

The one comment on this post is hilariously unhinged. For posterity:

NOT big enough; that it can’t be wiped away from the surface earth; with a relatively cyber attack from the Alliance. In fact; Google/YouTube will just be one; of 63 communist controlled platforms; that will be completely erased from existence once the order is given.

I don’t know why but; many conspiracy theorists I’ve encountered; use semicolons; wrong

Semi colons are a little understood and rarely used punctuation symbol, the mark of a true intellectual; it's only natural then, that those of us that know how the world really works use them more than others - and then there are those that think they know how they world works, but are mistaken: they similarly misuse the semicolons.

Wow, you went full colon.

Nothing like a full colon to prepare you for writing an interblags comment.

Some of those people are commenting in this very thread ;)

I actually was more interested in that comment than the article (not that the article wasn't interesting, but that comment was just wow). Who is the Alliance?? What are the 63 communist controlled platforms?? Who is giving the order to wipe them out?

I must know these things now.

I recommend checking the dataset of "YouTube dislikes": https://clickhouse.com/docs/en/getting-started/example-datas...

(it is named this way because it was an archival effort to collect the information before the dislike feature was removed)

It can be used to find things like the most controversial videos; top videos with a description in a certain language, etc.

It's important to know stats like this dislike counts, because youtube is such a large and public platform that it borderline a public utility.

from the article:

It’s possible that YouTube will object to the existence of this resource or the methods we used to create it. Counterpoint: I believe that high level data like this should be published regularly for all large user-generated media platforms. These platforms are some of the most important parts of our digital public sphere, and we need far more information about what’s on them, who creates this content and who it reaches.

The gov't ought to make it regulation to force platforms to expose stats like these, so that it can be collected by the statistics bureaus.

I disagree.

Nobody is stopping users from selecting one of the many YouTube competitors out there (eg - Twitch, Facebook, Vimeo) to host their content. We could also argue that savvy marketers/influencers use multiple hosting platforms.

YouTube's data is critical for YouTube and Google, which is basically an elaborate marketing company.

Governments should only enforce oversight on matters such as user rights and privacy, anticompetitive practices, content subject matter, etc.

Vimeo is not a viable alternative for creators who are trying to make monetize their content.

With very limited exception, Vimeo imposes a 2TB/month bandwidth limit [0] on all accounts. If you exceed that limit and don’t agree to pay for your excess usage, Vimeo will shut you down.

[0] https://help.vimeo.com/hc/en-us/articles/12426275404305-Band...

youtube is such a large and public platform that it borderline a public utility.

So has been the big banks, large corporations, land but they all feed off each other and the government. What we want as a community is usually quite different to what they decide to do.

Disclaimer: the author of the comment is the CEO of ClickHouse

Youtube is in a fairly sad state from it's former self. Search has become, quite frankly, unusable. I recall searching for a <band> recently at the gym, and instead of just showing me <band>. Results showed 3-4 results then completely unrelated topics to <band>.

Results are entirely off-topic, but related to other interests I have; tech, planes, trains, and automobiles, etc.

Just FYI, in case you didn't notice, you can bypass the stupid thing that makes search useless by using a filter, like filtering by "Type: Video" or something. I'm sure that workaround will stop working some day randomly.

Oh that's interesting! Thanks for that I'll try it. Much in the same way Google has "Site:X" "filetype:X".

The "For You" section of my youtube search is awful, like terribly awful. It's decided for some reason I enjoy popping and weird skin infection videos. I think this is because I clicked through to an external link to a horse getting a hoof infection shaved once (did not enjoy and did not watch all of it). That's literally the only time I've watched anything like that and since then it's always at the bottom of my search. These videos are never recommended on my front page, though.

There are some weird corners of YouTube.

For anyone who's wondering, their estimation method works like so:

1. Assume a range of values

2. Assume a fair probability function for sampling over the range of values

The estimated size is the %-of-hits * the total range of values.

I skimmed through the article but that's a lot of assumptions there if so.

1. So let's say that possible range of values is true (10 characters of specific range + 1). That would represent one big circle of possible area where videos might be.

2. Distribution of identifiers (valid videos) is everything. If Youtube did some contraints (or skewing) to IDs, that we don't know about, then actual existing video IDs might be a small(er) circle within that bigger circle of possibilities and not equally dispersed throughout, or there mught be clumping or whatever... So you'd need to sample the space by throwing darts in a way to get a silhouette of their skew or to see if it's random-ish, by I don't know let's say Poisson distribution.

Only then one could estimate the size. So is this what they're doing?

Also.. anyone bothered to you know, ask Youtube?

Video IDs are generated by hashing a a secret identifier, so they should be uniformly distributed.

No the distribution doesn't matter at all. I've given an extreme example here: https://news.ycombinator.com/item?id=38742735

Getting a 403 forbidden link from source right now, did we trip the poor guy's ddos protection or is it just me?

Same here. A computer decides I'm not going to see this and that's that, the info on page is 403 forbidden and the word nginx. No recourse or way to let anyone know the thing is broken. If my IP address or language setting were Russian or something I'd at least be able to see a reason for it, but no, Germany and English. On an Android! I know, very dangerous.

Can anyone post what this site is? So I can tell whether this #1 story is worth it trying a different ISP or user agent string or whatnot

It's the blog of a researcher at the University of Massachusetts-Amherst. He does research on the impact of social media on society. It's a Wordpress site, hosted by someone commercial. I know him, and I'm almost certain no blocking would be intentional. If you (or others) give me details about how your access is failing, I'll pass the info on to him so he can try to get it fixed.

In the meantime, this archive page should work for you: https://web.archive.org/web/20231223131348/https://ethanzuck...

you (or others) give me details about how your access is failing

Well, I click the link and get the access denied page. Do you need an IP address (93.135.167.200), user agent string¹, timestamp, or what would help?

Randomly noticed my accept-language header is super convoluted, mixing English-in-Germany, English, German in Germany, Dutch in the Netherlands, Dutch, and finally USA English, with decreasing priority from undefined to 0.9 to 0.4 in decrements of 0.1. I guess it takes this from my OS settings? Though I haven't configured it explicitly like that, particularly the en-US I'd not use because of the inconsistent date format and unfamiliar units system. Maybe the server finds it weird that six different locales are supported?

Thanks for responding and relaying!

¹ Mozilla/5.0 (Linux; Android 10; Pixel Build/QP1A.190711.019; wv) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Mobile Safari/537.36

I wonder how the wingnut in the comments arrived (and whether he read the article at all - wingnuts usually don't):

NOT big enough; that it can’t be wiped away from the surface earth; with a relatively cyber attack from the Alliance. In fact; Google/YouTube will just be one; of 63 communist controlled platforms; that will be completely erased from existence once the order is given.

I have to wonder what this "Alliance" is supposed to be.

Sounds like a headline from https://rense.com/

63 is such an oddly specific number too.

My guess, "The Alliance" would be a QAnon trope, a confederation of select "White Hats" in the US military and parts of US government (and other select humans) and some alien "Arcturian" UFO contingent. They are trying to overthrow the Deep State (corrupt civil/intelligence/military service in USA and Western governments) that are colluding with "globalists" and exploitative capitalists to economically and socially oppress "good white people of European descent" and corrupt their morals and dilute their influence and moral strength. Presumably The Alliance chose Trump to become President in 2016 as part of an attempt to awaken the masses and usher in "The Great Reset / The Storm / The Awakening" where global corruption will be exposed, corrupt politicians, businesspeople, officials, cultural influencers will quickly be judged and executed.

There are many variations on this theme, and though not widely held as a belief, elements seep into the wider discourse and thought.

I'm not sure what the aliens' motivation would be in this scheme, and often they're not part of the story.

EDIT: here's a random page that likely will bring more questions than answers but it references the elements described above ... https://divinecosmos.com/davids-blog/22143-moment-of-truth-q... ... there are variations of this conspiracy theory tailored to many different tendencies: New Agers, Evangelical Christians, White Supremacists, Conservative Republicans, etc. For a further look, find interesting interviews with a certain Jan Halper Hayes and her supposed work with USA Space Force. Some of these promoters a true believers, others are grifters. Also ... grand corporatists and communists are usually conflated in these narratives.

So if youtube is worth 180 billion, with 14 billion videos. Each video is worth $13.

The average length of a youtube video is around 13 minutes. My napkin GPT math values video per minute at $35 per minute.

My napkin GPT math values each video closer to $455.

This leaves me with one unanswered question… Where’s the beef?

These calculations should be viewed as long term expected value. Long-term is probably 10-20 years in this case. So, based on your calculations, each video is about 2-4 bucks a year.

My napkin GPT math values each video closer to $455.

I've read that the number of views / 100 in dollars is a good estimate of how much the channel receives.

Kevin Zheng wrote a whole bunch of scripts to do the dialing, and over the course of several months, we collected more than 10,000 truly random

Several is 3 to 5? So 4 months.

10 000 videos over 4 months. So 2500 per month. If we assume 20 work days.. that's 125 per day. Over 8 hours.. 15,6 per hour.

Those scripts seem to be quite slow? Was this done by hand? A video every 3 minutes.

The context you omitted is:

     Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many.

    Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.

I mean, sure, they did reduce that a little:

    Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often.

With that in mind, how many attempts did they make to get a hit every three minutes?

It's surprising that they weren't throttled back for making excessive requests.

the cheat was using a botnet of 32k devices, lol

Surely someone at YouTube could answer this question definitively?

I wish I could, but it would break NDA :P

Do all employees at YouTube have access to this kind of info?

This is a fun dataset. The paper leaves a slight misimpression about channel statistics: IIUC, they do not correct for sampling propensity to reweight when looking at subscriber counts (it should be weighted ~1/# of videos per channel since the probability of a given channel appearing is proportional to the number of public videos that channel has, as long as the sample is a small fraction of the population).

I noticed that too. Seems very unlikely that 1,000,000 subscribers represents the 98th percentile and not the 99.999th.

I love https://redditmap.social/

What are other things like this?

Damn, I didn't know about this one. Maybe it can help with my work on https://fediverser.network

The only thing I am suspicious of here is their estimated median view count being so high. I would expect the median view count to be zero.

Come on now, this isn't Twitter

Very cool, kudos!

It would be even cooler if it had a deeper view of categories. Right now the biggest category by far is People & Blogs but it's possible to get much more information if it was broken down into sub-categories.

Here's a study I did a few years ago that broke down popular YT videos into more granular categories, might be of interest: https://www.pewresearch.org/internet/2019/07/25/childrens-co...

Too big.

Last night they issued a notification to my phone requiring me to update the YouTube app.

The problem - it is the last version that runs on my phone.

At least web still works, for now.

Just curious - what device and OS version are you on?

Does anyone has experience with the autocomplete feature that was mentioned in the article?

generate a random 5 character long string where one of them is a dash, and Youtube will autocomplete the string if a matching video exists

A random youtube video service would be interesting! Like Wikipedia random, but less educational XD

There are multiple sites that do already do that, this is one I like - https://perchance.org/youtube-video

"We're starting some experiments to understand how the videos YouTube recommends differ from the "average" YouTube video - YouTube likes recommending videos with at least ten thousand views, while the median YouTube video has 39 views."

In experiments by yours truly, it seems that quality goes down as views go up.

I think the plot is more dome shaped. Big channels are crap, but many smaller ones are too

Non-view-weighted and non-impression-weighted stats are interesting but basically useless for disinformation research (which is what the post starts about). To a first approximation every viewpoint and its opposite is out there somewhere on YouTube, along with zero-value videos like security cam backups. The real questions of societal interest involve reach and volume.

Pretty big

It would be interesting to see if some other method of estimating YouTube's size could be used to corroborate the accuracy of this estimate.

Skim the article, find graph with key piece of info that answers title, y axis label: Estimated size (billions). …

start with Never Going to Give You Up, objectively the center of the internet,

LOL

The outcome of this article is this linked accompanying website: https://tubestats.org/

https://redditmap.social/

This is the best thing I've ever seen.

Hall monitors are using math now. Oh no!

would be interesting to know how this was before/after Shorts and how it's changed (and how notShorts have performed since Shorts launched)

Heh, when he starts quoting reports on COVID “misinformation“, you know where this report is going…

A mountains? A sky?

Google used to ask scaling questions about youtube for some positions. They often ended up in some big-O line of questions about syncing log-data across a growing an distributed infrastructure. The result was some ridiculous big-O(f(n)) where the function was almost impossible to even describe verbally. Fun fun.

source interviewed by Google a few times

Gangnam Style has 4.9 Billion views now. I remember how big a deal it was when it broke 1 Billion!

very big