return to table of content

Loading a trillion rows of weather data into TimescaleDB

ZeroCool2u
31 replies
1d4h

I've done a good amount of geospatial analysis for work.

One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.

I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.

I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.

This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.

ingenieroariel
10 replies
1d3h

I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.

System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.

GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.

jfim
4 replies
1d2h

BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.

dekhn
2 replies
1d1h

Frankly I think this is just a sign that it's a power tool for power users.

carlhjerpe
1 replies
1d1h

Sadly my colleagues aren't always "power users"

brailsafe
0 replies
19h59m

Nobody starts as a power user

mrgaro
0 replies
12h19m

Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.

And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.

hawk_
1 replies
1d

and may surprise you at any point in time with ever higher bills without higher usage.

What? really? Do they change your pricing plan? How can they charge more for the same usage?

ingenieroariel
0 replies
19h21m

When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.

I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.

There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.

nojvek
0 replies
17h6m

DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.

I would love it if somehow Postgres got duckdb powered columnstore tables.

I know hydra.so is doing columnstores.

DuckDB being able to query parquet files directly is a big win IMO.

I wish we could bulk insert parquet files into stock PG.

fikama
0 replies
6h48m

You made me curious. Since you are using Linux, why Mac and not PC? Wouldn't PC be cheaper? Or was there any other factors?

ZeroCool2u
0 replies
1d2h

That is a very cool setup!

My org would never allow that as we're in a highly regulated and security conscious space.

Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.

winrid
7 replies
1d2h

M6a is not even remotely enormous. Also were you using EBS?

ZeroCool2u
6 replies
1d2h

m6a.48xlarge: 192 vCPU & 768 GiB

If that's not a large node for you, well you're living in a different world from me. Yes to using EBS.

sgarland
4 replies
1d

Unless you’re CPU-bound (unlikely), the r-family is usually a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to 4x for m-family.

Then there’s the x-family, which can go up to bonkers levels, like 4 TiB of RAM (and local NVMe).

As a sibling comment mentioned, though, if you can fit the data into local storage, that’s the way to do it. EBS latency, even with io2.blockexpress, simply cannot compete.

That said, if I did the math correctly based on the 71 GB/31 days footnote, you’re looking at about 2.15 PB to load the entire thing, so, uh, good luck.

ayewo
3 replies
22h51m

In your experience, is the r-family better than the c-family for running an RDBMS?

aPoCoMiLogin
1 replies
21h54m

things are changing recently in aws, but few years ago R-family instances had one of the fastest uplink to EBS. for example only the larges M4 instance (m4.16xlarge) has 10gbps uplink, versus R5b where it starts from 10gbps (for the lowest tier) and ends on 60gbps @ 260k IOPS. you can very easily choke EBS with DB.

EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/

sgarland
0 replies
20h29m

The other fun thing about AWS instances is that the network uplink speed isn’t always what’s advertised. There is of course the “Up to X Gbps” levels (30 minutes of rated speed guaranteed every 24 hours), but there are also other limits, like cross-region speed being capped at 50% of rated capacity.

This rarely matters until it does, like if you’re setting up an active-active DB across regions. Then it’s a fun surprise.

sgarland
0 replies
20h32m

Yes. Most RDBMS are memory-bound (indexes, connection overhead, buffer pool…), so the more, the better.

At my last job, I switched a large-ish (100K QPS at peak) m5 MySQL instance to a smaller (in AWS numbering) r6i that was cheaper, despite having the same amount of RAM and being generation newer. That, combined with careful tuning and testing, resulted in queries speeding up 20-40%, AND we then had plenty of room for vertical scaling again if necessary.

winrid
0 replies
1d1h

Thanks for the info. The issue is using EBS. If you used an instance with NVME drives it would probably have been faster than BQ (and you aren't billed per-query...). I would suggest an R or I4 class instance for this, m6 is not good for the money here.

You would just have to setup replication for backups, but this could just be rsync to EBS or some other replication solution depending on your database.

xyzzy_plugh
3 replies
1d3h

Do you mind linking the specific dataset? I agree very wide columns break a lot of tools but other columnar postgres forks should support this no problem. It sounds like you didn't use Redshift, which I find surprising as it directly competes with BQ. Redshift has "super" columns that can be very large, even larger than the maximum supported by BigQuery.

I constantly see folks finding out the hard way that PostGIS is really hard to beat. The fact that Trini/Presto and Spark have languished here is particularly telling.

ZeroCool2u
2 replies
1d2h

It's FEMA's NFHL. I can't recall the specific layer of the GDB file, but you could probably figure it out. Try loading up Iowa into redshift and if that works for you I'd be quite surprised.

My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools.

The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB.

[1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-...

xyzzy_plugh
0 replies
17h41m

The description you provide appears too large for even BigQuery.

beeboobaa3
0 replies
23h16m

Just split it out into multiple polygons, one per row. If you're using a relational database, do as relational databases do.

detourdog
1 replies
1d2h

I’m glad to hear this first hand experience. I’m pretty sure that want to build and refine my own geospatial data horde.

I wonder if you think a longterm project is better rolling their own. The second priority is that I expect it all to be locally hosted.

Thanks for any considerations.

ZeroCool2u
0 replies
1d2h

Frankly I think for long term hoarding BQ is hard to beat. The storage costs are pretty reasonable and you never pay for compute until you actually run a query, so if you're mostly just hoarding, well, you're probably going to save a lot of time, money, and effort in the long run.

PolarizedPoutin
1 replies
1d3h

Thank you for the insights! Yeah I'm still not sure how Postgres/PostGIS will scale for me, but good to know that BigQuery does this nicely.

This is not something I'm productionizing (at least not yet?) and I'm giving myself zero budget since it's a side project, but if/when I do I'll definitely look into BigQuery!

ZeroCool2u
0 replies
1d2h

Good luck! They have some really great tutorials on how to get started with BQ and geospatial data. One other nuance of BigQuery that doesn't seem to apply to many other tools in this space is that you can enable partitioning on your tables in addition to clustering on the geometry (Geography in BQ) column.

https://cloud.google.com/bigquery/docs/geospatial-data#parti...

rtkwe
0 replies
1d2h

Even more fun I bet would be if you're getting data with different reference spheroids.

johnymontana
0 replies
1d2h

Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.

Could you say more about this? I'm curious if you've compared Apache Sedona [0] and what specifically you found lacking? I currently work at Wherobots [1], founded by the creators of Apache Sedona and would love to hear any feedback.

[0] https://sedona.apache.org/latest/

[1] https://wherobots.com/

efxhoy
0 replies
7h42m

We just ported our ~500gb data warehouse from an 8 core postgres rds instance to bigquery. Rebuilds went from 5 hours to 11 minutes and costs are about the same or lower. and that’s with caching of some very big tables in postgres and rebuilding from scratch in bq.

I much prefer postgres as a tool I can run myself but the order of magnitude performance benefits are hard to argue with.

Cthulhu_
0 replies
1d4h

I hear so much good things about BigTable / BigQuery, it's a shame I've had no opportunity to use it yet.

rkwasny
12 replies
1d4h

Yeah, don't use TimescaleDB, use ClickHouse - I have 10 years of NOAA climate data on my desktop that I query when I want to go on holidays ;-)

mfreed
2 replies
1d3h

Our experience is that Clickhouse and Timescale are designed for different workloads, and that Timescale is optimized for many of the time-series workloads people use in production:

- https://www.timescale.com/blog/what-is-clickhouse-how-does-i...

Sidenote: Timescale _does_ provide columnar storage. I don't believe that the blog author focused on this as part of insert benchmarks:

- Timescale columnar storage: https://www.timescale.com/blog/building-columnar-compression...

- Timescale query vectorization: https://www.timescale.com/blog/teaching-postgres-new-tricks-...

rkwasny
1 replies
1d2h

Well, as a Co-founder and CTO of Timescale, would you say TimescaleDB is a good fit for storing weather data as OP does?

mfreed
0 replies
20h58m

TimescaleDB primarily serves operational use cases: Developers building products on top of live data, where you are regularly streaming in fresh data, and you often know what many queries look like a priori, because those are powering your live APIs, dashboards, and product experience.

That's different from a data warehouse or many traditional "OLAP" use cases, where you might dump a big dataset statically, and then people will occasionally do ad-hoc queries against it. This is the big weather dataset file sitting on your desktop that you occasionally query while on holidays.

So it's less about "can you store weather data", but what does that use case look like? How are the queries shaped? Are you saving a single dataset for ad-hoc queries across the entire dataset, or continuously streaming in new data, and aging out or de-prioritizing old data?

In most of the products we serve, customers are often interested in recent data in a very granular format ("shallow and wide"), or longer historical queries along a well defined axis ("deep and narrow").

For example, this is where the benefits of TimescaleDB's segmented columnar compression emerges. It optimizes for those queries which are very common in your application, e.g., an IoT application that groups by or selected by deviceID, crypto/fintech analysis based on the ticker symbol, product analytics based on tenantID, etc.

If you look at Clickbench, what most of the queries say are: Scan ALL the data in your database, and GROUP BY one of the 100 columns in the web analytics logs.

- https://github.com/ClickHouse/ClickBench/blob/main/clickhous...

There are almost no time-predicates in the benchmark that Clickhouse created, but perhaps that is not surprising given it was designed for ad-hoc weblog analytics at Yandex.

So yes, Timescale serves many products today that use weather data, but has made different choices than Clickhouse (or things like DuckDB, pg_analytics, etc) to serve those more operational use cases.

dangoodmanUT
2 replies
1d3h

Agreed, clickhouse is faster and has better features for this

lyapunova
0 replies
23h54m

Thanks for maintaining benchmarks here. Is there a github repo that might accompany the benchmarks that I could take a look at / reproduce?

PolarizedPoutin
2 replies
1d3h

Haha very cool use! Yeah reading up on TimescaleDB vs. Clickhouse it seems like columnar storage and Clickhouse will be faster and better compress the time series data. For now I'm sticking to TimescaleDB to learn Postgres and PostGIS, but might make a TimescaleDB vs. Clickhouse comparison when I switch!

rkwasny
0 replies
1d3h

I can replicate your benchmark when I get a moment, the data, is it free to share if I wanted to make an open browser?

I have a feeling general public has very limited access to weather data, and graphs in news that state "are we getting hotter on colder" are all sensational and hard to verify.

cevian
0 replies
1d

Please note that TimescaleDB also uses columnar storage for its compressed data.

Disclosure: I am a TimescaleDB engineer.

anentropic
1 replies
1d4h

Do you say that because its quicker to insert large batches of rows into Clickhouse, or because it's better in other ways?

(I'm currently inserting large batches of rows into MySQL and curious about Clickhouse...)

rkwasny
0 replies
1d3h

It's better in insert speed, query speed and used disk storage.

gonzo41
0 replies
1d3h

So click house is a column db. Any thoughts on if the performance would be a wash if you just pivoted the timescale hypertable and indexed the time + column on timescale?

rjmunro
12 replies
1d3h

This annoys me:

You can use parameterized queries to protect against SQL injection,

No, that't not what parameterized queries are for. That's just a side benefit.

Parameterized queries are so that the database doesn't have to parse the data from SQL to do the query. It's stupid to turn all the data to text, send the text over the wire, escaped and quoted as appropriate, then parse it back to whatever form it was in originally when you can just send it binary straight to the database.

And if you are doing many similar queries with different results, e.g. inserts as here, or maybe querying the user table by user id every page load, the database doesn't have to parse any SQL each time and can even reuse the query plan.

This may be why psycopg3 performed better than pandas df.to_sql() function in the single row insert case.

dboreham
5 replies
1d3h

Quick note to observe that all of the above while true becomes less of a practical issue as CPUs become faster vs i/o, which they have done and probably will keep doing.

forrestthewoods
2 replies
1d2h

Wait what no stop that what are you talking about.

Transforming data and then transforming it back will always be stupid and costly – in more ways than one.

relaxing
1 replies
20h8m

Not if you save transit time that can be more quickly made up in processing time.

rjmunro
0 replies
9h54m

True, but in this case we are converting to strings that are longer because they require quoting and escaping etc. Transit time will be longer.

simiones
0 replies
1d1h

Isn't the opposite mostly happening, with CPU's single-core performance mostly flat for the last decade, while I/O speeds have been improving every year (especially in networking, not as sure about disk I/O).

pphysch
0 replies
1d2h

Feels like you're a couple decades behind. Single core speeds haven't gone up much in the last decade, especially for "server" CPU models.

eddd-ddde
2 replies
1d3h

Can it actually reuse the query plan? Couldn't that lead to a worse performing plan eventually? Say after inserting lots of data such that a different plan becomes a better option.

rjmunro
0 replies
1d1h

I think it's going to be very rare that a database changes enough for a plan to be significantly worse than the optimal plan during the life of a prepared query.

A database engine could easily mitigate against it by re-planning queries every 1000 runs of the query or have the query plan expire after 60 seconds or something. It might be worth trying to re-prepare the query from time to time as part of this kind of bulk insertion benchmarking.

The other thing that would be good to try is closing the transaction every 1000 inserts or some other number. Doing several inserts in a single transaction is certainly better than 1 insert per transaction, but there may come a point where the transaction being too big starts to slow things down.

munk-a
0 replies
1d1h

Usually the TTLs you'll set on these plans[1] are pretty short. It's helpful if you need to do a burst of a thousand queries of the same form over a short interval.

1. At least effectively - this is usually manged via persisting handles or plans themselves.

2. Which is usually a bad code smell from ORM overuse. Every time you dispatch a query you're paying latency overhead - so in an ideal world your report delivery will involve a single query against the database per report request (whether this is a website and those requests are page loads or you're running some internal system for business reporting).

PolarizedPoutin
1 replies
1d3h

Thank you for reading through it thoroughly and pointing this out! I'm still new and learning Postgres so this is good to know. I will update the post.

munk-a
0 replies
1d2h

I would note that I think the above is a rather minority opinion - while parameterized queries are great for reuse simply using them to guard against SQL injection is still an excellent use of them. If your query is reusable then go for it, but most complex queries tend not to be reusable if they involve optional inputs.

davedx
0 replies
1d2h

I disagree that the security is a side benefit. It’s why most people choose to use parameterized queries.

counters
7 replies
1d

Why?

Most weather and climate datasets - including ERA5 - are highly structured on regular latitude-longitude grids. Even if you were solely doing timeseries analyses for specific locations plucked from this grid, the strength of this sort of dataset is its intrinsic spatiotemporal structure and context, and it makes very little sense to completely destroy the dataset's structure unless you were solely and exclusively to extract point timeseries. And even then, you'd probably want to decimate the data pretty dramatically, since there is very little use case for, say, a point timeseries of surface temperature in the middle of the ocean!

The vast majority of research and operational applications of datasets like ERA5 are probably better suited by leveraging cloud-optimized replicas of the original dataset, such as ARCO-ERA5 published on the Google Public Datasets program [1]. These versions of the dataset preserve the original structure, and chunk it in ways that are amenable to massively parallel access via cloud storage. In almost any case I've encountered in my career, a generically chunked Zarr-based archive of a dataset like this will be more than performant enough for the majority of use cases that one might care about.

[1]: https://cloud.google.com/storage/docs/public-datasets/era5

rabernat
2 replies
1d

True, but in fact, the Google ERA5 public data suffers from the exact chunking problem described in the post: it's optimized for spatial queries, not timeseries queries. I just ran a benchmark, and it took me 20 minutes to pull a timeseries of a single variable at a single point!

This highlights the needs for timeseries-optimized chunking if that is your anticipated usage pattern.

bitschubser_
1 replies
8h27m

a good source for ERA5 historical data is https://open-meteo.com/en/docs/historical-weather-api (not affiliated, just a happy user) you can also run open-meteo locally, its quite fast for spatial and timeseries queries

open-meteo
0 replies
8h13m

Creator of Open-Meteo here. There is small tutorial to setup ERA5 locally: https://github.com/open-meteo/open-data/tree/main/tutorial_d...

Under the hood Open-Meteo is using a custom file format with time-series chunking and specialised compression for low-frequency weather data. General purpose time-series databases do not even get close to this setup.

PolarizedPoutin
2 replies
20h35m

The main reason why was that it's a personal project and I wanted to do everything on my home server so that I wouldn't have to pay for cloud resources, and so that I could learn Postgres, TimescaleDB, and eventuallly PostGIS.

But as rabernat pointed out in his comment, pulling out a long time series from the cloud replica is also slow. And I know I eventually want to perform complex spatio-temporal queries, e.g. computing the 99% percentile of summer temperatures in Chile from 1940-1980.

I don't doubt that a cloud replica can be faster, but it's at odds with my budget of $0 haha.

roter
1 replies
20h9m

I too need to do percentiles. One option is loop through the grids but bin/histogram it. You'll get a really good 99% from a 1 Kelvin bin width.

Also, I've found the diurnal profile from ERA5 analysis can be abysmal in some locations. ERA5-Land is much better, high resolution, though only available over... er... land.

To your point about not relying on cloud. Noted in the Google option [1] link above:

Update Frequency: The ERA5 dataset is currently not refreshed in the Google Cloud Public Dataset Program. The program provides ERA5 data spanning from 1940 to May 2023.

Another alternative, Amazon [2], also deprecated:

The provider of this dataset will no longer maintain this dataset. We are open to talking with anyone else who might be willing to provide this dataset to the community.

[2] https://registry.opendata.aws/ecmwf-era5/

PolarizedPoutin
0 replies
3h48m

Thanks for pointing out the status of ERA5 on GCP and AWS.

I'll have to try the binning and compare it with listing all the values and finding the 99th percentile! But yeah the spatial part of that query might be the tougher part.

And yeah I also considered whether to use ERA5 or ERA5-Land. I don't remember if ERA5-Land was missing a variable or I just wanted to look at some ocean grid cells, but I decided to start with vanilla ERA5.

orhmeh09
0 replies
1d

That might be nice if someone would do it and teach others to use it. Some labs have an RDBMS-based pipeline with published algorithms and data that nobody wants to try to reimplement (and which nobody would be paid to do). About the best improvement we could get was moving from an ancient version of MySQL to Postgres + PostGIS. I think Timescale would have helped. There were other reasons also to run locally due to privacy, cluster access, funds etc.

tonymet
6 replies
1d1h

I encourage people to look into the ERA5 dataset provenance especially when you approach the observations made toward the "pre industrial date" of 1850 .

Remember that modern global surface temperatures are collected by satellites, and the dataset is comingled with recordings observed visually & made by hand using buckets by sailors who were not primarily academic researchers. Segments of high resolution, low noise data (satellites) are mixed with low resolution, low coverage, high noise records (hand records on a boat made surrounding the united kingdom).

My point is to be in awe of the technical aspects of this effort but also keep in mind that we are only making copies of low resolution, noisy manuscripts from sailors 170 years ago.

relaxing
3 replies
22h14m

Ok? And what’s your point in pointing out that?

tonymet
1 replies
18h59m

The data is noisy so be careful when using it for research. Always account for the provenance of the records when working with "data".

relaxing
0 replies
18h8m

So like basically every data science effort.

tonymet
0 replies
18h53m

one of the project's goals was to load the data and make predictions. The page covered the data loading part, but not the methods and error tolerance in the predictions

shoyer
1 replies
18h27m

ERA5 covers 1940 to present. That's well before the satellite era (and the earlier data absolutely has more quality issues) but there's nothing from 170 years ago.

tonymet
0 replies
18h9m

Similar noise issues apply. Most of the other surface temp models have to cover 1850

semiquaver
6 replies
1d4h

Any idea why hypertable insert rates were slower? I though hypertables were supposed to _increase_ insert rates?

perrygeo
1 replies
1d4h

Hypertable insert rates are faster and more predictable over time. Each individual insert might incur a small bit of extra overhead, but they scale forever since each temporal chunk is indexed separately vs a regular table where the entire index needs to fit in memory. This is a case where you can't make meaningful inferences from micro-benchmarks (they tested 20k rows, you probably need 200M to start seeing the diff)

PolarizedPoutin
0 replies
1d4h

Thanks for the insight! It is true that I started with a micro-benchmark of 20k rows for slower inserts, but I also did some longer benchmarks with ~772 million rows.

leros
1 replies
1d4h

Isn't data inserted into basically a normal Postgres table with hypertable extensions? I don't know the details of Timescale but that sounds like it would incur a cost of a normal Postgres insert, plus potentially extra work at insert time, plus extra work in the background to manage the hypertable.

rahkiin
0 replies
1d1h

Not entirely. A hypertable is a postgres table chunked over time. There is the assumption that most data and queries are time-relevant, but also that older data is less relevant than new data.

Indexes are per chunk. So if the query analyzer understands you only touch 2023 it can omit looking at any chunk that is from other years and keep those out of memory. Same with the indexes.

PolarizedPoutin
1 replies
1d4h

Yeah I'm curious about this too. Been meaning to ask on the Timescale forums. My only guess is that there's some small extra overhead due to hypertable chunking.

I know Timescale has a blog post from 2017 claiming a 20x higher insert rate but that's for inserting into a table with an index. The general wisdom for loading huge amounts of data seems to be that you should insert into a table with no indexes then build them later though. So with no index, inserting into a hypertable seems a bit slower.

Timescale blog post: https://medium.com/timescale/timescaledb-vs-6a696248104e

h4kor
0 replies
1d4h

Timescale hypertables automatically have an index on the timestamp. To make this more comparable you could create the same index on the normal table and test the ingestion rate.

jhoechtl
5 replies
1d3h

For the German speaking among us: That should be 1 000 000 000 000 rows

Million - million

Milliarde - billion

Billion - trillion

aftbit
3 replies
1d1h

Yes, this article is actually loading (order of magnitude) 10^12 rows.

I was very surprised to learn about this the first time. In USA English:

10^3 -> thousand

10^6 -> million

10^9 -> billion

10^12 -> trillion

10^15 -> quadrillion

10^18 -> quintillion

But there's another scale too! See this Wiki article for more

https://en.wikipedia.org/wiki/Names_of_large_numbers

samatman
2 replies
23h36m

Unless you're reading historical documents, these words have only one meaning in the English language. It is in no sense American English specific.

One advantage of the short scale is that its cadence is the same as that of SI units: million is mega, billion is giga, trillion is tera, and so on. The long scale of course has the -ard names, so this isn't exactly a problem, any more than French saying "four twenties and nineteen" for ninety-nine is a problem.

The difference is one of the major reasons the SI scale names exist in the first place, in fact.

It also matches the decimal separators used by everyone but Indians, which strikes me as more logical.

aftbit
1 replies
23h33m

Sorry, I wasn't intending to imply that they were American English specific, just that I only have experience with English as spoken in the USA.

What's the advantage of the long scale? Just historical?

samatman
0 replies
18h45m

I don't think the long scale actually has advantages compared to the short. One could argue euphony: the alternative repetition of million, milliard, billion, billiard, and so on, is pleasing in a way.

But really, each is just a quirk of languages. I don't expect 7 to sound like "seven" in other languages, why expect billion to sound like "billion" rather than "milliard" or "trillion"? When conveying information across languages, we use numerals, sometimes with the SI scales, which are universal.

Just another confusing faux amis to figure out when learning another language, really.

jhoechtl
0 replies
6h37m

The downvoters stand out as those who don't know. Thank you, amis!

rabernat
4 replies
1d

Great post! Hi Ali!

I think what's missing here is an analysis of what is gained by moving the weather data into a RDBMS. The motivation is to speed up queries. But what's the baseline?

As someone very familiar with this tech landscape (maintainer of Xarray and Zarr, founder of https://earthmover.io/), I know that serverless solutions + object storage can deliver very low latency performance (sub second) for timeseries queries on weather data--much faster than the 30 minutes cited here--_if_ the data are chunked appropriately in Zarr. Given the difficulty of data ingestion described in this post, it's worth seriously evaluating those solutions before going down the RDBMS path.

ohmahjong
2 replies
22h54m

This is a bit off-topic but I'm interested in the same space you are in.

There seems to be an inherent pull between large chunks (great for visualising large extents and larger queries) vs smaller chunks for point-based or timeseries queries. It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets. I have heard of "kerchunk" being used to try and get the best of both, but then I _think_ you lose out on the option of compressing the data and it introduces quite a lot of complexity.

What are your thoughts on how to strike that balance between use cases?

rabernat
1 replies
22h22m

It's possible but not very cost-effective to maintain separately-chunked versions of these large geospatial datasets.

Like all things in tech, it's about tradeoffs. S3 storage costs about $275 TB a year. Typical weather datasets are ~10 TB. If you're running a business that uses weather data in operations to make money, you could easily afford to make 2-3 copies that are optimized for different query patterns. We see many teams doing this today in production. That's still much cheaper (and more flexible) than putting the same volume of data in a RDBMS, given the relative cost of S3 vs. persistent disks.

The real hidden costs of all of these solutions is the developer time operating the data pipelines for the transformation.

ohmahjong
0 replies
8h37m

That's a great point, it really is all about tradeoffs. In my use case there is strong motivation to keep data creation times low, so writing out multiple datasets comes at a product/opportunity cost moreso than a storage cost. Thanks for the insight.

PolarizedPoutin
0 replies
20h25m

Hey Ryan and thank you for the feedback!

I agree that storing the data is appropriately chunked Zarr files is almost surely going to be faster, simpler to set up, and take up less space. Could even put up an API in front of it to get "queries".

I also agree that I haven't motivated the RDBMS approach much. This is mainly because I took this approach with Postgres + Timescale since I wanted to learn to work with them, and playing around with ERA5 data seemed like the most fun way. Maybe it's the allure of weather data being big enough to pose a challenge here.

I don't have anything to back this up but I wonder if the RDBMS approach, with properly tuned and indexed TimescaleDB + PostGIS (non-trivial to set up), can speed up complex spatio-temporal queries, e.g. computing the 99th percentile of summer temperatures in Chile from 1940-1980, in case many different Zarr chunks have to be read to find this data. I like the idea of setting up different tables to cache these kinds of statistics, but it's not that hard to do with Zarr either.

I'm benchmarking queries and indexes next so I might know more then!

hyperman1
4 replies
1d4h

Two remarks with postgres and lots of data:

1) I always wonder of there is a better way than COPY. I tend to quickly get 100% CPU without saturating I/O

2) The row overhead seems big. A row has 23 bytes overhead, this table has 48 bytes data per row, so even without page overhead, we lose ~1/3 of our storage. This is pure data storage, without any index.

PolarizedPoutin
3 replies
1d3h

1. Yeah to me it seems very hard to saturate I/O with Postgres unless maybe you insert into an unlogged table. I guess there's quite a bit of overhead to get all the nice stability/consistency and crash-resistance.

2. That is a good point. I'm hoping TimescaleDB's compression helps here but yeah I don't think you'll ever get the database size below the data's original footprint.

hyperman1
1 replies
1d3h

If I look into perf, it seems mostly parsing overhead. I can saturate a create newtable as select from oldtable. Unfortunately, CSV seems still the lingua franca for transport between DBs. Maybe some day a more binary oriented transport protocol will appear( e.g parquet?)

feike
0 replies
1d3h

Many libraries for python, Rust, golang support COPY BINARY.

The times I've tested it, the improvement is very small as compared to plain copy, or copy with CSV, whereas it does require more work and thought upfront to ensure the binary actually works correctly.

https://www.postgresql.org/docs/current/sql-copy.html

feike
0 replies
1d3h

Timescaler here, if you configure the timescaledb.compress_segmentby well, and the data suits the compression, you can achieve 20x or more compression.

(On some metrics data internally, I have 98% reduction in size of the data).

One of the reasons this works is due to only having to pay the per-tuple overhead once per grouped row, which could be as much as a 1000 rows.

The other is the compression algorithm, which can be TimescaleDB or plain PostgreSQL TOAST

https://www.timescale.com/blog/time-series-compression-algor... https://www.postgresql.org/docs/current/storage-toast.html

robertlagrant
3 replies
7h24m

I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is querying a model rather than real observations, so it may more reflect the model's understanding of the world. Still useful; just worth noting I think.

defrost
1 replies
7h9m

Fair point, also worth noting is:

    The data is output from a climate model run that is constrained to match weather observations.

    So where we have lots of weather observations, ERA5 should match it closely.

    And where we do not have any weather observations, ERA5 will be physically consistent and should match the climatology, i.e. the simulated weather’s statistics should match reality.

    ERA5 covers the entire globe at 0.25 degree resolution, and stretches back in time to 1940 with hourly resolution. 
For the time frame (post 1940) there is likely to be actual weather data points twice a day at almost any point on the globe with cities and or ports that colonial expansion reached.

What I'd like to see in such tools are two things:

* Indications of actual observed data points (location time) .. so that the observed point mesh for a day in 1949 about the globe can be seen, and

* Interpolation for a modern day (with dense observed data points) from a reduced set of sparse points (just the ones, say, from that day in 1949) to get a sense of how the infill from sparse points matches the observed but unused points.

I'm not a climate skeptic in the least, global estimation of a coarse global mean from relatively sparse points is fine in my experience (I've done a lot of exploration geophysics for mineral | energy resource mapping) and a bit of interpolated wiggle in the model doesn't blow the big corase estimation .. but this question of interpolation from sparse data is one that always gets raised by skeptics.

PolarizedPoutin
0 replies
2h15m

The data is definitely sparse in the earlier decades, especially over unpopulated areas and in developing nations, due to the lack of regularly recorded weather observations (sensors, balloons, etc.). Especially since weather satellites started collecting global data, coverage has been better though. And weather observing stations have proliferated since too.

Totally agree that it would be cool to see when and where the weather observations are that are informing the climate model. A quick search didn't turn up this dataset, but it may be somewhere. I'll see if I can find this data.

PolarizedPoutin
0 replies
2h20m

Yes you are correct of course. ERA5 is climate model output constrained to match weather observations, not actual observations, which I do note in the post.

carderne
3 replies
1d3h

Hey OP (assuming you're the author), you might be interested in this similar experiment I did about four years ago, same dataset, same target, similar goal!

https://rdrn.me/optimising-sql/

Similar sequence of investigations, but using regular Postgres rather than Timescale. With my setup I got another ~3x speedup over COPY by copying binary data directly (assuming your data is already in memory).

PolarizedPoutin
2 replies
1d3h

Wish I saw this before I started haha! I left a footnote about why I didn't try binary copy (basically someone else found its performance disappointing) but it sounds like I should give it a try.

footnote: https://aliramadhan.me/2024/03/31/trillion-rows.html#fn:copy...

carderne
1 replies
1d2h

Yeah I imagine it depends where the data is coming from and what exactly it looks like (num fields, dtypes...?). What I did was source data -> Numpy Structured Array [0] -> Postgres binary [1]. Bit of a pain getting it into the required shape, but if you follow the links the code should get you going (sorry no type hints!).

[0] https://rdrn.me/optimising-sampling/#round-10-off-the-deep-e... [1] In the original blog I linked.

anentropic
0 replies
1d2h

I'd love to hear from anyone who's done the same in MySQL

tmiku
2 replies
1d1h

I think it would be cool to have historical weather data from around the world to analyze for signals of climate change we’ve already had rather than think about potential future change.

This is a very good instinct! A pretty major portion of modern climate science is paleoclimatology, with a goal of reaching far beyond reliable human measurements. A lot of earth's previous climate states were wildly different from the range of conditions we have experienced in the past 10,000 years, and a better climate record is essential to predicting the effects of massive carbon emission.

Ice cores from Antarctica/Greenland are the most famous instance of this, but there's a lot of other cool ones - there are chemical records of climate change in cave stalactites, ocean floor sediments, etc.

PolarizedPoutin
0 replies
2h1m

Thank you for mentioning paleoclimatology! Not sure if what I'm doing with ERA5 data is that rigorous haha, but one of my favorite plots is https://commons.wikimedia.org/wiki/File:All_palaeotemps.svg (still have it hung up in my office haha)

postgresperf
2 replies
22h28m

Contributor to the PG bulk loading docs you referenced here. Good survey of the techniques here. I've done a good bit of this trying to speed up loading the Open Street Map database. Presentation at https://www.youtube.com/watch?v=BCMnu7xay2Y for my last public update. Since then the advance of hardware, GIS improvements in PG15, and osm2pgsql adopting their middle-way-node-index-id-shift technique (makes the largest but rarely used index 1/32 the size) have gotten my times to load the planet set below 4 hours.

One suggestion aimed at the author here: some of your experiments are taking out WAL writing in a sort of indirect way, using pg_bulkload and COPY. There's one thing you could try that wasn't documented yet when my buddy Craig Ringer wrote the SO post you linked to: you can just turn off the WAL in the configuration. Yes, you will lose the tables in progress if there's a crash, and when things run for weeks those happen. With time scale data, it's not hard to structure the loading so you'll only lose the last chunk of work when that happens. WAL data isn't really necessary for bulk loading. Crash, clean up the right edge of the loaded data, start back up.

Here's the full set of postgresql.conf settings I run to disable the WAL and other overhead:

wal_level = minimal max_wal_senders = 0 synchronous_commit = off fsync = off full_page_writes = off autovacuum = off checkpoint_timeout = 60min

Finally, when loading in big chunks, to keep the vacuum work down I'd normally turn off autovac as above then issue periodic VACUUM FREEZE commands running behind the currently loading date partition. (Talking normal PG here) That skips some work of the intermediate step the database normally frets about where new transactions are written but not visible to everyone yet.

kabes
1 replies
11h46m

Do you have more info on the GIS improvements in PG15?

postgresperf
0 replies
9h58m

There's a whole talk about it we had in our conference: https://www.youtube.com/watch?v=TG28lRoailE

Short version is GIS indexes are notably smaller and build faster in PG15 than earlier versions. It's a major version to version PG improvement for these workloads.

RyanHamilton
2 replies
1d3h

If you want to plot time-series charts or many other charts directly from sql queries, qStudio is a free SQL IDE and works with everything including TimescaleDB: https://www.timestored.com/qstudio/database/timescale Disclaimer, I am the author.

ayewo
0 replies
22h24m

What's the process for adding support for other databases to your tool qStudio?

I'm thinking perhaps you could add support for Timeplus [1]? Timeplus is a streaming-first database built on ClickHouse. The core DB engine Timeplus Proton is open source [2].

It seems that qStudio is open source [3] and written in Java and will need a JDBC driver to add support for a new RDBMS? If yes, Timeplus Proton has an open source JDBC driver [4] based on ClickHouse's driver but with modifications added for streaming use cases.

1: https://www.timeplus.com/

2: https://github.com/timeplus-io/proton

3: https://github.com/timeseries/qstudio

4: https://github.com/timeplus-io/proton-java-driver

PolarizedPoutin
0 replies
3h34m

Thank you for the link! I've mostly used TablePlus (not free I think) and matplotlib (via psycopg3) to plot data returned by queries, but this looks like it'll be faster to use. I've only inserted data so far haha but will be querying and plotting soon.

user3939382
1 replies
8h28m

727,080 snapshots in time for each variable like temperature

Maybe we should be recording and transmitting the parameters for a function or set of functions that describes these temperature etc changes rather than a set of points to drastically reduce the size of these data sets.

PolarizedPoutin
0 replies
2h51m

Are you suggesting compressing the data using like Fourier series or wavelets? I know some of the variables have sharp gradients, especially variables like precipitation, which probably wouldn't compress super well. And some applications of the ERA5 dataset, e.g. closing heat or moisture budgets by accounting for every bit of it, cannot be done if the data has been compressed since you lose some data. Curious if this is what you had in mind or something else.

to11mtm
1 replies
17h43m

As someone who used to do some GIS hacking in an office job[0] before I was a 'Software Developer/engineer' this is super cool.

[0] - Honestly some of the coolest stuff I ever got to do in it's own right. Building tools that could move data between AutoCAD, Microstation, and Google Earth while also importing other bits with metadata from Trimble units[1]. Also it was probably the most I ever used math in my entire career [2], so there's that.

[1] - I wound up finding a custom font maker, and one of our folks made a font library with the symbols, made it easy to write a parser too :D

[2] - One of those PDFs I always seem to wind up having floating on a hard drive is the USGS 'Map Projections, a working manual'. At one point I used it as a reference to implement a C# library to handle transforms between coordinate systems... alas it was internal.

PolarizedPoutin
0 replies
2h53m

Thank you! I've relied on that USGS Projects manual multiple times haha. Working with satellite data some of it was in somewhat obscure projections and the manual always told you how to convert back to latitude-longitude.

smellybigbelly
1 replies
1d

Can anyone give some advice on how they run TimeScale in Kubernetes? I’m seeing they dropped support for their Helm chart.

sigmonsays
1 replies
1d3h

where can I download the weather data?

Is it free or available if I sign up?

sammy2255
1 replies
1d2h

Clickhouse will eat this for breakfast. And has built-in compression even at the column level

PolarizedPoutin
0 replies
1h57m

Hoping to find out when I compare TimescaleDB vs. Clickhouse!

roter
1 replies
1d1h

I too use the ERA5 reanalysis data and I too need quick time series. As the data comes in [lat, lon] grids, stacked by whatever period you've chosen, e.g. [month of hourly data, lat, lon], it becomes a massive matrix transpose problem if you want 20+ years.

What I do is download each netCDF file, transpose, and insert into a massive 3D HDF file organized as [lat, lon, hour]. On my workstation it takes about 30 minutes to create one year for one variable (no parallel I/O or processes) but then takes milliseconds to pull a single (lat, lon) location. Initial pain for long-term gain. Simplistic, but I'm just a climatologist not a database guru.

PolarizedPoutin
0 replies
1h57m

Haha simplistic but probably faster and more space-efficient than a relational database. Sounds like rabernat and open-meteo who commented here do something similar to you and find it fast as well!

lawn
1 replies
1d4h

What an interesting post!

At a sustained ~462k inserts per second, we’re waiting ~20 days for our ~754 billion rows which is not bad I guess It’s less time than it took me to write this post.

Hah, as I've been gravitating more to writing larger and more in depth blog posts I can relate to the surprising effort it can require.

PolarizedPoutin
0 replies
1d3h

Thank you! Yeah haha some of the benchmarks took several hours (and a few re-runs) and there was a lot of learning done along the way.

jamesgresql
1 replies
1d1h

This is super cool! I run DevlRel @ Timescale, and I love seeing our community create well written posts like this!

My initial reaction is that I think one of the reasons you're seeing a hypertable being slower is almost certainly that it creates an index on the timestamp column by default. You don't have an index on your standard table which lets it go faster.

You can use create_hypertable with create_default_indexes=>false to skip creating the index, or you can just drop the index before you ingest data. You'll eventually want that index - but it's best created after ingestion in a one-shot load like this.

I'd also be interested in how the HDD you're reading data from is holding up in some of the highly parallel setups?

PolarizedPoutin
0 replies
20h46m

Thank you for reading and for your kind words!

Ah I did not know about the `create_default_indexes=>false` and that a time index is created by default for hypertables. I'll add a note to explain this! Also curious to benchmark inserting without the time index then creating it manually.

Even with 32 workers I think the HDD was fine. I did monitor disk usage through btop and the SSD that Postgres lived on seemed to be more of a bottleneck than the HDD. So my conclusion was that a faster SSD for Postgres would be a better investment than moving the data from HDD to SSD.

islandert
1 replies
19h17m

If you don't have access to COPY if the postgres instance is managed, I've had a lot of luck with encoding a batch of rows as a JSON string, sending the string as a single query parameter, and using `json_to_recordset` to turn the JSON back into a list of rows in the db.

I haven't compared how this performs compared to using a low-level sql library but it outperforms everything else I've tried in sqlalchemy.

twoodfin
0 replies
16h15m

When I see suggestions like this (totally reasonable hack!), I do have to wonder what happened to JDBC’s “addBatch()/executeBatch()”, introduced over 25 years ago.

https://docs.oracle.com/javase/8/docs/api/java/sql/PreparedS...

Did modern APIs & protocols simply fail to carry this evolutionary strand forward?

gingerwizard
1 replies
9h31m

Few in this thread have suggested ClickHouse would do well here. We tested 1 trillon rows recently, albeit much simpler data - https://clickhouse.com/blog/clickhouse-1-trillion-row-challe...

This is a good dataset though and the level of detail in the post is appreciated. I'll give ClicKHouse a go on the same...

Disclaimer: I work for ClickHouse

PolarizedPoutin
0 replies
2h56m

Thanks for the link on the trillion row challenge, interesting read! I'm looking at queries and indexes next and I'm hoping to include Clickhouse in that comparison.

dunefox
1 replies
1d3h

OT: Does anyone know if DuckDB would be of use here?

wiredfool
0 replies
1d3h

Not likely, unless it's a set of parquet files already. Clickhouse would be a better bet.

d416
1 replies
18h55m

“Is a relational database even appropriate for gridded weather data? No idea but we’ll find out.”

Love this. It’s the exact opposite of all the other ‘well actually’ mainstream tech posts and I am here for all of it. Props for keeping the reader fully engaged on the journey.

PolarizedPoutin
0 replies
3h41m

Haha thank you for reading and glad you found it engaging! Maybe it's the benefit of being a beginner and not having any skin in the game. I did a lot of searching but couldn't find any conclusive answers for my use case so figured I may as well run the benchmarks myself.

nojvek
0 replies
17h3m

This should be a benchmark.

Could someone post me to where I can download the whole dataset?

koliber
0 replies
23h19m

Curious if he could squeeze more performance by using a different structure to store the same data. Some of these float4 cols could probably be stored as int2. Depending on how many decimal places are needed, can divide the int to get the resulting floating point value.

kinj28
0 replies
16h41m

It's a tangent.

I am curious if we query the data to give us temperature at a given time for all lat n long and plot it geo spatially , would the result give anything on heat distribution of energy received across the lat and long at that point in time?

hendiatris
0 replies
1d2h

If you’re going to work with weather data use a columnar database, like BigQuery. If you set things up right your performance will generally be a few seconds for aggregation queries. I setup a data platform like this at my previous company and we were able to vastly outperform our competitors and at a much lower cost.

The great thing about this data is it is generally append only, unless errors are found in earlier data sets. But it’s something that usually only happens once a year if at all.

dcreater
0 replies
11h10m

Nice! Can you upload the data to hf, oxen.ai, kaggle or something?

curious_cat_163
0 replies
17h15m

The data is output from a climate model run that is constrained to match weather observations.

That's interesting. Why store it? Why not compute it as needed using the model?

FWIW, I am not an expert in this space and if someone is, it would be good to understand it.

Aeroi
0 replies
17h0m

What did Google use to train GraphCast?