I've done a good amount of geospatial analysis for work.
One thing you quickly realize with geospatial data is that it's incredibly nuanced. You have to be quite careful about understanding which coordinate reference system (CRS) and for visualization which projection is being used. The CRS is somewhat paranoia inducing if you don't have great infrastructure setup with the right tools to carry that metadata with your geospatial data.
I've tested everything AWS has to offer, Postgres/PostGIS, Spark/DB, Snowflake, Trini, and ARCGis.
I'm convinced the best tool for large scale geospatial work is Google BigQuery and it's not even close. It took an expensive multi hour query running on PostGIS deployed on an enormous m6a EC2 instance to less than 5 seconds that ran in the BigQuery free tier. It does make sense if you think about it, Google was early with Maps, but it is just stunning how much better they are in this specific niche domain.
This was using publicly available FEMA data that Snowflake and AWS services would just choke on, because the geometry column exceeded their maximum byte size. Spark doesn't even have geospatial data types and the open source packages that add support leave a lot to be desired.
This guy is running on-prem, so maybe this made sense, but I just would never bother. The storage for BQ would probably be less than $100/months for 20 TB.
I went through a similar phase with a process that started with global OSM and Whosonfirst to process a pipeline. Google costs kept going up (7k a month with airflow + bigquery) and I was able to replace it with a one time $7k hardware purchase. We were able to do it since the process was using H3 indices early on and the resulting intermediate datasets all fit on ram.
System is a Mac Studio with 128GB + Asahi Linux + mmapped parquet files and DuckDB, it also runs airflow for us and with Nix can be used to accelerate developer builds and run the airflow tasks for the data team.
GCP is nice when it is free/cheap but they keep tabs on what you are doing and may surprise you at any point in time with ever higher bills without higher usage.
BigQuery is nice but it's definitely a major foot-gun in terms of cost. It's surprisingly easy to rack up high costs with say a misconfigured dashboard or a developer just testing stuff.
Frankly I think this is just a sign that it's a power tool for power users.
Sadly my colleagues aren't always "power users"
Nobody starts as a power user
Definitively agree here. Once the data is in BigQuery, people will start doing ad-hoc queries and building Grafana dashboards on top of it.
And sooner or later (usually sooner) somebody will build a fancy Grafana dashboard and set it to refresh every 5 second and you will not notice it until it's too late.
What? really? Do they change your pricing plan? How can they charge more for the same usage?
When you queried their 'Open Data' datasets and linked with your own it was absurdly cheap for some time. Granted we used our hacking skills to make sure the really big queries ran in the free tier and only smaller datasets got in the private tables.
I kept getting emails about small changes and the bills got bigger all over the place including BigQuery and how they dealt with queries on public datasets. Bill got higher.
There is a non zero chance I conflated things. But from my point of view: I created a system and let it running for years - afterwards bills got higher out of the blue and I moved out.
DuckDB is the real magic. On an nvme disk with decent amount of RAM, it goes brrrrrr.
I would love it if somehow Postgres got duckdb powered columnstore tables.
I know hydra.so is doing columnstores.
DuckDB being able to query parquet files directly is a big win IMO.
I wish we could bulk insert parquet files into stock PG.
You made me curious. Since you are using Linux, why Mac and not PC? Wouldn't PC be cheaper? Or was there any other factors?
That is a very cool setup!
My org would never allow that as we're in a highly regulated and security conscious space.
Totally agree about the BQ costs. The free tier is great and I think pretty generous, but if you're not very careful with enforcing table creation only with partitioning and clustering as much as possible, and don't enforce some training for devs on how to deal with columnar DB's if they're not familiar, the bills can get pretty crazy quickly.
M6a is not even remotely enormous. Also were you using EBS?
m6a.48xlarge: 192 vCPU & 768 GiB
If that's not a large node for you, well you're living in a different world from me. Yes to using EBS.
Unless you’re CPU-bound (unlikely), the r-family is usually a better fit for RDBMS, IME. 8x RAM:vCPU ratio, compared to 4x for m-family.
Then there’s the x-family, which can go up to bonkers levels, like 4 TiB of RAM (and local NVMe).
As a sibling comment mentioned, though, if you can fit the data into local storage, that’s the way to do it. EBS latency, even with io2.blockexpress, simply cannot compete.
That said, if I did the math correctly based on the 71 GB/31 days footnote, you’re looking at about 2.15 PB to load the entire thing, so, uh, good luck.
In your experience, is the r-family better than the c-family for running an RDBMS?
things are changing recently in aws, but few years ago R-family instances had one of the fastest uplink to EBS. for example only the larges M4 instance (m4.16xlarge) has 10gbps uplink, versus R5b where it starts from 10gbps (for the lowest tier) and ends on 60gbps @ 260k IOPS. you can very easily choke EBS with DB.
EDIT: only newer C instances have comparable uplink to EBS, C5 or C4 (and some C6) starts from ~4.7gbps. just compare the EBS bandwidth column in https://aws.amazon.com/ec2/instance-types/
The other fun thing about AWS instances is that the network uplink speed isn’t always what’s advertised. There is of course the “Up to X Gbps” levels (30 minutes of rated speed guaranteed every 24 hours), but there are also other limits, like cross-region speed being capped at 50% of rated capacity.
This rarely matters until it does, like if you’re setting up an active-active DB across regions. Then it’s a fun surprise.
Yes. Most RDBMS are memory-bound (indexes, connection overhead, buffer pool…), so the more, the better.
At my last job, I switched a large-ish (100K QPS at peak) m5 MySQL instance to a smaller (in AWS numbering) r6i that was cheaper, despite having the same amount of RAM and being generation newer. That, combined with careful tuning and testing, resulted in queries speeding up 20-40%, AND we then had plenty of room for vertical scaling again if necessary.
Thanks for the info. The issue is using EBS. If you used an instance with NVME drives it would probably have been faster than BQ (and you aren't billed per-query...). I would suggest an R or I4 class instance for this, m6 is not good for the money here.
You would just have to setup replication for backups, but this could just be rsync to EBS or some other replication solution depending on your database.
Do you mind linking the specific dataset? I agree very wide columns break a lot of tools but other columnar postgres forks should support this no problem. It sounds like you didn't use Redshift, which I find surprising as it directly competes with BQ. Redshift has "super" columns that can be very large, even larger than the maximum supported by BigQuery.
I constantly see folks finding out the hard way that PostGIS is really hard to beat. The fact that Trini/Presto and Spark have languished here is particularly telling.
It's FEMA's NFHL. I can't recall the specific layer of the GDB file, but you could probably figure it out. Try loading up Iowa into redshift and if that works for you I'd be quite surprised.
My org has a very large AWS spend and we got to have a chat with some of their SWE's that work on the geospatial processing features for Redshift and Athena. We described what we needed and they said our only option was to aggregate the data first or drop the offending rows. Obviously we're not interested in compromising our work just to use a specific tool, so we opted for better tools.
The crux of the issue was that the large problem column was the geometry itself. Specifically, MultiPolygon. You need to use the geometry datatype for this[1]. However, our MultiPolygon column was 10's to 100's of MB's. Well outside the max size for the Super datatype from what I can tell as it looks like that's 16 MB.
[1]: https://docs.aws.amazon.com/redshift/latest/dg/GeometryType-...
The description you provide appears too large for even BigQuery.
Just split it out into multiple polygons, one per row. If you're using a relational database, do as relational databases do.
I’m glad to hear this first hand experience. I’m pretty sure that want to build and refine my own geospatial data horde.
I wonder if you think a longterm project is better rolling their own. The second priority is that I expect it all to be locally hosted.
Thanks for any considerations.
Frankly I think for long term hoarding BQ is hard to beat. The storage costs are pretty reasonable and you never pay for compute until you actually run a query, so if you're mostly just hoarding, well, you're probably going to save a lot of time, money, and effort in the long run.
Thank you for the insights! Yeah I'm still not sure how Postgres/PostGIS will scale for me, but good to know that BigQuery does this nicely.
This is not something I'm productionizing (at least not yet?) and I'm giving myself zero budget since it's a side project, but if/when I do I'll definitely look into BigQuery!
Good luck! They have some really great tutorials on how to get started with BQ and geospatial data. One other nuance of BigQuery that doesn't seem to apply to many other tools in this space is that you can enable partitioning on your tables in addition to clustering on the geometry (Geography in BQ) column.
https://cloud.google.com/bigquery/docs/geospatial-data#parti...
Even more fun I bet would be if you're getting data with different reference spheroids.
Could you say more about this? I'm curious if you've compared Apache Sedona [0] and what specifically you found lacking? I currently work at Wherobots [1], founded by the creators of Apache Sedona and would love to hear any feedback.
[0] https://sedona.apache.org/latest/
[1] https://wherobots.com/
We just ported our ~500gb data warehouse from an 8 core postgres rds instance to bigquery. Rebuilds went from 5 hours to 11 minutes and costs are about the same or lower. and that’s with caching of some very big tables in postgres and rebuilding from scratch in bq.
I much prefer postgres as a tool I can run myself but the order of magnitude performance benefits are hard to argue with.
I hear so much good things about BigTable / BigQuery, it's a shame I've had no opportunity to use it yet.