HN comments for: pg_timeseries: Open-source time-series extension for PostgreSQL

riedel

17 replies

21h20m

2024-05-20 21:07:35 UTC

You may already be asking: “why not just power the stack using TimescaleDB?” The Timescale License would restrict our use of features such as compression, incremental materialized views, and bottomless storage. With these missing, we felt that what remained would not provide an adequate basis for our customers’ time-series needs. Therefore, we decided to build our own PostgreSQL-licensed extension.

Have been using the free version timescaledb before to shard a 500 Million observation time series database. Worked drop-in without much hassle. Would have expected some benchmarks and comparisons in the post. I will for sure watch this...

osigurdson

15 replies

17h7m

2024-05-21 01:21:09 UTC

500 million is very little however. A regular table with a covering index would probably be fine for many use cases with this number of points.

vegabook

12 replies

17h3m

2024-05-21 01:24:32 UTC

indeed. Financial timeseries I was working with over 100 million new points, _per day_. For anything serious TimescaleDB is essentially not open source. Well done tembo.io crew -- will definitely give this a whirl.

eyegor

7 replies

16h37m

2024-05-21 01:50:47 UTC

What do you mean by "for anything serious it isn't open source"? I didn't see any red flags in the apache variant of timescale, just constant pleading to try their hosted option.

https://github.com/timescale/timescaledb/blob/main/LICENSE-A...

pmeira

6 replies

15h34m

2024-05-21 02:53:42 UTC

Compression and other features use the non-Apache license:

https://github.com/timescale/timescaledb/tree/main/tsl

NineStarPoint

5 replies

8h34m

2024-05-21 09:54:17 UTC

And as I understand that license, you are allowed to use Timescale for anything that doesn’t involve offering Timescale itself as a service. If you were using Timescale to process lots of time series transactions in your backend, it doesn’t seem to me like that would break the license.

(Which is to say that if, like Tembo, you’re offering Postgres as a service you do indeed have a problem. But for other use, should be fine)

wasmitnetzen

2 replies

8h4m

2024-05-21 10:23:32 UTC

The license doesn't allow you to "give access to, directly or indirectly (e.g., via a wrapper) to [SQL]".

Legally, what's a wrapper? Is a REST API a wrapper?

belk

0 replies

6h54m

2024-05-21 11:34:08 UTC

I imagine legally would need a lawsuit to set a precedence, and if a license owner sets an over-reaching precedence of what a wrapper is, they risk losing customer trust and companies avoiding them like the plague.

e.g. timescaledb going after a tsdb as a service company offering tsdb behind a graphql wrapper vs timescaledb going after a financial company offering timeseries data collection and viewing.

I think a good border test would be, would timescaledb allow you to offer a metrics and logging service? technically you're offering timeseries database functionality, but it's in a constrained domain, and very clearly a different product, but still effectively CRUDing timeseries data.

NineStarPoint

0 replies

5h39m

2024-05-21 12:48:58 UTC

That’s the internal use restriction. There is also the restriction more relevant to the use cases I’m talking about on Value Added Products which is “the customer is prohibited, either contractually or technically, from defining, redefining, or modifying the database schema or other structural aspects of database objects”.

Which is, basically, saying that you can do anything that doesn’t give your customers the ability to redefine and modify the database schema as long as you are creating a product that is adding value on top of timescale. Is any of this 100% clear? Not any more that legalese generally is, and of course probably wise to talk to a lawyer if you’re concerned about it. Timescale has made their intent with the license clear in the past with blog posts and such though.

dig1

1 replies

8h8m

2024-05-21 10:19:45 UTC

The tricky thing with these licenses (BSL, SSPL, etc.) is that you can use them freely for internal stuff, but suddenly, if you make your product public (assuming it uses, e.g., TimescaleDB), things can get muddy. Everyone wants the flexibility to either open-source or commercialize a successful internal product in the future.

The problem is that, even if your app is not a mere frontend for TimescaleDB/Mongo/Redis, you can get sued, and you'll have to spend unnecessary time and money proving things in court. No one wants this, especially a startup owner whose money and time are tight. Also, even if your startup/company uses some of these techs, potential company buyers will be very wary of the purchase if they know they'll have to deal with this later.

miohtama

0 replies

6h33m

2024-05-21 11:54:49 UTC

I would assume TimescaleDb only sues if you money. In this case you can also afford a commercial license. If you hit big just contact them and tell there was a problem having a correct license earlier and you want to fix the situation.

There is 0% chance Timescale would sue mom’n’pop operation for breaking their license.

miohtama

3 replies

6h34m

2024-05-21 11:53:29 UTC

If you have 100 million points per day it’s likely you afford to pay any commercial license.

pgwhalen

2 replies

6h19m

2024-05-21 12:09:09 UTC

Why would the number of data points correlate to budget? Perhaps there’s a chance if the business scales with paying users, but that’s unlikely to be true in finance.

walr000s

1 replies

6h2m

2024-05-21 12:25:22 UTC

At that number of observations, I would assume depth of market data so probably HFT use case. HFT is notoriously expensive to try to compete in

moogly

0 replies

4h51m

2024-05-21 13:37:12 UTC

Or IoT data, which is notoriously hard to make money on.

vjerancrnjak

1 replies

16h30m

2024-05-21 01:57:35 UTC

I think you’re not talking about the same thing. There’s an expression related to time series data —- “high churn” and another “active time series”.

500 million active time series is extremely huge.

It does not have anything to do with number of data points.

Good time series databases can scale to 1M-10M writes per second without a hiccup.

osigurdson

0 replies

14h16m

2024-05-21 04:12:14 UTC

I suppose it means by what is meant by an "observation". Is that an entire time series for a single property or a single point? Nevertheless, the number of points absolutely matters.

A regular Postgres database can give you 50-100K inserts per second and can scale to at least 1B rows with 100K+ individual series without much difficultly. If you know you will need less (or much less) than this, my suggestion is to use a regular table with a covering index. If you need more, use ClickHouse.

rapsey

0 replies

15h22m

2024-05-21 03:06:09 UTC

Databases are a tough business. You're just waiting for open source to eat your lunch.

MuffinFlavored

10 replies

19h16m

2024-05-20 23:12:17 UTC

Dumb question: why can't I just insert a bunch of rows with a timestamp column and indices? Where does that fall short? At a certain # of rows or something?

What does this let me do that can't be achieved with "regular PostgreSQL without the extension"?

gonzo41

6 replies

18h54m

2024-05-20 23:34:17 UTC

Time based partitioning.

MuffinFlavored

5 replies

17h53m

2024-05-21 00:34:22 UTC

    CREATE TABLE logs (
        id SERIAL PRIMARY KEY,
        log_time TIMESTAMP NOT NULL,
        message TEXT
    ) PARTITION BY RANGE (log_time);

Why won't this work on stock PostgreSQL?

kbolino

1 replies

3h44m

2024-05-21 14:43:37 UTC

I think what's meant here is windowing (partitioning the query) not partitioning the table per se. Though even with this strategy, you must manually create new partitions all the time.

This also isn't typical time-series data, which generally stores numbers. Supposing you had a column "value INTEGER" as well, how do you do something like the following (pseudo-SQL)?

    SELECT AVG(value) AS avg_value FROM logs GROUP BY INTERVAL '5m'

Which should output rows like the following, even if the data were reported much more frequently than every 5 minutes:

    log_time             | avg_value
    2024-05-20T00:00:00Z | 10.3
    2024-05-20T00:05:00Z | 7.8
    2024-05-20T00:10:00Z | 16.1

ants_a

0 replies

41m

2024-05-21 17:46:42 UTC

    SELECT date_bin(interval '5 minutes', log_time) log_time,
           AVG(value) avg_value
    FROM logs GROUP BY 1

gonzo41

1 replies

16h49m

2024-05-21 01:39:00 UTC

read the docs, it's not saying that won't work. This extension along with timescale just makes some things more ergonomic.

darkstar_16

0 replies

9h37m

2024-05-21 08:50:24 UTC

And provides helper functions to deal with regular time series tasks

jmaker

0 replies

12h16m

2024-05-21 06:11:46 UTC

That won’t work already because your timestamp isn’t part of your primary key.

skibbityboop

0 replies

17h52m

2024-05-21 00:36:04 UTC

I'm with you, I need to read up more on where timeseries could benefit, at work we have a PostgreSQL instance with around 27 billion rows in a single partitioned table, partitioned by week. Goes back to January of 2017 and just contains tons of data coming in from sensors. It's not "fast", but also not ridiculously slow to say e.g. "Give me everything for sensor 29380 in March of 2019".

I guess depends on your needs but I do think I need to investigate timeseries more to see if it'd help us.

ishikawa

0 replies

5h33m

2024-05-21 12:54:39 UTC

there are several good articles explaining this, especially on Timescable blog, but in short, without time partitioning and just index, at some given point the performance for reads and writes degrades exponencially.

citizen_friend

0 replies

17h46m

2024-05-21 00:42:14 UTC

It will work just fine.

nitinreddy88

9 replies

19h16m

2024-05-20 23:12:11 UTC

Most of the time-series queries (almost all of them) are aggregated queries. Why not leverage or build top-notch Columnarstore for the same.

Everything seems to be there and why there's not first class product like ClickHouse on PG.

paulryanrogers

6 replies

18h56m

2024-05-20 23:31:40 UTC

Citus, Persona, TimescaleDB?

nitinreddy88

3 replies

18h52m

2024-05-20 23:35:25 UTC

Looking at the comparison with Click Benchmark, they are almost pathetic in terms of performance. They cant even handle sub-second aggregation queries for 10M records. Compared that too even duckdb reading from parquet files.

nikita

2 replies

13h31m

2024-05-21 04:56:48 UTC

Postgres is missing a proper columnstore implementation. It's a big gap and it's not easy to build.

One solution could be integrating duckdb in a similar way as pgvector. You need to map duckdb storage to Postgres storage and reuse duckdb query processor. I believe it's the fastest way to get Postgres to have competitive columnstores.

tarasglek

0 replies

10h1m

2024-05-21 08:26:22 UTC

This sounds interesting. I don't see duck db as a supported extension or mentioned anywhere in your code yet ;)

Is this foreshadowing?

anentropic

0 replies

9h38m

2024-05-21 08:50:10 UTC

Hydra?

bloopernova

0 replies

17h33m

2024-05-21 00:54:23 UTC

That was very "Klaatu, Barada, Nikto".

applied_heat

0 replies

14h2m

2024-05-21 04:25:38 UTC

Victoria metrics as well, they say based on similar structures used in clickhouse

netik

1 replies

18h45m

2024-05-20 23:42:29 UTC

The gold standard for this Druid at very large scale, or ClickhouseDB. Clickhouse has a lot of problems as far as modifying/scaling shards after the fact, while Druid handles this with ease (and the penalty of not being able to update after the fact.)

anentropic

0 replies

9h37m

2024-05-21 08:50:37 UTC

Doris?

gxyt6gfy5t

4 replies

21h27m

2024-05-20 21:00:31 UTC

How’s it different than timescaledb?

logrot

1 replies

21h25m

2024-05-20 21:03:07 UTC

You may already be asking: “why not just power the stack using TimescaleDB?” The Timescale License would restrict our use of features such as compression, incremental materialized views, and bottomless storage. With these missing, we felt that what remained would not provide an adequate basis for our customers’ time-series needs. Therefore, we decided to build our own PostgreSQL-licensed extension.

samaysharma

0 replies

21h19m

2024-05-20 21:08:56 UTC

It's much newer for one, so it's behind on features. But, we're working on adding new ones based on customer demand.

We want to build it with a PostgreSQL license using existing community extensions as much as possible and build custom stuff (still permissively licensed) only when necessary.

vantiro

0 replies

21h22m

2024-05-20 21:05:27 UTC

Timescaledb's license is more like Redis' new license?

hosh

0 replies

21h17m

2024-05-20 21:10:49 UTC

The use of postgresql licensing might mean we can see this available for AWS RDS and other managed PostgreSQL providers.

suyash

3 replies

12h5m

2024-05-21 06:22:47 UTC

Interesting, how does it compare to proper (open source) time series database like InfluxDB other than being 'Postgres' like ?

RedShift1

1 replies

9h28m

2024-05-21 08:59:58 UTC

InfluxDB is a "proper" time series database?

sofixa

0 replies

6h19m

2024-05-21 12:09:08 UTC

It is in the sense that it's a purpose built time series database. Once upon a time it was definitely among the best too, but IMO their sales and tech and everything strategies have been just massive clusterfucks. They also have the best metrics agent, by far.

They're on their third serious rewrite with breaking changes for users. They pulled features from the open source version (clustering) to have a higher moat for the Enterprise version. They deleted customer data permanently in their hosted version after sending two emails to billing addresses. Their marketing and product naming also sucks:

InfluxDB InfluxDB Cloud Serverless InfluxDB Cloud Dedicated InfluxDB Clustered

tucnak

0 replies

11h37m

2024-05-21 06:50:42 UTC

i's not Postgres-like, it _is_ Postgres

matthewmueller

2 replies

13h5m

2024-05-21 05:23:06 UTC

Would love to use this with RDS!

rywalker

0 replies

3h16m

2024-05-21 15:11:52 UTC

Tembo CEO here - we are targeting feature parity for Tembo Cloud w/ RDS as soon as possible, would love to have you give Tembo a try sometime, give us feedback :)

Tembo Cloud is standard SaaS offering, and our new Tembo Self Hosted (https://tembo.io/docs/product/software/tembo-self-hosted/ove...) allows you to run the same software that powers our SaaS, but in your own K8s cluster.

anentropic

0 replies

9h36m

2024-05-21 08:52:04 UTC

Same here, but not holding my breath since all these neat Postgres extensions compete with other AWS DBs like Redshift, Timestream etc

wdb

1 replies

7h51m

2024-05-21 10:37:00 UTC

Would this be a good extension when you want to load balancer log entries (status, response body, headers etc)?

I think a columnar database store would be more efficient than normal row-based databases? load balancer log entries could be considered something similar to analytics events.

samaysharma

0 replies

2h40m

2024-05-21 15:47:40 UTC

Yes. Columnar is integrated with pg_timeseries already.

plainOldText

1 replies

19h58m

2024-05-20 22:30:07 UTC

Your site is very well designed and easy to read btw, and the app UI looks great from the demo photos. I might try it!

samaysharma

0 replies

19h18m

2024-05-20 23:10:01 UTC

Thank you!

vivzkestrel

0 replies

15h39m

2024-05-21 02:49:13 UTC

Benchmarks with respect to QuestDB, TimescaleDB?

vantiro

0 replies

21h28m

2024-05-20 21:00:00 UTC

PostgreSQL licensed, good move!

valenterry

0 replies

6h49m

2024-05-21 11:38:24 UTC

It's about time that postgres (and other databases) add native append-only tables. That doesn't make it timeseries, but it probably helps with the standardiziation and all the logic/access around it.

ramoneguru

0 replies

20h29m

2024-05-20 21:58:38 UTC

How does this stack up against something like what QuestDB offers?

nhourcard

0 replies

7h15m

2024-05-21 11:12:25 UTC

Interesting release, it feels that the time-series database landscape is evolving toward:

a) columnar store & built from scratch, with convergence toward open formats such as parquet & arrow: influxdb 3.0, questdb

b) Adding time-series capabilities on top of Postgres: timescale, pg_timeseries

c) platforms focused on observability around the Prometheus ecosystem: grafana, victoria metrics, chronosphere

mnahkies

0 replies

6h50m

2024-05-21 11:38:09 UTC

Looking at their roadmap, the killer feature for me would be incremental materialised views

Incremental view maintenance — define views which stay up-to-date with incoming data without the performance hit of a REFRESH

I wonder if they plan to incorporate something like https://github.com/sraoss/pg_ivm or write their own implementation.

(Although I'm hopeful that one day we see ivm land in postgres core)

mathfailure

0 replies

20h47m

2024-05-20 21:40:36 UTC

Thank you for posting it: I followed the links and found out about trunk and https://pgt.dev/

PeterZaitsev

0 replies

19h19m

2024-05-20 23:09:17 UTC

Great to see this kind of innovation. PostgreSQL is interesting while "core" was always Open Source and using very permissive Open Source library, there have been many proprietary and source available extensions, ranging from replication to time series support.

Now we see those Proprietary extensions being disrupted by proper Open Source!