HN comments for: Show HN: PostgreSQL index advisor

sbstp

30 replies

17h24m

2024-04-14 05:32:30 UTC

I've often thought that a database that could automatically detect slow queries and create the necessary indexes would be neat. You run a load test on your application, which in turns calls the database and you collect all the queries it makes. Then the database automatically adjusts itself.

tuwtuwtuwtuw

9 replies

17h3m

2024-04-14 05:53:57 UTC

That exists in Microsoft SQL Server. It can create new indexes, drop unused indexes, change query plans when it detect degradation and so on.

BrentOzar

7 replies

12h18m

2024-04-14 10:38:38 UTC

Source? I’ve been working with SQL Server for a couple of decades and I don’t believe it will automatically create or drop indexes under any circumstances. You might be thinking of Azure SQL DB.

tuwtuwtuwtuw

2 replies

11h24m

2024-04-14 11:33:09 UTC

What's the point of asking for a source when you would find it on Google in one minute? Odd way of learning. Not like I brought up some debated viewpoint.

simplyinfinity

0 replies

6h24m

2024-04-14 16:32:24 UTC

Google the name of the person you're replying to :)

hobs

0 replies

10h50m

2024-04-14 12:06:50 UTC

Probably because most of the stuff you'd find in the top search results would include the GP's name.

Just a few sentences later "Automatic tuning in Azure SQL Database also creates necessary indexes and drops unused indexes" - that's not in on-prem SQL Server.

Ciantic

2 replies

11h46m

2024-04-14 11:10:35 UTC

"Automatic tuning, introduced in SQL Server 2017 (14.x), notifies you whenever a potential performance issue is detected and lets you apply corrective actions, or lets the Database Engine automatically fix performance problems." [1]

I have used this in Azure SQL too, but according to that it should be in SQL Server.

https://learn.microsoft.com/en-us/sql/relational-databases/a...

couchand

0 replies

9h54m

2024-04-14 13:02:46 UTC

Good link!

Automatic index management identifies indexes that should be added in your database, and indexes that should be removed. Applies to: Azure SQL Database

BrentOzar

0 replies

9h1m

2024-04-14 13:56:09 UTC

Read that link carefully: only automatic plan regression is available in SQL Server, not the automatic index tuning portion. The index tuning portion only applies to Azure SQL DB.

radicalbyte

0 replies

8h35m

2024-04-14 14:22:09 UTC

They've had a non-automatic "query advisor" in there forever, it operated on profiling data and was highly effective.

taspeotis

0 replies

8h51m

2024-04-14 14:05:33 UTC

That’s an Azure SQL thing, not MSSQL.

elric

5 replies

16h56m

2024-04-14 06:01:01 UTC

I'm sure the database could, but it doesn't mean the database should. Indexes come at the cost of extra disk space, slower inserts, and slower updates. In some cases, some slower queries might be an acceptable tradeoff. In other cases, maybe not. It depends.

kiwicopple

2 replies

15h41m

2024-04-14 07:15:16 UTC

this is our posture for this extension on the supabase platform. we could automate the creation of the indexes using the Index Advisor, but we feel it would be better to expose the possible indexes to the user and let them choose

gneray

0 replies

9h32m

2024-04-14 13:24:33 UTC

this is the way ^^

dmurray

0 replies

5h40m

2024-04-14 17:16:20 UTC

You could tell it "you have a budget of X GB for disk space, choose the indexes that best optimize the queries given the budget cap."

Not perfect, because some queries may be more time-critical than others.

You could even annotate every query (INSERT and UPDATE as well as SELECT) with the dollar amount per execution you're willing to pay to make it 100ms faster, or accept to make it 100ms slower. Then let it know the marginal dollar cost of adding index storage, throw this all into a constraint solver and add the indexes which are compatible with your pricing.

d0100

1 replies

5h34m

2024-04-14 17:22:46 UTC

Are the trade-offs measurable? If they are the database could just undo the index...

Not just indexing, but table partitions, materialized views, keeping things in-memory...

remus

0 replies

4h40m

2024-04-14 18:16:22 UTC

Are the trade-offs measurable?

Yes, but you need the context about what is the correct tradeoff for your use case. If you've got a service that depends on fast writes then adding latency via extra indices for improved read speed may not be an acceptable trade off. It depends on your application though.

arronax

5 replies

16h16m

2024-04-14 06:40:53 UTC

Oracle DB is, or was, very close to that with its query profiles, baselines, and query patches. It wasn't automatic back in 2014 when I last worked on it, but all the tools were there. Heck, it was possible to completely rewrite a bad query on the fly and execute a re-written variant. I suppose it all stems from the fact that Oracle is regularly used under massive black boxes, including the EBS.

Also, the problem with automatic indexing is that it only gets you so far, and any index can, in theory, mess up another query that is perfectly fine. Optimizers aren't omniscient. In addition, there are other knobs in the database, which affect performance. I suppose, a wider approach than just looking at indexes would be more successful. Like Ottertune, for example.

dmurray

2 replies

5h47m

2024-04-14 17:09:12 UTC

Heck, it was possible to completely rewrite a bad query on the fly and execute a re-written variant.

Is there really such a thing as a bad query that can be rewritten to give the same results but faster? For me, that's already the query optimizer's job.

Of course there are "bad queries" where you query for things you don't need, join on the wrong columns, etc. And yeah the optimizer isn't perfect. But a query that you expect the query optimizer to "rewrite" and execute in an optimal way is a good query.

Scene_Cast2

1 replies

4h46m

2024-04-14 18:10:20 UTC

One example is that in Presto, joins assume that left (IIRC) table is the big one you stream, and the right one is the small one you hash. One of the newer features was that in some cases, the optimizer is able to ignore the SQL order and correctly pick which table to hash and which one to stream.

legulere

0 replies

3h5m

2024-04-14 19:51:49 UTC

Isn't that just basic table/index statistics to know which table is the smallest?

Tostino

1 replies

8h30m

2024-04-14 14:26:13 UTC

The problem of new indexes messing up otherwise good queries is something I've battled on and off for the past decade with Postgres. Definitely annoying.

rand_r

0 replies

2h25m

2024-04-14 20:31:53 UTC

How would an index mess up another query? AFAIK indexes would only hurt write performance marginally per index, but most slow queries are read-only. I’ve tended to just add indexes as I go without thinking about it and haven’t run into issues, so genuinely curious.

masklinn

2 replies

16h31m

2024-04-14 06:25:30 UTC

Because indexes have costs you need a much more complicated system which can feed back into itself and downgrade probationary indexes back to unindexed.

fulafel

0 replies

6h9m

2024-04-14 16:48:04 UTC

Several databases index everything, needed or not. (And sometimes have mechanisms to force it off for some specific data)

freedomben

0 replies

5h45m

2024-04-14 17:11:32 UTC

Even this isn't sufficient, because some problems with over-indexing don't become apparent until the size of a table gets much larger, which only happens a drop at a time. I suppose if it was always probationary and continually being evaluated, at some point it could recognize that for example INSERTs are now taking 1000x longer than they were 2 years ago. But that feels like a never-ending battle against corner cases, and any automatic actions it takes add significant complexity to the person debugging later.

ComodoHacker

1 replies

17h3m

2024-04-14 05:53:50 UTC

Big Guys do this. For big bucks, of course.

tuwtuwtuwtuw

0 replies

17h0m

2024-04-14 05:56:45 UTC

big bucks

You get that feature in Azure SQL Database for $5/month.

ed_balls

0 replies

5h44m

2024-04-14 17:12:20 UTC

Default DB for App Engine (NDB) has this feature. Implicit indexes are tad annoying.

b3lm0nt

0 replies

5h48m

2024-04-14 17:09:01 UTC

Andrew Kane built dexter, which is an automatic indexer for Postgres.

https://github.com/ankane/dexter

https://ankane.org/introducing-dexter

GordonS

0 replies

5h44m

2024-04-14 17:12:46 UTC

I might be misremembering, but IIRC RavenDB does this (it's a commercial document DB, written in C#).

jitl

9 replies

18h21m

2024-04-14 04:35:24 UTC

I’ve been needing this and also HypoPG, which I didn’t realize had been available on RDS for a year plus!

The issue I’m struggling with these days is that I have an index that I want Postgres to use for one of the relations in a 3+ wide join, but unless I put a limit on a CTE of one of the tables, Postgres tries to run each join in parallel and ends up trying to join a bazillion rows like an asshole.

These days wrangling the query planner has me on the verge of breaking up with pg :-(

kiwicopple

2 replies

17h57m

2024-04-14 04:59:58 UTC

i don't know how far it was developed, but I believe pg_plan_guarantee[0] was created to solve this sort of issue

This extension provides a mechanism, using which you can set your execution plan in stone; that is, Postgres will execute the plan you give it, or it will throw an error, but it will never try to guess a plan for you.

[0] https://github.com/DrPostgres/pg_plan_guarantee

jitl

1 replies

1h3m

2024-04-14 21:53:31 UTC

Ugh I would love LOVE to have something like this or even better a stable API where I can manually specify my own physical query plan. As is, I’m on RDS and this thing is experimental so I have no way to use it.

jaggederest

0 replies

45m

2024-04-14 22:11:32 UTC

For what it's worth, depending on how many people and/or budget you have available to man on call rotations, running your own postgresql server with replication is not that hard. The RDS premium is like a whole engineer's salary by the time you get to Large Databases (tm), and if you're running production, staging, and maybe a data warehouse, it starts to rapidly be able to pay for a whole team.

jaggederest

2 replies

17h52m

2024-04-14 05:04:35 UTC

You can set max_parallel_workers_per_gather to zero in the session where you're running the problem query, if that's helpful. That will disable the parallel query behavior entirely. You can just reset it back to what it was once the query is complete. I've run into this issue before and that was my go-to fix.

krembo

1 replies

2h41m

2024-04-14 20:15:26 UTC

I'd strongly oppose setting the max parallel workers on session level just to bypass an execution plab. I stead understand the logic why pg behave the way it is, and change the query accordingly

jaggederest

0 replies

47m

2024-04-14 22:09:51 UTC

Yeah you can file a patch with postgres and get it into the next point release or something but in the mean time you really want to keep that query from OOMing your database :)

polishdude20

0 replies

17h31m

2024-04-14 05:25:34 UTC

Have you tried changing the order of the joins? I know it shouldn't matter but I've found sometimes it tricks the query planner into doing the right things!

justinclift

0 replies

17h51m

2024-04-14 05:06:03 UTC

Guessing you've seen this already and it's not doing what you want?

https://github.com/ossc-db/pg_hint_plan

arronax

0 replies

16h0m

2024-04-14 06:56:48 UTC

While pinning a single plan or disabling parallel operations is a valid quick fix, I'd recommend trying to actually understand why the optimizer is doing what it's doing. It's all "just" a simple math of execution node costs. The problem with plan hints, or disabling parallel, is that unless you have a perfect understanding of where the data will be in, say, a year, you may just have created another problem, just delayed. Some column will go from 10 to 10,000 unique values and your good plan will become way off. And at the end, perhaps you just need to tell PG that an index access is not 4 times as expensive as a plain heap scan.

ps fighting the optimizer is the worst

CAP_NET_ADMIN

7 replies

16h36m

2024-04-14 06:20:37 UTC

If you have some PostgreSQL performance issues, I'd recommend checking out PGAnalyze - they've offered a much more advanced index advisor for some time now.

My company is a paid customer since around 2020 and we are very satisfied, easily beats the Datadog's (which we use for the rest of our infra and apps) observability offering for PostgreSQL.

allan_s

6 replies

16h24m

2024-04-14 06:32:59 UTC

Re: performance issue

I used to think that performance issue in relational database was always a matter of :

* missing indexes * non-used indexes due to query order (where A, B instead of B, A)

But we had the case recently where we optimized a query in postgresql which was taking 100% of cpu during 1s (enough to trigger our alerting) by simply splitting a OR in two separate query.

So if you are looking for optimisation it may be good to know about "OR is bad". The two queries run in some ms both.

xvinci

4 replies

15h58m

2024-04-14 06:58:48 UTC

I'm sorry but you just cannot say that "OR is bad" - it being a key part of SQL . It's most likely your use that is bad (e.g. your intermediate result exceeding some cache size).

But "bad performance always due to indexes" gives a hint that you are somewhat new: No, bad performance in my experience was almost always due to developers either not understanding their ORM framework, or writing too expensive queries with or without index. Just adding indexes seldom solved the problem (maybe 1/5 of the time).

mrklol

1 replies

13h50m

2024-04-14 09:07:03 UTC

OR is indeed not bad, but you have to think about when to use it. It can easily make queries slower compared to other operators. That’s exactly why we have the terms "ugly OR" / "bad OR".

thom

0 replies

13h18m

2024-04-14 09:38:39 UTC

It’s worth having a mental model of _why_ OR can be suboptimal. Often it’s because you’re only hitting an index on half the conditional, or forcing PG into a bitmap scan, or worse turning and index lookup into a sequential scan. Not to bang on about indexes too much but a partial index on the OR condition works if you’re lazy, although splitting into two queries is often a great solution as it gives you two fast queries to combine instead of one slow one (although sometimes that implies PG’s statistics are incorrect because it might have been able to know which side of the conditional cuts out more data).

magicalhippo

0 replies

9h26m

2024-04-14 13:31:05 UTC

Just adding indexes seldom solved the problem

We write all our queries by hand. We've got decades of experience and I'd say we're pretty proficient.

For us adding an index is almost always the solution, assuming the statistics are fine.

Either we plain forgot, or a customer required new functionality we didn't predict so no index on the fields required.

Sure sometimes a poorly constructed query slips out or the optimizer needs some help by reorganizing the query, but it's rare.

dz08dl

0 replies

11h10m

2024-04-14 11:46:37 UTC

It's complicated; that's why there isn't a one-size-fits-all solution. In the end, you want to have a good execution plan, and there's usually not just one and the same action to achieve that.

magicalhippo

0 replies

9h40m

2024-04-14 13:16:26 UTC

We are transitioning from SQLAnywhere to MSSQL, and saw the same for a key query.

SQLAnywhere handled the single OR fine, but we had to split the query into two using UNION ALL for MSSQL not to be slow as a snail burning tons of CPU.

No idea why the MSSQL optimizer doesn't do that itself, it's essentially what SQLAnywhere does.

Ozzie_osman

5 replies

8h28m

2024-04-14 14:29:11 UTC

My team has been using Pganalyze (which has its own index advisor), and honestly, it's one of my favorite technical tools ever. I don't know how you could run a large-scale postgres setup without it. In addition to index advice, it also shows top queries, vacuums, etc, and monitors for a whole host of other issues out of the box.

They also have a ton of great content on their blog (5mins of postgres) where the founder will find blog posts by different authors/companies and analyze them in depth.

lfittl

4 replies

7h25m

2024-04-14 15:31:49 UTC

Thanks for the kind words!

For anyone interested in how pganalyze's approach compares to this extension (and other alternatives like dexter, or using HypoPG directly), I gave a talk with my colleague Philippe last year at PgCon that describes how we use constraint programming and CP-SAT for dealing with the trade-off between index write overhead and read performance improvement, across multiple queries on a table:

https://www.pgcon.org/events/pgcon_2023/schedule/session/422...

Rapzid

2 replies

4h17m

2024-04-14 18:39:20 UTC

This is awesome! Was the talk recording by chance?

lfittl

1 replies

3h54m

2024-04-14 19:02:51 UTC

Yep, there is a "link to video" link on the talk page - here is the direct link:

https://www.youtube.com/watch?v=pGN_pORKtSQ

We also did a more recent webinar that has some slight revisions on top of that talk, recording available in our docs: https://pganalyze.com/docs/indexing-engine/cp-model

Rapzid

0 replies

3h44m

2024-04-14 19:13:09 UTC

Cheers! I skimmed right over that.

polivier

0 replies

6h32m

2024-04-14 16:24:19 UTC

If anyone has some background in optimization and is interested in the more technical side of things, I gave a talk at JOPT 2023, comparing MIP and CP approaches for this problem. The slides can be found here: https://github.com/PhilippeOlivier/jopt2023

fforflo

2 replies

13h6m

2024-04-14 09:50:27 UTC

The convenient thing about this is that it's written in vanilla Pl/PgSQL. It can be tempting to copy the `index_advisor(text)`function in a session and start hard-coding stuff and heuristics :D .

Most meaningful extensions need to be compiled, installed, created dropped.

victorbjorklund

0 replies

9h32m

2024-04-14 13:24:59 UTC

Oh that is indeed very nice

pokipoke

0 replies

7h18m

2024-04-14 15:38:36 UTC

You still need to install hypopg which is C extension

brightball

2 replies

8h2m

2024-04-14 14:54:40 UTC

Would be great if there was something you could run on tables to recommend more space efficient data types based on what’s actually stored in the table.

klysm

1 replies

2h16m

2024-04-14 20:40:43 UTC

I can’t imagine a concrete example of this

jpgvm

0 replies

1h45m

2024-04-14 21:12:01 UTC

Examples are easy, usefulness... harder to imagine.

Simple example, bigint column where all values would fit in smallint or if only 0/1 are present then boolean.

For a more complex idea if a large number of boolean columns are present in a table suggest packing them into integer/bigint as appropriate or bit(n) if individual querying/indexing via bit operators is needed.

There are many ways to claw back bytes on disk from PostgreSQL if you really need to and a lot of them could be suggested automatically.

The reason I say usefulness is harder to imagine is I don't know of anyone that would want to do this but wouldn't know how or where to look for these strategies. It's as if awareness of the need is commensurate with ability to resolve it.

vvern

1 replies

7h39m

2024-04-14 15:17:15 UTC

CockroachDB has a similar feature built in. It takes existing queries which are slow and then analyzes hypothetical indexes for better query plans and suggests them. You can add them in the console ui with one click. It’s pretty useful.

danolivo

0 replies

7h18m

2024-04-14 15:38:58 UTC

The term ‘slow’ is too relational and not strong. I guess, we should look up for queries, which can be potentially faster - see into estimation errors or number of data pages involved into the query.

itsgrimetime

1 replies

17h45m

2024-04-14 05:11:29 UTC

Does something like this for MongoDB? Searching hasn’t turned up anything useful.

winrid

0 replies

17h8m

2024-04-14 05:48:58 UTC

Ops Manager (paid tool) kinda supports this but in my experience it misses a lot of obvious suggestions.

cpursley

1 replies

6h57m

2024-04-14 15:59:58 UTC

Good stuff, just added this and some others mentioned in this thread to my Postgres Is Enough listicle: https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...

asah

0 replies

12m

2024-04-14 22:44:27 UTC

Nice. This is 404ing: https://levelup.gitconnected.com/how-cloudflare-achieved-55-...

PikachuEXE

1 replies

15h32m

2024-04-14 07:25:05 UTC

I am using https://github.com/ankane/pghero/ and this is one of its features with GUI

kiwicopple

0 replies

15h16m

2024-04-14 07:40:44 UTC

Andrew builds some amazing tools for the postgres ecosystem (he is also the creator of pgvector)

Bishonen88

1 replies

16h2m

2024-04-14 06:54:24 UTC

How does one install this on a aws rds instance?

kiwicopple

0 replies

15h38m

2024-04-14 07:18:59 UTC

hypopg is now supported in RDS so the steps would be:

1. Enable hypopg

    create extension if not exists hypopg;

2. Copy/paste the plpgsql file:

https://github.com/supabase/index_advisor/blob/main/index_ad...

We are also developing the Trusted Language Extension with the RDS team, so at some point it should be easier to do this through database.dev:

https://database.dev/olirice/index_advisor

pkiv

0 replies

17h28m

2024-04-14 05:28:20 UTC

The supabase team always delivers. Excited to give this a try!

l5870uoo9y

0 replies

12h27m

2024-04-14 10:30:08 UTC

This can be optimized both easier and further[0]. How is case-sensitivity optimally handled? Should the query only match exact titles?

[0]: https://www.sqlai.ai/snippets/cluzdmi8w006d53gt82mguaga

jensenbox

0 replies

15h45m

2024-04-14 07:11:38 UTC

dexter uses a similar premise with hypog.

https://github.com/ankane/dexter

icang888

0 replies

7h32m

2024-04-14 15:24:58 UTC

Twslive

gregw2

0 replies

11h45m

2024-04-14 11:11:42 UTC

Is this aware of parent and child inherited tables?

dveeden2

0 replies

15h16m

2024-04-14 07:41:08 UTC

This is somewhat similar to TiAdvisor for TiDB, which also uses a hypothetical method.

https://www.pingcap.com/blog/introducing-tiadvisor-automated...

danolivo

0 replies

7h12m

2024-04-14 15:44:26 UTC

Having creation advice, the extension obviously must provide candidates to delete and, less obvious, candidates to merge some indexes.

crorella

0 replies

15h58m

2024-04-14 06:58:12 UTC

Interesting! We did something similar for distributed query engines like presto and Spark, instead of indexes we use partitions and buckets, the benefits are less compute, time and money.

ComodoHacker

0 replies

16h58m

2024-04-14 05:58:35 UTC

This doesn't seem to consider or give insight on any tradeoffs involved.

Also, the underlying extension, HypoPG, doesn't seem to collect any statistics on data to influence query planner.