return to table of content

The hunt for the missing data type

obi1kenobi
28 replies
1d1h

As someone who did a lot of work with graphs, "why don't programming languages have a built-in graph data type?" is a question I’ve been asked a million times.

I'm thrilled I'll be able to point folks to a much more in-depth analysis like the one here, instead of saying some variation of "it's really hard to do it well" and having them just take my word for it.

somat
9 replies
23h20m

This is a super naive take. but I would consider the pointer the be the native graph type. What is wanted is not a graph type but the tooling to traverse graphs.

Nevermark
2 replies
19h17m

Generalized pointers (RAM addresses, disk storage locations, network locations, etc.) would be the general way to implement explicit literal graphs.

Other graphs, that are partly or wholly computed or recomputed as needed from other relationships, could be considered "implicit" graphs and can be implemented many other ways.

The fact that graphs can be any combinations of literally or dependently defined, static or dynamic defined, would add even more complexity to any truly general graph library.

thfuran
1 replies
17h14m

You could say the same of Lists.

Nevermark
0 replies
9h45m

Well, of course! Lists, trees, arrays, structures, etc. are specialized graphs.

alanbernstein
1 replies
23h7m

Maybe a pointer is better described as equivalent to a half-edge?

andyferris
0 replies
20h25m

I think this might be kinda right.

Take rust. Tree edges use Box, and DAG edges use Arc or Rc. Rust doesn’t really help with building non-DAG graphs as native data types - you can drop back to using “handles” for more generic graphs.

And I actually think that is kinda fair. Languages and standard libraries should support trees and DAGs out of the box. Perhaps Hillel is right that more complex graphs should be in a user-space library or something.

nine_k
0 replies
23h5m

It's a bit like saying that a byte is a native number type, providing comparison to zero and addition as the only operations, and asking to figure the rest yourself.

I mean, it's technically sufficient, but I'd not call it proper support on a language level.

cwillu
0 replies
22h20m

Period should probably be a comma; I read this comment a couple times before it made sense, rather than calling the grand parent super naive.

computerdork
0 replies
21h23m

Hmm, think you're taking the title of the article a little too literally, and focusing on just the "data type" aspect. Yeah, the article itself indirectly agrees with you that pointers are often how graphs are implemented: "Graphs are a generalization of linked lists, binary trees, and hash tables," which are data-structures often implemented by pointers.

Yeah, the article is really saying what you're saying, that building a graph library ("tooling") is very complicated.

But, maybe I'm misunderstanding what you're saying??

boothby
0 replies
21h55m

I actually brought this hot take up in my conversation with Hillel -- something along the lines of "void * * is the canonical native graph representation in C."

jjice
6 replies
1d1h

I used to think that since graphs are such a broad data structure that can be represented in different ways depending on requirements that it just made more sense to implement them at a domain-ish level (the article mentions this in the "There are too many implementation choices" section).

Then I saw Petgraph [0] which is the first time I had really looked at a generic graph library. It's very interesting, but I still have implemented graphs at a domain level.

[0] https://github.com/petgraph/petgraph

TachyonicBytes
3 replies
21h58m

Why wouldn't an abstract `Graph` type and specific implementations of that work? Like Java has for `Set` and various implementations?

skybrian
1 replies
19h30m

For importing and exporting data, it often makes more sense to use something like a table or a tree rather than a graph. (Like a CSV file or a JSON file.)

So it's not clear what the interface would do. What methods should there be? Again, there are too many choices, and a Graph interface often isn't the best way to represent a view of some subset of a graph.

nyrikki
0 replies
19h0m

To add to this, recursively enumerable is the same as semi-decidable, and that only gets you to finite time.

One of the big reasons to fight to make dependencies DAG like is because exhaustive search gets you to exponential time.

NP-complete, NP-hard are easy to run into with graphs.

Graph k-colorability, finding Hamiltonian cycles, max cliques, max independent sets, and vertex cover on (n)vertex graphs aren't just NP, they have no sub-exponential time algorithms.

Finding those subgraphs is often impractical.

gryn
0 replies
20h49m

even at that level of abstraction there are many flavors of "graphs". what some people call graph are actually hypergraphs, or attributed graphs.

nickpsecurity
1 replies
1d

You might say graphs are an abstract concept, solving our problems requires specialized graphs, and so we just use or build the kind we need.

sanderjd
0 replies
22h8m

Yeah this is exactly how I think of it. I think "graphs" are just at a different level of the abstraction hierarchy than the data structures they're often grouped together with.

This is even further up the abstraction hierarchy, but to illustrate the point, nobody really wonders why languages don't ship with a built-in database implementation. And it's the same basic reason as with graphs; one size doesn't fit most.

paulddraper
5 replies
22h37m

"it's really hard to do it well"

More importantly, there are a lot of tradeoffs.

Virtually every language offers a hash map. You can roll your own to outperform in an individual circumstance, the default works pretty well.

You can't really do that with a graph. Maybe if you offered a bunch of graph types.

---

PS. Bit of trivia: Java's HashMap is a bit different from almost every other language in that it lets you tune the load factor.

adastra22
1 replies
21h43m

You can't really do that with a graph. Maybe if you offered a bunch of graph types.

And so why isn't this the solution?

Most languages support both hash map (fast lookup) and balanced tree (ordered entries) primitives, even though they both implement the "associative map" container type.

Can't we have 2, 3, 5, or even 8 different graph types?

masklinn
0 replies
20h38m

Can't we have 2, 3, 5, or even 8 different graph types?

That’s what graph libraries do, look at petgraph. You’ve got a bunch of graph implementations, and a bunch of algorithms over them.

swiftcoder
0 replies
6h25m

It's also a bit different in that every insertion entails at least one allocation, and every lookup entails pointer-chasing through at least 3 dereferences. Java's lack of proper value types really kneecaps their hashmap implementation in many practical performance scenarios...

karmakaze
0 replies
20h52m

It's also different in that each hash bucket can use a Red-Black tree when there's many entries.

dataflow
1 replies
17h28m

"why don't programming languages have a built-in graph data type?"

What I find a little funny about that question is that people miss the fact that there isn't even a tree data structure in most languages. Most languages have static arrays, dynamic arrays, linked lists, and... that's it as far as structural types go. Everything else (BSTs, hashtables, etc.) is some semantic abstraction that hides some capabilities of the underlying structure, not a purely structural representation.

maxiepoo
0 replies
3h51m

Typed functional languages like Haskell (data)/ML (type) do have a built-in way to define new tree types, and so does Rust (enum). It's one of the biggest things I miss when I'm not using these languages, especially when combined with some metaprogramming (deriving/derive) to get some functions defined for these new types very quickly.

sayrer
0 replies
21h28m

I think it's because you have to think it through when they get too big for one machine. A lot of those so-called "NoSQL" databases are really meant to represent graph databases (think DynamoDB) in an adjacency list. I've had success with Gremlin and JanusGraph, but it's a really messy space. It's not a programming language problem in my opinion, but a distributed systems one.

bloaf
0 replies
15h8m

I had an opposite experience. I was doing some new-to-me graph work in tcl and had just assumed the tcl standard library wouldn't have any graph algorithms. Come to find out I was wrong, and ended up saving myself a bunch of time not having to re-invent the wheel:

https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tclli...

atoav
0 replies
14h30m

Puthon has one, as mentioned, but as a non CS-person I never encountered any problem in my programming-life that so fundamentally called for graphs that I needed one, not did I had so much fascination for graphs that it was a tool I wanted to force onto my problems.

I guess it is just that there are many problems that can be solved incredibly well without graphs and not that many where graphs outshine everything else so clearly that people would use them.

That being said, convince me of the opposite and I am happy.

jes5199
28 replies
1d1h

I think this is a cop-out. Someone in the 1990s could have written the same thing about collections, or dictionaries, but we eventually came up with good-enough compromises that Python, Ruby, and Javascript all basically do the same thing. They don't implement literally every case, but they are good enough for "small" data, where the definition of "small" has grown to be quite large by human standards.

I think the real problem is syntax. Someone needs to come up with a textual graph literal that fits in source code, and it'll be good enough for the small type. Switch to a custom ubergraph when you exceed, idk, ten million entries.

gilleain
11 replies
1d1h

What do you mean by "textual graph literal"?

jes5199
10 replies
1d1h

Textual array literal: [1,2,3]

Textual dict literal: {"a": 1}

Textual graph literal: ???

jes5199
4 replies
1d

I like DOT a lot. Way back in the day we had a semi-serious proposal to use it in the Puppet configuration language to describe dependencies

Groxx
3 replies
1d

Same - DOT is a great way to go from zero to functional in near minimal time. It's also pretty trivial to generate since it's not order dependent, which is great for lightweight automation.

Fine-tuning layout can be a real hassle though, sadly. I haven't found any quick tools for that yet.

kevindamm
1 replies
1d

The way I approach fine-tuning DOT layout is to add subgroups where it seems appropriate, add a 1px border for the subgroup and see where the layout engine is situating it next to other nearby vertices/edges. Sometimes I may have to put a border around a few subgroups, then attempt to adjust size of vertices and entire area to nudge it to a local minima. Note: I don't attempt to adjust the size of the subgroups, I'm not sure that even works anyway, but maybe it depends on the choice of layout algorithm, too. Padding and other style on the vertex-box may help, too. It's been a few years for me, tbh.

Deciding where the appropriate subgroups are is a bit of an art. Sometimes it's obvious, as in bipartite graphs that are intentionally bipartite. Or, if there is a staged layout like for pipeline architectures. Sometimes it's not obvious even when it seems it should be, like when graphviz really wants to make a certain edge really short. Be ready to backtrack sometimes. Then I usually remove the subgroup border after I'm done, but a few times they have been useful to leave there.

One thing I really like about DOT is that adding hyperlinks to the vertices and edges that translate decently into the compiled output is really nice. I had an oncall dashboard that made liberal use of this feature that I still think back on fondly sometimes.

whitten
0 replies
23h35m

Can you point to where I can learn more about this ? E.G. an example and explanation exists for hyper link embedding ?

I am especially interested in syntax suitable to be used in creating something to input into https://www.viz-js.com and creation of SVGs with embedded hyperlinks.

Izkata
0 replies
1d

If you're using subgraphs, with an edge across subgraph boundaries, it is slightly order dependent - a node will appear in the area it was first mentioned in. If you define all the nodes/subgraphs first and all the edges at the bottom you'd never notice this though.

gre
0 replies
1d

  array{ 1 2 3 }
  dict{ { "a" 1 } }
  graph{ { "a" "b" } { "b" "c" } { "c" "a" } }
  AVL{ { "a" 1 } { "b" 2 } } 
Factor has some syntax like this. All you have to do is pass the `{...}` to the prefix (array/dict/graph/AVL) and it knows how to construct one.

https://github.com/factor/factor/blob/master/extra/trees/avl...

gilleain
0 replies
1d1h

Heh, one possibility is:

AB 0:1(C),0:2(D)

For a three node graph with edges between vertex 0 and 1, and vertex 0 and 2, vertex labels 'A' and 'B' and edge labels 'C', and 'D'. Not great to parse (as I sadly found), but possible to read.

WorldMaker
0 replies
1d

We've got a couple common graph literals today: Graphviz's "dot" language [0] and the Mermaid language (in part inspired from "dot") [1]. Mermaid is increasingly common in documentation today, given Github supports it almost everywhere Markdown is supported. Parsing dot or Mermaid as embedded literals in another language doesn't seem that complicated (Mermaid's parser is written in JS and already runs anywhere you can run a JS parser).

Though one interesting thing to note here is that both languages are edge list languages and are optimized for algorithms that are useful for edge lists, particularly display/diagramming. That gets back to the article's point that there are other useful representations and those can matter for efficiency of algorithms.

They also can matter for efficiency of a graph literal in a source document. Edge lists are great for sparse graphs, but if you have a dense graph it might be easier to write as a literal with a massive spreadsheet of an adjacency graph. Especially if it can just be a spreadsheet with modern affordances like scrolling and sticky rows/columns and such. There's no perfect answer for "every" graph.

[0] https://graphviz.org/doc/info/lang.html

[1] https://mermaid.js.org/intro/#flowchart

obi1kenobi
5 replies
1d

Two problems I see here, based on the research I've done in high-performance graph algorithms:

- It's hard to find a "good-enough" graph implementation. The best hashtable is only a handful of percent better than the built-in ones. The best graph impl is 1000x or more better than any generic built-in one could be, so there's much more incentive to specialize (and people already specialize hashtables for just a handful of percent speedup!)

- The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset. The baseline complexity of implementing a graph algorithm for a small dataset is pretty low, and the real problems come in later / at larger scale. So in graphs there's less incentive to learn a complex library's API when "I could just hack it myself," unlike for hashtables where the API is simple and doing it myself is much harder.

obi1kenobi
1 replies
1d

The work you cited is very impressive and very welcome.

But you seem to be implying that `std::unordered_map` is the default choice one would use, which in my experience is not accurate -- it is well-known to have serious perf shortcomings, and everyone I know uses some other implementation by default. Even so, the delta from `std::unordered_map` to the improved hashtable in the blog post is impressive, and just shy of 10x.

Graph algorithms frequently have 10x improvements from one state-of-the-art approach to the next -- for example, here's one from my own research[1]. The delta between state-of-the-art and "good default" in graph algorithms would often be around 100-1000x. And comparing state-of-the-art to the equivalent of an `std::unordered_map` would be another 10-100x on top of that, so 1000-100000x total.

[1]: https://dl.acm.org/doi/10.1145/3210377.3210395

bruce343434
0 replies
10h44m

Whoah, thank you for sharing. I only knew that just like dictionaries, there are quite a few implementation choices when making a graph, depending on what operations the algorithms needs to do often, and how sparse the data is.

brazzy
0 replies
1d

The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.

Have you tried doing it? My experience was that it was surprisingly simple. We may have different expectations for what is "reasonable", of course.

DSMan195276
0 replies
23h49m

The baseline complexity level of implementing a reasonable hashtable is fairly high, even if for a small dataset.

I would disagree with this, it's actually really easy to make one if you're willing to do away with many features (which aren't essential, but provide performance benefits). Implementing one is just something you never have to do in most modern languages.

PaulHoule
5 replies
1d1h

I've been involved with RDF and sometimes it seems the gap between "too simple to use RDF" and "too hard to use RDF" is tiny.

For instance I tried to pitch a data processing library a bit like

https://www.knime.com/

but where RDF graphs (roughly like a JSON document) get passed over the "lines" but found that the heavy hitters in this space believed this sort of product has to use columnar execution to be "fast enough". You can certainly build something that can do operations on a dynamic RDF graph (really a set of triple) but in principle you could compile code that treats native data structures as if they were in RDF... You might get pretty good in speed but it won't be as fast as native and hard to make it easier to code for than native.

jes5199
4 replies
1d

RDF is an XML standard, isn't it? there was a XML era for collections and dicts, too. That is now largely considered to have been a mistake

zozbot234
3 replies
1d

RDF is not dependent on XML. There is a XML representation of RDF, but alternatives include JSON (using JSON-LD) and simple textual formats such as N3 and Turtle.

PaulHoule
1 replies
1d

One trouble w/ RDF is that there are two official ways to do ordered collections, the RDF list which is basically a LISP list, and then RDF collections where the order is encoded in predicates. Neither is well-supported in most tools, for instance SPARQL lacks the list handling capabilities that you'd see in

https://docs.arangodb.com/3.12/aql/

or

https://www.couchbase.com/products/n1ql/

The XMP spec, for instance, hacks Dublin Core by adding ordering information because... It matters what order the authors are in. Dublin Core on the other hand seems to be developed for cataloging elementary school libraries and they were huge fans of doing the easy stuff and leaving out anything moderately hard, so Dublin Core looks like it has a 1968 level of sophistication and MARC is so much more 2000s. People come to RDF, see all these problems that are ignored, and come to the conclusion RDF is not for them.

hobofan
0 replies
1d

There is a good repo that collects many of the issues with RDF[0], which I would have loved to have known about earlier in my journey with RDF.

[0]: https://github.com/w3c/EasierRDF

hobofan
0 replies
23h4m

While there are (commonly used) alternative serialization formats, RDF is married to XML data model, as all core datatypes of RDF are defined in terms of XSD (XML Schema Definition) datatypes and very closely mimics it's data model.

If you want to have an RDF that is independent of that (e.g. based on Apache Arrow so that it's compatible with modern big data tooling) you might as well start from scratch.

abeppu
1 replies
1d1h

And some languages (e.g. java, scala) have standard libraries with interfaces that describe ordered collections, dictionaries, etc, but offer multiple implementations so the programmer can pick based on their specific considerations, but still benefit from library code written around the interface.

Hackbraten
0 replies
1d1h

It also uses hint/marker interfaces (eg. `RandomAccess`) so the library code can choose among algorithms internally while still presenting an uniform API to the client.

boothby
0 replies
23h39m

As somebody who's actually done the work, I don't think you've really spent time with the question. The problem is that graphs, in general, aren't special. They're a structural dumping ground.

Data structures are specialized graph representations. The source code you write is in a specialized graph representation. For the overwhelming majority of programmers, the fact that something can be viewed as a graph is much like saying "oh well that file is just a really long sequence of bits, so it's effectively an integer." It's not wrong, but is it useful?

082349872349872
0 replies
23h49m

https://lobste.rs/s/uhmhum/hunt_for_missing_data_type#c_1na1...

matrix multiplication and permanents are known to be non-cheap to compute, requiring worse-than-quadratic time! So any sort of good graph algorithm must dig deeper and be more specialized for the task at hand
ylow
16 replies
21h32m

I think this is because a graph is not a data-structure nor a data-type. It is really an abstraction.

Fundamentally, all I need to define a graph is a set of vertices v \in V and function Neighbors(v). And that really is all is needed for the most foundational set of graph algorithms.

Everything else are case-by-case constraints. Does A->B imply B->A? is the node set partitionable with certain constraints? Are there colors? labels?

To make things even more general I can go up one level and consider the hypergraph. In which case I just have a set of vertices, and a set of sets of vertices. This can be represented in a myriad of different ways depending on what you are interested in. Of which (non-hyper) graph is simply a special case.

An alternative way to think about it perhaps from the database perspective, is that its a query optimization and indexing problem. Depending on what questions you want to ask about the graph, there will be different ways to represent the graph to answer the question better. Just like there is not one way to represent the abstraction called "Table", there is not one way to do "Graph" either. It really depends on the questions you care about.

dataflow
11 replies
17h34m

Fundamentally, all I need to define a graph is a set of vertices v \in V and function Neighbors(v).

Even that is severely overconstrained. It doesn't allow multiple edges to the same neighbor!

lou1306
7 replies
9h33m

Well to be fair, that constraint is also part of the mathematical definition of a graph, where the set of edges E is a binary relation over vertices V (i.e., a subset of V x V). You'd need either a multiset or a labelled relation (i.e., a subset of V x L x V for some set of labels L) to overcome that.

dataflow
6 replies
2h56m

There is no "the" definition. From Wikipedia:

"Definitions in graph theory vary. [...] A multigraph is a generalization that allows multiple edges to have the same pair of endpoints. In some texts, multigraphs are simply called graphs."

lolinder
5 replies
2h45m

It's a bit disingenuous to skip over the Graph subsection of that article, right after the "definitions vary" line:

A graph (sometimes called an undirected graph to distinguish it from a directed graph, or a simple graph to distinguish it from a multigraph) is a pair G = (V, E), where V is a set whose elements are called vertices (singular: vertex), and E is a set of unordered pairs of vertices, whose elements are called edges (sometimes links or lines).

An unqualified "graph" is almost always this one—a simple, undirected graph. If you mean something different you almost always need to use one of the more specific names to be clear.

https://en.m.wikipedia.org/wiki/Graph_(discrete_mathematics)...

dataflow
4 replies
2h40m

"Disingenuous"? Do you have to make this personal before looking for another explanation?

The parent comment I replied to said:

"Does A->B imply B->A?"

That "undirected" condition was already violated before I wrote anything.

lolinder
3 replies
2h29m

Sorry, I didn't intend to make it personal, I was just pointing out that the very next paragraph after the chunk you quoted included the definition of "graph" that lou1306 was referring to, almost verbatim.

Definitions sometimes vary, but lou1306 is correct on the merits that the most widely accepted definition of an unqualified "graph" states that "the set of edges E is a binary relation over vertices V (i.e., a subset of V x V)".

dataflow
2 replies
2h12m

Again, that objection makes no sense to my comment when it was already violated in the discussion before I wrote anything: "Are there colors? labels?"

If you'd like to object to it ("to be fair" or whatever), the parent comment I replied to would be the one to do so to.

lolinder
1 replies
1h45m

You're pulling in context from ylow's post that isn't relevant to this subthread. I'm not defending ylow's definition, I'm defending lou1306's.

Here's the first few parts of the chain of thought of this subthread:

ylow> Fundamentally, all I need to define a graph is a set of vertices v \in V and function Neighbors(v).

You> Even that is severely overconstrained. It doesn't allow multiple edges to the same neighbor!

lou1306> Well to be fair, that constraint is also part of the mathematical definition of a graph ...

You> There is no "the" definition. From Wikipedia ...

You explicitly were only replying to the portion of ylow's comment that was about vertices and a neighbors function, and lou1306 was replying to your assertion that vertices+neighbors was overconstrained because it wouldn't allow multiple edges. All I'm saying is that lou1306 is correct in their definition of a graph. If that means that both you and ylow are wrong, that's fine with me!

dataflow
0 replies
1h2m

All I'm saying is that lou1306 is correct in their definition of a graph.

I never claimed otherwise. I explicitly said the opposite - there are multiple correct definitions. That's literally one of the reasons why there's no general purpose graph type - there are multiple definitions with different properties, all of which are referred to in various contexts as "graphs".

If that means that both you and ylow are wrong, that's fine with me!

This gives off very strong "somebody is wrong on the internet" vibes...

All I said was (a) in the context of the current discussion (not decided by me!), graphs were already assumed to encompass more than the vanilla undirected V x V definition people are pointing me to, and (b) in that context, one more example (supporting the parent's point that I was replying to!) was graphs with multiple edges. All of which seems quite uncontroversial, true, and in line with the context of the parent comment I replied to. I have nothing to add.

I seriously don't get where the desire to die on this hill is coming from, but I don't share it to keep continuing here.

saghm
2 replies
11h52m

Maybe the naming would be a little weird for that use case, but they didn't specify what the output of `Neighbors(v)` is; I don't see any reason why it couldn't return a multiset or a relation (w, c) where `c` is the number of connections between `v` and `w`

dataflow
1 replies
2h58m

Returning the same neighbor multiple times kind of misses the point. The point was that you need to return edges (not neighbors) because the edges connecting the same neighbors can be different.

Like, imagine two train lines between the same pair of stations. Or two roads between the same intersections. They might have different travel times, costs, etc.

saghm
0 replies
1h24m

I still don't see how this couldn't work with a `Neighbors(v)` function with an unspecified return type. The outputs I gave were examples of how it could be adapted for various use cases, not an exhaustive list of all possibilities; in the example with multiple edges with multiple weights, the output could instead be a relation of (v2, w, c) that indicates that v connects to v2 with weight w with c as the number different edges between the two with that weight.

gloftus
3 replies
20h23m

Yes, graphs are ubiquitous because they are so abstract. They live on the same level of abstraction as pure numbers. There are useful "numerical" libraries that exist, and by analogy I think you could say there are also useful "graphical" libraries that exist. But we don't really have "number" libraries, and we don't really have "graph" libraries, because those concepts are a bit too abstract to write APIs against.

kragen
2 replies
19h30m

it's true that numbers are very abstract, which is what makes it so easy to design apis for them

the python runtime includes four built-in number types (small integer, arbitrary-precision integer, float, and complex) and the python standard library includes two more number types (decimal and fractions), and one of the most popular non-standard libraries for python is numpy, which provides some other kinds of numbers such as single-precision floats, vectors, and matrices. other systems like pari/gp have number libraries that provide other kinds of numbers, such as p-adic numbers and galois field elements

the only programming languages i've ever used that didn't have 'number' libraries were esoteric languages like brainfuck and the lambda calculus

ysofunny
1 replies
16h51m

numbers have all of mathematics as a background which is what makes it so easy to design apis for them

graphs are a much newer development, I think there's a very deep connection between category theory and graphs in general (and also computers make both much more useful somehow)

lambda calculus can be used to define numbers but it's a wonky construction, it's reminiscent of how sets can also be used to define numbers.

kragen
0 replies
16h43m

that seems reasonable

pizlonator
9 replies
1d1h

I would claim that object oriented languages are a syntax, semantics and type system for graphs.

Objects are nodes.

Fields are edges.

The object graph is the heap.

So your whole program state is a graph.

tlb
1 replies
1d1h

A limitation is that you can only have one such graph in the program. So any graph algorithm that returns a graph, or uses another graph during the computation, doesn't fit.

boothby
0 replies
1d

Don't fall for the abstraction! Mathematically speaking, if you have graphs G1 and G2; you can make another graph H with nodes {G1, G2} and edges {(G1, G2)} and nothing goes wrong. You can definitely view your whole operating system as a graph; it doesn't invalidate having a program running which processes graphs.

qsort
1 replies
1d1h

Graphs are such a general concept that if you squint everything is a graph. Your example works just as well with structs and member fields, we don't even need the OO hypothesis.

pizlonator
0 replies
1d

C and structs let you write code using whatever paradigm you want. So you can do object graphs. And you can also do closures. And you can do many other things.

Lots of things are possible when you just treat memory as bytes and pointers are just integers.

reaperman
0 replies
1d1h

I think this is a useful model for me personally, and don't want to diminish its potential value to others; I often think of programs as graphs.

I think it's interesting to add to the discussion that I'm wary to reduce anything to any particular "Turing-complete concept". Because anything can be represented by anything.

gwbas1c
0 replies
1d1h

I was about to post "Data structures are graphs," but your explanation says it better.

dustingetz
0 replies
1d

    let x = 1; // edge named x
    let y = f(x); // node named f with input edge x and output edge y

chubot
0 replies
1d1h

Right, I think that's just what I said the first time this came up: all languages with GC have graphs builtin.

(Though C has graphs too. If every node shares the same lifetime, then it's pretty easy to manage. Otherwise it can be pretty painful)

And the good news is that you simply use the TYPE SYSTEM to categorize your nodes and edges.

Your edges are references to other objects, which are typed. Node can be typed as well.

---

Although the original article does get at this -- there are many types of graphs, and some of them can be encoded in a typed object graph.

Some of them can't -- you need the equivalent of void* for the edges.

Others would need a List[T] for the edges, if the out degree is not fixed.

And that only covers directed graphs, etc.

Also, it's true that allocating all these tiny objects as GC objects can be very slow, so then you use other representations of graphs, like a list of pairs of node IDs.

I don't really think of it as a "missing" data structure, but yeah now I do see how that framing can be useful.

PaulHoule
0 replies
1d1h

For that matter it's true about a certain kind of C program. In a civilized language you have run-time typing and introspection and have all the type information so that it is straightforward to look at the program's data structures as a graph.

If you are looking at the debug symbols and a copy of the program you can usually figure this graph out but you might need to think about it sometimes.

criloz2
8 replies
23h46m

Graph drawing tools are also very underwhelming, they work pretty good for small graphs until you have something like 500 nodes or more, then eventually their output becomes complete incompressible or very difficult to look at it, they miss the ability to automatically organize those graph in hierarchical structures and provide a nice interface to explore them, we are used that everything around us have some kind of hierarchy, I think that is the same kind of problem that will need to be solved in order to have a generic graph data type, also this kind of thing will need to be implemented at the compiler level where those graph generic algos will be adapted to the generated hierarchy of structures, and if you add a theorem prover that can check that certain subgraph will always have certain structures you can statically generated those procedures and for the other super graphs those methods will be generated dynamically at runtime.

So whoever solve the problem for generic graph drawing will have the ability or the insight to implement this too.

swagasaurus-rex
3 replies
21h11m

The illustrations I've seen of neural networks really highlights the sheet incomprehensibility of visualizing large graphs

criloz2
0 replies
18h14m

Wow, this is really amazing, thank you

cloogshicer
0 replies
18h44m

Super interesting paper on an alternative way to render graphs. Thanks for posting!

samatman
0 replies
17h42m

Some algorithms do better at this than others, but "make a good diagram of a graph" is an intelligence-complete problem in the general case. Two people might render structurally-identical graphs in very different ways, to emphasize different aspects of the data. This is in fact a similar problem to the "generic graph algorithm" and "generic graph data structure" problems.

Graphs straddle the line between code and data. For instance, any given program has a call graph, so in a real sense, the "generic graph algorithm" is just computation.

nine_k
0 replies
22h59m

Ideal things are often tree-like. Real-world structures are usually DAGs if they are nice and well-behaved.

Making things planar, or almost planar with few crossings and nice clustering of related nodes, is usually hard past a couple dozen nodes :(

kjqgqkejbfefn
0 replies
20h26m

Graph drawing tools

It's hard

Graphviz-like generic graph-drawing library. More options, more control.

https://eclipse.dev/elk/

Experiments by the same team responsible for the development of ELK, at Kiel University

https://github.com/kieler/KLighD

Kieler project wiki

https://rtsys.informatik.uni-kiel.de/confluence/display/KIEL...

Constraint-based graph drawing libraries

https://www.adaptagrams.org/

JS implementation

https://ialab.it.monash.edu/webcola/

Some cool stuff:

HOLA: Human-like Orthogonal Network Layout

https://ialab.it.monash.edu/~dwyer/papers/hola2015.pdf

Confluent Graphs demos: makes edges more readable.

https://www.aviz.fr/~bbach/confluentgraphs/

Stress-Minimizing Orthogonal Layout of Data Flow Diagrams with Ports

https://arxiv.org/pdf/1408.4626.pdf

Improved Optimal and Approximate Power Graph Compression for Clearer Visualisation of Dense Graphs

https://arxiv.org/pdf/1311.6996v1.pdf

hobofan
0 replies
23h20m

we are used that everything around us have some kind of hierarchy

I think the problem is more that we are used to the illusion/delusion that everything is hierarchical. The problem that we then encouter is with graph drawing is that it has to try and reconcile the fact that things in practice are rarely really hierarchical, and it's hard to draw those lines of where the hierarchies are with mathematical rigor. And that problem gets worse and worse the less properties you are allowed to assume about the underlying graph structure (connectedness, cyclic/acyclic, sparse/dense).

In practice when you want build a UI that interacts with graphs it's often feasible to determine/impose one or two levels of meta-hierarchy with which you can do clustering (allows for reducing layout destroying impact of hairball nodes + improves rendering performance by reducing node count) and layout with fCOSE (Cytoscape.js has an implementation of that).

rhelz
6 replies
23h34m

Ya, the central obstacle is that:

1. for simple and small graph problems, a simple vector-of-vectors adjacency list is easy enough to code up.

2. For complex and huge graph problems, the only way to get performant solutions is to tailor the graph implementation to the specific details of the problem to be solved.

And its hard to see what kind of language support would help, other than just having a super-smart compiler which could analyze the code and determine whether an adjacency list, matrix, 3d array, etc was the best way to implement it. That's the kind of optimization which we won't see in compilers for a while.

It's another instance of the phenomenon which Strousroup noticed: we are really good at code sharing of small things like vectors, and of large things like operating systems. Its the middle-sized problems we are bad at.

twoodfin
2 replies
21h39m

And its hard to see what kind of language support would help, other than just having a super-smart compiler which could analyze the code and determine whether an adjacency list, matrix, 3d array, etc was the best way to implement it. That's the kind of optimization which we won't see in compilers for a while.

I’m not so sure? Looking at an algorithm against an abstract graph type, then filling in the implementation to optimize for the particular algorithm seems right in the wheelhouse of code-specialized LLM’s.

rocqua
0 replies
20h20m

My experience with cipher is that the query optimizer doesn't know enough about the graph to pick up on trivial optimizations. This can't be fixed without a way to tell the optimizer about those properties, and even just dreiging a language to tell the optimizer those things is difficult.

Just an LLM looking at your query isn't going to cut it. It will need to take your actual data into account.

rhelz
0 replies
21h26m

Good point. The game has really changed in terms of what kinds of programs we can write now. Perhaps it's too pessimistic to not expect these sorts of optimizing compilers soon.

Sounds like a good research opportunity to me.

js8
2 replies
19h46m

we are really good at code sharing of small things like vectors, and of large things like operating systems. Its the middle-sized problems we are bad at.

Interesting. But I am not sure we are good at sharing small things - every programming language has its own implementation of vectors. Within one language ecosystem, the API of a vector is small, and that's probably what makes it easy to share.

For operating systems, the API is relatively small compared to the internal complexity of the OS. This is also true for libraries for numerical problems, which are also easily shared. But the more you want to customize things (e.g. share a complicated data structure), this complicates the API and inhibits sharing.

So it seems to this is determined by the surface area (relative size of the API) of the thing being shared.

swiftcoder
0 replies
6h23m

every programming language has its own implementation of vectors

Many programming languages have more than one implementations of vectors. Turns out you want tiny vectors stored on the stack, and big vectors stored on the heap...

rhelz
0 replies
1h39m

Well, we could always be better at sharing small things; but recall, the comment was made by Bjarne Stroustrup, and he probably thought that he had pretty much nailed the vector by that time :-)

The point of the OP is a bit broader than that: for something like a vector, we have at least figured out some language features which would help a programmer make an efficient and generic implementation. Templates are not great, but at least they are something.

For graphs, we don't even have that. What kind of built-in graph support would work for graphs which would work for pathfinding in a video game, or the internet, or a social networking graph a la facebook, or a routing graph routing a 100 million transistor chip....

We are getting better at abstraction all the time, but to abstract across all these kinds of applications is something which eludes us. Its really hard to see how you could give a programmer anything which would actually save him some time.

rb-2
6 replies
1d1h

I wonder if it would be possible to mathematically define (in a theorem proving language like Coq) a bunch of accessor methods as well as a bunch of implementation primitives and then "compile" a custom graph implementation with whatever properties you need for your application. Some accessor methods will be very efficient for some implementations and very inefficient for others, but every method will still be available for every implementation. Profiling your application performance can help adjust the implementation "compiler" settings.

Ironically, this is a graph problem.

andoando
1 replies
23h27m

Ive been thinking about something like this. A mathematical definition of a function such that we can search it. Imagine we had something like "Find a function that fits this signature -> Input arr[numbers] out-> for every x in arr, x2>x1.

lupire
0 replies
22h18m

That's https://hoogle.haskell.org/ plus dependent types (data constraints).

Without human provided dependent typing, the search engine would be almost as hard to write as a system to directly generate the code you need.

JonChesterfield
1 replies
23h39m

I think this is a worthwhile direction.

For example, I'd like to program against a sequence abstraction. When sort is applied to it, I hope it's a vector. When slice or splice, I hope it's some sort of linked structure. Size is as cheap as empty for the vector but much more expensive for a linked list.

It should be possible to determine a reasonable data representation statically based on the operations and control flow graph, inserting conversions where the optimal choice is different.

The drawback of course is that people write different programs for different data structures. Knowing what things are cheap and what aren't guides the design. There's also a relinquishing of control implied by letting the compiler choose for you that people may dislike.

As an anecdote for the latter, clojure uses vectors for lambda arguments. I thought that was silly since it's a lisp that mostly works in terms of seq abstractions, why not have the compiler choose based on what you do with the sequence? The professional clojure devs I was talking to really didn't like that idea.

samatman
0 replies
17h57m

Clojure uses vector syntax for lambda arguments. `read` sees a vector. What comes out of eval is a lambda. Does a Vector get built in the process? You'd have to check, my bet would be that the argument list spends a little while as a Java Array, for performance reasons, but that a Clojure Vector is not actually constructed.

maxiepoo
0 replies
3h50m

You can do something like this with OCaml/SML's module system.

And certainly from an abstraction point of view you can do this in any dependently typed language like Idris/Agda/Coq, but these don't have great implementations.

kevindamm
0 replies
1d

This sounds like Partial Evaluation and the Futamura Projection. The research around that shows that your interpreter determines the shape of the compiled output, so a formal proof of its application isn't necessary, if the $mix$-equivalent has the appropriate syntax and semantics for graph processes in its design.

I know this has been done for procedural languages and for declarative logical languages but I'm not aware of something like this specifically for graph processing and highly specialized code generation of graph processing. I wouldn't be surprised if Mix has been extended for this already, even if it has I'm sure there is still value in it.

montmorency88
6 replies
1d1h

I'd highly recommend Erwigs FGL library in Haskell as a really nice example of a generally performant graph data structure that is easy to work with. The API feels like working with lists because you are essentially consing contexts(node, neighbours) into a list of contexts that form your graph. Many standard graph algorithms are then built up from depth or breadth first search and you can compose really succinct programs to manipulate the graph. Graphs are labelled with any Haskell data structure and there is a graphviz library complementary to FGL to generate dot files which can be rendered according to the data carried in the node labels. Often in an application you want to both perform computations on your graph and render a visualization simultaneously to the end user or for debugging and in FGL you minimise duplication of application and rendering logic.

tikhonj
4 replies
1d

FGL is a great example of how to make a "nice" high-level graph interface suited for functional programming. I'm a big fan. But it's orders of magnitude too slow and memory-inefficient for performance-sensitive graph computations—if you have even moderately sized graphs and graph algorithms are a bottleneck, you'll need to use something else, and probably something domain-specific. Given the way the interface works, I don't think you could have a high-performance version that would scale to large graphs.

In my experience this leaves FGL in an awkward spot: on the one hand, it isn't sufficient for heavy-duty graph processing; on the other, if you don't need fancy high-performance graph algorithms, chance are that encoding your problem as a graph is going to be more awkward than defining some domain-specific types for what you're doing. Graphs are such a general structure that they're usually the wrong level of abstraction for higher-level domain-specific logic.

Of course, sometimes you're writing graph code specifically and you need a nice way to express your graph algorithms without worrying about performance. In that case, FGL is great. I wrote a tutorial about using it to [generate mazes][1] and it helped me express the algorithms better than I would have been able to do without it. But that still leaves it as too narrow for something to be "the" graph representation in a language's standard library.

[1]: https://jelv.is/blog/Generating-Mazes-with-Inductive-Graphs/

montmorency88
1 replies
23h14m

Interesting. Under the hood FGL is mapping the graph to relatively efficient data structures like Patricia Trees as implemented in Data.IntMap so I would expect it to scale reasonably for inserting edges and mapping over nodes. I agree the memory inefficiency is definitely a limiting factor of the library. As you say I think it is best suited for expressing graph algorithms and if those calculations become the bottle neck a custom solution can be developed with the proof of concept already in place. Out of curiosity what number of nodes/edges would you consider a "moderately sized graph"? My user cases are typically on the order of 500-1000 nodes with a similar number of edges that require bfs and topological sorting.

tikhonj
0 replies
22h39m

I don't have an exact number in mind—I imagine it's pretty context-specific—but 500–1000 nodes seems like it would qualify.

I've played around with IntMap before and it's not a great data structure. It's a binary Patricia trie, which means that you quickly get a relatively deep tree with lots of pointer traversals. Unless I've managed to confuse myself on how it works, you'd end up with, what, at least 10 traversals to look up a value from 1000 keys?

Chris_Newton
1 replies
21h35m

But it's orders of magnitude too slow and memory-inefficient for performance-sensitive graph computations—if you have even moderately sized graphs and graph algorithms are a bottleneck, you'll need to use something else, and probably something domain-specific.

This seems a little pessimistic to me. There are plenty of application domains that can be conveniently represented using graphs where you might have thousands of nodes and edges — which is what I’d characterise as “moderately sized” — and your needs might only extend to relatively simple and efficient graph algorithms. FGL is excellent in this kind of scenario.

If you do need the kind of algorithms that explode in complexity then even a representation a couple of orders of magnitude more efficient won’t help you much either. Big-O is the thing that is going to spoil your day in this story, not the constant factor. Some problems simply don’t have convenient fast solutions and ideally with those you find a way to change the representation so the original problem doesn’t arise in the first place.

It’s true that there’s also a zone where you have significantly larger graphs but still only need computationally tractable algorithms, and in that case the overheads of a library like FGL become a factor in what is viable. I also don’t disagree with you (and Hillel in the original piece) that it would be difficult to define comprehensive graph functionality to include in a standard library when there are so many different trade-offs involved.

A good — and not entirely unconnected — analogy might be calculating with matrices. It’s convenient to have support for simple but widely useful cases like 3x3 and 4x4 built into your language or standard library. However, once you’re solving systems with hundreds or thousands of rows, you probably want more specialised tools like BLAS/LAPACK, and the structure of your matrix and how you can decompose it start to matter a lot more.

kccqzy
0 replies
17h38m

In Haskell though I think Alga has an even nicer API. Don't know about performance as I haven't had a need to use Haskell to process enormous graphs. https://github.com/snowleopard/alga

ogogmad
5 replies
22h29m

I suspect a lot of graph theory can be reduced to abstract algebra. You consider matrices over lots of different scalar types. The scalar types should be:

- Rigs (rings without negation),

- idempotent (that is, where x + x = x for all x),

- equipped with involution (so that undirected graphs can be made the default by restricting the matrices which represent graphs to only self-adjoint matrices),

- and the entries of the matrix can be restricted to a *-ideal. Note that a *-ideal can be considered a scalar type in its own right.

Different choices of the above scalar types can be used to capture different graph types: Weighted, directed, undirected, bipartite.

There's no clue to physical implementation, other than that sparse graphs can be treated like sparse matrices. Anybody tried this? How did it work out?

bigbillheck
2 replies
22h7m

You're a couple decades too late: https://graphblas.org

ogogmad
1 replies
22h5m

If you read what I wrote, you'd see I never claimed originality. Anyway, that's cool. I'll need to check how much abstract algebra stuff they use.

Using semirings (uh, rigs) alone isn't impressive. Do they consider semirings with more algebraic structure attached to them?

bigbillheck
0 replies
19h27m

Using semirings (uh, rigs) alone isn't impressive. Do they consider semirings with more algebraic structure attached to them?

Wiki says they have R, the two tropical semirings, the 'max-min' semiring, and GF(2). The tropical and max-min have your idempotency requirement, all but the max-min have involution.

yen223
0 replies
18h12m

According to the article, the physical implementation of graphs is precisely the thing you cannot ignore, because the performance difference between a generic solution and a "correct" solution is huge, basically the difference between the algorithm not completing and the algorithm completing.

That's the reason why it's hard to come up with a single one-size-fits-all graph implementation.

aweinstock
0 replies
16h0m

The post "A Very General Method of Computing Shortest Paths" (https://r6.ca/blog/20110808T035622Z.html) shows that the Gauss-Jordan algorithm for solving matrix equations, the Floyd-Warshall algorithm for all-pairs shortest paths, and Kleene's algorithm for converting DFAs to regular expressions, are all the same algorithm on matrices of elements from a star-semiring.

4ad
5 replies
1d1h

Ok, trees are not graphs but they are very related, and algebraic data types are trees, so they are ubiquitous in functional programming.

Why don't we have graphs in FP (or in Rust)? Because graphs require mutation (respectively break linearity).

Why don't we have graphs in imperative languages? Perhaps because very few imperative languages have ADTs? Just a thought.

zokier
1 replies
1d

In what sense graphs require mutation?

joshlemer
0 replies
23h46m

I think maybe they're talking about how, in order to create a data structure where two nodes contain a reference (edge) to each other, you have to resort to mutation. i.e. you have to create A, then B with reference to A, then mutate A to refer to B, or you have to create B, then A with reference to B, then mutate B to refer to A.

Though this ignores that there are other ways to represent graphs, such as adjacency matrices, etc.

harporoeder
0 replies
23h27m

Even with direct mutual references between some node type this can be represented in a lazy functional language such as Haskell pretty easily without mutation.

boothby
0 replies
1d

I disagree: trees are connected graphs with n nodes and n-1 edges. They're graphs with a special structure that is highly exploitable for performance gains, but they're graphs.

bigbillheck
0 replies
22h1m

Why don't we have graphs ... in Rust?

You might have missed this from the article but: https://docs.rs/petgraph/latest/petgraph/index.html

Because graphs require mutation (respectively break linearity).

I don't think this is actually the case. Graph nodes go in one container (`Vec` or `HashMap` or `BTreeMap`), and the edges go in another container (`HashMap` or `BTreeMap`). The object in which you store the node only needs to know what its name is, you can let something else know what its neighbors are.

ogogmad
1 replies
9h12m

How flexible and performant is that in different situations?

rdtsc
0 replies
5h39m

I had never used it, aside from a simple dependency checker. But I know the compiler and the Erlang types analyzer (dialyzer) uses them.

DylanSp
1 replies
19h23m

Erlang's briefly mentioned at the end of the article:

There are two other languages I found with graph types: Erlang and SWI-Prolog. I don’t know either language and cannot tell when they were added; with Erlang, at least, it was before 2008. I reached out to a person on the Erlang core language committee but did not hear back.
rdtsc
0 replies
19h1m

Oh I did miss that. Thanks for pointing it out.

mizzlr_
4 replies
1d1h

graph is a data structure, not a data type. if you squint enough pointers are the building blocks (the data type is you please) for building graph data structure. a->b is pointer access, looks like an edge in the graph.

graph data structure is parent of tree, code execution/ function call stacks work like a tree, think flame graphs.

stacks and pointers are baked in assembly and cpu architecture. your claims can't be farther from the truth.

renewiltord
1 replies
1d

Pointer-based graph structure will make matrix algos painful to implement. A graph is a concept. Article is quite meaningful about how it follows the various subtypes. Would recommend reading.

Mizzlr
0 replies
7h48m

Firstly, I would recommend reading this comment by me https://news.ycombinator.com/item?id=39592444#39596277

That should provides you some more context about my earlier comment.

By definition of concept (think conceptnet) anything is a concept. Any noun is a concept. Graph theory defines graph as set of two more sets. The set of nodes and set of edges, where each edge itself is set of two nodes (or tuple of two nodes if directionality of the edge also needs to be encoded). A node is anything that you can consider putting into set. And according to set theory, a set is well defined collection of things.

According web ontology language, a "thing" is the root of all things that can exist (see https://www.w3.org/TR/owl-ref/, specifically owl:Thing), except "nothing" maybe.

What all this means is a graph is collection of things, with things pointing to each other sometimes.

Pointers are the underlying data type that makes all other higher level data structures possible, including arrays, matrices, hashmaps, graphs, structs and more.

Chris_Newton
1 replies
21h5m

graph is a data structure, not a data type.

It can refer to either.

Any concrete data structure that uses indirection — which means pretty much anything more complicated than dense arrays and records — is indeed a form of graph.

But graphs, and more constrained forms like DAGs and trees, can also be abstract data types, implemented by a variety of concrete representations.

One of life’s little ironies is that implementing a general abstract graph using a general concrete graph whose records and pointers correspond (roughly) 1:1 with the nodes and edges in the abstract graph is often a poor choice.

Moreover, it’s not unusual to have an abstract graph implemented using a non-graph data structure (for example, a dense adjacency matrix) or to use a graph-like data structure to implement an abstract data type whose interface doesn’t look particularly graph-like (for example, a piece table).

Mizzlr
0 replies
8h1m

A data structure, is a group of related data.

A graph is a group of two sets, the set of nodes and set of edges.

As an abstract data type, you may define operations on the data structure (aka abstract data type).

In case of graph, for example, you can define connectivity check (existence of an edge). And graph theory provides plenty more.

And set (in set theoretic sense) is also a data structure, you may define the membership check as on operation on that.

On the other hand, a data type is a tag that a compiler associates with raw bits and bytes on the memory in order to operate on them. Examples, datetime is a data type, string is a data type, array is a data type, numbers are data type, these are not data structures. These are language primitives.

Further graph is superseded by hi-graph (which is foundation for relational data bases and tuple algebra), and subseded by for example DAGs and trees.

To build an edge in a graph, you need something that could point to something. Like A points to B, the most fundamental way to capture this mapping is by using pointers (that is the address of B, stored at a known location accessible by A). A->B or A.B are just syntactic elements that underlie this.

Arrays, Matrices, Structs, Strings, are all made possible by pointers.

Pointers are a data type, it tags the value (usually in range 0..usize), as being an address of something else in the memory. Pointers are not data structures.

I should say primitives vs non-primitives if that makes the difference between what is data type vs data structure.

jkaptur
4 replies
22h2m

I've often wondered about a missing application: "Excel for graphs".

Just like Excel for tabular data, it would support RAM-sized data (enough to require a computer, but not so much that you need a data center), implement lots of algorithms and visualizations "well enough", and require no programming skill to operate.

As the article says, a lot of our real-world problems are graph problems - why are programmers the only ones who should have the tools to solve them?

roenxi
0 replies
9h22m

As the article says, a lot of our real-world problems are graph problems

The article struggles to back that up though. Eg, it notes that the internet can be modelled with a graph. Undeniably true. But so what? The internet can be represented as many different things and it is unclear that representing it as a graph has any generically useful engineering implications. There is an argument that is just as good that representing the internet as a neural-network (ie, a black-box matrix-encoded function of arbitrary inputs to coherent outputs) is the ideal representation for getting useful info out of it.

Maybe for someone like Google that is a billion-dollar idea (even then though, it might not be - I don't know if they represent their index as a graph or not). But the internet overall isn't much of a graph problem to many other people, and representing it as a graph doesn't solve much.

It is rare to see someone solving a real-life problem on paper as a graph. Using tables happens all the time. Graphs are common, graph problems are uncommon.

jonahss
0 replies
15h54m

I agree. And I think the key to this is VR.

Another comment in this thread is about how hard graphs are to visualize, but a 3D interface gives you a lot more room.

When VR hype began I thought "well what's the excel of VR?". Microsoft's answer was "2D spreadsheets floating in 3D space". What nonsense. I think graphs.

email my username at gmail.com if anyone is interested in exploring this together!

empiko
0 replies
20h43m

Yeah, I feel like the article is too quick with its conclusions. Many other problems can be made arbitrary complex and difficult with additional requirements. But there are still data structure and standard libraries to provide good enough experience that fits most use-cases. And if you need something extra spicy you need to build a custom solution.

The article claims that graphs are often just too big, but yeah, if you ask people who are actively working on graph algorithms they might have that sort of experience. But most programmers and users probably only work with really small graphs.

BlindEyeHalo
0 replies
4h53m

As the article says, a lot of our real-world problems are graph problems - why are programmers the only ones who should have the tools to solve them?

I think programmers and mathematicians are the only ones that model these problems as graphs. I doubt a casual person sees graphs in random real world problems.

And something I learned working in a big corporations, everything can be an excel spreadsheet if you try hard enough.

qazxcvbnm
3 replies
23h57m

This reminds me of my quest to find the proper method to model what I've termed 'large types' in an ideal language.

As is well known, algebraic data types as commonly found, consist of sums of products, yet a great deal of useful types are larger than that; some hopefully illustrative examples include:

1) the type of subsets of another type would be 2^X (hopefully demonstrating what I mean by 'large'ness);

2) in practical languages like TypeScript, the 'Partial' of a product type A x B x C would be (1 + A) x (1 + B) x (1 + C);

3) data structures in general, as a term amenable to some certain set of operations, when needing to be represented for performance reasons e.g. a) union-find structures (quotients?); b) a list of words and their inverted indexes for searching; c) a sorted list

Reading more about type modelling, and learning of the disagreements in how even basic things like quotients ought to be represented as types, I've since resigned to an understanding of this as an unsolved problem, and relegated the modelling the kitchen sinks of types with the kitchen sink of types - i.e. the function type (curbed with suitable type constraints upon the signature - from an index type to a suitable base type) - after all, its power and province being the irreducible kernel of type polymorphism, shadow over Church's types, original sin against type decidability.

gue5t
1 replies
23h48m

These types don't seem to escape the scope of what can be described with algebraic types, but the relationships between them seem like you're looking for a notion of type-level functions: subset ≡ X => 2^X, partial ≡ A×B => (A+1)×partial(B)

qazxcvbnm
0 replies
23h36m

Consider the case of Partials - we might like to restrict the Partials to different subsets of the fields for different purposes; consider the modelling of an inverted index.

Certainly it is possible to represent each specific case as some algebraic type; but beyond trivial cases, I find that when I need such of these types, quickly I discover that there are myriad ways to express them, none of them uniquely natural, unlike the way a sum of products type (and its terms) can be pretty much unambiguously drawn from a specification.

This matters especially when e.g. I need to evolve my types in a data migration.

andoando
0 replies
23h55m

I've been thinking about something like this for the last 3 years. However, I can't find a practical reason for it, even though I am sure there are.

michelpp
3 replies
1d

I think one of the elements that author is missing here is that graphs are sparse matrices, and thus can be expressed with Linear Algebra. They mention adjacency matrices, but not sparse adjacency matrices, or incidence matrices (which can express muti and hypergraphs).

Linear Algebra is how almost all academic graph theory is expressed, and large chunks of machine learning and AI research are expressed in this language as well. There was recent thread here about PageRank and how it's really an eigenvector problem over a matrix, and the reality is, all graphs are matrices, they're typically sparse ones.

One question you might ask is, why would I do this? Why not just write my graph algorithms as a function that traverses nodes and edges? And one of the big answers is, parallelism. How are you going to do it? Fork a thread at each edge? Use a thread pool? What if you want to do it on CUDA too? Now you have many problems. How do you know how to efficiently schedule work? By treating graph traversal as a matrix multiplication, you just say Ax = b, and let the library figure it out on the specific hardware you want to target.

Here for example is a recent question on the NetworkX repo for how to find the boundary of a triangular mesh, it's one single line of GraphBLAS if you consider the graph as a matrix:

https://github.com/networkx/networkx/discussions/7326

This brings a very powerful language to the table, Linear Algebra. A language spoken by every scientist, engineer, mathematician and researcher on the planet. By treating graphs like matrices graph algorithms become expressible as mathematical formulas. For example, neural networks are graphs of adjacent layers, and the operation used to traverse from layer to layer is matrix multiplication. This generalizes to all matrices.

There is a lot of very new and powerful research and development going on around sparse graphs with linear algebra in the GraphBLAS API standard, and it's best reference implementation, SuiteSparse:GraphBLAS:

https://github.com/DrTimothyAldenDavis/GraphBLAS

SuiteSparse provides a highly optimized, parallel and CPU/GPU supported sparse Matrix Multiplication. This is relevant because traversing graph edges IS matrix multiplication when you realize that graphs are matrices.

Recently NetworkX has grown the ability to have different "graph engine" backends, and one of the first to be developed uses the python-graphblas library that binds to SuiteSparse. I'm not a directly contributor to that particular work but as I understand it there has been great results.

jcgrillo
2 replies
19h41m

I was really excited about RedisGraph, and sad to see it was cancelled. In my (limited) experience with graph databases they proved very frustrating because it seemed like they tried to do too much. Ultimately the way I thought of a graph was an indexing strategy into some underlying data. So I needed the graph to be very quick, but I didn't have any requirement to store actual data in it--just references. This made triples based graph storage seem very heavy-handed.

The idea of composing sparse linear transformations to optimize queries is really cool. You can get a lot of work done in one shot that way, in a manner that's just quite a lot easier on the machine than chasing pointers around.

jcgrillo
0 replies
13h24m

oh excellent! that's great news

graphviz
3 replies
23h21m

Graphviz has its own foundation graph library, that's not used by any other project. It has its good and bad points.

Based on that experience, we had our very own second-system-syndrome experience.

We decided our graph library should be modular, type safe, and efficient. (These properties came up in the comments here, too.) This is probably just a variation of "good, fast, cheap - pick any two."

By modular, want to write collections of graph algorithm libraries that are developed and even compiled independently.

By type safe, we mean we want to detect programming errors during compilation, or link time at the latest. We don't want programs to throw runtime errors like "your node does not have a color attribute".

By efficient, we mean that accessing an attribute of a graph is as cheap as a field in a C struct. (So, we don't want to carry around external hash table, or do a lot of string conversions, for instance.)

You can argue about whether these things are worth the price or even make sense, but that's what we wanted. We had some famous C++ creators in our lab, so we figured we could get help, and we were willing to give C++ another chance.

Gordon Woodhull, who had been an intern and kept working for us, is a brilliant programmer, and wrote an implementation of this kind of graph library working in templated C++. It's even published at https://www.dynagraph.org/ via sourceforge. The rest of us were not really sure we could ever understand how the code worked, so we had a code review with said famous C++ inventors. There were a lot of screens of code, and chinstroking, and greybeards pronounced "That would probably work." We knew we might have gone over the complexity cliff. (Let's not even talk about compile-time template errors, where one error fills an entire screen with details that only a... C++ inventor could love.) It was our fault, not anyone else's, and Gordon kept plugging away and even made all the dynamic graph layout stuff work, in Microsoft OLE, too. In hindsight it was probably our own Project Xanadu. While we got lost in this, a lot of things like Gephi (Java) and NetworkX and NetworKit (python) happened.

Also, John Ellson, a very talented software engineer who had written parts of Graphviz, revitalized the main effort.

transitionnel
0 replies
23h1m

That all sounds fantastic!

samsquire
0 replies
9h5m

Thank you for graphviz.

I use graphviz dot syntax, parsing with networkx to plan expensive tool execution and the graph structure lets me automatically paralellize.

Everdred2dx
0 replies
12h54m

I love comments like this. Thank you for sharing.

superlopuh
2 replies
7h7m

This is basically my PhD thesis proposal, I don't think there's any fundamental technological problem here, just that for a graph to be efficient to process you need high-level optimisations that can take mathematical properties of graphs into account. For that you need to either reimplement a compiler into your framework, or be integrated into an existing compiler, both obviously take a lot of work.

Some comments here mention GraphBLAS, which is the big breakthrough in decoupling the layout of the graph from an efficient implementation of an algorithm, but none mention MLIR-GraphBLAS [0] which is the most promising integration into a compiler that I've seen.

I still think it's early days, I wouldn't throw in the towel quite yet :)

[0]: https://mlir-graphblas.readthedocs.io/en/latest/index.html

platz
1 replies
3h14m

How will I have any expectations of run-time behavior if I have to hope that my graph will fuse or fail to fuse at run time?

Reminds me of the issues that haskell programmers face when an innocuous change causes list fusion to fail tanking performance; to know how to coax the compiler to fuse again you have to have intimate knowledge of how that fusion process works which isn't visible in the API; you need knowledge of compiler implementation/behavior.

programmers do not like this kind of instability.

superlopuh
0 replies
1h27m

The same could be said of a lot of things. For example in hash maps, you might have a cliff in performance if your hash function is not good for your data distribution, but you'll still happily use the default ones until you're really sure that they're not the right tool for the job. I also feel like this depends a lot on your language philosophy, some languages generally accept the cliffs, some try to expose enough of an API for you to work around the cliff if you feel like you know what you're doing, like custom allocators etc.

I have some personal hunches about how to have better guarantees about these properties but I feel like it's ok for this to not be solved with the v1.

skrebbel
2 replies
22h40m

I think most languages support representing many kinds of graphs very well, particularly directed graphs without data on the edges: with objects, lists, and object references (or pointers).

A tree is a graph. A typical Java-style object composing other objects composing other objects again, etc etc, often with cycles and parent backreferences and whatnot, is a graph. The html DOM is a graph.

I recognize that these are often very tree-like, which feels like cheating in the same way as saying “well a list is also a graph!” is. But given that cycles are common enough that serializers (eg JSON.stringify) need to special-case those, I think maybe this is simply not true, and they’re really just graphs. Very few tree-like class structures tend to remain pure trees.

The only thing missing from references/pointers to be able to represent what the author is looking for, is having data on the edges. I think this is trivially solvable by putting nodes halfway the edge (= add a level of indirection, an operation so common that we don’t even think of it as “adding data to the edges”).

So I think the answer is that there’s no explicit data structure named “graph” because the basic building block of composition in nearly every language (reference/pointer) is an edge, and the basic building block of data representation (objects/structs/records) is a node. So for most graphs, trying to pour it all into some fancy Graph<V, E> datastructure feels like needless complexity.

e_y_
1 replies
20h16m

Back in the C days, it was common to not use generic data structures like lists; instead you'd have a next_item pointer as a field in the struct. For linked lists, this would save you from needing another memory allocation or wrapper struct, and since C doesn't have templates you'd either have to blindly cast the type or use macros or write a type-specific iterator anyways.

Lists eventually became a standard language feature in C++ and other languages, but it's trickier for trees and graphs. Taking the DOM example, you might be searching through child elements (div, span, etc) or nodes (text nodes, comment nodes) and different operations might only work with a specific subset of the "edges". There might be pointers to other objects, like from a DOM node to accessibility tree node. You might even have multiple parent node pointers, such as a pointer that takes you to the nearest shadow root or something.

Since there are multiple ways to traverse the same data structure, generic functions don't work on it. You could create a separate tree/graph for each thing you want to use it for, but that takes additional memory and has to be updated when the original struct changes. Or you could create some kind of adapter that has a get_edges() function, but this might not be very well optimized or might be clunky for many other reasons. So it usually just ends up being simpler rolling your own functions instead of using a library.

actionfromafar
0 replies
19h56m

Bonus points for not allocating space for the next pointer if you didn’t plan to use it…

mcphage
2 replies
1d

This is a good write up, but for me seems to miss the mark. I agree with the author that different algorithms require differently organized data structures. But what if your problem requires 2 different algorithms, which each work best with a different data structure? Either you pick one and run both (which is not ideal) or you run the first algorithm with one structure, convert it, and the run the second algorithm in the second structure.

And once you can do that… why not have every algorithm ensure it runs in its best structure, and convert where necessary (and possible) on the way in? Yes, there’s absolutely a performance or storage cost… but if the algorithm is that much faster, it should be worth it.

Basically a beefier version of sorting your data before searching in it. If an algorithm works best with a specific model of a graph, then let that be part of the algorithm.

riwsky
0 replies
22h59m

why not have every algorithm ensure it runs in its best structure, and convert where necessary (and possible) on the way in? Yes, there’s absolutely a performance or storage cost… but if the algorithm is that much faster, it should be worth it.

Because it's not that much faster, so it's not worth it. You're severely underestimating the amount of thought that went into the article, or the work of the experts interviewed.

irishsultan
0 replies
6h48m

Sorting your data before searching it will only pay off if you need to search multiple things. If instead you need to search for one specific thing then going through things linearly is O(n) while sorting and searching the sorted result will be O(n log(n)).

js8
2 replies
19h54m

Another data type that would be quite useful is a table (like in a database). For the same reasons, too many design choices.

Anyway, that being said, I have felt that progress will be made in programming languages if the compiler gets to choose an implementation of a data structure, kinda like when a database chooses an execution plan. So you just use an abstract structure (like sequence, map, set, table, graph) and based on the program profile, the compiler will pick the specific implementation. It will also transform the structure into another isomorphic one as needed. (Some programming languages already do something like this, for example, array of structs to struct of arrays conversion.)

Kamq
1 replies
18h2m

So you just use an abstract structure (like sequence, map, set, table, graph) and based on the program profile, the compiler will pick the specific implementation. It will also transform the structure into another isomorphic one as needed.

I'm so not looking forward to having to debug a sudden change in perf characteristics when one additional usage of some feature tips a heuristic over the line and an implementation gets swapped out between builds.

js8
0 replies
11h7m

a sudden change in perf characteristics when one additional usage of some feature tips a heuristic over

This already happens with humans, changing features will change how the product is used and thus performance characteristics changes.

The question is, do you trust the compiler to do a good job? Of course you won't, till the late 90s, people didn't trust compilers to do a better job than humans in assembler.

So it's important to have a good UX for this feature, where the compiler communicates what data types is it using, and gives human option to override its decisions. So that users would gain trust in this feature.

jerf
2 replies
1d1h

This is a good article and I endorse it.

I would supplement it with the observation that when I was a younger programmer, like many people, I considered "generic" or "flexible" a positive when describing a library or framework. I have come to see it as generally negative, especially when the developer's summary puts these adjectives or something similar front and center.

Let me show you the most flexible possible Javascript framework. This will look like a joke, but it's not. It fits perfectly into an HN post. The most flexible possible JS framework is simply:

    eval
Similarly flexible frameworks exist for dynamic scripting languages. For static languages one must invoke the entire compiler as the framework. Of course, if you think about it hard enough, I'm doing that for dynamic languages here too, it just has a snappier representation for dynamic languages.

Frameworks and libraries provide their value precisely through limiting things, and then building on those limitations. Of course, the limitations must be well-chosen, to make what can be built on them interesting enough to pay for the limitations the framework chooses. But the essence of them are in their limitations. I start out from the get-go with the maximally flexible framework my language allows, which is itself the language, and the additional framework needs to make limitations on my code in order to do anything useful.

(A problem when framework designers don't understand this is that they make a series of little incorrect design decisions that can often add up to a real pain. For instance, if I were to design a web framework that took over some amount of routing from the user, I would still leave you the ability to claim some bit of the URL space and route it entirely out of my framework, because I understand that my framework is based around limitations and you may need to expose a URL to something that can't work under those limitations. But someone who doesn't realize that frameworks intrinsically involve limitations might fail to give that callout because they can't imagine that someone might have a problem that their framework is not "flexible" and "generic" enough to handle. Imagine a CRUD framework, even a very good one, but I need to offer an endpoint based on streaming server events in the same URL space, which is intrinsically foreign to the CRUD framework's concept of page loads. This is just one example; real frameworks designed without this understanding will make dozens or hundreds of such little mistakes.)

Graphs have the same problem. It seems like they're so flexible and generic that they ought to be more used and more useful. But that's precisely what kills them. Even if you nail down the problem to exactly one representation, they still don't fit. For instance I have a great need for data structures that don't admit cycles, but if all I have is a graph, imposing that limitation from a coding perspective is a real challenge. Mathematically it's trivial, I just say "and this graph has no cycles" et voila [1], there are no cycles, but in code I need to enforce that somehow and there's no trivial solution to that.

Another way of viewing graphs is that we do work in graphs all the time, precisely because everything in RAM can be seen as a graph. GC algorithms even generally work by viewing everything in very raw graphy terms. It just turns out the API you'd expect to work over a graph just isn't useful in the general sense when applied to everything in a programming language's memory space. It seems like it ought to be, but it just isn't. It may seem like it would be great to have a set of employees and extract their names and then look that up into another database etc. etc. with a general graph query language or something, but it turns out the special considerations at each layer make it so that what the general purpose programming language is already doing is actually generally better. The details at each layer matter.

I like the metaphor of architecture astronautics and have often discussed "the 30,000 foot view" versus the view on the ground here on HN, and to my mind the key to the metaphor isn't the nerdery of being an astronaut or the difficulty. The key is that when you get high up, everything looks the same. It feels like graphs ought to be awesome when you're looking down at the entire computing landscape from metaphorical low Earth orbit. But down in the trenches, the local concerns overwhelm that viewpoint... and this is real. This is not just because we all suck or we don't try hard enough or we just Don't Get It. It's real. The architecture astronaut is just wrong in this case. It's not even a beautiful vision this world isn't good enough to manifest or any such conciliatory thing... it's just wrong. It is good to write good code and reduce the amount of bespoke details to be considered. The programming community has made great progress there and there is still great opportunity to do more. But there are an awful, awful lot of details in the world, and the world being detailed is fundamental.

[1]: Or if you are, like me, kinda a fan of surprise stringed instrument attacks, et viola.

jcgrillo
0 replies
4h17m

Even if you nail down the problem to exactly one representation, they still don't fit.

This is exactly the problem I've found with graph databases. I've never successfully used a graph database to solve a problem, and multiple other engineers I've spoken to have bounced off them in a similar way. The problem is I don't have an arbitrary graph problem, mine is specific, and as you say the choices a graph database makes matter. It really gives me greater appreciation for the relational model because so many things can be made to work on a relational database. It may not be elegant, but it works.

I think the way I'd like to approach graphs, if I do it again, would be to use a graph represented as sparse matrices in memory as an index. This is more or less in line with what you get from expressing the graph problem in application code, but maybe easier to understand and maintain? I guess that is to say I'm still optimistic there might be some general purpose solution like RedisGraph (now FalkorDB) that makes sense to use this way, but I'm not sure I'd try to use an out-of-core graph database again.

Terr_
0 replies
22h1m

when I was a younger programmer, like many people, I considered "generic" or "flexible" a positive when describing a library or framework [...]

I've come to prefer what I call "design for deletion": Most of those long-term "flexibility someday" needs are best-met by making sure the inflexible modules or flows can be clearly identified and ripped out for replacement. This leads to a certain kind of decoupling, although with a higher tolerance for coupling that can kept in check by static analysis.

This is a contrast to my days of youthful exuberance where I thought I could solve the problem by making my work extensible or customizable. No, I cannot make the immortal program, so I should focus on making a mortal one which can pass gracefully.

hobofan
2 replies
1d

I think the post mostly answers the questions "why are graph _algorithms_ not better supported in programming languages", with a focus that is much more on "big data" graph processing than graph support in general.

I think if you look at graph support in general you are also looking at wider questions, like "why are OGMs (Object Graph Mappers) not as popular as ORMs" and "why is JSON so prevalent while RDF (or another low-level graph serialization) isn't"?

And I think in the end it comes down to historic reasons (RDF emerged a bit too early and never evolved and accrued an ecosystem of horrible academic standards and implementations), and a graphs having just a smidge more of inherent complexity in implementation and learning curve that just doesn't scale well across many developers.

------

I also wouldn't put too much weight on the "Graph Querying Language" part of the article. It sadly reads like exactly the marketing copy you would read from Neo4J or SPARQL enthusiasts that haven't tried building a product on top of it.

The main difference between all GQLs and SQL is that the “joins” (relationships) are first-class entities.

Joins are first-class entities in SQL. They even have their own keyword (hint: it starts with J and ends with OIN) ;)

If you also go to the lower levels of any graph query language and look at it's query plans you'll notice that there isn't any meaningful difference to that of one you'll find in an SQL based query. The standardization of GQL[0] as an SQL extension is evidence for that.

In SPARQL relationships are just edges, making the same query easy.

SPARQL is easy if you need to do exact path traversals. If you try to do anything sophisticated with it (like you would in the backend of a webapp), you'll quickly run into footguns like joins with unbound values and you accidently join your whole result set away.

[0]: https://en.wikipedia.org/wiki/Graph_Query_Language

lmm
1 replies
20h11m

Joins are first-class entities in SQL. They even have their own keyword (hint: it starts with J and ends with OIN) ;)

Having its own keyword is pretty strong evidence that something isn't first-class (e.g. typeclasses are not first-class in Haskell; control flow is not first-class in most programming languages).

eyelidlessness
0 replies
15h32m

I think OP is using “first class” as in “an explicit semantic affordance”, and you seem to be using “first class” as in “a supported operand” (or similar, feel free to correct me if I’m misunderstanding). In which case both points are right, albeit orthogonal.

gue5t
2 replies
1d

One interesting perspective is to view the sequence of lists -> trees -> DAGs -> general graphs as a loosening of restrictions:

- list nodes may have one child

- tree nodes may have multiple

- DAG nodes may have multiple parents though restricted by topological ordering

- graph nodes may have multiple parents from anywhere in the collection

Lists and trees can be fully captured by sum and product types, but extending this representation style to DAGs and graphs doesn't work--you either get inefficiency (for DAGs) and then infinite regress (for cyclic graphs) attempting to continue the "syntactic" style of representation, or you need to adopt an "indirect" representation based on identifiers or indices or hash consing.

schindlabua
0 replies
23h0m

I like thinking of this concept via free constructions.

As is somewhat commonly known the free Monoid is the List type; monoids are not commutative so we get a sense of "direction", like a list has a start and an end.

If we add commutativity and look at free groups, we find they are equivalent to multisets.

If we take associativity away from monoids and look at free semigroups, we get binary finger trees, I think?

In some sense removing constraints from the binary operator results in more general free types. Would be interesting to find what free construction makes digraphs but I have to bounce.

nickpsecurity
0 replies
1d

That’s a neat idea. The other side of it might be expressiveness.

The more expressive constructs are usually more productive and concise. The less expressive constructs are usually easier to optimize or analyze with a machine. The old rule, like in LANGSEC, is to pick the least-expressive option that works. Some people also develop transforming code (eg netaprogramming) to let you write highly-expressive code that generates correct, low-expressiveness code.

fatherzine
2 replies
21h19m

What an awesome article. Kudos to the author!

On the core observation "there are too many implementation choices", that is not quite right. True, the author mentions 4, and there are further variations. In practice, a library can:

1. Implement all suitable graph representations.

2. Implement algorithms tailored to the representation(s) that offer the highest performance.

3. Provide transformations from one representation to another. This is O(#representations), trivial to implement and trivial to use. Quite fair workload for both maintainers and users.

4. Bonus, provide import / export transformations from / to common standard library datatypes and idioms.

Memory and transformations are cheap, 99% of use-cases would likely find the overhead of transforming data, both in RAM and CPU, negligible.

Edit: "the harsh truth of working at Google is that in the end you are moving protobufs from one place to another." -- https://news.ycombinator.com/item?id=20132880

josephg
1 replies
20h58m

Sounds like the makings of a huge library that I’m not sure I’d even use in my work. I use graphs heavily in my work, and my experience matches the people the author interviewed.

We always end up reimplementing graphs because:

- Performance matters, and no off the shelf graph library I’ve seen can take advantage of many of the regularities in our particular data set. (We have an append-only DAG which we can internally run-length encode because almost all nodes just have an edge pointing to the last added item).

- I haven’t seen any generic graph library which supports the specific queries I need to make on my graphs. The big one is a subgraph diffing function.

- Writing something custom just isn’t much work anyway! Graphs are way simpler to reimplement than btrees. You can have a simple graph implementation in tens of lines. Our highly optimised library - with all the supporting algorithms - is still only a few hundred lines of code.

I think it would be handy to have ways to export the data into some standard format. But eh. I think pulling a library in for our use case would add more problems than it would solve.

therobots927
0 replies
57m

What do you mean by “subgraph diffing”? I work with graphs a lot and use SQL almost all the time. Sometimes I need to compute connected components with python.

csb6
2 replies
22h7m

This is an interesting article but one major effort it doesn’t mention is the Boost Graph Library [0].

It is kind of clunky in places because it is written in C++03 and uses some weird idioms to simulate keyword arguments and provide generic ways of getting attributes for nodes. Also it suffers from the terrible template instantiation errors that most C++ template libraries do. But I still think it addresses a lot of the difficulties covered in the article:

There are too many design choices

BGL is limited to directed/undirected multigraphs, so hypergraphs are not supported. However I think these cover most use cases.

In terms of implementation choices, BGL provides several concrete data types, such as an adjacency list and an adjacency matrix. It also provides adaptors for the GraphBase and LEDA graph libraries. If none of these are suitable you can write adaptor functions to support your custom data type. All algorithms* work unmodified on these concrete implementations.

So which algorithms should come with the library?

BGL comes with most of the common ones [1], but I do wish it came with more. The implementations of them are quite hard to read because they are written in highly generic (and before many of the conveniences offered in C++11) C++ code.

Performance is too important

Since BGL is generic using C++ templates instead of runtime polymorphism, it should (in theory) be able to work with a concrete graph implementation that is performant for a certain task and so let you reuse its generic algorithms.

I think the article describes a lot of the difficulties that Stepanov’s generic programming approach tries to solve (e.g. finding the most abstract but still efficient implementation of an algorithm, writing algorithms that depend on a limited set of type requirements, having many data types that can reuse the same algorithms). While C++ supports this style of programming it is not ideal for it, but I think BGL is the closest thing I have seen to a generic graph library that is also performant in many cases.

*Algorithms have varying requirements, e.g. some may need to be able to remove edges while others do not. But these requirements are generic and can be fulfilled by many different graph implementations.

[0] https://www.boost.org/doc/libs/1_84_0/libs/graph/doc/index.h...

[1] section 22, https://www.boost.org/doc/libs/1_84_0/libs/graph/doc/table_o...

csb6
0 replies
21h57m

As a sidenote, Stepanov’s introduction [0] to the Boost Graph Library book explains a lot about his approach to programming. He didn’t write the BGL, but his work on the STL was a major inspiration for it.

[0] http://stepanovpapers.com/siekforeword.html

akoboldfrying
0 replies
7h19m

I fully agree. Despite the syntactical clunkiness and the need to learn some new concepts, BGL's design is beautiful -- surprisingly, it mostly achieves the STL goal of implementing M algorithms on N different representations using just O(M+N) code (vs. the usual O(M*N)), without sacrificing efficiency. Like the OP, after much wandering through the desert I had come to the conclusion that there was no way to generically yet performantly model graphs -- but once I understood (the basics of) BGL, I was persuaded otherwise.

It's a little disappointing that BGL didn't make an appearance in TFA.

bevekspldnw
2 replies
1d

Was seriously hoping at the start that article would reveal some secret easy solution to all my problems…only to learn there are none.

:’(

boothby
1 replies
23h55m

I take solace in such findings. I learned that my quest was ill-conceived, and I abandoned it; I should be glad that I'm no longer searching for what cannot be found.

bevekspldnw
0 replies
23h42m

Yes that’s true, the “smarter people than I failed at this” factor is definitely a form of solace.

andsens
2 replies
1d1h

devils advocate: Is this maybe a case of discarding an 80% solution because you can’t do the last 20%?

I understand the constraints, but imagine how legible you could make code by replacing some key parts with a graph type that everybody knows. I honestly think that having a type that supports a small subset of possibilities and only has the simplest algorithms implemented would go a long way.

boothby
0 replies
1d

It isn't just an 80/20 problem. Imagine that you replace every linked list, binary tree, trie, etc., with a generic directed graph datatype. The resulting performance would be abysmal. The resulting notation would be horrid, too.

Here's our nice linked list:

  def last_element(ll):
      last = ll
      while ll is not None:
          last = ll
          ll = ll.next
      return last
And here's an implementation with generic graph notation:

  def last_element(g):
      for v, deg in g.out_degree:
          if deg == 0:
              return v
      return None
There are several problems with this; most importantly, there can be silent failures when g is not a linked list. But it also throws out a useful abstraction where a list is equivalent to a node, so I wrote a horrid implementation that takes O(n) regardless of the position in the list. And then comes all the baggage of representation, because you can't just represent a node with a pointer anymore.

When your data structure better reflects the, well, structure of your data, you can go faster and safer. There's a reason we teach undergrads about these specific datatypes and don't just sweep it all under a rug with "it's a graph!"

bigbillheck
0 replies
22h11m

I think it's more like discarding a 20% solution that can't do the last 80%.

adamnemecek
2 replies
1d

Mathematica, MATLAB, Maple, etc all have graph libraries of some form or another. I am not paying the thousands of dollars in licensing needed to learn more.

Wolfram provides a free Mathematica called Wolfram Engine https://www.wolfram.com/engine/. It's Mathematica without the UI. I hear you can combine it with Jupyter Notebook to get a similar experience to Mathematica.

rhodin
0 replies
22h52m

You can also use wolframcloud.com for free. Which runs the latest Mathematica online.

dekhn
0 replies
1d

Ha! I wrote the predecessor for the Python integration some ~25 years ago https://library.wolfram.com/infocenter/MathSource/585/ and I think it uses a similar approach. The mathematica engine had a protocol for sending and receiving expressions and evaluating them. I was pretty proud of my implementation as it it was my first real attempt at doing anything remotely complicated with recursion.

JonChesterfield
2 replies
23h58m

I needed a directed graph yesterday. I gave the nodes integer ids by appending them to an arena, then used a hashtable from integer to vector of integer. Iterating over it involves a set of integers to track which nodes have already been visited.

Deduplicating nodes on insert into the area was more hassle than cobbling together the graph structure out of a hashtable and a vector.

Maybe one reason against putting graphs in the standard library is they're easily put together from more common structures for whatever special case you have in mind.

jkaptur
0 replies
22h6m

I agree with this take - graphs are in an interesting superposition where implementing a large set of algorithms correctly and in their full generality is hard (as the article points out), but getting started with an implementation that solves a particular problem well enough for a particular case is easy and fun.

PeeMcGee
0 replies
21h14m

Maybe one reason against putting graphs in the standard library is they're easily put together from more common structures for whatever special case you have in mind.

This is a fair argument (how implementations tend to combine existing structures in bespoke ways). But any time I've needed to use a graph explicitly, it hasn't really mattered what underlying structures were involved. What has mattered each time is having to invent my own little API to expose well-known/primitive graph operations, then go and implement them which is unnecessarily error prone.

Your example of de-duplicating nodes on insert sounds like it describes a property of your particular graph that may be better expressed through a type, which would also afford the necessary API. I'm approaching this from an OOP-ish perspective so do with that what you will.

I gave the nodes integer ids by appending them to an arena, then used a hashtable from integer to vector of integer. Iterating over it involves a set of integers to track which nodes have already been visited.

This is what sucks about using graphs IMO. I don't want to think about all that stuff, I just want think about graphs. In practice I spend most of the time toiling around with noisy boilerplate that dominates my mental model and allows graph concerns to leak into business concerns.

ChicagoDave
2 replies
20h27m

I’ve been building an interactive fiction platform built in C# with a world model contained within a bidirectional graph data structure.

Since IF stories are relatively small in graph terms, it’s a reasonable solution.

Locations and objects are nodes and movement and location of objects are edges. Nodes and edges both can have dynamic properties.

I’ve also noticed the lack of graph structures in programming languages, so this article was very enlightening.

jonahss
1 replies
15h50m

sounds neat. link?

ChicagoDave
0 replies
8h11m

It's really an experimental endeavor. I have a Github repo (https://github.com/ChicagoDave/chatgpt-ifplatform), but I'm still changing things all the time. It's very volatile.

Finding the balance between OO principals, Fluid coding capabilities, separating the data, grammar, parser, and world model and then constructing a standard IF library of common IF "things" is like juggling 20 kittens and 10 chainsaws.

Some things are confounding like do I define a container with a boolean property on an object or is a container a subclass of the base Thing? How does that extend to the underlying graph data store? What will queries look like and which solution is more meaningful to authors?

Seriously, 95% of the fun is figuring all of these things out.

throwaway2037
1 replies
16h21m

The C++ Standard Template Library is now 29 years old. It doesn't have a graph (or generic B-tree) data structure. That says it all for me. In my too many years of programming, I have only needed to program a graph structure once or twice. Yes, the are "ubiquitous in software engineering", but still incredibly rare in most enterprise programming projects.

gpderetta
0 replies
6h37m

Boost does have the BGL (which, following the STL, abstracts graph algorithms from graph implementations). But I don't think it has seen significant updates in more than two decades.

pfdietz
1 replies
1d1h

This is looking like a programming language problem. There is no way to make graph algorithms available in a way where the details are abstracted away.

What would a programming language look like that could address all those issues?

tlb
0 replies
1d1h

Some C++ libraries do pretty well. In the Eigen math library you can call most functions on dense or sparse matrices, and some template magic makes it work efficiently for all combinations.

It's a shame it's so hard to write that kind of generic template code.

gilleain
1 replies
1d1h

And then for each of those types we have hypergraphs, where an edge can connect three or more nodes, and ubergraphs, where edges can point to other edges.

Huh, I've heard of hypergraphs (although never actually really used them) but never an 'ubergraph'. Sounds tricky!

In practice, how often are there situations you definitely need hypergraphs? I had a particular situation where I needed graphs that were both vertex coloured (labelled) and edge coloured (labelled) - even then it was outside the normal situation for what I was doing (graph canonicalization).

ratmice
0 replies
19h2m

ubergraphs are pretty weird, i've never actually seen them really used anywhere. Just a couple of papers pointing out their existence. They have some weird quirks like every hypergraph and thus graph has a dual, but ubergraphs with uberedges do not appear to have one.

caditinpiscinam
1 replies
21h54m

Relational databases are graphs where the nodes are records and the edges are foreign keys

I disagree with the premise of the article: programming languages do have strong and mature support for graphs in the form of relational database interfaces, which cover most of the real-world use-cases for linked data.

samatman
0 replies
17h15m

Just in the last month I've been working with PEG patterns, which are mostly tree-shaped but rules make it a cyclic graph, and writing a rope, which is a DAG.

How do I use SQLite for those?

zacify
0 replies
20h13m

I feel like this is common knowledge….

xLaszlo
0 replies
18h37m

I found that the best to think of graph implementation is sparse matrices of the adjacency matrix. CSR/CSC format has fast lookup abilities, building and switching between formats is "relative" efficient. Most graph algorithms need primitives that can be built on top of this.

For distributed computing one can look int GraphLab or its smaller version, now largely abandoned GraphChi.

wvlia5
0 replies
2h27m

In my last job, I implemented a transpiler for most GQLs: we supported Gremlin, Cypher, SPARQL NetworkX, GraphQL and SQL.

It took me about 3 months to implement each language.

We could also convert between different graph serialization formats, like RDF and some JSON formats.

The product is now dead (KgBase). The transpiler tool wasn't exposed to users, it was just part of the internal machinery used to seamlessly support many DBs on the same frontend.

The graph community should split in 2. Some are interested in graphs from a math/statistics view point, while others are interested in graphs as a generalization of relational DBs. Graph tools attempt to satisfy both camps simultaneously, but their interests and needs are very different.

wruza
0 replies
12h18m

Partly it’s just our beloved overthinking and rationalization of it. When I needed a graph to repesent and edit a structure of supply contracts, I just stored {who, to-who[]} and loaded these arrays into a graph library written in js. Performance, formats didn’t matter because there’s only so much contracts, 5-20 per single graph. If there was no graph library, that would suck, and no amount of rationalization would change that. Complexity of the area never offsets the value of having at least something useful in it.

tunesmith
0 replies
21h39m

Given the wide variety of implementations, what I'd like is a questionnaire that asks you what sort of graph you are intending to work with, and then recommends what sort of algorithm implementations or software packages would be best... I'm hoping that the language models will get better at this sort of question over time.

taeric
0 replies
1d

A graph, as presented in this article, is a model of something else. That we have more than one way to implement this model is rather natural, all told. The hope that we can have a singular syntax and data structure that represents a graph in code is almost exactly why the author of Java's Linked List posted this gem: https://twitter.com/joshbloch/status/583813919019573248

My favorite on the idea of having a linked list where the node is first class in your code, is almost precisely the problem. You rarely want/need to work at that level. In a very real sense, objects that have other objects are already trees of data. Many can back reference, such that then you have a graph.

And then there is the joy of trying to use matrix operations to work with graphs. You can do some powerful things, but at that point, you almost certainly want the matrix to be the abstraction.

Excited to see someone come up with good things in this idea. I retain very serious doubts that I want a singular model for my data.

planede
0 replies
6h57m

I think graph data structures are generally missing because they are not something that you would explicitly use most of the time. Sometimes graphs just emerge from existing data structures in an application.

You might want to run generic graph algorithms on such emergent graph data structures, but usually they don't have a uniform graph interface that you can make use of. So you either would need to copy the graph over to some normalized graph data structure, or implement a uniform graph interface facade over the existing objects. The latter is usually more efficient.

Anyway, there are libraries that tend to do this decently, I think Boost does it well[1]. But there are a lot of inherent complexities and so much open design space that you can't really serve with one or even a handful of data structures.

[1] https://www.boost.org/doc/libs/1_84_0/libs/graph/doc/index.h...

lowbloodsugar
0 replies
22h11m

Couple of thoughts:

First, the discussion about representation highlights that the issue is a lack of infinite resources. If we had an infinite computer, that could execute an infinite number of operations in zero time, and had infinite memory, then we wouldn't be worrying about whether it's better to store the graph as a matrix, an edge list, or a pointer graph.

Software Engineering is everywhere and always a job of optimization. Sometimes that optimization is premature, and sometimes it's too little too late. It's always about optimization.

Second, when we're talking about a graph of 10 nodes, it really doesn't matter what data structure we use. It can quickly matter if we have 100s or 1000s of nodes and edges because now the possible arrangements are huge as is the search space. But this is no different than other problems like the knapsack problem where the search space is huge: depending on the problem, there is very likely a "trick" to make it tractable, and that trick is different depending on the problem.

So, like the knapsack problem, there are different, specific solutions for specific graph problems.

khanguy
0 replies
19h50m

I'm surprised I haven't seen anyone mention Stanford's SNAP [0] yet. In addition to their C++ library, they also have a Python interface too.

[0] https://snap.stanford.edu/

kajic
0 replies
11h41m

One of the complications described by the author is performance. Personally, I find stdlib graph libraries extremely useful even if their performance is poor because it’s often the case that my dataset is small enough, and even if performance turns out to be an issue, first spending time on the problem with a underperforming graph library is a very worthwhile exercise before trying to write my own optimized implementation. By analogy, many programming languages are far from being the fastest, but they can nevertheless be very useful.

That said, I’m not surprised performance came up in interviews with experts; they probably have tons of interesting performance-related stories to tell from their extensive work on graphs.

fiddlerwoaroof
0 replies
20h56m

FWIW, despite its name being an abbreviation for LISt Processing, the fundamental datatype of lisp (the cons cell) is actually a vertice on a directed graph with the restriction that each vertice can only have two outgoing edges.

dustingetz
0 replies
1d

Electric Clojure uses Clojure itself (s-expressions) as a graph authoring syntax, using a macro to reify dataflow through a reactive client/server system (here the use case is full stack user interfaces but the idea generalizes) https://github.com/hyperfiddle/electric (I'm the founder).

IMO, the answer to the question "Where are all the graph types?" is: the graph authoring DSL needs to express scope, control flow and abstraction, which essentially makes it isomorphic to a programming language, freed of its evaluation model. In Python and Typescript, embedding a complete programming language is something that's rather hard to do!

Also see my blog post "Four problems preventing visual flowchart programming from expressing web applications" https://www.dustingetz.com/#/page/four%20problems%20preventi...

dietr1ch
0 replies
11h50m

Well, we have many implementations for simpler abstractions like lists where it might be useful to have contiguous memory, or to have quick append/pop, or maybe inserting at the front, or slicing.

I think that having clearly defined "instances" of these tailored lists, like vector, deque, linked list helps a bit, but graphs are a harder problem since there's more ways of tailoring them to specific purposes. and with this comes more tradeoffs.

de6u99er
0 replies
13h17m

@HN: can we please have less of those "Look I discovered something, most of you already know" articles? Or at least a tag to filter them out!

bluemeniscus
0 replies
53m

I really enjoyed reading this thread. The opening line (and core argument?) "Graphs are ubiquitous in software engineering" triggered my neural network to respond.

TLDR: graphs are ubiquitous in _science_, and a good data type should not put the crossbar too high.

(some history.) The Center for Nonlinear Studies (https://cnls.lanl.gov/External/) has a rich legacy of organizing annual Los Alamos Lab driven conferences in Santa Fe that bring together emergent disciplines. We combine overview talks by world experts and enough spaces in between these talks so that new bridges can be built at outstanding Santa Fe restaurants. I was co-organizer of the 2003 conference on Complex Networks, and this one was turning out to be a real banger. Sitting in the back with Aric Hagberg and Dan Schult (from Colgate University, but then spending his sabbatical at the CNLS) we were struck by how many really smart people were using "complex networks", but with very few computational tools available to them. That was the origin of networkx ( = network "X", reflecting the multidisciplinary renaissance we were watching). At that time python was a high-productivity starting framework to build the infrastructure, but I honestly expected that eventually there will be another language plugged in under the hood to make it more efficient for huge data sets on supercomputing architectures. We used the Guido v Rossum dict of dicts data structure idea (a few years old at that time) and built the most natural setup designed for a range of the disciplines. The python dict data type was a well-tested and integrated part of the language, so we felt this to be a solid base to work on. We freely borrowed ideas from many smarter than us. E.g. from David Eppstein [1] - one should be able to just say "if n in G" for "if the node n is in the graph G" and "G[n]" for "the neighborhood of node n in the grap G". We loved the graphviz drawing tools but quickly decided that graph drawing was a separate challenge [2]. Our goal was platform-independent, open source tools that will allow any graduate student, from any country, to use it in any field. Often when I came up with some strange subset of mathy graph stuff Aric would push back with YAGNI! [3]. Fast forward a few years, and the success of networkx --- due to the wise management and long midnight hours leadership by Aric and Dan, who inspired many new contributors --- continued to surprise us. The python dict-of-dicts technology allowed a wide range of fields to use these tools, and smart graduate students (working in diverse fields such as epidemiology, proteomics, ecology, architecture, social sciences, ...) could learn "applied graph theory" on the fly and easily write their own code. If we used C++ Boost BGL or some other more efficient C data structures, this would likely have bypassed all these thousands [4] of applications. The evolution of networkx continues with great new ideas, as for example explained in the recent scipy talks by current maintainers and contributors. Thanks Jarrod Millman ! [5].

Many programmers ached at better faster newer graph libraries, and I know that they used networkx as part of their development, as they should. Borrowing from paper dictionaries, there are some humorous quirks added into old code that allows one to track borrowed memes [6]. One reason the abundance of graph libraries will continue is that programmers, like woodworkers, enjoy the great joy of crafting them. I look forward to a future AI that creates a superb graph library just because it should be done. I hope it will contain random_lobster, and the Aric Hagberg Ankh-Morporkian quote "it is dictionaries all the way down".

Pieter Swart

Theoretical Division and Center for Nonlinear Studies, LANL

[1] https://ics.uci.edu/~eppstein/ )

[2] In his CNLS 2003 talk, Bill Cheswick made the point that for typical internet related graphs any drawing tool soon delivers a "peacock splattered onto your windshield." https://www.cheswick.com/ches/

[3] https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

[4] https://scholar.google.com/citations?view_op=view_citation&h...

[5] https://www.jarrodmillman.com/

[6] https://networkx.org/documentation/stable/reference/generate...

I pay homage to the wise soul that gave it its own web page at https://randomlobster.com/

bionhoward
0 replies
23h10m

Could dataframes be a superior backend for graphs to maps/lists?

avgcorrection
0 replies
23h27m

Maybe there could be some utility for a decently general Graph type that can be used for high-level testing and verification. Maybe you could implement your efficient graph per-problem and then validate it with a more high-level and declarative type. You would have to implement some isomorphism between the types though.

at_a_remove
0 replies
1d

Hrm. This reminds me of a thought I had about, for the lack of a better term, "groups" in Python. Each data type has a property which is paramount and more or less defines the type.

Lists are ordered. The tuple is immutable. The dict is keyed. The set is unique. (But there are some overlaps: dicts are both keyed and their keys are unique.)

And I thought, what if you had a group where you could make it immutable or mutable, ordered or not-ordered, where the value was a key to something else (or not), and so on? But then I saw the weird edge cases and the explosions of complexity. Some combinations of these attributes are less appealing than others.

anonymoushn
0 replies
19h17m

To make matters worse, people often run graph algorithms on graphs that are too large to store, the details of which are generated on the fly at each step of the algorithm from the definition of the graph.

Ptitselov
0 replies
18h0m

If you code in .NET, please try my graph library Arborescence [1]. It's small and not very feature-rich. However, I believe it provides a good separation between abstractions [2], algorithms, and data structures. Regarding the points mentioned in the article: - you can use the edges with or without their own identity, - you can use implicit graphs unfolded on the fly, - you can use both adjacency (out-neighbors) and incidence (out-edges + head) interfaces, - no edge type is emposed by the library, although it does provide the basic tail-head-pair structure as a utility.

[1] https://github.com/qbit86/arborescence

[2] https://github.com/qbit86/arborescence/tree/develop/src/Arbo...

Borg3
0 replies
22h3m

Ahh, Graphs.. One of my favorite subject. I love doing graphs but im kinda bad at implementation. But, I still managed to slap D3 + Cola to make my own little interactive visualizer I use for various networking visualizations (L1,L2,L3).

Here is example of IRCnet network:

http://ds-1.ovh.uu3.net/~borg/d3/ircnet.htm

AtlasBarfed
0 replies
20h47m

My own take: a document database, which devs seem to love, combined with relations aka edges,is basically a property graph and that's what basically you need. Directionality is a property on the edge.