return to table of content

Data Structures for Data-Intensive Applications [pdf] (2023)

russfink
4 replies
14h21m

Needs a TOC

albert_e
2 replies
12h40m

I uploaded PDF and asked ChatGPT 4 for ToC ..it is really struggling to process this.

Even after asking it to ignore page headers and footers.

I thought the state of the art was much better?

082349872349872
1 replies
9h43m

By hand (5 min):

  I Introduction                                          2
  II Performance Metrics and Operational Tradeoffs        9
  III Dimensions of the Data Structure Design Space      21
  IV From Workloads to Data Structures                   67
  V Adaptivity: Evolving Data Structures to a Workload   93
  VI Data Structures for Specific Application Domains   109
  VII Challenging Design Considerations                 124
  VIII Summary                                          132
(along the way I saw some interesting figures, so, for me, this exercise was worth the time)

albert_e
0 replies
5h46m

Yes I was going to do the same but wanted to learn a general way to do this task along the way too and get more details (second level sections as well).

======================

Step 1:

Use pdftk to extract metadata

pdftk input.pdf dump_data output > raw.txt

Step 2:

copy relevant sections of metadata "Bookmarks" into a new tetx file.

Step 3:

Upload this text file and ask ChatGPT to format it as you want -- markdown / HTML / plain text.

Took maybe 10 minutes and this was also worth it for a different reason (learning one more bit about pdftk capabilities).

=======================

CONTENTS

    1. Introduction (2)

    1.1. Data Structures Are Foundational (2)
    1.2. Tradeoffs in Data Structure Design (4)
    1.3. Audience & Prerequisites (7)
    1.4. Learning Outcomes (8)
    1.5. Overview of the Book (8)

    2. Performance Metrics and Operational Tradeoffs (9)
    2.1. Memory Hierarchy (9)
    2.2. From Read/Update to RUM: Memory & Space Costs (10)
    2.3. RUM Performance Costs (11)
    2.4. From RUM to PyRUMID (14)
    2.5. Chapter Summary (17)
    2.6. Questions (18)
    2.7. Further Readings (19)

    3. Dimensions of the Data Structure Design Space (21)
    3.1. Global Data Organization (24)
    3.2. Global Search Algorithms When Not Using an Index (26)
    3.3. Search When Using an Index (32)
    3.4. Local Data Organization (46)
    3.5. Local Search (48)
    3.6. Modification Policy: In-place vs. Out-of-place (48)
    3.7. Buffering (53)
    3.8. Key-Value Representation (55)
    3.9. Summary of the Design Space Dimensions (57)
    3.10. Data Structure Design Expert Rules (58)
    3.11. Chapter Summary (61)
    3.12. Questions (61)
    3.13. Further Readings (64)

    4. From Workloads to Data Structures (67)
    4.1. Point and Range Queries, Modifications, but Rare Scans (Traditional B+-trees and Learned Tree Indexes) (68)
    4.2. Similar Workload With a Working Set That Fits in Memory (Fractal B+-trees) (70)
    4.3. Point and Range Queries, Rare Scans, More Modifications (Insert-Optimized Search Trees) (70)
    4.4. Mixed Workload With No Short Range Queries (Hybrid Range Trees) (75)
    4.5. Mixed Workload, With Ever Increasing Data Size (Radix Trees) (78)
    4.6. Point Queries, Inserts, and Some Modifications (Static Hashing with Overflow Pages) (79)
    4.7. Read-mostly With Long Range Queries (Scans with Zonemaps) (81)
    4.8. Modification-intensive With Point and Range Queries (LSM-tree) (82)
    4.9. Modification-intensive With Point Queries Only (LSM-hash) (84)
    4.10. When to Design Heterogeneous Data Structures (85)
    4.11. Data Structures in Practice (86)
    4.12. Chapter Summary (88)
    4.13. Questions (89)
    4.14. Further Readings (92)

    5. Adaptivity: Evolving Data Structures to a Workload (93)
    5.1. Design Dimension: Reorganization Aggressiveness (96)
    5.2. Adaptivity for Frequently Accessed Data (96)
    5.3. Adaptivity for Value-Organized Data (97)
    5.4. Aggressiveness of Adaptivity during Initialization (100)
    5.5. Partial Adaptive Indexing (101)
    5.6. Adaptive Modifications (102)
    5.7. Adaptivity and Concurrency (103)
    5.8. Adaptivity Metrics (104)
    5.9. Open Topics (105)
    5.10. Chapter Summary (106)
    5.11. Questions (107)

    6. Data Structures for Specific Application Domains (109)
    6.1. Data Structures in Relational Database Systems (109)
    6.2. File Systems & Memory Management (117)
    6.3. Data Structures in Machine Learning Pipelines (118)
    6.4. Cross-System Design Considerations and Tradeoffs (120)
    6.5. Chapter Summary (121)
    6.6. Questions (122)
    6.7. Further Readings (123)

    7. Challenging Design Considerations (124)
    7.1. Concurrency (125)
    7.2. Distributed Systems (127)
    7.3. Emerging Workload Types (127)
    7.4. Hardware Considerations in Data Structure Implementation (128)
    7.5.Chapter Summary (129)
    7.6. Questions (129)
    7.7. Further Readings (130)

    8. Summary (132)
    Acknowledgments (134)
    References (135)

csbartus
0 replies
7h25m

Opening with Firefox gives a full TOC: https://imgur.com/a/cgdy0nY

petra
4 replies
11h58m

One of the authors of this book runs a research lab in this field.

They have this cool tool that can help one design an optimal data structure:

http://daslab.seas.harvard.edu/datacalculator/

5Qn8mNbc2FNCiVV
3 replies
11h13m

I can't seem to find where the actual tool is

skavi
0 replies
6h50m

If you manage to click on the correct image, you’ll get to this page [0]. Then you can find a link to the “interactive demo” [1]. Weirdly well hidden.

[0]: http://daslab.seas.harvard.edu/projects/evolution/

[1]: http://sigmod.dyndns.org:8000/app/#

kinow
0 replies
7h22m

Me too, some text, links, but no actual code or link to download.

hermitcrab
0 replies
7h43m

Same here.

pgwhalen
2 replies
18h35m

I've only had a chance to skim so far, but this is an absolutely superb survey of an enormous space. It doesn't just list out a bunch of data structures, it actully helps you build a mental model of all the concerns to think about when building or using data structures in an application.

PH95VuimJjqBqy
0 replies
4h3m

Easily one of the best technical books I've ever read.

3abiton
0 replies
17h32m

That's the recap I needed!

datadrivenangel
2 replies
21h41m

Data structures across distributed systems are challenging, so this is a useful resource.

Still, better if you can keep your application on a single machine if that's feasible.

teraflop
1 replies
21h23m

This paper does not cover distributed systems, except for a very brief mention at the end.

fithisux
0 replies
5h43m
Ozzie_osman
2 replies
12h40m

I'd love to buy a copy but it's $100 on Amazon.

willsmith72
1 replies
12h9m

I'm still waiting for someone to disrupt and un-amazon the book industry. Authors lose, readers lose. It's broken

eimrine
0 replies
10h17m

Torrents can do this, also offline book shops who sell used books.

bbsz
1 replies
5h56m

More recommendations on this topic? Paper's great. I also know about Martin Klepmann's Designing data intensive applications but it's less about data structures and more about databases.

phyrex
0 replies
5h5m

I got a lot of value out of “Algorithms and Data Structures for Massive Datasets” (https://www.manning.com/books/algorithms-and-data-structures...)

nwareing
0 replies
21h22m

DSDIA ;)

lgkk
0 replies
17h18m

This is a good read

Octokat
0 replies
23h0m

Thank you for sharing this!