return to table of content

Database Design for Google Calendar: A Tutorial

bigbones
21 replies
1d9h

A random event from my calendar serializes to 740 bytes of iCalendar. An extremely busy calendar containing one event every 15 minutes 9am-5pm 7 days per week only contains 11680 entries for a full year, which works out to around 8MB.

Would suggest instead of designing a schema at all, a calendar is a good example of a problem that might be far better implemented as a scan. Optimizing an iCalendar parser to traverse a range of dumped events at GB/sec-like throughputs would mean the above worst-case calendar could be scanned in single-digit milliseconds.

Since optimizing a parser is a much simpler problem to solve once than changing or adding to a bad data model after it has many users, and that the very first task involving your new data model is probably to write an iCalendar importer/exporter anyway, I think this would be a really great trade off.

canucker2016
13 replies
1d9h

iCalendar is designed as an interchange format not as a storage format for fast access of calendar data.

The format shows its age - you can tell it was designed before XML/JSON were "hot".

see https://en.wikipedia.org/wiki/ICalendar

sleepyhead
3 replies
1d7h

you can tell it was designed before XML/JSON were "hot

or you can tell it was designed by people who care more about performance.

dspillett
1 replies
1d1h

> > you can tell it was designed before XML/JSON were "hot

or you can tell it was designed by people who care more about performance.

I doubt parsing iCal is significantly more performant than JSON for most use cases. In fact I can image it being less so in more cases than it is more so, and as close to the same as makes no odds in the vast majority of cases.

LeonB
0 replies
11h16m

Parsing iCal was far more performant than parsing JSON as your parser could simply start at the first byte, and follow the spec — whereas a JSON parser would first need to wait for ten years before JSON was invented.

bigbones
3 replies
1d9h

It is also a data model that amounts to a de facto standard across all real calendaring apps. The text serialization sure is ugly and who would want to touch it unless they had to, but that's just a parsing problem. The wiki page shows 14 component types, who is ever going to think of that many when designing a replacement schema? Why not just use the data model you already have 'for free'

canucker2016
1 replies
1d8h

It's a data model for an early-mid 1990s calendaring app (Lotus Organizer / Microsoft Schedule+).

If that works for you, go for it.

kreetx
0 replies
1d8h

But what is the exact criticism here, e.g the serialized form doesn't look "nice" when you open the file?

HelloNurse
0 replies
10h27m

A defacto standard for interchange ("feeds" announcing changes to activities), not for a database (remembering a large set of activities indefinitely).

The appropriate schemas are vastly different: a message like iCalendar should be simple and self-contained (leaving cleverness to the applications exchanging it), while a database like the article discusses should be normalized (nontrivial structures and queries aren't a problem).

asmor
2 replies
1d8h

Showed up to a doctors appointment during their lunch break once because iCal supports the time format "TZ is whatever is local, trust me" and the device I used to add an appointment was set to UTC, not UTC+1.

My doctors calendar app vendor was pretty happy I found the root cause of their very occasional mystery appointment drift too.

wodenokoto
1 replies
1d6h

“TZ is whatever is local, trust me”.

New years is midnight, local time, wherever you are. Trust me.

LeonB
0 replies
11h23m

If I invite you (who lives in Spain, let’s say) to fly over and enjoy my New Year’s party (in New Zealand) it definitely is not going to occur at “midnight local time wherever you are” - Trust me.

(Actually there’s a famous missing person’s case related to a New Year’s party in New Zealand … I don’t think that timezones were part of what went wrong, but I can’t be sure.)

rockwotj
1 replies
1d9h

It’s like the email format - it’s the wild west trying to parse these things, it doesn’t help that date/time is already a tricky subject.

LeonB
0 replies
11h21m

In the Wild West things were easier —

“Meet me at high noon!”

Was never met with -

“High noon? What time zone?”

mosselman
4 replies
1d9h

Wouldn't you have to re-invent things like ranges searches which are 'BETWEEN ... AND ...' queries in SQL? The same for finding events that are for users 1, 2 and 3, etc.

In a real application you'd probably have some user accounts of some sort that are stored in a relational database already and then you'd suddenly have to scan for events in a directory that you then have to connect to those records in the database.

So there might be some specific set of applications where you are right, but there are specific things that a database is really good at, which would make it a really good choice. With the proper indices you'd probably get the same or even better throughputs, unless you come up with some clever directory structure for your events, which would in fact be the same as an index and only on one dimension whereas in a database you'd be able to create indices for many dimensions and combinations of dimensions.

So you are right, trade offs.

bigbones
2 replies
1d8h

I didn't mean to imply avoiding use of a database entirely, almost any DB system tasked with copying a few long strings around in a simple query won't perform much worse than a literal raw disk file.

Even just something like: CREATE TABLE calendar(id, user_id, blob)

mosselman
1 replies
1d7h

Yes sure. I can imagine though that normally you'd also want to be able to query on details of an event. In which case having most things in columns would make sense because you can combine it with JOIN queries, etc.

Also, in the context of web applications, you probably already have a database and probably don't have persisted disks on your application servers, which then adds complexity to the file based scenario. In which case using blobs is a perfectly fine solution indeed.

Still you are right that in many cases, let's say a desktop application, you are probably better off reading directly from tens of files on disk rather than having to deal with the complexity of a database.

The same applies to vector databases. I read an article a few months ago that spoke about just storing vectors in files and looping through them instead of setting up a vector database and the performance was pretty much the same for the author's use case.

dotancohen
0 replies
1d6h

And then you quickly get to the not-author's-use cases where you have to start reinventing wheels, poorly.

nolok
0 replies
1d7h

If I understood him correctly, I think this is where some language collections libraries ought to shine. PHP/Laravel collection or C# Link for exemple. Tell it how to load a record into the collection, add one liners for each criterias you want to define, and in a few dozen lines you're free to go.

datr
0 replies
1d8h

Such an approach might also provide a neater solution to the infinite time-slots problem mentioned in the article.

HelloNurse
0 replies
1d6h

Maybe you can do without a DBMS, but you still need a good schema for your data, and it isn't a bunch of iCalendar records: for example, instead of a stable user ID there are only mutable and repeated email addresses.

markus_zhang
20 replies
1d10h

Interesting. I wonder what does a NoSQL one look like?

evnix
12 replies
1d9h

I call it the "Build now, cry later" approach.

shreddit
8 replies
1d9h

Second that, i’ve built multiple services with MongoDB just to port them now (and probably for some time) to postgres.

irisgrunn
4 replies
1d9h

That only makes sense if you wanted to store relational data in a NoSQL database (and that's not what Mongo is meant to do)

threeseed
2 replies
1d8h

And if your business is okay with all of your data living in a single instance.

Because PostgreSQL is unacceptably poor at HA/replication compared to MongoDB.

spacebanana7
0 replies
1d7h

Is that really true these days? Setting up Postgres read replicas with automatic fail over across multiple machines is pretty trivial in the cloud with services like RDS, spanner etc. And although doing it in your own datacenter is still a big job it's far from impossible.

sgarland
0 replies
1d6h

Huh? Replicas are easy, and hot standby nodes aren’t that hard either. There are also various active-active solutions if you need that.

redwood
0 replies
6h52m

I'm genuinely curious what you mean when you say "relational data"? I've seen this phrase thrown around and I think it's something of a misconception.

The way you use the term implies that you're referring to the type of data, but the term generally refers to the method used for storing the data.

This distinction is important because it leads to a circular reasoning dynamic: many of us are accustomed to storing the data in tabular form using a relational data model. But choosing to use that particular model to represent objects or entities or ideas does not make those objects or entities or ideas fundamentally relational data.

throwaway290
1 replies
1d6h

Postgres and NoSQL are not exclusive, Postgres works pretty well as NoSQL (some would say better than Mongo).

redwood
0 replies
6h40m

This is an extremely misleading statement. You have to jump through hoops to do expressive queries, with a limited set of index primitives, a query optimizer that doesn't take advantage of anything that resembles nested JSONB, and syntax and storage overhead that will leave you thinking there's nothing good about storing data with rich JSONic structure begin with.. worst of all worlds

markus_zhang
0 replies
1d9h

Just curious why? I have never worked with NoSQL but I always envy the people who does that because they are a bit further away from business than us data warehouse modellers. They are usually our upstream, paid more and less hassle from business.

threeseed
2 replies
1d8h

And yet almost all of the major websites you use today rely on NoSQL.

Also if you have non-traditional data structures e.g. document, star, graph, time series then storing them in a SQL database will cause you nothing but problems.

There are no black/white answers in tech. Always right tool for the right job.

fredoliveira
1 replies
1d8h

And yet almost all of the major websites you use today rely on NoSQL.

They may use NoSQL in specific use cases, but certainly not exclusively. Using the right tool for the job is crucial; otherwise, you’re doing yourself and your product a disservice.

In this case, NoSQL database architecture and internals provide little to no advantage over relational databases. I can’t imagine building a calendar implementation with NoSQL. Some flexible parts of the event model might be stored as NoSQL, but in general? No way.

Edit: looks like I wrote my comment as you were editing yours. We agree :-)

threeseed
0 replies
1d7h

It's a weird argument because the article is wrong.

Google Calendar is not implemented on top of a traditional SQL database but rather on top of Spanner which is more akin to a NoSQL database with a SQL front end.

rcaught
5 replies
1d9h

It looks like one you shouldn't implement.

markus_zhang
4 replies
1d9h

May I ask why? Actually I never used a NoSQL one so curious.

irisgrunn
1 replies
1d9h

Because events are related to users and they both are related to timezones and events can be related to each other. MongoDB is really good for storing big blobs of data you want to retrieve quickly, with some basic search and index, but it's awful at relations between data.

markus_zhang
0 replies
1d9h

Ah I see what you mean. That makes sense!

Lutger
1 replies
1d8h

The for most people somewhat counter-intuitive answer is that NoSql is very rigid. It is counter intuitive, because having no required schema up front appears to be more flexible, not less.

However, having your database not handle schema means your application must do it, there is no way around it. If you ask for an DayEvent and you get back something totally different, what do you do?

The rigidness in most NoSql (assuming some form of document store like MongoDB) comes from its inability to combine data in new ways in a performant manner (joins). This is what SQL excels at. That implies you need to design your data in exactly the way it is going to be consumed, because you can't easily recombine the pieces in different ways as you iterate your application. Generally you must know your data access patterns in advance to create a well behaved NoSql database. Changes are hard. This is rigid.

Thus, it actually makes more sense to go from sql to a nosql, as you gain experience and discover the data access patterns. The advantage of nosql is not flexibility, that is actually its disadvantage! The advantage is rather its horizontal scalability. However, a decent sql server with competently designed schema will go a very long way.

redwood
0 replies
6h58m

I think you have a very thoughtful take but believe it's a mistake to think of 'NoSQL' as a monolithic category..

There's a very wide spectrum from having an evolvable document oriented data model with evolvable strongly consistent secondary indexes, transactions, aggregations, and joins to simplistic key/value stores like DynamoDB and Cassandra that do force you into a very much waterfall posture that I think you are spot on in pointing out.

captn3m0
0 replies
1d9h

I've been building something using a JSON-blob inside SQLite using the Schema.org/Event schema. It's turning out okay.

canucker2016
11 replies
1d10h

Timezones will bend your mind, especially around the transition times.

Assuming your timezone jumps forward one hour for daylight savings time and falls back one hour for transition to standard time...

When your time skips forward one hour, your 1 hour event may now be displayed as spanning two hours - the second hour will not be reachable/does not exist.

When your time falls back one hour, your 1 hour event may now show as spanning 2 hours or 0 hours.

Timezones are a man-made construct so don't hardcode values cause things will change...

0xEF
8 replies
1d9h

Are you suggesting to just make the calendar in UTC with zero timezones? I agree that they are something we created, but we created them for better local organization. I work with customers globally and everyone uses them. This is a pretty ridiculous ask, in my opinion.

Rather just focus on eliminating the Daylight Savings concept from the few localization that still use it, as they tend to cause the most confusion across timezones, especially when planning past an upcoming shift.

canucker2016
4 replies
1d9h

I did not say that.

Timezones are not carved in stone. Prepare for that.

A location could go from a +1/-1 timezone as in most of USA/Canada to a fixed timezone with no transitions.

There are various ways to adapt, but the user-friendly way involves a lot more work in the app, especially if the app thought that timezone data wouldn't change.

Lutger
3 replies
1d9h

Indeed. There are multiple timezone changes each year, causing gaps in time. Also, not every timezone is a neat 1 hour increment, some are 15 minutes or worse. There are even people living in the same geographical location but in different timezones!

Dealing with timezones will drive you mad, quickly.

pantulis
1 replies
1d8h

Dealing with timezones will drive you mad, quickly.

I guess you mean "implementing timezone logic yourself". other than that, the suggested approach (store everything in UTC) and translating to the relevant user timezone with a TZ database in the frontend is the way to go.

plibblr
0 replies
1d5h

Not sure, that's good advice for times in the past, or for times in the next few months, but if we're arranging to meet at midday local time in two years time, I don't think a change in timezone rules in 2025 should cause our meeting to take place at a different time. I suppose you could store the meeting in UTC and use the creation time of the meeting to decide to use the 2024 timezone rules for conversion not the 2026 rules, but that seems pretty confusing too!

canucker2016
0 replies
1d8h

Definitely.

The timezone inventors specified the timezone transitions would occur when most people weren't awake or affected - early in the day in the middle of a Saturday/Sunday weekend for USA/Canada.

But remote teamwork threw a wrench in that. 2AM Sunday meetings sounded unlikely unless your team needs to communicate with a team several hours ahead or behind and Sunday is a regular workday for one of the teams.

Lutger
1 replies
1d9h

One approach is to store everything in UTC, and display in the timezone of the user.

Dealing with timezones - including DST - properly is a must-have, no way around it. I live in a country that uses DST, a lot of Europe does that. If my calendar would be off by 1 hour half of the year, I'd consider it broken and seriously doubt the competence of its authors. This is the core domain of a calendar app! It would be like an email app that just silently drops every other email.

I'd love for us to ditch DST by the way, hate it every time. Its bad for the economy, its bad for our health, its bad for software.

dqv
0 replies
1d7h

I don't trust UTC for future dates, only for things that have already happened. Future dates are tentative in that their UTC representation can change. For example, if I converted a Kazakh user's event slated for April 1, 2024 to UTC before February 1, 2024, the event time would be off by one hour.

dotancohen
0 replies
1d4h

  > Rather just focus on eliminating the Daylight Savings concept from the few localization that still use it
Oh, certainly, change the laws and cultures of foreign states and peoples in order to simplify your code. Can you get them to just write everything in ASCII while you're at it?

zild3d
1 replies
1d9h

It gets even more interesting if you're tracking time while considering movement. First time I ran into this was timetracking aboard a ship that is at sea for multiple days. You can cross timezones repeatedly in both directions, the date line, can have a local start datetime that is after the local end datetime, etc.

dotancohen
0 replies
1d5h

Is this logging or planning? If it's logging I would love to know the arguments against storing the records as timestamps.

pg_bot
8 replies
1d8h

You don't want to store two dates for an event. It's easier to store the start time of the event and a duration for the event. This will make the logic for updating your events simple. You can always calculate the end time of the event based off the start and duration.

sleepyhead
5 replies
1d7h

That would be problematic for database performance. You would have to calculate the duration on demand when querying.

pg_bot
2 replies
1d6h

In practice performance problems are a nonissue. Your API should consist of start time and duration because that is how people think about appointments and meetings. (An hour long meeting starting at noon) It's a pain to write a UI that updates two pieces of information when one piece of information changes. If you are truly worried about query performance you can denormalize the data before saving by storing start/end times as a datetime range field, but it still makes no sense to expose that in your API.

sleepyhead
0 replies
1d5h

Showing duration is helpful but so is the exact end time. Visually as a user I would like to see the exact time when the appointment ends instead of calculating it in my head. While it is not that hard to process when an appointment with a duration of 4:15 ends after starting at 2:30 but still.

As for API it makes a lot of sense to expose end time. If you for example are creating a calendar widget then it has start and end datetime for all events. With only duration available in the API output you know how to calculate the end time. More lines of codes for you.

Fetching from the API you would in most cases limit it to certain dates, for example next week. So now you suddenly do have to deal with start and end time. Not having it otherwise makes no sense.

Never had any developers ask for outputting duration in our scheduling API. It would be useful to include it but since no one have asked about it then I think having end time is more critical. https://developer.makeplans.com/#attributes-1

canucker2016
0 replies
1d1h

How would your code generate all the appointments for a given day?

Requirement: Must be able to handle appointments that span a day, i.e. show all Sunday appointments when there's a party appointment that starts at 8PM Saturday and ends at 2AM Sunday, or in your data model, Saturday 20:00 for six hours.

skeeter2020
0 replies
1d6h

one of you is optimizing for transactional operations and the other reporting :)

sgarland
0 replies
1d6h

Tbf at Google / Apple scale this may not hold true, but an addition should not cause any noticeable stress to a DB as a generated column.

Or do it in the app, of course.

petalmind
0 replies
21h16m

You don't want to store two dates for an event. It's easier to store the start time of the event and a duration for the event.

Frankly speaking you could store all three (begin, end, duration) and just use whatever you need for different purposes. Just introduce a single point of entry/update that keeps alternative representations in sync.

dgrin91
0 replies
1d5h

Isn't the reverse also true? If you have start & end you can just calc duration.

munchler
5 replies
1d8h

Modeling a system is an under-appreciated skill. In a new domain, however, this should really start with an analysis of the entire problem so as to capture both the static structure of the system (i.e. class model) and its dynamic behavior (i.e. use cases). Jumping directly into a static database model tends to leave out the dynamic behavior. That might be OK in a simple CRUD app like this one, but could be a big mistake in more complex systems.

apwheele
4 replies
1d7h

Data scientist, and I have had a few examples of seemingly simple "how would you build a schema" job interview questions that I had a difficult time with on the spot.

So last one I remember was how would you build a product table with coupons. Ok, so two tables right, no big deal. Well, we are going to need to keep a history right? So now I need to update and have datetimes for different products and coupons. And now I should think about how to do indexes on the tables, and gosh my join to get the discounted price is that a good way to do that? Most coupons only allow a person to use them once, how the hell am I going to implement that?

They probably just wanted the simple product + coupon table, but let me spin on it for quite a while like a madman.

nerdponx
2 replies
1d4h

I don't like these questions. Data warehouse schema design is out of scope for data science, it's data engineering. Yes we have to do a lot of ad-hoc data engineering along the way, but it's such a strange thing to interview for in lieu of the many possible data/math/stats skills and more directly relevant programming skills. It signals a lack of respect for division of labor and specialization, and that lack of respect will be visible in the form of a stretched inefficient team.

To be clear, I think the ability to be your own data engineer is a great attribute as a data scientist. I just don't think it's reasonable to expect it: it's not part of the core job description.

datadrivenangel
1 replies
1d4h

I expect a skilled data scientist to be able to articulate the complexities of the real world in relational data. Otherwise, how the hell will they be able to infer the real world from relational data?

nerdponx
0 replies
9h16m

That's fair. That's the ad-hoc data engineering I was talking about.

I interpreted the question to be asking about "warehousing" moreso than just "laying out data", but I think I might have read too far into it.

blowski
0 replies
1d6h

I'd say this is exactly what the interviewers wanted. They're interested in how you break down the problem, the types of solutions you consider, your understanding of the trade-offs involved. For example, I interviewed somebody who was adamant they could prevent double-booking by polling an end-point and storing the state in Redux. Fantastic JavaScript skills, terrible knowledge of databases.

sensanaty
3 replies
1d9h

The worst I have ever messed up during an interview was building a simple booking system that had to do recurring appointments. I never felt that lost and confused trying to accomplish something in code since my early university days lol

To this day whenever I have to work with anything datetime related I dread it, it just does not click in my head for some reason

tonnydourado
1 replies
1d6h

Asking a design question like "Design a booking system, but with recurring appointments" is like asking "Write a function to order a list of string, but it has to work with arbitrary UTF-8 strings, in any locale, respecting alphabetical order conventions". It's a deceptively simple question that gets impossibly hairy if you try to make it work in a real-world, general, way.

dmd
0 replies
1d6h

(As someone who hires) the correct answer to any question involving datetimes is always "use a [good] library - do not attempt to reimplement all of human civilization from scratch".

mattgreenrocks
0 replies
1d6h

Don’t feel bad about it. I was tasked with improving a calendar in a CMS to support a bunch more functions, and basically eaten alive by it. I had 8yrs of experience or so by then.

It is deceptively hard and requires excellent data modeling skills.

Sander_Marechal
2 replies
1d7h

Funny how the original article's advice and your article oppose each other.

- OP says to always store a timezone with each date, yours says to convert everything to UTC (I agree with OP)

- OP says to generate database rows for each event, yours says to not do that (I agree with yours)

fendy3002
0 replies
1d6h

OP says to always store a timezone with each date

Jon Skeet talked about this once. https://codeblog.jonskeet.uk/2019/03/27/storing-utc-is-not-a...

What I take is convert everything to UTC is fine if it's historical data, even unix timestamp is fine. However for the future datetime, it's more complicated than that.

UglyToad
0 replies
1d5h

Having built recurring stuff in the past (date based with no time component, luckily for me) I think you gain a lot of usability gains for generating a row for each occurrence of the event.

Inevitably the user will come back and say "oh, I want it monthly except this specific instance" or if it's a time based event "this specific one should be half an hour later". You could just store the exceptions to the rule as their own data-structure but then you need to correlate the exception to the scheduler 'tick' and if they can edit the schedule, well, you're S.O.O.L either way but I think having concrete occurrences is potentially easier to recover from.

throwaway211
1 replies
1d10h

Interesting.

How about edits, changes of time and location, who's signed up and to which revision.

michaelmior
0 replies
1d10h

I think edits and changes of time were pretty well discussed (although certainly some details missing).

trwhite
0 replies
1d7h

Years ago I worked on a calendar application with recurrence. After lots of research I settled on using RRules to represent this, which I was very pleased with. That initial work was when I was at an agency.

Later I joined the company full time and discovered to my amazement that a contractor from a different company had removed the RRules in favour of creating and destroying instances of events on the fly. It had no/little fault tolerance so sometimes the script (which did other things that would sometimes fail) would fail to create new events. You'd have monthly recurring events with missing months.

I found it so frustrating that (after going through a lot of thought and research) that someone hadn't put anywhere near as much effort into removing mine. It took just a few weeks at that company to realise that the CEO expected the Engineering team to pump out features (that nobody used) at his will and, in the uncertainty of the job market, sadly I stayed there for 2 years.

Unrelated footnote: After Googling them, it's really sad to see what are blatantly fake reviews by the CEO on Glassdoor all written in the same style with nothing bad to say. I (and a bunch of other people I know who worked there) hated him, but the silver lining is that I wrote some of my best essays there. The CTO was hopeless too.

roland35
0 replies
1d7h

All I want in Google calendar is a log of changes to the calendar itself. Please add this to the database!

lichtenberger
0 replies
1d10h

I once implemented the backend of a calendar and resource control for a low code platform.

The control is highly customizable, with a lot of views to chose from, daily, monthly, yearly... but also resource views (you can book resources with custom groupings, by plugin, by the resource-ID, whatever...), define "plugins" on the data sources, what's the from- and to- columns, the title column, what's the resource (may be from a foreign key / 1:1 relationship or 1:N if it's from a "child" data source or from the same data source/table).

Furthermore I've implemented different appointment series, to chose from (monthly, weekly (which weekdays), daily...), which column values should be copied. Also appointment conflicts (or only conflicts if they book the same resource). You could also configure buffers before and after appointments where no other appointment can be.

That was a lot of fun and also challenge sometimes regarding time zones and summer/winter time in Europe and so on :-)

evnix
0 replies
1d9h

Nicely done. This is one of those aspects which is not really touched on in most courses.

Will try to get my company to get a few copies of your book for each of our team member.

LeonB
0 replies
1d10h

I found this to be a good introduction and a well-chosen domain to demonstrate modelling.

The term “anchor” feels kind of weird to me, but the explanation is so concrete/grounded (like an actual anchor) that I guess it works well enough.

The concept of defining the attributes via a question is solid, great way to get clarity quickly. Too often we jump to a minimal column/property name without defining what question we’re trying to answer, and thus not shaking loose any ambiguity in the mind of the customer(s).