Capturing snapshots for historical graphs

Graphs are essential components in any software that involves recording and analyzing data. They are visual representations of complex data organized in a way that makes it easier to read. Take for example: a graph that displays team throughput in the last 12 months. These are valuable to businesses as they help them forecast results and make timely, critical decisions.
However, designing graphs—particularly from the backend—can be very tricky and sometimes prone to integrity issues if not handled with care. Historical graphs are one of those that are deceptively difficult to build. Let's see why.
Note: I wrote this blog with the assumption that the software uses the application layer to decide when or how to persist data. Therefore, the SQL examples work with that rule in mind.
On-demand approach
Let's use the team's throughput graph as an example. Suppose we have these tables in the database:
CREATE TABLE teams (
id BIGSERIAL PRIMARY KEY,
name VARCHAR NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE UNIQUE INDEX index_users_on_email
ON users(email);
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR NOT NULL,
password_digest VARCHAR NOT NULL,
team_id BIGINT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT fk_users_teams
FOREIGN KEY (team_id)
REFERENCES teams(id)
ON DELETE CASCADE
);
CREATE TABLE tasks (
id BIGSERIAL PRIMARY KEY,
title VARCHAR NOT NULL,
completed_at TIMESTAMP,
user_id BIGINT NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT fk_tasks_users
FOREIGN KEY (user_id)
REFERENCES users(id)
ON DELETE CASCADE
);
To summarize this picture, we have teams that have many users that are individually assigned to one task. If a user completes a task, the associated record has its completed_at field updated to the current time to reflect the action. Pretty easy to understand.
Now, if the company wants to analyze a team's throughput over the last 12 months, we can build the graph on demand using a query like this:
WITH months AS (
SELECT GENERATE_SERIES(
DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '11 months',
DATE_TRUNC('month', CURRENT_DATE),
INTERVAL '1 month'
) AS month
)
SELECT
months.month,
COALESCE(COUNT(tasks.id), 0) AS task_count
FROM months
LEFT JOIN tasks ON DATE_TRUNC('month', tasks.completed_at) = months.month
LEFT JOIN users ON users.id = tasks.user_id
LEFT JOIN teams ON teams.id = users.team_id
WHERE tasks.completed_at IS NOT NULL
AND tasks.completed_at >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '11 months'
AND teams.id = 1
GROUP BY months.month
ORDER BY months.month;
which returns these results for team ID 1:
month | task_count
----------------+------------
2025-05-01 | 12
2025-06-01 | 0
2025-07-01 | 7
...
2026-04-01 | 5
And it works, right?
Yes, it does! That is until... we start deleting records. Once we introduce hard-deletes, the graph loses integrity and will no longer accurately display the throughput. Sometimes records may get deleted by accident or intentionally depending on the reason behind the action. Regardless of the reason, the graph is unreliable.
But what if the software doesn't hard-delete records as part of its process? Mission-critical software would often do soft-deletes instead to prevent data loss and support data resurrection. For those unfamiliar with soft-deletes, it is basically using a flag to hide the record from the client instead of actually removing it from the database (hard-delete). For example:
-- Soft-deleting a task
UPDATE tasks
SET discarded_at = NOW()
WHERE id = 1;
-- Fetching "non-deleted" tasks
SELECT * FROM tasks WHERE discarded_at IS NULL;
In this situation, building the graph on demand seems reasonable. However, the graph is still susceptible to data modifications. Tasks can be "reopended" (completed_at = NULL) after completion, users can change teams, and so on. Admittedly, these are very simple examples that can be solved with proper database design, but as more complexities are introduced, it will soon be difficult to build an accurate graph this way. The query will also suffer from performance issues over time as records increase.
So what's the alternative?
Snapshot-based approach
A snapshot represents a record as it existed at a particular time. The software treats this as immutable—with insert-only and read-only constraints—to ensure the record is never changed. Using a cronjob, the software captures the snapshots periodically. When it is time to build the graph, it no longer has to rely on the raw data to populate it.
Let's look at how we can implement this for the scenario described in the previous section.
First, we need a snapshot table that stores the number of completed tasks within the current day for each team:
CREATE TABLE completed_task_snapshot (
id BIGSERIAL PRIMARY KEY,
count INT NOT NULL,
team_id BIGINT NOT NULL,
captured_at DATE NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT NOW(),
updated_at TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT fk_completed_task_snapshots_teams
FOREIGN KEY (team_id)
REFERENCES teams(id)
ON DELETE CASCADE
);
ALTER TABLE completed_task_snapshot
ADD CONSTRAINT unique_team_day
UNIQUE (team_id, captured_at); -- To prevent double-counting
Then, we introduce a cron job that runs every midnight to get the necessary data to create the snapshot. The queries inside the cron job would look like this:
-- Count the total number of completed tasks for the current day for team ID 1
SELECT COUNT(id)
FROM tasks
WHERE team_id = 1
AND completed_at >= DATE_TRUNC('day', "2026-04-19"::timestamp)
AND completed_at < DATE_TRUNC('day', "2026-04-19"::timestamp) + INTERVAL '1 day';
-- Create the snapshot for that day
INSERT INTO completed_task_snapshot (count, team_id, captured_at)
VALUES (4, 1, '2026-04-19'::date)
ON CONFLICT (team_id, captured_at)
DO UPDATE SET count = EXCLUDED.count;
And finally, we fetch the snapshots to visualize the team's throughput using this query:
WITH months AS (
SELECT GENERATE_SERIES(
DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '11 months',
DATE_TRUNC('month', CURRENT_DATE),
INTERVAL '1 month'
) AS month
),
team_monthly AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(count) AS total_completed
FROM completed_task_snapshot
WHERE completed_task_snapshot.team_id = 1
AND completed_task_snapshot.created_at >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '11 months'
GROUP BY DATE_TRUNC('month', completed_task_snapshot.created_at)
)
SELECT
m.month,
COALESCE(tm.total_completed, 0) AS total_completed
FROM months m
LEFT JOIN team_monthly tm
ON tm.month = m.month
ORDER BY m.month;
Notice how we achieved the same result as the on-demand approach. This time, the graph's integrity is preserved even if tasks get deleted or reopened, users change teams, and other data modifications occur. None of these should affect the snapshots because of the surrounding constraints. Therefore, we can confidently say this is the superior approach, right?
Well... not quite.
The implementation requires capturing snapshots every day to populate the graph. If this was introduced later than the tags feature—let's say months after—then the graph would have empty records on the prior months. There's also a limitation to this approach. If we want filters on the graph, we have to figure out every "dimension" of the filter before rolling out the snapshot. For example, a filter by user would require us adding the user_id on the snapshot table and design the snapshot as per-user and not per-team.
-- Assuming the table already has `user_id`
-- Count the total number of completed tasks for the current day for user ID 1
SELECT COUNT(id)
FROM tasks
WHERE user_id = 1
AND completed_at >= DATE_TRUNC('day', "2026-04-19"::timestamp)
AND completed_at < DATE_TRUNC('day', "2026-04-19"::timestamp) + INTERVAL '1 day';
-- Create the snapshot for that day
INSERT INTO completed_task_snapshot (count, team_id, user_id, captured_at)
VALUES (4, 1, 1, '2026-04-19'::date)
ON CONFLICT (team_id, user_id, captured_at) -- Constraint now includes user_id
DO UPDATE SET count = EXCLUDED.count;
It's nearly impossible to backfill the snapshots to apply any new dimensions. If we go with this approach, we are forced to lock the filters for the graph from the start. This can be very limiting for certain software and you'd be forced to create another graph if you want to show a different perspective of the data.
Takeaway
You might’ve expected me to pitch one over the other, but honestly, it depends (pretty boring, I know!). There are many factors to consider when deciding an implementation. Unfortunately, there’s no such thing as a bulletproof solution. You have to weigh the risks carefully and commit to it throughout. If there’s any takeaway here, it’s that historical graphs are difficult to create. Things can get messy when you consider the scenarios around the data—and software itself is already messy. I think we can all agree on that.
Thanks for reading my blog! It took some time writing this so I appreciate you for reading the whole thing. Till next time!