Comments & Replies

September 25, 2024

Introduction

In this blog article, we’ll explore how to model comments and replies in a relational database and how to efficiently fetch them. In the case of a commenting system, each post can have multiple comments. A comment consists of the comment text, the username of the commenter, their profile picture, the timestamp of creation, and the number of likes. Additionally, it should track whether the user viewing the comment has already liked it. Each comment made directly on a post (a parent comment) can have multiple replies. However, replies are not nested further - meaning a reply cannot have its own replies. Instead, replying to a reply is treated as a reply to the original parent comment, similar to how Instagram handles comments. Parent comments are ordered based on relevance, which is determined by a combination of the number of likes and the comment's age. Replies (child comments) are simply sorted by time. When viewing a post, only parent comments are initially loaded, and users can expand them to view replies. These are the key requirements for this use case, and we will explore how to model and retrieve this data efficiently in a relational database.

Schema Design

When designing the schema using normalization, you would create separate tables for comments and replies. In this setup, each comment in the comments table can have multiple associated replies stored in a replies table, with the relationship enforced by a foreign key (comment_id). To track which users have liked a particular comment or reply, additional many-to-many relationship tables - such as relation_liked_comment and relation_liked_reply - are introduced. The diagram below illustrates a normalized design:

Normalized Design

While the normalized approach is a valid way to model the data, it comes with some drawbacks. In many cases, a single-table design using a self-referencing parent_comment_id is preferable because comments and replies share nearly identical data and behavior, making them essentially the same entity. This unified approach avoids splitting similar data across two tables, which can increase the complexity of queries and application logic - especially if you later decide to support nested replies beyond a single level. By consolidating everything into one table, you simplify data management, reduce code duplication, and gain the flexibility to traverse comment threads using recursive queries or common table expressions. In contrast, a two-table design introduces redundant structures, adds join overhead, and limits the evolution of the comment system into deeper hierarchies without a significant redesign.

Not Normalized Design

Query Construction

Let’s begin by querying the parent comments for a specific post. The following query demonstrates how to fetch these top-level comments.

</> SQL
SELECT * FROM comments WHERE parent_id IS NULL AND post_id = :${POST_ID}; 
        

The initial query retrieves the comment text and its creation timestamp, but it lacks additional user details. To include information about the comment’s author, we join the comments with the users table.

</> SQL
SELECT c.comment, c.created_at, u.username, u.picture 
FROM comments c 
JOIN users u ON c.user_id = u.id 
WHERE c.parent_id IS NULL
AND c.post_id = :${POST_ID};
        

Next, we want to display the number of likes each comment has received. This is done by joining the relation_liked_comment table, which records the many-to-many relationship between users and comments, into our query.

</> SQL
SELECT 
    c.comment, 
    c.created_at,
    u.username, 
    u.picture, 
    COUNT(rlc) AS like_count 
FROM comments c 
JOIN users u ON c.user_id = u.id 
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id 
WHERE parent_id IS NULL 
AND c.post_id = :${POST_ID}
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture;
        

To include the number of replies, we perform a self-join on the comments table using the parent_id field, which connects a comment to its replies.

</> SQL
SELECT
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    COUNT(rlc) AS like_count,
    COUNT(c2.id) AS response_count
FROM comments c
JOIN users u ON c.user_id = u.id
LEFT JOIN comments c2 ON c2.parent_id = c.id
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id
WHERE c.parent_id IS NULL
AND c.post_id = :${POST_ID}
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture;
        

Additionally, we need to determine whether the currently logged-in user has liked each comment. This is achieved by leveraging the relation_liked_comment table once more.

</> SQL
SELECT
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    CASE WHEN rlc2.comment_id IS NOT NULL THEN true ELSE false END AS is_liked,
    COUNT(rlc) AS like_count,
    COUNT(c2.id) AS response_count
FROM comments c
JOIN users u ON c.user_id = u.id
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id
LEFT JOIN relation_liked_comment rlc2 ON rlc2.comment_id = c.id AND rlc2.user_id = :${USER_ID}
LEFT JOIN comments c2 ON c2.parent_id = c.id
WHERE parent_id IS NULL
AND c.post_id = :${POST_ID}
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    is_liked;
        

The final piece is an ordering column that sorts comments based on relevance. The query below shows how this field is calculated and used.

</> SQL
SELECT
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    CASE WHEN rlc2.comment_id IS NOT NULL THEN true ELSE false END AS is_liked,
    COUNT(rlc) AS like_count,
    COUNT(c2.id) AS response_count,
    (10 * COUNT(rlc.user_id)) / POWER((EXTRACT(EPOCH FROM (NOW() - c.created_at)) / 3600 + 2), 1.5) AS order_col
FROM comments c
JOIN users u ON c.user_id = u.id
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id
LEFT JOIN relation_liked_comment rlc2 ON rlc2.comment_id = c.id AND rlc2.user_id = :${USER_ID}
LEFT JOIN comments c2 ON c2.parent_id = c.id
WHERE parent_id IS NULL
AND c.post_id = :${POST_ID}
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    is_liked
ORDER BY
    order_col DESC,
    c.id DESC;
        

Here’s a breakdown of how the ordering column, order_col, is computed:
- The numerator multiplies the like count by 10, ensuring that likes have a significant impact.
- The denominator is derived from the comment’s age in hours (calculated using EXTRACT(EPOCH FROM (NOW() - c.created_at)) / 3600), with a constant added to prevent issues with very recent comments. An exponent of 1.5 controls how quickly the score decays over time.
- This formula means that a comment with zero likes scores 0, a new comment with a few likes scores higher than an older comment with the same number of likes, and popular comments remain prominent even as they age.
- To ensure consistent ordering when scores match, the comment ID is used as a secondary sort key.

Rather than retrieving all comments simultaneously, it's more efficient to use pagination. Since a custom order column determines the sequence, keyset pagination is employed to consistently maintain that order across pages.

</> SQL
SELECT
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    CASE WHEN rlc2.comment_id IS NOT NULL THEN true ELSE false END AS is_liked,
    COUNT(rlc) AS like_count,
    COUNT(c2.id) AS response_count,
    (10 * COUNT(rlc.user_id)) / POWER((EXTRACT(EPOCH FROM (NOW() - c.created_at)) / 3600 + 2), 1.5) AS order_col
FROM comments c
JOIN users u ON c.user_id = u.id
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id
LEFT JOIN relation_liked_comment rlc2 ON rlc2.comment_id = c.id AND rlc2.user_id = :${USER_ID}
LEFT JOIN comments c2 ON c2.parent_id = c.id
WHERE parent_id IS NULL
AND c.post_id = :${POST_ID}
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    is_liked
HAVING ((10 * COUNT(rlc.user_id)) / POWER((EXTRACT(EPOCH FROM (NOW() - c.created_at)) / 3600 + 2), 1.5), c.id) 
    < (:${ORDER_VALUE}, :${COMMENT_ID})
ORDER BY
    order_col DESC,
    c.id DESC
FETCH FIRST :${PAGE_SIZE} ROWS ONLY;
        

With this query design, we use keyset pagination to ensure a consistent order of comments across pages. For the first page, we start by using maximum possible values for the ordering fields (such as the highest order_col and comment ID) as our cursor, so that we retrieve the top results. For subsequent pages, the pagination query uses the order_col and ID of the last comment from the previous page as the cursor in the HAVING clause. This filters the results to only those comments that come after the last fetched entry, ensuring no duplicates and maintaining the intended order.

This is the complete query for fetching the parent comments. But how should we retrieve the child comments associated with a parent? The following example illustrates the query for fetching them.

</> SQL
SELECT
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    CASE WHEN rlc2.comment_id IS NOT NULL THEN true ELSE false END AS is_liked,
    COUNT(rlc) AS like_count
FROM comments c
JOIN users u ON c.user_id = u.id
LEFT JOIN comments c2 ON c2.parent_id = c.id
LEFT JOIN relation_liked_comment rlc ON rlc.comment_id = c.id
LEFT JOIN relation_liked_comment rlc2 ON rlc2.comment_id = c.id AND rlc2.user_id = :${USER_ID}
WHERE parent_id = :${PARENT_COMMENT_ID}
AND (c.created_at, c.id) > (:${CREATED_AT}, :${REPLY_ID})
GROUP BY
    c.comment,
    c.created_at,
    u.username,
    u.picture,
    is_liked
ORDER BY 
    c.created_at, 
    c.id
FETCH FIRST :${PAGE_SIZE} ROWS ONLY;
        

Key differences include filtering the parent_id by the specific comment id rather than null. Additionally, the reply count is removed because nested replies are not supported, and the post_id condition is omitted since unique comment ids already ensure the required filtering granularity. Finally, the ordering is based solely on the natural sequence - displaying the oldest responses first and progressing to the newest.

Indexing

Proper indexing is critical for maintaining performance as your comments and replies dataset grows. To ensure efficient query execution, create indexes on columns that are frequently used in filtering (such as post_id, parent_id) and sorting (like created_at). In cases where multiple columns are often queried together, consider composite indexes to optimize join and search operations. Keep in mind that while indexes can significantly speed up data retrieval, they also add overhead during write operations, so it’s important to balance indexing strategies with your update frequency. Regularly review your query execution plans and monitor index usage to adjust or add indexes as your application's query patterns evolve.

Conclusion

This example illustrates how a single SQL query can effectively retrieve comments, providing all the necessary details for display. It fetches parent comments along with associated user information, like counts, reply counts, and a relevance-based ordering mechanism, while also offering a method to retrieve child comments. This approach balances performance with flexibility, streamlining the management and pagination of comment threads - all without duplicating data, as both parent comments and replies are stored in a single table.