Column Store Databases are awesome!

Except when they are not. I'll try explain some the underlying concepts, so you can use them better, and sound really cool with your friends and coworkers.

Nov 28, 2024

In recent years, I’ve seen a growing trend of clients moving their data platforms to solutions like Snowflake, Databricks, or DuckDB. These platforms have become wildly popular for analytics workloads, largely because of one standout feature: column-level data storage. To understand why this is such a game-changer, it helps to start with what it’s replacing: the traditional row-level storage.

Row level storage

Most applications are built to handle transactional workloads rather than analytical ones. They treat data as individual records to be stored, retrieved, and modified—think of a database as a filing cabinet, where each record contains all the details for a specific entity.

In a traditional row-level storage model, each record is stored as a row, similar to how a CSV file organizes data. Here’s an example of what this looks like:

When data is stored this way, all the information for a single row is physically grouped together on the disk. This makes it extremely efficient to retrieve entire rows of data—perfect for CRUD (Create, Read, Update, Delete) operations.

Let’s draw up a quick example of how our application might handle such a request.

If an application needs the data for 2021, it can query an index, locate the relevant row, and retrieve it quickly.

However, row-level storage is far less efficient for analytical operations. Let’s say we want to calculate the average CO2 emissions from coal over a period of time. Instead of making a single lookup, the database has to read through every row, extract the relevant column value, and then perform the aggregation. That’s a lot of work.

What we really want in this case is to store all the CO2 emissions data together—next to each other on disk or in memory—so we can access it in one go. Enter column-level storage.

What we really want in this case is to have all the CO2 emissions next to each other, (on disk or in memory) so that we can just do one lookup. Enter column level storage.

Column level storage

Instead of storing all the fields for a record together, column-level storage organizes data by column. Each column’s values are stored together on disk.

This layout transforms analytical workloads. Instead of making a lookup for every row, we only need to scan the relevant column, which drastically improves performance.

This principle is often referred to as projection pushdown—only the columns you need (the "projection") is scanned.

But column-level storage doesn’t stop there. It also optimizes how data is stored to make queries even faster and more space-efficient. Let’s dive into an example.

In a simple setup, we’d store the Gender column like this:

Male, Male, Male, Male, Male, Male, Male, Male, Female, Female, Female, Female, Female, Female, Female, Female

It’s already more efficient than row storage because all the values for the Gender column are grouped. But we can take it a step further using compression. Instead of storing each individual value, we can store the value and its frequency:

Male: 8, Female: 8

This compressed format requires fewer operations for aggregations (e.g., counting occurrences) and takes up less space. However, compression is just one piece of the puzzle (4 entries instead of 16).

However, compression is just one piece of the puzzle. For larger datasets, reading an entire column can still be inefficient, especially if we only need a subset of the data. This is where metadata comes into play. Columns are often divided into chunks (or "row groups"), and metadata is stored about each chunk—such as minimum and maximum values, or even aggregates like sums and counts.

Let’s say we’re only interested in analyzing data for "cool people." With metadata, the system can skip chunks that don’t meet the criteria, retrieving only the relevant data.

In the example above, we can tell our application to only retrieve data from files 1, 2 and 6, and skip files 3, 4 and 5, making the greatly decreasing the data we need to send and read, and in turn increase performance. When you hear people talk about predicate pushdown, this is what they are talking about. Even more specific this is predicate pushdown using row-group level metadata. The reason we call it that, is because we are populating the file with metadata referencing another column. We could also pushdown metadata on the actual values of the column, such as min, max, sum and count. This would be predicate pushdown using column-group level metadata.

With this techniques, we can greatly improve the performance our analytical workloads.

a boy in a blue shirt is looking at a computer screen

When not to use it?

Column-level storage is fantastic for analytical workloads, but it’s not a one-size-fits-all solution. For transactional applications that frequently update, delete, or retrieve individual records, row-level storage still reigns supreme. If you’re building a CRUD application, stick with row-level storage—or explore other options like graph databases if your use case demands it.

Final Thoughts

Column-level storage has revolutionized analytics, making platforms like Snowflake and Databricks the powerhouses they are today. By restructuring how data is stored and accessed, these systems deliver massive performance gains and enable new levels of scalability.

Shantanu Tripathi

Nov 29

Purely Analytical Purpose which doesn't require any change in the data - Yes. But as soon as it changes to transactional side, a hybrid approach works best, like parquet. It's a columnar storage but uses row groups to even divide the rows along with columns.

Expand full comment

1 reply by Dilovan Celik

1 more comment...

Thoughts on Data

Discussion about this post