熱門文章

2022年2月11日 星期五

FW:Learn from Google’s Data Engineers: Dimensional Data Modeling is Dead

My first day at Google was in 2019, and one of my biggest surprises was that I didn’t find any dimensional data marts. I was used to joining facts to dimensions and could rattle off normal forms and preach best practices of data modeling. I considered myself an expert on slowly-changing dimensions and how to apply them. Dimensional data modeling, popularized by Ralph Kimball with his 1996 book, is a method of organizing data within a data warehouse. While many benefits are preached, I believe that it exists for 3 primary reasons; optimizing compute, organizing data by topic, and optimizing storage. These foundational purposes driving the advent of dimensional modeling have evolved over time. Let’s take this moment in history to revisit why dimensional modeling exists, and how its roots meet our needs today.

In early days of computing, storage cost a premium; as much as $90,000 in 1985. At this rate, data needed to be organized in such a way that minimized the number of times the same value would be stored. To do this, pointers were used in the form of database keys, to point a unique identifier to many records of the same value. Database normalization came about to describe how de-duplicated and storage-optimized a database had become. Instead of storing a long descriptive value every time a dimension existed in the table, we could store this description just once, and connect it to the record it is meant to describe.

With the fastest computer putting out 1.9 gigaflops of computational power in 1985 for a whopping $32M, compute-optimization was equally as important. For reference, some of the fastest computers today put out over 400 petaflops of compute, or 20,000 times more flops. In the early days of databases, reducing the number of operations and the average complexity of each operation could save companies millions. For one example, instead of needing to analyze long complex string values to relate records, we could assign a single integer to unique instances, and relate these integers instead of strings.

In order to accomplish these optimizations, data was organized into topic-oriented data models. One common topic-oriented model was the star schema.



The benefit of this model was that the center of the star; the fact table, contained values that were physically indexed and easily retrievable. The more costly values to retrieve were stored on dimension tables and could be selectively retrieved, saving on processing cost.

If new dimensions were to arise relating to the fact, new dimension tables would have to be put in place, key relationships would need to be enforced, and the process of normalization compliance would be maintained. In successful dimensional modeling, the source data tables were torn apart, distributed among many tables, and if done correctly, could be re-assembled back to the source table if necessary.

Why is dimensional modeling disappearing?

1. Storage has become almost free
Database normalization is showing its age. The cost of 1gb of Google Cloud storage per month is just 2 cents. The returns for breaking a long and/or wide table down into a star or snowflake schema are poor. While storage could be reduced by perhaps 95% of the total space required to store the table, this cost isn’t a consideration factor when valued at pennies per year. This holds true for tables of nearly all sizes, and companies of all sizes.

2. Compute is inexpensive
While the cost of a 10-fold reduction in compute for a data model could save hundreds of thousands, or even millions of dollars in yesteryear, these savings are no longer cost-justified. The speed can also be a factor to consider, but the time required to perform all necessary operations and return a result set is miniscule. With the advent of cloud architectures, compute is easily scalable and requesting additional resources to apply to a long-running query has become easy.
3. Dimensional models are difficult for average users to understand
For our average data consumers, such as the business that operates based on conclusions drawn from the data surfaced in these models, data models aren’t intuitive. We as Data Engineers may look at data models and intuitively understand them, but the users who ultimately benefit from the data systems we maintain almost universally prefer to view data in a format they’re familiar with; tables like spreadsheets. In my experience, it’s much easier to teach simple ‘SELECT’, ‘FROM’, ‘WHERE’ than to describe dimensional models, why they exist, and how to join tables to retrieve values.
4. Dimensional models are costly to maintain
New columns being added to source-systems don’t consume any integration resources to maintain, while integrating new columns or source systems into data models is relatively time consuming. While new data modeling tools have made integration easier in recent years, if the data model isn’t adjusted each time new columns are added to source tables, new columns often aren’t usable by end-users.


What’s next for data design?

The benefits of dimensional data modeling are showing their age. Just as cubes came to fame and soon faded from popularity, star schemas have also had their day in the spotlight. In the coming years, data lakes and data lakehouses will take the spotlight. Data lakes provide a better end-user experience, are inexpensive to maintain, and require no additional engineering resources to construct.

The primary benefit of data lakes is usability for the business. The middle layer of the data ingestion machine; Analysts or Business Intelligence Engineers, were once required to interpret complex data models to deliver value to the business, the data can now be connected directly from source to end-user. Analysts and Business Intelligence Engineers can now focus on solving more valuable problems, such as engineering features to build predictive pipelines. The recent success of data lakes show that compute and storage resources that no longer benefit from being marginally reduced, and increased usability has been re-discovered as a major overall uplift to the data ecosystem. The maintenance cost associated with dimensional models could instead be dedicated to creating rapid-value for the business.

What about dimensional modeling? Dimensional modeling has its time in history, and much like the cube, I believe it will drift into obscurity. There are many companies today that are deeply committed to dimensional modeling, so I don’t believe the skill will die out for many years. As new teams begin evaluating the cost of data lakes and dimensional models, fewer and fewer dimensional models will come to exist.

source:https://blog.devgenius.io/learn-from-googles-data-engineers-dimensional-data-modeling-is-dead-68f6c2cb3fb0