Data Modeling: Part 1 — Goals and Methodology
By
Riccardo Tommasini /
Use Cases
Mar 08, 2023
Navigate to:
This article was originally published in The New Stack and is reposted here with permisison.
In different techniques, entities and relationships remain central. However, their nature and roles are reinterpreted according to the business goals.
Data modeling is the process of defining and representing the data elements in a system in order to communicate connections between data points and structures. In his impactful book “Designing Data-Intensive Applications,” Martin Kleppmann describes data modeling as the most critical step in developing any information system.
Understanding which data is relevant for the business and in what form requires communication between functional and technical people. Moreover, allowing data sharing across components within an information system is critical for the good functioning of the system. Quoting Kleppmann, “data models have such a profound effect not only on how the software is written but also on how we think about the problem that we are solving.”
But what exactly is a data model, then?
A data model is a specification that describes the structure of the data stored in the system.
In addition, it may define constraints that guarantee data integrity and standardize how to represent (rules), store (format) or share (protocol) data. In the literature, we typically distinguish between three different levels of data modeling (see pyramid figure)
- The Conceptual level defines what the system contains. Business stakeholders typically create a conceptual model. The purpose is to organize, scope and define business concepts and rules. Definitions are most important at this level, such as a product.
- The Logical level defines how the database management system (DBMS) should be implemented. A logical model is technologically biased and is created with the purpose of developing a technical map of rules and data structures. Relationships and attributes become visible, for instance, product name and price.
- The Physical level describes how to use a specific technology to implement the information system. The physical model is created with the purpose of implementing the database. The physical level explores the trade-offs in terms of data structures and algorithms.
During the Beginner Flux training at InfluxDB University, we used the same levels to understand how time series data maps onto the Flux data structure and InfluxDB’s line protocol data model. Here we take this a step further in data modeling for InfluxDB and Flux. Therefore, it is worth recalling that:
- Conceptually, a time series is an ordered set of timestamped data points described by one — and only one — measurement and a set of tags.
- Logically, Flux represents multiple series simultaneously, representing different values by a set of key-value pairs named fields. Moreover, tags are key-value pairs that help further partition data for processing.
- Physically, InfluxDB stores data into a Time-Structured Merge Tree; it is also worth mentioning that tags are both key and value indexed.
A brief history of data modeling methods
Now that we clarified what a data model is and the goals of data modeling, we can discuss how we get there. In practice, several methodologies exist in the literature. The most prominent ones, listed below, differ in terms of target information systems and workloads, such asi.e. online transaction processing (OLTP) and DBMS; online analytical processing (OLAP) and data warehouse; and big data and data lakes.
- Relational modeling (RM) focuses on removing redundant information for a model that encompasses the whole enterprise business. RM uses relations (tables) to describe domain entities and their relationships.
- Dimensional modeling (DM) focuses on enabling complete requirement analysis while maintaining high performance when handling large and complex (analytical) queries. DM aims to optimize the data access; thus, it is tailored for OLAP. The star and snowflake models are notable results of dimensional modeling.
Notably, RM and DM produce significantly different results considering the logical and physical levels of abstraction described above. Nonetheless, they all share similar conceptualization and tooling when operating at the conceptual level. Indeed, the entity-relationship (ER) modeling technique and diagrams underpin all the models mentioned above and graph databases or semantic stores. Therefore, it is worth refreshing what ER implies:
- An entity is an object that exists and is distinguishable from other objects. Entities have a type and descriptive attributes; an entity-set groups entities of the same type. An attribute called the primary key uniquely identifies each entity in a set.
- A relationship is an association among several entities. The cardinality of a relationship describes the number of entities to which another entity can be associated; we consider one-to-one, one-to-many and many-to-one.
In different techniques, entities and relationships remain central. However, their nature and roles are reinterpreted according to the business goals. For example, RM stresses identifying as many entities as possible to avoid data redundancy. Indeed, redundancy creates maintenance problems over time, which oppose the user’s need for consistency.
Conversely, DM builds around facts that borrow their identity from other entities using their many-to-many relations. Such entities are interpreted as dimensions, such as, descriptive information that gives context to the facts. DM is of primary interest to data warehouse users, whose top concerns are analytics. Both the modeling techniques mentioned above can, to some extent, represent time.
- In relational modeling, time is just an attribute. Entities and relationships can be updated, but the conceptual schema does not carry information at this level. Temporal extensions of the relational modeling approaches have been proposed. However, they are tailored for temporal databases, which focus on the temporal validity of their entities (as a form of consistency) rather than time series databases (TSDBs) and the history of their time-varying attributes.
- In dimensional modeling, time is considered an analytical dimension — it represents a possible subject for slicing over, which produces significant aggregates. Dimensional tables within the dimensional model do not consider changes at the conceptual level. However, in lower levels, changes may happen. Different approaches to handling such “slowly changing dimensions” have been proposed, including keeping track of their history, which is close to what a TSDB would do.