Knowing your consumer
- Easily queryable, no complex data types
- Compact and harder to query usually done to reduce storage space.
- ML models as consumer
- Customers: Should be very to interpret charts
Some of the biggest problems in data engineering occur when data is modeled for the wrong consumer!
- OLAP is looking at the entire database or a subset of the database whereas OLTP is mostly looking at a single entities or a few rows, ex: one user. Hence the performance and modelling mechanisms vary.
- Master Data is the middle ground between OLAP and OLTP.
- Production database snapshots → master data → OLAP cubes (slice and dice, easy group by) → metrics
Cumulative Table Design
Used to bring data across multiple timeframes into a single table or dataframe. Example: Data over multiple days where users login only on a few days. Cumulative table design will FULL OUTER JOIN all the dataframes with COALESCE values. This way history is preserved.
Can be used for state tracking where the same data goes through different state transitions. Example: dead → live → backburner etc.
Steps:
- FULL OUTER JOIN
- COALECSE ids and unchanging dimensions
- Compute cumulation metrics (eg: Number of days in-active)
- Combine arrays and changing values
- Usability vs compactness tradeoff
- arrays vs structs vs maps
- Run length encoding is a form of compression which removes duplicates.
New commands
- UNNEST: expands the array column into multiple row entires
- CARDINALITY function returns the size of the array
- Idempotent pipelines: Pipeline result should not change if inputs haven’t changed even if run multiple times
- Using INSERT INTO with MERGE or INSERT OVERWRITE. INSERT INTO should never be used ?
- Using start_date > and end_date <
- Using partition sensors.
- This basically means, running the pipeline without insufficient data. Say we need 3 tables of data for the pipeline but the pipeline runs too early with only 2 tables and the 3rd table not yet populated.
- Using
depends_on_past
- Do not rely on
latest
metric. If data isn’t ready, it might look at previous data which is technically still latest.
- SCD: Slowly changing dimension
- Dimension vs metric ? He mentions heart rate is a rapidly changing dimension but also mentioned that it could be a metric and not dimension.
- Types of SCD:
- Type 0: never changing
- Type 1: Changes but we only care about latest data
- Type 2: Care about what the value was from start_date to end_date
- Additive vs Non-Additive dimensions
- Can the user be 2 things at the same time ? If not, that dimension is not additive (thing being a value for that dimension).
Questions
- Fact data modelling, what is it ? What was mentioned is the video is that OLTP, OLAP and master data modelling methods are
Dimensional data modelling
. - What is shuffle ? is it just randomly ordering data or is there more to it ? Why is it talked about often in the video ? I get that shuffling data will break compression but why would we even shuffle ? This was mentioned in Badness of denormalized temporal dimensions.
- he mentions that spark joins will shuffle the data.
- He also mentions that having arrays will auto sort on explosion ?
- What is temporal cardinality explosions of dimensions ? He talk about either using an array for all nights or having another table with all nights and FK to the listing table. Apart from compression deets, what is the title talking about ?
- Why is ENUM for country dimension is bad ? The reason given in the video is that the number of values are greater than 200. Wouldn’t an ENUM make sense where we have fixed cardinality for a dimension ? How often do we think a country would be added or removed 🤔
- What da hell is broadcast join