Knowing your consumer

  • Easily queryable, no complex data types
  • Compact and harder to query usually done to reduce storage space.
  • ML models as consumer
  • Customers: Should be very to interpret charts

Some of the biggest problems in data engineering occur when data is modeled for the wrong consumer!

  • OLAP is looking at the entire database or a subset of the database whereas OLTP is mostly looking at a single entities or a few rows, ex: one user. Hence the performance and modelling mechanisms vary.
  • Master Data is the middle ground between OLAP and OLTP.
  • Production database snapshots master data OLAP cubes (slice and dice, easy group by) metrics

Cumulative Table Design

Used to bring data across multiple timeframes into a single table or dataframe. Example: Data over multiple days where users login only on a few days. Cumulative table design will FULL OUTER JOIN all the dataframes with COALESCE values. This way history is preserved.

Can be used for state tracking where the same data goes through different state transitions. Example: dead live backburner etc.

Steps:

  1. FULL OUTER JOIN
  2. COALECSE ids and unchanging dimensions
  3. Compute cumulation metrics (eg: Number of days in-active)
  4. Combine arrays and changing values
  • Usability vs compactness tradeoff
  • arrays vs structs vs maps
  • Run length encoding is a form of compression which removes duplicates.

Questions

  1. Fact data modelling, what is it ? What was mentioned is the video is that OLTP, OLAP and master data modelling methods are Dimensional data modelling.
  2. What is shuffle ? is it just randomly ordering data or is there more to it ? Why is it talked about often in the video ? I get that shuffling data will break compression but why would we even shuffle ? This was mentioned in Badness of denormalized temporal dimensions.
    • he mentions that spark joins will shuffle the data.
    • He also mentions that having arrays will auto sort on explosion ?
  3. What is temporal cardinality explosions of dimensions ? He talk about either using an array for all nights or having another table with all nights and FK to the listing table. Apart from compression deets, what is the title talking about ?