databases

ScyllaDB LIVE - Essentials 29th Jan, 2025

Server configuration to run ScyllaDB on prem: Use a sizing calculator on ScyllaDB for AWS: I4i instance

https://www.scylladb.com/category/engineering/

Books: Database Performance at Scale (freely available online): https://www.scylladb.com/2023/10/02/introducing-database-performance-at-scale-a-free-open-source-book/

shard per code architecture: https://www.scylladb.com/2024/10/21/why-scylladbs-shard-per-core-architecture-matters/

ScyllaDB university: https://university.scylladb.com

Essentials track

Slides are available at scylladb university.

Basic Architecture

2 family groups: relational(SQL) and NoSQL databases

Below are outlines of NoSQL databases with increasing data complexity.

Key-value databases(redis, rocksDB) Document databases(MongoDB, couchbase) Wide column databases(ScyllaDB, Cassandra, DynamoDB)(database model using rows and columns) Graph(JanusGraph, Neo4J)

ScyllaDB:

  1. Provides predictable performance at scale
  2. High availability
  3. High scalability (can add or remove nodes as the system is running)
  4. Performance
  5. Low maintenance
  6. Drop in replacement for Cassandra, dynamoDB

New feature: Tablet (Allows database to scale fast)

  • Written in C++ , no need to worry about garbage collection like in Cassandra (written in Java)

ScyllaDB: All nodes are equal, no master or leaders. Uses Node Ring. Minimum nodes = 3 for a cluster

Keyspace: Contains tables, and has config such as replication factor, etc.

Each table needs to have a partition key. It can be one or many columns. Database will partition the data according to the partition key. It performs consistent hashing on those values.

Consistency Level(CL) Tuneable consistency. Can be per operation. It is = number of nodes that must ack a read/write. Example: CL = 1 | quorum | local quorum | ALL | etc

Higher consistency = lower availability (CAP theorem)

ScyllaDB supports LightWeight Transactions (LWTs) to atomically write with high consistency. Docs https://opensource.docs.scylladb.com/stable/features/lwt.html

Data modelling

In SQL, we think about data first, using entity diagrams, relationships. Then model them in the database, and at the end comes the application. We write queries at the very end.

Whereas in NoSQL, we think about data as well as application. We think about the possible queries, the consistency levels for queries before we model it.

Data will be ordered according to partition key.

Clustering key can also be added. its similar to ordering key in dynamoDB (its optional).

There can be multiple values for the partition key (in green). Data is sorted according to the clustering key.

Choosing a partition key

  • high cardinality (means that, the column has lots of possible values). Example: application user reviews.
  • even distribution

Avoid:

  • low cardinality
  • hot partition
  • large partition

Any ID (user id) is a good candidate for a partition key. Use a monitoring tool always.

CQL also has collections as data types: sets, lists, maps

ScyllaDB also has UDT: User defined types (like a struct). Or a map with fixed keys.

ScyllaDB also has counters: Specialised data types designed for efficient inc or dec operations in a distributed database environment. You can not SET a value, only INC and DEC.

  1. Use prepared statements whenever possible. These types of queries are cached. Prepare the statement first and then fire the query with the dynamic values.
  2. Prefer token-aware policy: Basically, try to send the read request to the node which contains the partition. If this is not done, the node will randomly look for another node which contains the request partition which is an extra hop.
  3. How to write better apps (similar techniques as above): https://university.scylladb.com/courses/data-modeling/lessons/how-to-write-better-apps/

Materialised View: The same table replicated with different partition or clustering keys.