Modeling database tables & migrations

Many real-world systems that persist data rely on "relational databases", which store data in rows and columns (conceptually like spreadsheets, but with a different backing implementation).

Design of database organization

Databases are a common way that applications and systems persist, or store, information across different executions of the program. Application code writes data to a database, and the database management systems stores the database tables to disk, e.g., the file system.

Database systems are organized into tables that store distinct subsets of information (conceptually like tabs in a spreadsheet application like Excel).

The specific organization of data into tables is called a schema.

A "flat" schema is one that uses few or only one distinct table for all information. Flat schema's typically work, but are often less efficient and conceptually more difficult to work with.

A system's schema may change over time as the application code itself changes. For example, if an engineer introduces a new feature, they may also decide to split off data into a new table. The new version of the entire application uses the new schema.

What happens to a user's data in the old schema when they update the application to the new version?

Think, then click!

The data must be moved to the new format via a migration. We can think of the migration as reading in all the rows and writing them back out in the new format, to a new database.

What could happen if the code that implements a migration fails?

Think, then click!

The old data could be deleted.
The old data could be rendered inaccessible, because the current version of the application no longer handles the old schema.

Student question: does a bug in the migration mean it will always fail, or only fail for some users?

Think, then click!

A really bad migration might fail regardless of the data present, but that is more rare in practice because those sorts of bugs are easy to catch in tests (any test of the write two schemas would find them).

Trickier bugs depend on the specific data a user might have on their device. For example, migration code might forget to account for timestamps that change strangely based on daylight savings time, or other rare situations.

For these reasons, we really want to be careful that database migrations always succeed, no matter what initial data a user has in the old format.

Today's example: sensor data

For today's example, we'll consider a database to store data coming in from some sensors.

We have 3 types of sensors: mics, cameras, and video streams.

Data that comes in from a sensor has:

A timestamp (for simplicity, we'll assume these are unique and can be used as an identifier, in practice a real database would use a truly unique ID).
The ID of the sensor it came from.
The type of the sensor it came from.
Optionally: audio data.
Optionally: video data.

How do we model the sensor types in Alloy?

Think, then click!

We want the 3 types to be mutually exclusive, and for all sensors to have a type. So we use an abstract sig and 3 mutually exclusive, extended subsigs.

sig Timestamp {}

abstract sig SensorType {}
one sig Mic, Camera, VideoStream extends SensorType {}

How do we model the optional audio and visual data?

Think, then click!

We define a sig for each and make them lone fields.

sig SensorData {
  sensorID: Int,
  sensorType: SensorType,
  audio: lone Audio,
  visual: lone Image,
}

We'll also add a Timestamp field and label it disj, so that every instance of SensorData has a unique timestamp.

When we run this for examples, we see that sensors can have the same ID but different types, which does not make sense.

How do we restrict this?

Think, then click!

We add a fact that sensors with the same ID are the same type. We do not want the opposite to be true, since it's valid to have multiple mics with different ids.

fact sensorIDs {
  // Capture idea: sensor types don't change for a given ID
  all x, y : SensorData | 
    // If same ID, then same type
    (x.sensorID = y.sensorID) =>  (x.sensorType = y.sensorType)    // <=> ?  no!
}

Next, we need to determine what it means for data to be valid in this model. We'll say that cameras only have image data, mics only have audio, and video streams have both.

How do we write this in Alloy?

Think, then click!

We add a new fact that has if-and-only-if (<=>) for each case:

pred validSensorData {
  all sd: SensorData | {
    let t = sd.sensorType | {
       t = Mic  <=> some sd.audio and no sd.visual
       t = Camera <=> no sd.audio and some sd.visual
       t = VideoStream <=> some sd.audio and some sd.visual
   }  
  }
}

Modeling the migration

When we run instances of the model so far, we notice two inefficiencies in the flat schema:

There are a lot of empty cells for rows with no audio of visual data. In a real database, this amounts to wasted space.
The same sensor type is listed every time we get sensor data from a given ID, as expected. Do we really want to repeat this data if we have, for example, millions of rows?

To solve these inefficiencies, we can design a new schema with the following tables and columns:

SensorDataRow
| timestamp | sensor ID |
    
SensorRow
| ID | type |

AudioRow
| timestamp | audio |
    
VisualRow
| timestamp | visual |

In this new schema, we never have empty cells (we just don't add rows if there is no audio of visual data for a given sensor). We also only save the type based on the number of total sensors, which may be far lower than the number of total datapoints.

Below, we model the new schema and validity in the new schema.

Next time, we'll show how we can model the migration itself.

Full model from Monday's class


sig Timestamp {}

abstract sig SensorType {}
one sig Mic, Camera, VideoStream extends SensorType {}

sig Audio, Image {}

sig SensorData {
  timestamp: disj Timestamp,
  sensorID: Int,
  sensorType: SensorType, // how do I write this type in Alloy?
  audio: lone Audio, // how do I indicate optional?
  visual: lone Image,
}

run {
  // Example statements to show us specific sizes of database tables:
  // some SensorData
  //  sensorType: SensorData->SensorType
  //  #sensorType.VideoStream = 2
  //  #sensorType.Mic = 2
  //  #sensorType.Camera = 2

  // validSensorData
} for 6

fact sensorIDs {
  // Capture idea: sensor types don't change for a given ID
  all x, y : SensorData | 
    // If same ID, then same type
    (x.sensorID = y.sensorID) =>  (x.sensorType = y.sensorType)    // <=> ?  no!
}


pred validSensorData {
  // The data matches the type: mics only audio, camera only visual, 
  // video has to have both. Assumption: sensors all work as expected.
  all sd: SensorData | {
    let t = sd.sensorType | {
       t = Mic  <=> some sd.audio and no sd.visual
       t = Camera <=> no sd.audio and some sd.visual
       t = VideoStream <=> some sd.audio and some sd.visual
   }  
  }
}

// Design our new schema design
sig SensorDataRow {
  timestamp: disj Timestamp,
  sensorID: Int,
}

sig SensorRow {
  ID: disj Int,
  type: SensorType,
}

sig AudioRow {
  timestamp: disj Timestamp,
  audio: Audio, // No longer lone, can't be empty
}

sig VisualRow {
  timestamp: disj Timestamp,
  visual: Image,  // No longer lone, can't be empty
}


pred validSensorDataRow {
  // We don't have missing sensor types
  all sd: SensorDataRow | 
    one sensor: SensorRow | sd.sensorID = sensor.ID

  // Sensor data matches the sensor type
  all sd: SensorDataRow | 
   // First: find the type for a given row 
   let sensorRow = ID.(sd.sensorID) | // Could also write as: sd.sensorID.~ID
   let t = sensorRow.type | 
   let time = sd.timestamp | 
   let hasAudio = time in AudioRow.timestamp |
   let hasVisual =  time in VisualRow.timestamp |
   { 
       t = Mic  <=> hasAudio and not hasVisual
       t = Camera <=> not hasAudio and hasVisual
       t = VideoStream <=> hasAudio and hasVisual
   }
}

Attribution

This lecture draws on inspiration from Hillel Wayne's Formally Modeling Database Migrations post, though the example schemas we use are different.