Modeling database tables & migrations
Many real-world systems that persist data rely on "relational databases", which store data in rows and columns (conceptually like spreadsheets, but with a different backing implementation).
Design of database organization
Databases are a common way that applications and systems persist, or store, information across different executions of the program. Application code writes data to a database, and the database management systems stores the database tables to disk, e.g., the file system.
Database systems are organized into tables that store distinct subsets of information (conceptually like tabs in a spreadsheet application like Excel).
The specific organization of data into tables is called a schema.
For example, one schema might store users' | ID | Name | Email |
in one table, and users' | ID | Age |
in another table. To access the full information for a given user, the engineer would construct a query (typically in SQL) to join the two tables based on the ID
column, like the dot joins we have been using in Alloy all semester.
A "flat" schema is one that uses few or only one distinct table for all information. Flat schema's typically work, but are often less efficient and conceptually more difficult to work with.
A system's schema may change over time as the application code itself changes. For example, if an engineer introduces a new feature, they may also decide to split off data into a new table. The new version of the entire application uses the new schema.
What happens to a user's data in the old schema when they update the application to the new version?
Think, then click!
The data must be moved to the new format via a migration. We can think of the migration as reading in all the rows and writing them back out in the new format, to a new database.
What could happen if the code that implements a migration fails?
Think, then click!
- The old data could be deleted.
- The old data could be rendered inaccessible, because the current version of the application no longer handles the old schema.
Student question: does a bug in the migration mean it will always fail, or only fail for some users?
Think, then click!
A really bad migration might fail regardless of the data present, but that is more rare in practice because those sorts of bugs are easy to catch in tests (any test of the write two schemas would find them).
Trickier bugs depend on the specific data a user might have on their device. For example, migration code might forget to account for timestamps that change strangely based on daylight savings time, or other rare situations.
For these reasons, we really want to be careful that database migrations always succeed, no matter what initial data a user has in the old format.
Today's example: sensor data
For today's example, we'll consider a database to store data coming in from some sensors.
We have 3 types of sensors: mics, cameras, and video streams.
Data that comes in from a sensor has:
- A timestamp (for simplicity, we'll assume these are unique and can be used as an identifier, in practice a real database would use a truly unique ID).
- The ID of the sensor it came from.
- The type of the sensor it came from.
- Optionally: audio data.
- Optionally: video data.
How do we model the sensor types in Alloy?
Think, then click!
We want the 3 types to be mutually exclusive, and for all sensors to have a type. So we use an abstract sig and 3 mutually exclusive, extended subsigs.
sig Timestamp {}
abstract sig SensorType {}
one sig Mic, Camera, VideoStream extends SensorType {}
How do we model the optional audio and visual data?
Think, then click!
We define a sig for each and make them lone
fields.
sig SensorData {
sensorID: Int,
sensorType: SensorType,
audio: lone Audio,
visual: lone Image,
}
We'll also add a Timestamp
field and label it disj
, so that every instance of SensorData
has a unique timestamp.
When we run this for examples, we see that sensors can have the same ID but different types, which does not make sense.
Think, then click!
We add a fact
that sensors with the same ID are the same type. We do not want the opposite to be true, since it's valid to have multiple mics with different ids.
fact sensorIDs {
all x, y : SensorData |
(x.sensorID = y.sensorID) => (x.sensorType = y.sensorType)
}
Next, we need to determine what it means for data to be valid in this model. We'll say that cameras only have image data, mics only have audio, and video streams have both.
How do we write this in Alloy?
Think, then click!
We add a new fact
that has if-and-only-if (<=>
) for each case:
pred validSensorData {
all sd: SensorData | {
let t = sd.sensorType | {
t = Mic <=> some sd.audio and no sd.visual
t = Camera <=> no sd.audio and some sd.visual
t = VideoStream <=> some sd.audio and some sd.visual
}
}
}
Modeling the migration
When we run instances of the model so far, we notice two inefficiencies in the flat schema:
- There are a lot of empty cells for rows with no audio of visual data. In a real database, this amounts to wasted space.
- The same sensor type is listed every time we get sensor data from a given ID, as expected. Do we really want to repeat this data if we have, for example, millions of rows?
To solve these inefficiencies, we can design a new schema with the following tables and columns:
SensorDataRow
| timestamp | sensor ID |
SensorRow
| ID | type |
AudioRow
| timestamp | audio |
VisualRow
| timestamp | visual |
In this new schema, we never have empty cells (we just don't add rows if there is no audio of visual data for a given sensor). We also only save the type based on the number of total sensors, which may be far lower than the number of total datapoints.
Below, we model the new schema and validity in the new schema.
Next time, we'll show how we can model the migration itself.
Full model from Monday's class
sig Timestamp {}
abstract sig SensorType {}
one sig Mic, Camera, VideoStream extends SensorType {}
sig Audio, Image {}
sig SensorData {
timestamp: disj Timestamp,
sensorID: Int,
sensorType: SensorType,
audio: lone Audio,
visual: lone Image,
}
run {
} for 6
fact sensorIDs {
all x, y : SensorData |
(x.sensorID = y.sensorID) => (x.sensorType = y.sensorType)
}
pred validSensorData {
all sd: SensorData | {
let t = sd.sensorType | {
t = Mic <=> some sd.audio and no sd.visual
t = Camera <=> no sd.audio and some sd.visual
t = VideoStream <=> some sd.audio and some sd.visual
}
}
}
sig SensorDataRow {
timestamp: disj Timestamp,
sensorID: Int,
}
sig SensorRow {
ID: disj Int,
type: SensorType,
}
sig AudioRow {
timestamp: disj Timestamp,
audio: Audio,
}
sig VisualRow {
timestamp: disj Timestamp,
visual: Image,
}
pred validSensorDataRow {
all sd: SensorDataRow |
one sensor: SensorRow | sd.sensorID = sensor.ID
all sd: SensorDataRow |
let sensorRow = ID.(sd.sensorID) |
let t = sensorRow.type |
let time = sd.timestamp |
let hasAudio = time in AudioRow.timestamp |
let hasVisual = time in VisualRow.timestamp |
{
t = Mic <=> hasAudio and not hasVisual
t = Camera <=> not hasAudio and hasVisual
t = VideoStream <=> hasAudio and hasVisual
}
}
Attribution
This lecture draws on inspiration from Hillel Wayne's Formally Modeling Database Migrations post, though the example schemas we use are different.
Modeling database tables & migrations
Many real-world systems that persist data rely on "relational databases", which store data in rows and columns (conceptually like spreadsheets, but with a different backing implementation).
Design of database organization
Databases are a common way that applications and systems persist, or store, information across different executions of the program. Application code writes data to a database, and the database management systems stores the database tables to disk, e.g., the file system.
Database systems are organized into tables that store distinct subsets of information (conceptually like tabs in a spreadsheet application like Excel).
The specific organization of data into tables is called a schema.
For example, one schema might store users'
| ID | Name | Email |
in one table, and users'| ID | Age |
in another table. To access the full information for a given user, the engineer would construct a query (typically in SQL) to join the two tables based on theID
column, like the dot joins we have been using in Alloy all semester.A "flat" schema is one that uses few or only one distinct table for all information. Flat schema's typically work, but are often less efficient and conceptually more difficult to work with.
A system's schema may change over time as the application code itself changes. For example, if an engineer introduces a new feature, they may also decide to split off data into a new table. The new version of the entire application uses the new schema.
What happens to a user's data in the old schema when they update the application to the new version?
Think, then click!
The data must be moved to the new format via a migration. We can think of the migration as reading in all the rows and writing them back out in the new format, to a new database.
What could happen if the code that implements a migration fails?
Think, then click!
Student question: does a bug in the migration mean it will always fail, or only fail for some users?
Think, then click!
A really bad migration might fail regardless of the data present, but that is more rare in practice because those sorts of bugs are easy to catch in tests (any test of the write two schemas would find them).
Trickier bugs depend on the specific data a user might have on their device. For example, migration code might forget to account for timestamps that change strangely based on daylight savings time, or other rare situations.
For these reasons, we really want to be careful that database migrations always succeed, no matter what initial data a user has in the old format.
Today's example: sensor data
For today's example, we'll consider a database to store data coming in from some sensors.
We have 3 types of sensors: mics, cameras, and video streams.
Data that comes in from a sensor has:
How do we model the sensor types in Alloy?
Think, then click!
We want the 3 types to be mutually exclusive, and for all sensors to have a type. So we use an abstract sig and 3 mutually exclusive, extended subsigs.
How do we model the optional audio and visual data?
Think, then click!
We define a sig for each and make them
lone
fields.We'll also add a
Timestamp
field and label itdisj
, so that every instance ofSensorData
has a unique timestamp.When we run this for examples, we see that sensors can have the same ID but different types, which does not make sense.
How do we restrict this?
Think, then click!
We add a
fact
that sensors with the same ID are the same type. We do not want the opposite to be true, since it's valid to have multiple mics with different ids.Next, we need to determine what it means for data to be valid in this model. We'll say that cameras only have image data, mics only have audio, and video streams have both.
How do we write this in Alloy?
Think, then click!
We add a new
fact
that has if-and-only-if (<=>
) for each case:Modeling the migration
When we run instances of the model so far, we notice two inefficiencies in the flat schema:
To solve these inefficiencies, we can design a new schema with the following tables and columns:
In this new schema, we never have empty cells (we just don't add rows if there is no audio of visual data for a given sensor). We also only save the type based on the number of total sensors, which may be far lower than the number of total datapoints.
Below, we model the new schema and validity in the new schema.
Next time, we'll show how we can model the migration itself.
Full model from Monday's class
Attribution
This lecture draws on inspiration from Hillel Wayne's Formally Modeling Database Migrations post, though the example schemas we use are different.