Overview of Databases. Relational databases

Definition of a database, DBMS

A database is a collection of data which is organized in a way that allows for easy data retrieval and manipulation. While a folder with several files in it may be viewed as a simple database, database professionals usually require that a databae has a database management system (DBMS). A DBMS is a software tool which stores data in a specified form and provides access to this data for a user or an application. Specificaly, a DBMS provides some or all of the following functionality:
  1. Data Definition -- A DBMS must define a structure for stored data, and provide a means for a user to define and organize their data within that structure.
  2. Data Retrieval -- A DBMS must provide a toolset that allows a user to retrieve data stored in the database (for instance, query tools).
  3. Access Control -- The database administrator should be able to define data access for an individual or a group.
  4. Data Sharing -- More than one user should be able to use the database at the same time without a danger of overwriting each other's data changes.
  5. Data Integrity -- A DBMS should provide mechanisms for maintaining data integrity through system failures and inconsistent, or incomplete, updates.

Data models

The data structure and the access techniques supported by a DBMS is called a data model. Several kinds of data models have been implemented: Most commercial databases use relational model. The main advantage of it is that it is much faster than the other models, has a better support for data integrity and data sharing, and provides an easy way of manipulating data via the query language SQL.

Overview of Relational Model, some terminology

The relational data model organizes data as a collection of tables with the following rules:
  1. Each table has a unique name.
  2. Each table has at least one column. All columns in the same table have different names.
  3. Each table has zero or more rows, each row has a single data value in each column (values in some columns may be NULL, while other columns may not allow NULL values). Rows are also called records.
  4. All data values in a given column have the same data type. For instance, one can't place an integer into a column containing strings, and vice versa.
A row in a table is identified by its primary key. Primary key is a column or a combination of columns which uniquely identifies a row of data. For instance, in a database of students a student is uniquely identified by his/her ID number. Choice of a primary key is important in defining a database. For instance, if I decide to identify a student by the combination of the first name, middle initial, and the last name, then I wouldn't be able to insert a record for two students for whom this combination is the same.

A foreign key can define a relationship between the two tables. A foreign key logically links a column from a second table to the current table. For instance, if we add a table describing courses to the database of students, then we can use the course code as a foreign key in the student table to spacify that a student is taking a course. The relations defined by adding a foreign key to a table may be one-to-one (one record of one table corresponds to at most one record of another) or one-to-many (a record of one table may correspond to many records in another).

A constraint is a user-imposed restriction on the range of values for a particular column. Requiring that values are non-null is a constraint. Another example is a requirement that a telephone number has ten digits, where first three of them are a valid US area code.

A formal description of the structure of the entire database (i.e. its tables, columns of each table, types of each column, primary keys, constraints, etc.) is called the schema of the database.

Codd's 12 rules

In 1970, an IBM researcher E.F. Codd has proposed 12 rules that describe relational model. 15 years later, in 1985, these rules were used to develop first relational databases. While no commercial database fully conforms to all 12 rules, the rules have been closely followed as a guideline for the design of relational database systems. Below is the list of (very simplified) rules with the original names:
  1. The Information Rule: All data should be presented to the user in table form.
  2. Guaranteed Access Rule: All data should be accessible without ambiguity (uniqueness of table and column names and of the primary key accomplishes this requirement).
  3. Systematic Treatment of Null Values:Every field, except for (each part of) primary key and fields restricted by non-null constraints, should be allowed to remain empty. This requires support of NULL value, which is different from zero or an empty string. Many databases allow the user to specify a constraint that certain fields are not allowed to contain NULL values.
  4. Dynamic On-Line Catalog Based on the Relational Model: A relational database must provide access to its structure through the same tools that are used to provide access to data. This means, for instance, that the user can find out the column names and column types of a table in a similar way as finding out values of a row in that table. Data that describes structure of the database is called metadata.
  5. Comprehensive Data Sublanguage Rule: The database must support at least one clearly defined language that includes functionality for data definition, data manipulation, data integrity, and database transaction control. In practice SQL (Structured Query Language) is used as such a language in all commercial relational databases.
  6. View Updating Rule: Data can be presented to the user in different logical combinations, called views. A view is a part of the database that's obtained by selecting some of the tables, some of the column names, and some rows. Each view should support the same full range of data manipulation that direct-access to a table has available. This rule is not fully implemented in commercial databases since updates and deletes are difficult on views.
  7. High-level Insert, Update, and Delete:Data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table.
  8. Physical Data Independence:The user is isolated from the physical method of storing and retrieving information from the database. The underlying architecture may change without the user being aware of the change.
  9. Logical Data Independence: How a user views data should not change when the logical structure (tables structure) of the database changes. For instance, adding columns to a table should not change the view for a user who is not using these columns.This rule is particularly difficult to satisfy in practice.
  10. Integrity Independence: The database language (like SQL) should support constraints on user input that maintain database integrity. This rule is not fully implemented by most major vendors. At a minimum, all databases do preserve two constraints through SQL.
  11. Distribution Independence: A user should be totally unaware of whether or not the database is distributed (whether parts of the database exist in multiple locations).
  12. Nonsubversion Rule: There should be no way to modify the database structure other than through the multiple row database language (like SQL). Most databases today support administrative tools that allow some direct manipulation of the datastructure.

Database queries, SQL

A query language is a language for user's requests to DBMS. Every commercial relational database support (a version of) a query language SQL.

SQL has been developed in late 1970th. The first version of the language was called SEQUEL (Structured English Query Language). After the languages has been modified, the name has been changed to SQL (usually considered an abbreviation for Structured Query Language).

SQL has commands for creating and deleting tables, adding, updating, and deleting data, and selecting data that satisfies certain properties. It also supports working with metadata, i.e. information about the structure of the database, thus conforming to the 4th Codd's rule.

Transactions

Transaction is a sequence of database commands which is treated as one unit. Most modern databases allow many users to access the database at the same time. It is essential to prevent inconsistency of data because of simultaneous updates (for instance, one airline ticket sold to two customers because two sales clerks have accessed the database at the same time). To prevent inconsistencies, transactions should satisfy the following properties, usually abbreviated as ACID:
  1. Atomicity: a transaction should be done or undone completely. In the event of a failure, all operations and procedures should be undone, and all data should rollback to its previous state.
  2. Consistency: a transaction should transform a system from one consistent state to another consistent state.
  3. Isolation: each transaction should happen independently of other transactions occurring at the same time.
  4. Durability: completed transactions should remain permanent, even during system failure.
There are two operations which allow the user to control behaviour of a transaction: COMMIT, which makes the changes permanent, and ROLLBACK, which puts the relevant part of the database into the state prior to the transaction (as if it has never happenned).

Access control is achieved via locks: if a transaction needs to access a part of a database, the relevant part of a database (one or several tables, or one or several rows of a table) is locked. There are locks for reading (while one transaction is reading a data, no other transaction is allowed to update, since this may make the information read by the first transaction inconsistent) and for writing (while a transaction is updating data, no other transaction is allowed to access the data).

Distributed databases, remote access to databases

Modern databases are often accessed remotely. A database may reside on database server, while an application running on a client machine submit queries to the server. All queries are executed on the server side. As we have mentioned, a database may be accessed by multiple users at the same time.

There is a variety of Java tools for remote access to and interaction with relational databases. We will work with JDBC (Java Database Connectivity) package.

A database may be distributed, i.e. a part of it may reside on one server, and a part on another. Usually, however, the user shouldn't be aware of where the data resides, the communication between the servers should be handled by DBMS.


Some material on this page has been adopted from a subset of online sources listed here
This page has been created and is maintained by Elena Machkasova
Comments and suggestions are welcome at emachkas@wellesley.edu

Spring Semester 2002