Relational databases (Article)

Description :: What they are, and what they are not ...

[Note: I realize this article doesn't quite agree with my scratchbook articles on the subject. Some of it is just slight imprecision in the use of terms, some of it is an evolving understanding of those same terms. Bear with me, I'll get it right eventually.]

A relational database is a concise structure for storing and using statements of truth, together or separately. New terms are marked in bold as they are defined, examples (long and short) are marked in italics. Please email me if you find I have slipped in my use of the following terminology -- I've formed some bad habits by actually working in the field.

Values, Variables, and Domains
A value, such as the number "2", is one of many possible values in a set of similar values, such as "all possible numbers". Your street address is one value out of the set of all of the possible street addresses. A variable is a (usually named) storage space which may contain a representation of a value. A variable may only represent one value at any given time. Values can be represented in many possible ways, and you cannot ever store a value itself, only its representation. The number "2", a concept, can be represented by ink on paper forming its shape ("2", or "II",) or by a painting, or by spoken words, or by bits in a computer. Values are members of sets of possible values (which may or may not be easy to enumerate, that is, list.) Each such set is its own domain. The domain of all integers is a subset of the range of all real numbers: you may form one domain from another by applying a constraint; you can also (though not in any system I've seen) form a new domain as the union of several other domains, such as the set of "all values that are either integers or names of places". A constraint is a predicate expression (that is, an expression whose domain is that of boolean values, "true" and "false") which must be "true". For example, "even numbers" are a domain formed by constraining the "integers" domain to cases where the number is evenly divisible by "2".

Databases and Relations
A database contains zero or more named relations, zero or more named relation variable prototypes, each of a given relation type, and zero or more named domains (to be used in the relation definitions). Each relation has a relation header which names zero or more attributes, each of which has an associated domain. A relation named "addresses" might have the following attributes: "city" (text), "state" (text), "zip code" (number), and so forth. These attributes are not ordered (left to right or otherwise) though, for ease, an order may be chosen when representing a relation (perhaps alphabetically, for example.) Attributes can be imagined to be parts of a sentence, such as "A lives on Nth street, in B, C." By supplying values for each attribute, we can have statements like "Joe lives on 8th street, in Birmingham, Alabama." The attributes "A", "N", "B", "C" are here replaced with the values "Joe", "8", "Birmingham", "Alabama".

Though not commonly recognized, a database may state that there will be several relation variables of the same relation type in the database variable. You might define an "addresses" relation, but define several relation variables, "home_addresses" and "work_addresses" of that exactly relation type. Whether or not this is a good idea is an entirely different question.

A database variable is a variable of the domain defined by the database which describes it. For each relation variable prototype named in the database, the database variable will have a relation variable. A relational database variable may not contain any top-level named variables that are not of a relation type. You cannot, for example, store a variable named "bob" of the domain "integer" at the top (global, main, etc.) level of the relational database variable. A relation variable, apart from having the same header (and therefore set of attributes) as the relation that describes it, also has a relation body. A relation body is a set (that is, it cannot contain any duplicates) of tuples. A tuple, like a relation header, has named attributes. Unlike a relation header, it also has values for each attribute and the values must match the domain defined for each attribute. All of the tuples of a relation body will have the same set of attributes, as described and required by the relation header. The tuples of a relation body are not ordered (top to bottom or otherwise), though an order may be chosen for convenience. Therefore, a relation variable is not at all like a spreadsheet grid, where the order of columns and the order of rows may matter: there is no order here, only sets of items.

Constraints
Additionally, a database may include database constraints which must be satisfied in any database variables described by the database. No changes may be made to such database variables that violate these rules. Such constraints may include things like "A given social security number may be used only by one person at a time" or "Payments may not be sent to anyone for whom we do not have an address". These constraints are in addition to domain constraints, though the two are related as we will now see.

A database, then, is also a domain, describing a set of possible variables. Two companies may use the same database, in different database variables. Each will have different data, but the same rules and layouts still apply. A relation is a domain, describing a set of possible relation values matching the description. As domains may necessarily have constraints, we can infer that (though not mentioned yet) each relation may also have constraints. It is generally preferable to keep constraints localized: while the previously described constraint about payments may be done at the database level, the uniqueness of social security numbers would likely be a relation-level constraint.

Normal Forms
As stated earlier, a variable may only represent one value at a time. A relational database is by definition in first normal form which requires that each attribute of each tuple in each relation body have only one value at a time. To dispell a myth, however, this does not mean that you cannot have an attribute which is "list of things to buy at the store". Such an attribute is perfectly valid. First normal form requires only that you always consider the entire value at a time, even if you can see how it might be built from smaller pieces. If an attribute is of the domain "integers", it may only have one integer in it. If it is of the domain "arrays of integers", it may contain an array of multiple integers. The entire array is the value, not each integer.

To be in second normal form, a database must be in first normal form (which should be a given) and also be free of partial-key functional dependencies. A candidate key is a set of attributes in a relation which are considered uniquely identifying. By knowing values for each of the attributes in a candidate key, you can find exactly zero or one tuples in the relation body. A relation may have multiple candidate keys. No two tuples in a given relation variable may have exactly the same values for all the attributes of any candidate key. As an example, consider that you were given a unique number by your government (social security number), and that you have likely been given such unique identifiers by various companies. A relation variable might store information about you, including several such identifiers. By knowing any of them, information about exactly you might be found. Non-key information, such as your date of birth, might find several people's information. A partial-key functional dependency describes a scenario in which a relation contains attributes whose values could be guessed by knowing only part of a candidate key, rather than the whole thing. For example, the first three digits of your United States social security number are based on zip-code / area-code information. By knowing only part of your SSN, other information (mailing address given when applying for the number) may be guessed. If a relation were to have attributes for both an SSN and this other information (area where the application was filed), there would be a partial-key functional-dependency. The location information is not unique to the SSN, it is unique to part of the SSN. Such information should be stored elsewhere, in another relation variable. You might have a relation which defines the three-digit code, and information about it (the location), as well as another relation in which the SSN is stored as three fields along with other information about the SSN's owner. By using the two relations together, you can find out something about each SSN owner. The location information, however, is no longer repeated unnecessarily.

Third normal form extends second normal form by requiring that relations not contain any non-key functional dependencies, also called transitive dependencies. An example of an offense would be "Joe lives in Colorado, which is abbreviated CO". While the candidate key might be the person's name, the state of residence is unlikely to be a candidate key at all. (Many people will live in the same state.) The abbreviation is obviously derived from the state name. This redundancy should also be eliminated, by storing state names and their abbreviations in one relation, and information about residency in another. Note that in this case, both the full state name and the abbreviation are candidate keys. Because of this, you could have either "Joe lives in CO" or "Joe lives in Colorado" -- the two are equivalent.

Primary, Foreign, and Surrogate keys
A primary key is a pragmatic choice of one of several possible candidate keys, to make things easier. Either the full state name or its abbreviation will be chosen to be the primary key for the relation, and all other relations which refer to it should use fields of the same type (domain) as the chosen primary key. It would be confusing, though not wrong, for a database to have relations using both candidate keys in various circumstances. Using the abbreviation when talking about residency, but using the full name when talking about elections, would unnecessarily complicate things.

A foreign key constraint is a very commonly used database-level constraint. It requires that an attribute in one relation variable never contain a value which is not present in a tuple in another relation. For example, such a constraint would prevent you from saying "Joe lives in YT" when "YT" is not present in the state-name-to-abbreviation relation variable. This is a basic way to protect a database variable from being filled with "junk" data. A foreign key constraint is formed between one group of attributes in one relation variable and a group of similar attributes in another relation variable (though sometimes the same relation variable, and in rare cases even the same exact attributes.) As an example of self-referential constraints, consider a "person" relation in which the attributes "mother" and "father" must refer to other people in the same relation variable.

In some cases, it may be preferable to use a surrogate key rather than any of the already available candidate keys, to be the primary key. These are attributes for which you define your own unique identifier. You may assign a unique number to each person in the database, even though a social security number could have worked just as well. Both will be candidate keys, but your generated number will be the primary key used throughout the database when referring to people. This is often more efficient (integers being easier to sort and compare than some other datatypes) and allows for more flexibility (such as missing SSN's), but that same flexibility may also be a liability or an oversight. In some circles, the use of surrogate keys is seen as a "hack" because it is often caused by poor design, and disregard for real-world truths.

Domain Agnosticism
While relational databases are generally free of the need to define particular datatypes (such as numbers or text), there are a few requirements. For constraints to work, there must be some way to determine the truth or falsity of an expression, which implies the need for boolean datatypes of some sort. All sets require that their contents not contain any duplicates, which means that items must be comparable at least enough to determine if two items are duplicates. As a relation header is a set of attributes, where the uniqueness is based on the names, we need to be able to compare two attribute names and determine whether or not they are different. Relation bodies must be unique overall (regardless of the uniqueness of candidate key values) which means tuples must be comparable. Two tuples are identical if all of their attributes are equal in value. Because of this, you must be able to determine the equality of any two values of any of the domains used in any of the attributes of any of the relations in the database. Relations and tuples are also required, as they are also domains themselves, and essential to the definition of relational databases.

Terminology in the Real World (tm)
This terminology is not often fully used in the "real world" (of vendors). In general, database and relation types are ignored, though the term "schema" is sometimes used to refer to databases (as domains). Instead, database variables and relation variables are referred to as being databases and relations. "Database" may also refer to a database management system (also called a database server under some circumstances). Further, the terminology of spreadsheets is often used: relations are "tables", attributes are "columns", tuples in a relation body are "rows", and attribute variables in each tuple are either "fields" or "cells". Candidate keys are generally forgotten, though primary keys are not. In the place of foreign key constraints, the term "foreign key" is used to refer to the volatile attributes likely to cause a problem, and all foreign keys "point" to primary keys (though this is not actually necessarily true.)

E.F. (Ted) Codd's 12 rules
For want of finding a good copy of his original twelve (actually thirteen) rules, I'll instead provide my current interpretation thereof:
0) An RDBMS must manage all data through relational facilities.
1) All information is represented explicitly as values in relation variables. (No pointers, no row numbers, no hidden fields, etc.)
2) Any datum can be accessed by relation variable name, attribute name, and primary key value (together).
3) NULL, it present, consistently represents missing information, and is distinct from any other value. (It's not actually a value, really.)
4) The system catalog (which provides information about the database structure) is accessible relationally, just like other data.
5) At least one well-defined text-based language must be defined which allows DML (data-manipulation), DDL (data-definition) and transaction management.
6) All views which are theoretically updatable should be updatable.
7) Views (derived relation variables,) inasmuch as possible, should be indistiguishable from the base relation variables upon which they depend.
8) Physical changes to the database variable should be as transparent as possible. Users should not notice a change unless absolutely necessary.
9) Logical changes to the database should be as transparent as possible. Again, users shouldn't have to notice. (This one's a bit odd in that you normally mean for logical changes to be visible.)
10) All constraints should be definable and enforcable in the RDBMS itself, rather than in the client application.
11) Distributed and clustered databases should act like their non-distributed / non-clustered counterparts as much as possible. (Think of it as a physical change, as per rule 8.)
12) No matter how you access the RDBMS, security and constraints should be enforced at all times.

Coming up next...
So far, then, we have seen that relational databases are set-based constructs which can store truth statements with no redundancy (if designed properly.) We should cover relational algebra next. Just as integers have algebra (addition, subtraction, etc.), relations (being domains, just like integers) have their own allowable operations. Joining two relation values together (by matching attributes) or concatenating them (employee addresses plus manager addresses), getting their difference, finding their intersection, and so forth are all part of the relational algebra.

We should also look at transactions (units of work). While neither unique to relational databases, nor required, transactions are a common component of relational database systems, and have interesting interactions with constraints. (At what point during a transaction should various types of constraints be satisfied?)

Also also coming up... (Monty Python/Holy Grail reference)
I should go back and fix some terminology just a bit more. The following table should be more correct, at least as I see things right this second.

Domain Value Variable
Domain Value Variable
Database Schema Database Database Variable (db-var)
Relation Header* Relation Relation Variable (rel-var)

* I'm not entirely sure about that, as a header likely has its own type which could be defined in terms of other types. The same would also be true of database schemas, actually. Can't you store the declaration of a database, as if it were itself a value? Say, in a text field, store information about the rules and layout of a database... Is your brain on fire yet? The main change here is to consider a "database" to be the value which a database-variable may represent and name, and the same for relations. You can do math with relations, you can store their representations in relation variables, etc.

Continued at top

Extras