On the perils of genericity (Article)

Description :: Programmers easily take "handles anything" too far

0. Index

1. Introduction
1.1 Class definition
1.2 Key-value
1.3 User-defined fields
1.4 XML Blob
1.5 DOM
2. Convenience
3. Performance
3.1 Joins
3.2 Field types
3.3 Record sizes
3.4 Indexing

1. Introduction

I won't lie. I've been there. I've blundered down this path myself. It's tempting. You need a database to contain variable information. By variable, I mean you don't know what it'll be. You may have a rough idea, but you're intending to deploy the software to a customer who will maintain "forms" themselves. So you get to thinking about how to make a "do-anything" generic database. You re-invent the wheel. Some examples:

1.1 Class-definition
You're a type-safe object-oriented programmer (C++ or Java) and decide to replicate all that safety in your database. Everything must be well-defined. But flexible. So you come up with this:

 create table forms (
  formId integer not null primary key,
  formName varchar(100) not null unique
 );

 create table records (
  recordId integer not null primary key,
  formId integer not null references forms (formId)
 );

 create table fields (
  fieldId integer not null primary key,
  formId integer not null references forms (formId),
  fieldName varchar(100) not null,
  constraint unique (formId, fieldName),
  fieldType varchar(100) not null check (value in ('text', 'number', 'date'))
 );

 create table values (
  recordId integer not null references records (recordId),
  fieldId integer not null references fields (fieldId),
  constraint primary key (recordId, fieldId),
  fieldValue varchar(10000)
 );

   *or*

 create table values (
  recordId integer not null references records (recordId),
  fieldId integer not null references fields (fieldId),
  constraint primary key (recordId, fieldId),
  textValue varchar(10000),
  numberValue decimal(15,5),
  dateValue timestamp
  ...
 );

   *or, because you want to allow multi-value*

 create table values (
  valueId integer not null primary key,
  recordId integer not null references records (recordId),
  fieldId integer not null references fields (fieldId),
  ...
 );

1.2 Key-value
You have no idea what your data will be, but you've heard of master-detail table relationships, and want to do one better than so-called "tagging", maybe you'll do this:

 create table documents (
  documentId integer not null primary key,
  ...
 );

 create table slots (
  slotName varchar(100) not null primary key
 );

 create table attributes (
  documentId integer not null references documents (documentId),
  slotName varchar(100),
  attributeValue varchar(10000)  
 );

1.3 User-defined fields
If you were, on the whole, using your database correctly but found it inconvenient in a few spots, or if you come from a COBOL background and are accustomed to leaving yourself a little padding in your data files, maybe this would happen:

 create table items (
  itemId integer not null primary key,
  ...
  userfield1 varchar(100),
  userfield2 varchar(100),
  ...
  dateField1 date,
  dateField2 date,
  ...
 );

1.4 XML Blob
If you happen to think databases are nothing more than unjustly glorified filesystems, and you already use XML everywhere and can't be arsed to learn what a database is actually for, you might take this shortcut:

 create table things (
  thingId integer not null primary key,
  xmlData blob
 );

1.5 DOM
If you were forced to take an XML file that can "do anything" (and does) and normalize into tables, you might come up with something like this:

 create table elements (
  elementId integer not null primary key,
  parentElementId integer references elements (elementId),
  elementContent blob
 );

 create table documents (
  documentId integer not null primary key,
  rootElementId integer not null references elements (elementId)
 );

 create table attributes (
  elementId integer not null,
  attributeName varchar(100) not null,
  constraint primary key (elementId, attributeName),
  attributeValue varchar(10000)
 );

Now that you know what kind of database I'm talking about, let's talk about why this may seem like the awesomest Mountain Dew-inspired 4am idea you've ever had, but will be your worst 11pm Caffeine-dependency nightmare.

2. Convenience

You've been clever, great. Good for you. But now you'll discover that whether you're accessing your own database through a generic GUI database tool, or trying to run reports off of it, especially through any sort of GUI report designer, no standard tools have any clue what you've done. They can't help you. They can't auto-discover the "underlying" database layout. All they see is your awesomeness, which is no help at all when you want them to see things from the client's point of view, not yours.

You see, all those tools assume that your end-user data is in normalized tables — tables with normal names, with normal fields, with normal datatypes. Nothing fancy. You can recreate that "simplistic" model with a lot of extra views (and triggers, to make them updateable) but then you will have defeated the purpose of a generic database that doesn't have to be constrained to one particular business requirement, one set of tables and fields with specific meanings. You're on your own.

3. Performance

This won't sway those who already believe databases are intrinsically slow (and may have resorted to XML Blobs to "fix" this problem) but generic databases really don't help performance. At all.

3.1 Joins

Every time you want to extract data from solution 1 ("class definition") you'll find yourself doing a lovely join: records to forms (inner) to fields (left) to values (left, but across both fields and records). That doesn't even get you an easy-to-query view of the data, but it does certainly slow things down:

select records.recordId, forms.formName, 
       fields.fieldName, fields.fieldType, values.fieldValue
       /* or values.numberValue, etc. if using solution 1b */
from records
inner join forms on forms.formId = records.formId
 /* if values are sparse, you could use default values from the fields table */
left join fields on fields.formId = records.formId
 /* if values have a nullable fieldId, or fieldId accidentally gets changed to be on another form, or you want to add a fields.disabled attribute, you can use this to strip out values that no longer make sense */
left join values on values.recordId = records.recordId and values.fieldId = fields.fieldId
where records.recordId = ?;

The rest of the solutions just get more annoying. With slots, you don't know which slots might make sense. For user-defined fields, you wind up renaming and casting fields everywhere. (You could create a view, sure, but that gets into the realm of customization that you're trying to avoid.) With the XML and DOM solutions, you're on your own, period. They work out a little better if your end product is XML anyway, to be consumed by some other layer that gets to deal with the complexity &mdash but that's just shifting the burden.

3.2 Field types

Generally-speaking, extra nullable fields don't waste a lot of space in a modern database. That used to be the case with fixed-size records, it continued to be the case when CHAR datatypes were all we had, but nowadays, NULL and otherwise blank fields aren't a huge cost. In Firebird, for example, every record contains a bit-array that just marks, up-front, which fields are NULL; after that, only fields that were marked as non-NULL have values, packed densely and compressed (RLE). Other systems may use some sort of "tagging" approach where a record is made up of little chunks that each identify which field they represent, and NULL fields just don't show up. So I'm really not worried about having extra fields laying around unused. Having something like solution 3's dateField1, dateField2, etc. won't hurt you.

But if you have something like solution 1a's fieldValue, you've got a constant cost in terms of casting values between datatypes. You also need to make room for the biggest possible value, which may mean using a BLOB field, which I don't object to in itself, but which makes life difficult in all sorts of ways. BLOBs may be stored outside the record, they're not necessarily retrieved inline when you select, they have different rules for comparison operators, casting is required for just about anything useful, you can't always index them, ...

In solution 4, xmlData isn't just one attribute whose type is variable, no; it's far worse than that. It's a whole document tree, full of attributes and node values that are all, essentially, untyped. Sure, there are XSD's and DTD's for validation. I'm sure you're going to use those every time you update that field, right? And when you pull the data out, those really help with casting, hunh? (Please, just don't use solution 4. Every time you do, a deity kills something cute and cuddly.)

3.3 Record sizes

If your database product has per-record size limits (either for storage or transmission,) then solutions 1b and 3 may quickly hit that limit, as you add various usually-unused fields, but which the database must plan for -- which means that while your records may not regularly even get close to hitting the limit, their theoretical maximum size quickly does, and your database product will probably decline to add the columns.

3.4 Indexing

Having extra fields per type isn't so bad, but when you start indexing them, you now have indices on lots of unused fields, or indexing that is fairly NULL-heavy because solution 1b's numberValue is rarely used. Each index is larger than it need-be, because you've got more value rows than a traditional table would need, you've got more NULLs than really make sense, and you now have to pay the price of keeping each index up to date more often.

Furthermore, the database can't easily combine indices when you run an AND search across different fields, because it's now trying to find different value records, from which it can then find the relevant records, then AND those together, etc. — a normalized table (combined with a database server that has a habit of doing this, like Firebird) would do that with a bitwise combination of the indices. Even better, you could optimize a normalized table with compound indexing, but you lose that option with these generic solutions, even with solution 3's user-defined fields, because you don't know what the fields will mean so you can't plan ahead on which fields to indexing, in which order, in a compound index. Admittedly, such an index could be added after the fact by the end-user, and would assist even your dynamically-built queries, but we're getting close to the realm of customization again.

Creating an index on something like solution 1a's fieldValue isn't great either: if you're using a massive text field that will be later cast to its correct type, the index will be a text index, which isn't useful for date or number operations. You could create an expression/function-based index (computed by) that avoids casting errors (the expression would return NULL, or something similar, for any values that can't be cast to the appropriate type) for each value type you might want to store, but that solution is cumbersome and expensive at runtime. An expression index only gets used if the query optimizer can recognize that what you're searching by is indexed, which means every time you want to sort by a field and treat it as numeric, you have to match that same expression operator-for-operator, something like:

... order by (case when isnumeric(fieldvalue) then cast(fieldvalue as integer) else null end)

END of article content.

[outline, because I'm making this public well before it's done]
Convenience
 isql: no view/procedure can make that easy
 no reporting tool will know how to use db without views/procedures
Performance
 Joins
 Field types
 Record sizes
 Indexing
 Filtering
 Blobs, extracting pieces of xml
Correctness
 Joins (left)
 Fields (which mean what)
 Field types
 Meta changes (add, remove, change type)
 Allowing multi-values
Constraints
 Drop-downs
 Other FK's
 FK's into these records (type)
 Multi-field constraints
Limits
 Field type limits
 Rowsize, if maxing field limits
 Inheritance -- it can do anything, right?
Accidents
 Values (in the wrong place)
 Meta-change, missing values, extra values, wrong type
 Not filtering rows properly
 Mass-update
Business rules
 Apply to all record types
 Defer until all values inserted, then fire
 Even triggers deferred
Security
 Views
 Procedures
 ...
Backup/restore
 Single-table, can't be partial
Usefulness
 Searching 'all fields'