On database design [scratchbook] (Article)

Description :: Things to keep in mind during the process, not hard rules

[Unfinished. I've made at least all of these mistakes, and will make more. Need to shorten and remove commentary so I don't bore you before I get to the point.]

Will the data be coming from an outside system?
Consider a case I happen to be familiar with: a point of sale system actually gets its inventory data from a warehouse, which has its own inventory database; throughout the day, new part numbers may get imported into your database for local use in the point of sales system. The warehouse is really only concerned with the basic information for products: a name their people can use, maybe a SKU, a manufacturer name, etc. Your system, on the other hand, needs to interact with customers, and needs to have pretty pictures, good text, maybe even internationalized text. You'll be needing to associate this "marketing" data with the imported "inventory" data. It's true, you could add a bunch of fields to the inventory tables (on your side), and allow them to be null until a script pulls in rows from the warehouse, and then you'd go back through and fix them up. You could also decide that some part numbers share marketing information, because they're just nitpicking on variations between models of the same basic product; so you decide to create a new table of marketing information for "products", and add a foreign key to that table in the inventory table. I would recommend against this approach as well; the inventory rows won't exist until the warehouse notifies you of the new part numbers, and although you can pre-create the marketing information, you have to stand around waiting for the warehouse in order to complete the process. It also means that if you should need to completely re-load inventory data, those tables are shared between the two systems and a simple delete-and-insert will be destructive.

I would therefore recommend that you make sure tables coming from another system are entirely segregated, and that if you need to have foreign keys from their tables to yours (because they logically make sense that way) then you should instead do a many-to-many into your own tables; find some part number that you can use as a candidate key in their tables, and create an extra table (product_parts) linking the inventory and marketing information together. If you're worried about cardinality, just add a unique constraint on this many-to-many table that guarantees that an inventory part number will only be related to one marketing product. This also has the benefit that as you're typing in the product information, you can go ahead and associate it (if you relax the foreign key constraint) with the inventory part, so that as soon as that data arrives in your database, everything's synched up and ready to go. If you do relax that foreign key (i.e. not implement one) then at least put a trigger on the 'foreign' table to detect updates to key fields and perform your own version of on-update-cascade. You may have known what their product key was when you first prepared marketing materials, but nothing prevents them from changing their primary key values; if the system that updates your tables is smart enough to update existing rows to change their primary key (or whatever key you're using), you need to be smart enough to stay synchronized with those changes. By keeping the foreign table clean of any of your own fields (which you moved into that many-to-many table), you make truncate/insert operations a lot easier. You can treat it as "their data", separate from your own. You could even give that vendor access to modify that table, without giving them the ability to re-associate their parts with random marketing products on your side.

Are any core concepts and processes shared across the database?
Most databases are pretty bland: you've got people, parts, locations, services, bills, problems, solutions, changes, etc. But sometimes you get to play with a database that requires some actual algorithm work: maybe some queues (tasks), or some trees (org-tree), or better yet some graphs (shipping lanes and transfer centers) and some finite state machines. Maybe some of these "interesting" concepts even come up in several different places in the database -- maybe your robots and your people both have work queues; maybe you've got a tree of people and a tree of expressions in a decisioning system; maybe you've got a graph of routes between locations as well as a graph of technologies (sometimes incorrectly called a "tech tree" in video games) and a graph of relationships between documents.

You could model those "properly", right along with the entities they relate to -- but you could also model them separately, and "tack on" the specific entities for specific circumstances (satellite tables that share a primary key with the tasks in the generic table, but extend it with required fields,) bundling all trees together, all graphs togther, all queues together, all finite state machines together, regardless of the entities involved. If you find yourself building stored procedures for tasks related to these algorithms, it should be helpful to have a single procedure for "increasing the priority of a thing within a queue", regardless of who owns the queue or what the items in that queue are, than having separate procedures for "increasing the priority of a robotic task" and "increasing the priority of a personnel task". Stored procedures are generally checked for correctness as long as you don't embed your queries in strings inside the stored procedure; although you could build a single procedure to which you passed a flag indicating which table (robot_tasks vs. human_tasks) to operate against, it would likely involve concatenating strings together to form queries, and the server wouldn't be able to warn you of table changes that could break your procedure. On the other hand, having a procedure that uses a large "if" (case, switch) statement to decide which table to operate against is rather messy and prone to error. The single-table approach for algorithms helps clear all that up, and is consistent with object-oriented techniques in traditional programming. If you feel the need, you can always create some views to expose robot_tasks and human_tasks to users and hide the single internal task queue table from them, bringing you back to a more traditional table layout as far as anyone else is concerned.

What concepts haven't occurred to anyone yet? (Use a thesaurus)
Thesaurii are absolutely amazing brainstorming tools. Go design a database about problems and solutions; would you think of symptoms, diagnoses, treatments, tests, etc.? Try designing one for tracking changes (to software); sure, you'd think of bugs and their states, and maybe even of releases, but would you separate out the problems from their solutions, solutions from changes, and changes from releases? A thesaurus can point you to related words that might help you frame the context of your thinking; remind you of related concepts with nitpicking differences that might be important to you, but that your customer hasn't discussed so far. Maybe you'll find concepts that aren't important yet, but that will be important in a future release, and should be planned for early. For that matter, maybe you'll just find a better name for your tables; we hate long table names, or ugly table names, or hackjobs of table names -- a thesaurus can fix that for you. Use one.

Separation of powers (don't assume you know anything)
So you're storing someone's name -- do you really feel confident you should be splitting that into first, middle, and last names? Not all cultures give people two or three names; some names may be in character sets you forgot to support; your system probably needs to handle both business and individual names, and business names being fit into a last-name field column, while convenient, is also rather messy if you step back and think about it. (Is it really a family name? Sure, it might be Zimmerman Dentistry, but it could also be Bob's Auto Parts -- is that not more correctly placed in the first name column?) Prefix and suffix information ("Dr. Julius Caesar, M.D.") is just as messy; yes, you may want some gender and age information (Mr., Mrs., and Ms.) for call-center purposes when they cold-call the customer, but should you drive that off the name fields? Consider keeping those separate and explicit ("young unmarried female" is a perfectly valid lookup code) and just put the prefix/suffix information into the name field. You might keep separate name fields, but I suggest splitting them by purpose: formal (with prefix and suffix), slightly less formal (prefix only), informal (nickname, no prefix), official (exact first and last name, nothing else), etc. so your phone calls, letters, and government documents can all get filled in appropriately. Of course when it comes to government documents, you may be forced to use the first-name/last-name concept; in that case I recommend a set of fields just for that purpose, separate from what you use for letters, phone calls, etc.

Now consider street addresses (for mail delivery): you've got a lot of things going against you if you try to take an address apart.
- Delivery addresses for USPS, while handy for FEDEX and UPS, are really in a USPS format; in the future, we could easily see delivery services ask for addresses in another format entirely (possibly by unique name only, where they do the lookup to get a physically routable address, taking care of any change-of-address issues?)
- The format you're thinking of, that goes something like:

Name
Line1
Line2
City, State, Zip5-Zip4
Country

... is really for the USA only. In France, at least, the format would be more like:

Name
Line1
Line2
Zip5 City
Country

Please see Publication 28 in the links section below for details on how messy an address can actually get, and still be properly formatted.

I've seen even more odd formats for other countries. Yes, there are some commonalities, but should we really assume they're here to stay?
- One of the messaging systems I deal with specifically tries to split out the number part of the first line of the address, assuming all addresses are in a format like "Number Direction Name Type" (123 North Main Dr.) The problem is that some addresses, such as addresses starting with a PO Box, or addresses in which a user has decided to use line 1 for an "attention" line, will fail within the messaging system for no very good reason. Neither the sender nor the recipient really wanted the information filtered, but the messaging system did it anyway.

You could, of course, create a set of tables for each possible delivery system (and that's an odd case too -- you may send a package to another country, using their addressing format, but it's going to first go through your local routing and delivery service, such as USPS, and they have to be able to read it too!) and track very precisely the rules for every country. You could even make it extensible so you can easily add in a table for the new Star-Trek-like transporter system which doesn't use city, state, and zip, but merely uses a "station" number. But you probably won't -- if you're feeling spunky you'll just create a blob field and use linefeed characters in the text you store, ready for printing. (And then your reporting team will throw a fit when they have to explain your reasoning to some grumpy business person.)

Topics I still actively intend to address soon:

Do I need to audit this data later?
Do I need to easily recover point-in-time data?
What if I don't own the source data?
What would make this easy to figure out?
Will some areas need to be further refined in future revisions?
Is there a concept of plugins involved?
Is there a concept of inheritance involved?
What would make this easy to secure?
How can I make the upgrade painless or foolproof?

Continued at top