Database vs. Filesystem, Blob vs. File (Article)

Description :: Marginal performance drain for drastic feature gain

Storage

File/Filesystem

Blob/Database

I generally tell people to use the technologies they're comfortable with, and generally, people are more comfortable with files than databases, and certainly than blobs stored in databases. I'd like to think that even if you're not comfortable with databases, they're still so easy to use that you'll still make fewer mistakes using blobs than files for mass storage, but instead of relying on that argument, I'll rely on this one: databases are so easy to learn that I recommend you do so, rather than implement a complex file-based solution.

Claim: files make databases easier to backup. Smaller database files are more transportable, it is true. If you've ever had to chunk up a database into 2GB chunks for transfer via FTP, you'll know this for sure. But don't forget that you still need to backup those external files! Database backups have improved: online backups are now possible, where concurrent users, even ones making changes, won't lock the backup process; incremental and differential backups produce smaller backup files and give you nearly point-in-time recovery options at little cost. While you can mimic this with your externally-stored files, you'll need to synchronize your backup processes so you can reliably restore data back to the same point in time. File-based backups will need to be careful backing up files currently being modified, to avoid corruption.

A database without its associated external files is generally of little use, but one good case is often made: developers need to take a snapshot of a production database, copy it to their computer, and do some analysis on it. They generaly don't need copies of all the external files for this, and copying the larger file would only slow down the bug-hunting process.

Claim: some tools really need files. Most applications are not built to access files either on the filesystem or in a database; you can't provide Photoshop with an SQL query to go fetch the exact image you want to edit. Just as the strength of the Internet lies in its open, re-usable nature, filesystems provide a shared space for data manipulation that databases have yet to equal. But there's hope yet. While building a new filesystem that pulls its data from a database is possible, it's neither easy nor widespread, though there have been a few examples, such as the Gmail filesystem (GmailFS.) Rather than building every layer of this solution yourself, you can build a WebDAV layer to expose your database blobs as HTTP resources in a standard way, then use one of many existing WebDAV filesystems for the last mile; see cadaver for linux as one example. WebDAV is not the highest-performance protocol out there, and it certainly has its limitations (the locking protocol is optional, file operations tend to modify the entire file, not just individual blocks, the protocol is very block oriented, not good for streaming) but it has the advantage of being standardized and not overly difficult to implement on top of whatever database tables you might have. The result is that applications don't need to know they're accessing blobs in a database, they can continue to see them as files on a filesystem. So don't let the tooling issue discourage you entirely. A little bit would be fine though.

END of article content.

[once more, unfinished content. I swear I'll finish this someday.]
a. Files are more convenient
. A database without its associated large files may be useless or dangerous.
. Sometimes you just want to copy the database back to a dev box for testing, quickly.
You can cluster the database servers while keeping a single file store, but ...?
. Real backups are difficult to snapshot; something will be out of synch.
. A database provides that.
Restoring "deleted" records not really made easier by restoring individual files.
Logical deletes, audit triggers, etc. provide that.
. Some tools really need files.
. WebDAV is a pain, but depending on platform may get you very close (cadaver)

b. Files are faster
Only true as long as the files are accessed outside of any database access. (security?)
Filesystems historically deal poorly with large file counts.
You can work around the folder issue with name hashing and sub folders but
You have to reinvent the wheel each time
You have to careful reimplement tree balancing
Your file names, or at least paths, are likely meaningless
Databases have their own cache mechanism, not just a filesystem feature.
Databases may store small blobs near the records you're accessing anyway.
Independent image access is only a result of HTML IMG tags.

c. Files are safer
You don't fail gracefully back to images without a database. If that were true,
everyone would use a database for index cards, and paper for real data!
You can't reliably commit to the database and filesystem without leftovers somewhere.
You can't modify files without locking or corruption to fellow readers (incl. backup.)
You can't guarantee that file changes will also update the full-text-index, hash, etc.

d. Files are more secure
Databases can secure content by more than one attribute (folder.)
By associated fields
By current document content
By time of day, concurrent users, or friends list, ...
Databases provide a single authentication and authorization layer, not also shares.
Databases can have triggers (not just "watch" daemons) Could even audit reads.

e. Files are easy
The folder rebalancing (above) is not easy.
Concurrent access is not easy.
Consistent file manipulation across apps is not easy.
When using external tools, can't even rely on libraries to keep access consistent!
Ad-hoc requests for new types of manipulation (bulk delete, etc.) are not easy.
You don't need to worry about escaping binay data; use prepared statements, API.

f. Files aren't data
Just because it's binary doesn't mean a server can't manipulate it.
UDF's could very well do image manipulation, sound manipulation, etc.
Where do you draw the line? Is an email "data" or not? Just the headers?
Today's "just a file" becomes tomorrow's mineable data
xml blob?
exif data?
shape files?
geographic data?
pdf text extraction?
Advanced features can be transparently added to databases, easier than new filesystem:
Compression/decompression on the fly
Change (diff) detection
Immediate full-text indexing
Content encryption/decryption on the fly
(Apps making changes do not need to know about these, and can't avoid them either)