Poor man's database tricks (Article)

Description :: How to make good use of your free/libre/open-source database system's features

0. Index

1. Materialized Views
1.1 Definition
1.2 Storage
1.3 Scheduled refresh
1.4 On-demand refresh
1.5 Automated refresh
1.6 Read-only
2. User-aware data
2.1 With Views
2.2 Without User Accounts
2.3 With Table Functions
2.4 Updateable
2.5 Other Security
Upcoming topics

1. Materialized Views

Oracle's materialized views (or snapshots) are a means of keeping pre-calculated and often-indexed sets of data on hand, generally for performance reasons, though they may also be used to isolate one database from another where connectivity is unreliable. They are characterized by the following main features:

- Refreshed on a scheduled basis, or

- Refreshed as rows in source tables are modified, or

- Refreshed on demand

- Read-only except when being updated by approved means

If your relational database system of choice does not natively support materialized views, you may mimic the above behavior (and in fact, Oracle used to do this itself) with the following:

1.1 Definition

Create a VIEW ("sourceview" below) with appropriately-aliased column names that can serve as the main source of data for the snapshot, pulling from various source tables.

1.2 Storage

Create a TABLE that resembles the VIEW, most easily via CREATE TABLE ... AS SELECT * FROM SOURCEVIEW. Index the table as needed for the expected usage pattern, both in terms of frequent updates to the table as source data changes (fewer indexes), and the searches performed against it (well-chosen indexes -- materialized views are, after all, generally designed to enhance performance!)

1.3 Scheduled refresh

Create a STORED PROCEDURE that replaces the table's contents with the current result of the view; if you have TRUNCATE available to you, use it with care: it may be non-transactional and leave your users with no accessible data while you are refreshing the snapshot. DELETE and then re-INSERT the data into the table.

You have the option of temporarily disabling your indexes, or if that feature is not available, of dropping and recreating them before and after the refresh, respectively. If you have INSERT OR UPDATE or MERGE INTO available, you may consider them as options for updating the cached dataset; however remember that these might not have the option to delete rows that have disappeared since the last refresh, leaving you with 'cruft' in the materialized view. In some cases, you may have the option of inserting your new data into another table first, then swapping them out quickly (by renaming the tables, or something more physical -- such as swapping out their data segments.)

Changes to the table itself may cause dependent stored procedures and triggers to become invalid, which may or may not be allowed by your database system. You may be able to overcome this by assuming that the overhead of EXECUTE IMMEDIATE or its equivalent, since you perform refreshes relatively rarely, is low enough that this level of isolation from changes to the table suits you. (Database systems that track dependencies between database objects like procedures and views generally don't notice when a table name is embedded in a text variable that is intended to be parsed at runtime via EXECUTE IMMEDIATE and can therefore be fooled.)

Schedule this stored procedure to be executed on a regular basis by your favorite job scheduler (e.g. cron, CosBatch.) Remember to set up notifications and ROLLBACK in case of failure.

1.4 On-demand refresh

As long as you've provided a convenient stored procedure for refreshing the entire materialized view, it's a cinch to grant access to this stored procedure to the users who may need to ask for it. If you need to set up refresh groups (whereby multiple tables are refreshed together) you can create a wrapper stored procedure that knows about all of them, and forces them all to refresh together, rolling them all back together if any of them fail. (Please see the exception-handling section of your database system's procedural-language documentation.)

1.5 Automated refresh

Oracle's ability to automatically refresh parts of materialized views is actually somewhat limited; it only knows to look for changes to rows by PRIMARY KEY or ROWID. If your materialized view, even in Oracle, were to be formed as the JOIN of a master/detail relationship, and you were capturing aggregate results (e.g. "number of shipped items per order"), such that changes to either the master or detail table would need to cause the materialized view to refresh, you would need at least a slightly customized solution.

You may set up triggers on the detail table to make dummy changes to their parent rows, then base all of your snapshot refresh logic on the changes to the master table, or you can set up triggers on all relevant tables to customize how much of the materialized view is updated when changes occur. (For example: your materialized view may be primarily concerned with orders, but include the name of an oft-used vendor in one of the columns, via a JOIN; any change to the vendors table would cause not one, but many rows in the materialized view to refresh, yet you may not want to refresh the whole thing. You could customize the trigger on the vendors table such that it causes all rows in the materialized view that include that vendor's id (and/or name) to be updated all at once.)

You may have the option of re-using the view you created above: your trigger would DELETE the rows that have changed (or in the case of an INSERT on a source table, effectively do nothing) then INSERT the rows that result from SELECT * FROM SOURCEVIEW WHERE ... (none in the case of a DELETE on the source table or some updates that unlink records from each other;) in this case, an UPDATE would be handled as a pair of deletions and inserts.

You could specifically code for this, but as the number of source tables increases, you become more likely to code one of these updates incorrectly; it's easier and safer to have a single point of contact with the materialized view that has barely sufficient knowledge to know which rows to pull from (possibly passing it an old-primary-key and new-primary-key pair, just in case the primary key it bases its queries on changes as part of the update) and just deletes and re-inserts from the source view, making it easy to tweak the definition of the materialized view at runtime. (This is true in Oracle as well, by the way; if you include the entire query in the materialized view definition, it's a lot harder to modify it later, as it will force you to drop and recreate the entire materialized view; if you only tell Oracle to select from a view and define your query there, it won't care if you change the definition of the view at runtime. A simple dbms_mview.refresh will clear things up and get you going again.)

1.6 Read-only

REVOKE all but SELECT privileges on the table from the users who will need access to it; GRANT privileges INSERT, UPDATE, and DELETE to the various stored procedures and/or triggers that will be maintaining the table. Make sure the procedures and triggers are not editable by the users (as a general rule), but grant them EXECUTE privileges so they can still be used. Triggers in particular will be running within the security and transaction context of the end user as she makes changes to the source tables; although the user won't have permission to modify the materialized view, the triggers will be able to do so for her. I wouldn't recommend giving users REFERENCES permissions to the materialized view, particularly if your refresh method involves lots of deletes and inserts: you would have FOREIGN KEY CONSTRAINTs raising exceptions far too often; DEFERRED constraints, if available, might solve this for you.

Most of this should work with at least PostgreSQL or Interbase/Firebird. MySQL recently gained support for triggers and stored procedures, so I assume all this should be possible there as well.

2. User-aware data

It used to be suggested that to provide users or groups of users (roles) with access to the database without giving them all full access to all the data in every table they could see, views should be created for each user or group, and permission to those views only (not to the source tables) granted to the users. If the database system supported schemas, it would be possible to create multiple views with the same name, one in each schema, tying the users or groups to those schemas such that the default namespace they would search in one find the "table" names they were given access to, so they wouldn't know they were accessing anything special. These views would then refer back to a table in another schema where the data was "safe." This meant creating individual views for everyone, tailoring its query definition to each situation; changes to the base table would cause many views to need to be redefined, one at a time. Consistency would be hard to achieve.

Furthermore, if the views were intended to be updateable (to further hide from the users that they were living a lie) triggers or rewrite rules would be defined on each of those views to perform the correct action in the source table. Again, changes would need to be made everywhere, consistency would be hard to achieve, etc. We have other options.

2.1 With Views

Create a single view that you will expose to users. If you're feeling spunky, give it the "real world" name you would normally choose for the table it hides (e.g. "orders"), and name the table something else (e.g. "x_orders".) In the view definition, you can generally use a global session variable provided by the database system to identify the user you are serving (e.g. "CURRENT_USER"); you may not have access via those variables to see the group(s) the user is a member of, but you can perform that logic in the view, joining over to the database system's security tables (your database system does expose catalog tables, right?) and getting a list of group memberships. Your view now has all the information it needs to hide rows (e.g. ... WHERE ORDERS.DEPARTMENT IN (SELECT DISTINCT GROUPS.DEPARTMENT ... WHERE GROUPS.USERNAME = CURRENT_USER) ...) or hide/modify columns (e.g. SELECT ... CASE WHEN EXISTS (SELECT * ... WHERE GROUPS.USERNAME = CURRENT_USER AND ...) THEN SALARY ELSE NULL END AS SALARY ...) on the fly. Only one view is defined, all users can be given access to that one view, yet its definition can be customized to fit their varying needs. It's a single point of modification if your source tables change, and it's a single point of interception when you want to set up triggers to make the view updateable. (With this complex of a view, it's very unlikely that your database system will be able to automatically make the view updateable, even if it sometimes knows how.)

2.2 Without User Accounts

"Ah," you say, "but all my users access the database through a web site that always logs in as the same user!" Indeed, you probably can't perform a magic trick and sudo from the default username your web application uses over to the username you know you should log in as (as provided to you by the user in your web application's login form, or via mapping tables in the database.) You also can't log in as the user directly from the web application, because your security libraries hide the true password from you; besides, maybe the usernames and passwords they use to log in aren't the same as those defined in the database. Then again, maybe the problem is that you haven't defined user accounts in the database because your users were never supposed to log into it directly; you've got it firewalled off, only the web server can access it, why would they need accounts? Or perhaps you manage millions of online accounts, and you see no reason to define that many database accounts -- it would admittedly get messy.

If your database system supports custom session variables, you can make sure your web application always sets a value for that variable as the first thing it does after it logs in as the default user; all of your views are then keyed to read from that variable when deciding what data to return.Firebird 2 users: look up the documentation on rdb$get_context('USER_SESSION', ...) and rdb$set_context('USER_SESSION', ...); these are equivalent to Oracle's SYS_CONTEXT('USER_ENV', ...) and DBMS_SESSION.SET_CONTEXT('USER_ENV', ...) and more than equivalent to its DBMS_SESSION.SET_IDENTIFIER(...) though this last might be the more maintainable solution.

If you don't have session variables available to you, remember that you do have tables; create a table (defined as TEMPORARY if possible) to hold these session variables, and use as above -- except all of your views will now have to do a little more work to go select the current values out of the table. This might be a multi-column, single-row table, or a multi-row, double-column table, as you prefer. A multi-column table would let you set up your datatypes precisely, but would require modification when your application needs to set up more session variables.

Don't put too many constraints on the table; remember that you may have hundreds of transactions simultaneously inserting data into this table, where none of them can see each other's data; you don't want the database to prevent them from doing so just because of a primary key constraint on the "username" column. Remember to delete all rows from the table at the end of your session, or your COMMIT will make your temporary data visible to the whole world. If you have the option of ON COMMIT DELETE ROWS, this isn't an issue. If you have ON COMMIT triggers, you can use them as well, just so you don't forget a spot in your web application. It won't hurt to delete all rows from the table at the beginning of a session either, just in case some cruft is accidentally left in there; if it happens too often, however, multiple concurrent transactions will attempt to delete the same rows, and one will error out unexpectedly.

2.3 With Table Functions

The view method above may give you some amount of trouble with performance; views are generally precompiled by the server, and it would be hard to compile an efficient execution plan that covers all the various security policies you've put in place. Stored procedures (or functions) that return multiple rows (thus "table functions") can help with this; they are also precompiled when defined, but the EXECUTE IMMEDIATE feature may give you a way around this.

Create a stored function that returns a set of rows whose definition match the view you would have used above, and that takes no parameters; it will use the same method as above to gather information about the user it needs to customize data for; have your security policies concatenate together a single SQL statement based on available data that covers just the rules necessary for the current user, and use EXECUTE IMMEDIATE or its equivalent to prepare and execute the statement on the fly, and return the data to the client. You will have an increased execution cost from replanning the query each time the function is accessed, but as each call will generate a simpler SQL statement that is reoptimized each time, you will likely gain some performance in this area; pick the solution that works the best for you. Be careful to test all combinations of security rules, as you now have multiple plans, each with its own performance issues.

Keep in mind that views are not just functions that return sets of rows; to the query optimizer, they are an opportunity to understand what you are trying to do instead of how, and do what's best for you. A stored function is not similarly optimizable, unless you can return queries as first-order objects (rather than the result set) and let the optimizer integrate it into the overall query. Generally, an optimizer can look at how you're using a view inside a larger statement, merge some conditions together between the inner and outer chunks (e.g. SELECT * FROM MYVIEW WHERE ISSALARIED='T' internally becomes SELECT * FROM ALLEMPLOYEES WHERE ISSALARIED='T' AND MANAGER='BOB', which can be optimized), possibly even re-arrange the whole query to run faster; with a stored function, it really has no option but to treat it as a "black box" and work around it (in the example above, it would select the entire contents of the view, then try to find salaried employees, without using an index.)

Since table functions do not look exactly like views or tables when used in a query, consider wrapping the function in a view and making that view accessible to your users; rather than SELECT * FROM TABLE(MYSTUFF()) (Oracle) or SELECT * FROM MYSTUFF() (Firebird), they would see only SELECT * FROM MYSTUFF. Unfortunately, Firebird is not expected to support views based on stored procedures until release 2.5. See JIRA ticket CORE-886 for details. This also allows you to put INSERT/UPDATE/DELETE triggers on the view (see below) so it can be treated as updateable, which you wouldn't be able to do with a trigger.

2.4 Updateable

By now you should see where we're going with this; your view (possibly built on top of a table function) will include triggers which will give it the appearance of being a table, and those triggers will take the user's identity into account when deciding what to allow, how to perform the requested change, etc. It could decide that the row the user is attempting to update can't possibly have been in the result set the user would be messing with, and either ignore the change or raise an exception. It could decide that since the user only has a limited view of the world, her request should be interpreted narrowly: if she asks that an item's price be increased, and she's a regional store manager who can only see pricing for her region, the change is implied to apply only to her region. You might even find opportunities to allow users to update aggregate columns, because you can guess at what they must mean; if they update the column indicating the last date on which a reply from the defendant can be accepted, you can add a row into that table indicating the change, such that the view will in fact reflect the desired change -- the maximum date will have increased because of the new detail row that the user can't directly see.

All the usual caveats about updateable views still apply here; deleting a row from a joined view can imply many things: that the vendor no longer exists, that the order no longer exists, that the customer no longer exists -- you must pick something based on the most obvious meaning to the user of the view, and make it clear to everyone what inserts, updates, and deletes "mean" to the system. A view is typically "about" one particular entity, so this shouldn't be too difficult, but that's not always the case. You can't hide all details from the user; at some point you're going to have to provide them with some alternate procedure calls they can make (if they still don't have direct access to the underlying tables) to request the specific changes they want, rather than update the view directly.

2.5 Other Security

Just because you let your users see all available data doesn't mean all users can make the same changes to that data. You might let everyone see the organizational chart of your enterprise, yet want only your direct manager to be able to re-assign you to another department. This isn't table-level security, as all managers can touch the table, and it's not column-level security, as the same column would be used by any of those managers. You can't use CHECK CONSTRAINTS to validate the data, as it only checks that the new department is valid, but doesn't look at the transition between departments or the user's identity. For that, you'll want to use triggers, as above, that know who the user is by one means or another and can reject (by raising an exception) the change when it is disallowed by business rules.

Upcoming topics

I intend to talk about at least the following: partitioned tables (via UNIONs, VIEWs, and TRIGGERs), after-commit triggers (via the RAISE EVENT or NOTIFY facilities.) Note that Firebird 2.1 providers BEFORE COMMIT and AFTER COMMIT triggers, among other new trigger types, making this pointless in its case.