0. Index1. Ideas only
3. Sharing logic
3.2 Areas of responsibility
3.3 Database objects
3.4 Finding things
4.1 Procedure privileges
4.3 Managing permissions
6. Views vs. Stored Procedures
8. Changes on the fly
9. Database agnosticism
10. Dependency tracking
14. A call to tooling
Use the tools you're comfortable with. That's the first rule. You're far less likely to make mistakes that way! It may not be ideal, your co-workers and successors may not approve or give you gold medals, but there's a lot to be said for a solution that works. This article should just encourage you to invest time in learning stored procedure languages if you're not already familiar with them, because they have certain advantages.
My goal here was only to give you a taste of these technologies; I may have gone beyond that a bit, yet still fallen short of your expectations. For those looking for a decision tree or matrix to help them decide between similar technologies (procedures, functions, views, ad-hoc sql with or without prepared statements), I have a separate article in the works, currently titled 'Where should SQL live?', which I will attach to this one when it's ready. Some of this is also covered in this article (keep reading,) but briefly, for those in a SQLServer environment (or similar) where the following distinctions makes sense:
|Feature||MS Table-function||MS Table-procedure||IB/FB Table-procedure||View|
|tracks dependencies [a]||Yes||Yes ||Yes ||Yes|
|setup, teardown work [c]||No||Yes||Yes||No|
|side-effects [d]||No||Yes||Yes||Unwise |
|default parameters [e]||Yes||Yes||Yes||No|
|re-optimized query [f]||No||No||No||Yes|
|select from, join to [g]||Yes||No||Yes||Yes|
|variable result-set columns [h]||No||Yes||No||No|
|update-able [i]||No||No||No||Yes |
I use the term "stored procedure" loosely here. I'm really talking about a database/server-side construct, accessible via the database API. In some database products, stored procedures are limited to returning zero, one or several scalar return values, while so-called stored functions can return result-sets (like a select statement.) Rarely, a procedure can even return multiple result-sets simultaneously! Often, you can only use the database's internal programming language (PSQL, T-SQL, P-SQL, pgSQL, etc.), but occasionally you can write the code in a multitude of languages (perl, C#, Java, VB, etc.). Consult your product documentation for details.
In the end, what matters is that any program already connected to and using the database can also use these procedures or functions. I'm also including views here, as simple result-set procedures could just as well be written as views, with other advantages. We'll talk about all of this.
As with procedures/functions in any language, two main objectives are to share the logic across uses so you can fix it in just one place, and break up complex logic into readable (and findable!) chunks. If you find yourself writing SQL queries that share a common sub-clause, you might see if you can turn that chunk into a stored procedure or view; see below on the distinction.
create view active_employees as select * from employees where active = 1;
If your database supports them, you may also use computed by columns to avoid creating and using a whole separate view when all you want is to consistently add a calculated column to a table.
alter table invoice_items add item_total computed by (item_count * item_cost);
You don't have to go all the way, it's okay to migrate the bare minimum to the database. If you find that you need to share logic between triggers (which have their own reasons for being in the database, see separate article) and application code, that logic should probably also be migrated to the database.
Also consider that you, as a DBA, are one of the direct users of this logic. When an urgent request comes in to run a wacky one-time report, why should you have to repeat the similar logic you wrote for those old reports? You could have built a view to do most of the heavy-lifting for you, both for the standard reports and these ad-hoc reports. When the FTP site or ETL software blow up one morning and you need to — pronto! — load a shipping file into the database by hand, you'll be thanking yourself that all the complex logic involved in updating inventory numbers is already in stored procedures and triggers, so you know you're not reinventing the wheel, incorrectly, in the heat of the moment. You don't want to discover that after waking up early to a panicked call from your boss, and spending hours massaging data while the system administrators try to figure out what went wrong with the machinery, you've now also messed up the production database and have a new disaster to clean up before customers start complaining. This can really save your bacon.
3.2 Areas of responsibility
If you have different teams working with a shared database, but there's sufficient trust that the issue isn't one of security, you may still want the team with the most experience with certain areas of the database to write server-side logic for the other team to consume, simply to make their lives easier without getting into the nitty-gritty of what exactly they're doing.
Some companies have strictly segregated this task to dedicated database-administration / database-programming teams, who are responsible for learning the details of what every end-user needs and implementing just enough to get them by; again, you don't have to go all the way, you can simply have teams write utility procedures, generic views, or other "half-way" solutions to be used by their coworkers. This is generally sufficient to avoid those frustrating phone calls where the experts try to explain, in detail, to someone else how to write a complex query, when they could at least provide a procedure or view that gets the other programmer back on familiar ground.
3.3 Database objects
But why as server-side objects? Surely there are other ways to share code: code libraries that either perform individual queries, or even return chunks of SQL to embed in other queries, webservices, etc. Sadly, code libraries are still difficult to share across programming languages and environments: the C API is fairly robust in that regard, but still inconvenient; JNI and similar technologies are a pain to setup and use.
The database, however, is already there, and you're already accessing it, and with small variations, the SQL specification has made simple operations like calling procedures or selecting from views standard across database systems. I'll discuss the problems with webservices in a bit.
Another way of looking at it is the concept of object-oriented programming: the data and the code used to modify it live together. Just as you would now look down on the use of bare struct's in your C++ code, you can begin to look down on the use of databases as mere "storage." Database systems have offered "active" features (procedures, views, triggers) for decades and it's about time programmers started to take advantage of them!
3.4 Finding things
I mentioned findability above; all too often, shared utility libraries are cluttered with similar functions, thrown in by well-meaning programmers who failed to notice their coworkers' previous good deeds; when it comes time to use one of those functions, an inability to find or recognize existing code will lead to further duplication. Whatever you do, try to come up with a good naming convention, document the shared logic, and train your programmers to search before they build. For someone whose job is to write code, this can be amazingly difficult to remember.
You may be used to the concept of public, protected, or private functions, but in databases, you get a little better security than that. Access to use stored procedures can be granted on a per-user or per-role basis.
4.1 Procedure privileges
Furthermore, you can limit what the stored procedure itself can do. At first, that may seem pointless, if you think stored procedures can only access clearly-defined objects in clearly-defined ways. But you may have a stored procedure capable of doing quite a bit (when asked) but depending on the user calling it, you want to limit what it will actually do. In those cases, you may be able to assign the varying privileges at the user level and let the procedure inherit from the user's permissions, or you could create proxy procedures for different users, and assign permissions there. Remember that if you give the user permissions to the table directly, they could make an end-run around the procedures you setup for them.
Another problem is created by execute statement: a stored procedure may actually be used as a tiny interpreter, running arbitrary code. You can't statically analyze the procedure to determine what it will want to do at runtime, and assign permissions on that basis: you must assign it explicit permissions.
create procedure lock_document (tablename varchar(50), rowid integer) as begin execute statement 'update ' || :tablename || ' set locked_by = ' || current_user || ', locked_at = current_timestamp where id = ' || rowid; end grant select, update on invoices to julie; grant execute on procedure lock_document to julie;
This also allows a database admin to delegate limited rights to another user to modify his own procedures as needed, within a sandbox. He doesn't have to use all the privileges his procedure is initially given by the administrator.
4.3 Managing permissions
If your procedures are split up by purpose, they provide another security tool: if you wanted to remove access from a given user to do withdrawals without taking away their ability to do deposits, both of which look very similar at the row-update level, you would find that table-level privileges are insufficiently granular. You could write a trigger on the base accounting tables to detect withdrawals, but it would be messy, and you would have to look at the source code of the trigger to determine which users are allowed to do what. It's far easier to simply take away their ability to call the account_withdrawal procedure, knowing they have no other way to perform that same operation.
create procedure account_withdrawal (account_id integer, amount decimal(10,2)) as begin if (amount < 0) then exception e_negative_withdrawal; insert into account_transactions (account_id, amount) values (:account_id, -1.00 * :amount); end create procedure account_deposit (account_id integer, amount decimal(10,2)) as begin if (amount < 0) then exception e_negative_deposit; insert into account_transactions (account_id, amount) values (:account_id, :amount); end grant select, insert on account_transactions to procedure account_withdrawal; grant select, insert on account_transactions to procedure account_deposit; grant execute on procedure account_deposit to teller1; grant execute on procedure account_deposit to teller2; grant execute on procedure account_withdrawal to teller1;
Some shops have a "stored procedures only" policy for security, under the assumption they are inherently more secure. One big fear is "SQL injection"; in fact, stored procedures can be just as vulnerable to this as in-application SQL strings, when the particulars of the procedure call for execute statement and dynamic SQL strings. Even client applications can be rather safe, if they consistently used prepared statements, passing arguments by name or position, rather than building SQL strings with values (supplied by the user) embedded and hopefully properly escaped. Stored procedures can be rather rigid in their parameter types, leading to many dangerous uses of execute statement; see "rigidity" below.
If you find yourself writing client-side code that makes repeated calls to the database in a tight loop or recursion, you should consider other options. Some databases offer recursive statements (e.g. CTE's, common table expressions) that could boost performance by making a single request to the database and returning all your results together. Those may not always work out, and stored procedures give you another option.
The same is true if you were generally discarding data from intermediary steps: why send that data to the client at all, when it could be thrown away, cheaply, on the server? When allowed to return result sets, they are more flexible than views in how, exactly, they go about gathering their data. As long as those various repeated queries were all returning very similar result-sets, a single stored procedure that outputs a much longer (and potentially wider, to allow for the small variation between row types you might have been using) result-set will outperform your repeated across-network queries.
If the stored procedure is allowed to return multiple result-sets, this could give you the extra bit of variation you needed to replace all those queries; I would suggest that these procedures could also prepare and return all the different result-sets you need for a complex report, without recalculations — but that's not unique to multi-set stored procedures. You could just as easily use temporary tables, populate them with a few procedures or queries, then run your later queries on those, saving some time.
Views and procedures can both return result-sets. Which one should you pick? Views will be optimized on the fly by the database; if you apply two different "where" clauses to a given view, the database system will try to find the most efficient way to run each one. With procedures, "where" clauses applied to the result are pretty much always handled the same way: ask the procedure to return its entire result-set, then filter that result before sending it back to the client. The cost is obvious.
Similarly, views can optimize sort-order requests, and even some joins. Again, procedures will run without regard to how they are being used, and leave it to the database system to deal with the result. While you can join to the result of a procedure, you must in general give the procedure all the parameters it needs up-front; if the procedure takes an account number, you won't be able to join between a list of daily transactions and the result of the procedure, as one side of the join has varying account numbers, and the other can only handle one at a time; even unions wouldn't really help you out here.
You may want to give users access to certain data, but never allow them to download the entire possible dataset, perhaps because it's too large, and they should never have a use for the whole thing at a time. In those cases, you could provide them with a stored procedure that takes as one or more parameters the filter(s) to apply to the source data, with the requirement that at least some sort of filtering (e.g. year/month combination, department, account number) must be performed. In the end though, your requirements may leave you no choice, as the logic you need to write may be possible only in a procedural language.
Views have a particular advantage over procedures: you can control both what the user sees, and what happens when they manipulate the data you gave them, using view triggers (sometimes referred to as instead of triggers.) Even in the rare cases where a stored procedure can return updateable results (as a cursor), the only code that will intercept those updates, and can do fancy things with them, will be a trigger defined on the (one) base table; the procedure that generated the data is now out of the picture.
create view account_totals as select account_id, sum(amount) as account_balance from account_transactions group by account_id; create trigger balance_transactions on account_totals before update as begin if (new.account_balance != old.account_balance) then insert into account_transactions (account_id, amount) values (new.account_id, new.account_balance - old.account_balance); end
create view active_employees as select * from employees where active = 1; create trigger deactivate_employees for active_employees before delete as begin update employees set active = 0 where id = new.id; end
create view couples as select a.name, b.name, c.married_on from marriages c inner join people a on a.id = c.a_id inner join people b on b.id = c.b_id where c.divorced_on is null; create trigger divorce_couples for couples before delete as begin update marriages set divorced_on = current_date where a_id = new.a_id and b_id = new.b_id; end
This is more the result of my frustration with slow application-server deployment startup times than anything else (thank you, WebLogic) but it's amazingly faster to fix a stored procedure in production than to redeploy the entire application to fix one little SQL string.
You should of course still talk to your QA department before making even little changes like these, they shouldn't be used as a back-door to avoid discussion and approval.
When you have multiple clients accessing the database, deployments are a much bigger problem: you can't make use of your failover / clustering strategy when you need all of the application servers to get the change at the same time; for two-tier (native) apps, do you really want to kick all the users out of their sessions and force them to download and install the latest version of the app, because you don't know who may or may not need to be using the latest SQL strings? This can come up whether fixing a logic bug, fixing performance issues, or refactoring the database.
In all cases, you really want the change to apply everywhere, all at once. Server-side objects help you with that.
Without getting into the reasons why this happens, let's assume you need to write your application to run against a range of database products. You have a couple of options: treat the database as a pure data storage device, putting all of your logic in the client, and either sticking to very simple SQL statements that work anywhere, putting variations of the same statements in big conditional blocks, or trusting a database-abstraction layer to rewrite your queries for you; or you could use server-side objects to achieve your goals, assuming all of your target database products support them.
Simple SQL has obvious problems: varying performance on different products, inability to use advanced features when available, even trouble with advanced joins and complex queries requiring you to instead download massive amounts of data the client and perform post-processing there.
Conditionals alleviate these problems, but make the application code a lot harder to follow; I'd complain about the problem of keeping the statements synchronized, but that would defeat my pro-procedure argument here in a second.
Database-abstraction layers have their own issues: they tend to be lacking the features you need, they require you to express what you want to do in simplified terms that don't allow them to do the most effective thing in a given situation, they don't know how to use the database's advanced features, they write some queries so poorly the database can't even figure out how to optimize them, and updating them may enhance some statements while breaking others (hard to test), and if you ever need to outright override their decisions, you're back to conditional statements and the problems therein.
With server-side objects, you can simplify your application code; you always either select from a view or procedure (named consistently across database products), you perform simple updates against views or tables (and let triggers do the complicated parts), or you call procedures to make your changes, again simplifying your application code. For each database product, you have a separate installation script to create the appropriate server-side objects; you probably already had such a script anyway for table, view, and sequence creation, now you just include procedures, triggers, and trigger-updatable views as well. As you discover bugs or possible performance enhancements, you can release those as updates to those specific objects, without touching the application itself.
Now, I'll admit, that creates a new problem: keeping the tables, views, procedures, and triggers synchronized across database products when you make a change. I won't say it's easy. It's the same problem faced by internationalization teams, keeping translations of GUI text and documentation synchronized, clear and appropriate in each targeted culture. Maybe technical writers can suggest some good solutions. And if there aren't any, maybe we should build some. I can't tell you how often I've been annoyed that language variants of Wikipedia articles aren't actively kept synchronized.
You don't have to worry about updating the abstraction layer and breaking some statements; you don't have to worry that you can't see exactly what the abstraction layer is doing, because all of your statements are already written out explicitly on the server, where you can tweak them to your content. And again, it comes down to an object-oriented approach: delegate the task via an interface (procedure names) and let the implementations vary.
Some modern IDE's come with "refactoring" tools. They let you search the codebase for uses of a given function, globally rename functions and variables, and so forth. Database systems provide system tables in which you can look up this kind of dependency information. Some database systems will also enforce those dependencies, either preventing you from making a damaging change, or automatically disabling the affected database objects. This can be terribly convenient when you need to perform database refactoring: moving a column from a master table to a detail table, splitting fields up, etc.
SQL strings inside client applications are notoriously difficult to search for references to a given database object, particularly table field names. If you search by tablename, you'll get too many results. If you search by tablename.fieldname, you'll miss most of your uses because you either didn't qualify the field name, or you aliased the table name. And then there's the ubiquitous use of select *. The database is not confused or misdirected by such tricks. It will know exactly which views, procedures, triggers, and constraints make use of a given field, table, sequence, or other database object.
You can use this to plan your outages during a refactoring, discover what teams need to be asked to update their own code, or even hunt down bugs (what piece of code might be updating this field to the wrong value?) This only works if you put the code in the database, where it can be analyzed and tracked. While even partial dependency information can be useful, this is more of an all-or-nothing sort of thing if you want to actually rely on it. That would mean moving even your simple select statements into procedures, or at least doing nothing more than selecting straight from a view, where the view can be assumed to accurately represent your needs and usage.
If you want to get really fancy, make sure you create a new view/procedure every time you change its meaning or API (but not each time you fix a bug in it) so different versions of an app, or different apps, can continue to use that view as long as they need to, until their next deployment. The task then becomes tracking who's using what, but at least you've started down the path of not dropping a supposedly-unused column from a view, breaking an application you'd forgotten about. Instead you'd create a new view, change your code to use it, and online drop the old view when you've done your research and verified it's no longer being used.
In theory, there are specs for transactions in webservices. In practice, I never see it done. A webservice request is treated like any other web request — statelessly: each one stands on its own.
As long as there's a webservice to do exactly what you want, and everything you want, all at once, you're okay. But that's hardly always the case. You'll find yourself wanting to perform two operations together, atomically; you'll want to perform a thousand operations similar operations, without the network overhead and lag of separate calls; you'll want to do a little scripting, looking up a value from one webservice to use as a parameter to another; you'll want to do a bit more scripting, conditionally calling a webservice after a previous one completes (again: transactionally, without overhead, etc.)
Maybe someday we'll get there. But in the mean time, if these webse || :tablename || rvices are just fronts to stored procedures, there's a better way: directly call the stored procedures, using the transaction services already provided by the database. You can upload your requests into a temporary table as fast as uploads can go, then run a single procedure to perform all the updates. You can embed sub-queries in your procedure calls, looking up id values based on names and passing them on to the procedure, without extra network overhead, and without the risk of transaction isolation oddness. (With two completely separate requests, it's entirely possible for your first request to find a row, but when passing it to the second request, discover that the row already no longer exists.)
And at least with Firebird, you can send an execute block statement to the server, effectively running an anonymous, temporary stored procedure, again without the network overhead of bringing results back to the client, processing them, and making more requests.
execute block as declare variable employee_id integer; begin for select id from active_employees into :employee_id do execute procedure send_employee_email(:employee_id, 'Happy New Year!'); end
There's still a ways to go: most database products do not interact with each other well for distributed transactions (e.g. two-phase commit.) But I do believe that in the future, any "server" will have to host a scripting language of some sort to make better use of the functionality it exposes. This is just a step in that direction.
Server-side objects aren't all unicorn tears and silver bullets. The tooling doesn't seem to be there yet to make it easy to look at your application code, find that a stored procedure is called, control-click and get a definition of that stored procedure (or view, etc.) from the server.
Database systems may return a stack trace when an error is encountered inside a procedure, but that trace isn't well-integrated into your application's stack trace indicating where the procedure was called.
You don't get auto-complete to tell you what parameters a stored procedure accepts, what its default values are, what its return type is.
Furthermore, server-side code is difficult to put under proper version control, and difficult to deploy: it's not enough to have a good list of procedures that need to go out, you need to know what order to create/alter them in so your dependency tracker doesn't balk. It's hard to tell what "version" of your server-side code you're currently running, difficult to quickly switch to another version to back-test a change.
Metadata changes are made to an online database; you don't simply disconnect everyone, shut everything down, replace a file, and start back up; you have to make changes in just the right order, on the fly. Your clustered application servers may give you the ability to do rolling blackouts during deployment, but clustered database servers don't serve the same purpose; you'd have to switch to "shards" for that, and deal with the attendant data synchronization issues between independent databases.
If your shop uses an expensive database product, you may find your developers sharing a single instance of the database, yet needing to develop along different branches for different customers — one member's metadata changes unfortunately impact everyone else.
There are some things you shouldn't expect out of stored procedures. Variable output columns, for one: they're not ideal for generating a pivot table based on arbitrary input.
They also tend to have limited parameter types: arrays, complex objects, entire tables can't reliably be used as parameters. The lack of array parameters is what I was alluding to earlier about execute statement; if you need a procedure to take a list of allowed department id's and filter to just those (traditionally a job for department_id in (1,2,4,7)), you'll pull your hair out finding a way to pass that list without risking an SQL injection attack. You don't really want to add CSV parsing, argument escaping, etc. logic to every complex procedure. In those situations, I do recommend you investigate the use of a separate (temporary) table to pass parameters: insert first, then call the procedure, and have the procedure clean up its parameter table when it's done.
create global temporary table search_departments ( department_id integer ); create view employees_in_departments as select employees.* from employees inner join search_departments on search_departments.department_id = employees.department_id; insert into search_departments (department_id) values (1); ... insert into search_departments (department_id) values (7); select * from employees_in_departments;
There are advantages to this round-about approach; for example, if you need to do a join between the passed list of items and other tables in the database, the server can actually help you out! You can also use the parameter table as both input and output, simply updating the rows inside the procedure, thus avoiding the complexity of matching up your outputs with your inputs.
I hope I've inspired you to look into stored procedures and views before your next database application. Maybe I've even encouraged you to look into them for your current project, to fill a void you've been feeling but unable to identify — a vague unease about security, a dissatisfaction with performance, a constant, irritating drip of requests to write SQL for a team that's supposed to be self-sufficient. If so, great! And good luck, you'll need it. Like any other technology worth picking up, you'll need time and dedication to really learn the material. But it's worth it. I hope you'll agree.