Symbols [Scratchbook] (Article)

Description :: Variables, constants, functions, contexts, lambdas

Generally, we classify symbols (not tokens) into categories such as "variable" or "function" or "procedure". A variable, such as X, holds a representation of a value (in a few languages, it might hold nothing at all, but that's rare.) As to whether or not X has a type (domain), languages are fairly evenly split: some may not remember the type at all, some may have a type with the representation but let you assign to the variable a representation of some completely different type, and some will force you to assign the variable only representations matching a certain type constraint.

Functions and procedures are only sometimes seen as distinct. Both take zero or more parameters, and have zero or one return values. Functions are generally seen as having a return value that is entirely based on what is passed to the function, with no side-effects. A side-effect is the modification of anything outside the function by the function itself: changing the value of the parameters (passed by reference) or changing a global variable (that is, not local to the function.) Scope (local and global) are discussed elsewhere, and pass-by-reference isn't important here. The expression "f(x) = x*x" is a mathematical function, and a function by our definition as well. Procedures are generally allowed to do these things: they can be defined not to return a value, they can modify their parameters, they can have side-effects, they can do things not based purely on what's passed to them. A procedure named SystemTime() may give you the clock time, but you didn't pass it a time as a parameter. A procedure named GetNextNumber(X) may both modify X and give you back its value after modification: this is commonly used with counter-variables, sometimes called sequences or generators or auto-numbers. Repeated use may be handy for generating employee numbers, serial numbers on items leaving a factory, etc. A procedure named FixAccounting() may do some work without returning any information at all.

What these have in common (apart from the rare case of procedures not returning anything) is that they are symbols, to which you pass zero or more parameters, and get a (representation of a) value back. The variable X can also be seen as the procedure X(), to which you pass nothing, and which gives you a value. A function that always returns "3", called Constant(), might be seen as a non-modifiable variable (a constant) called Constant.

We are accustomed to variables being updatable and functions being read-only. You can say "b = a" or "a = 5", but while "b = x()" is common, "x() = 5" is not. (Nor is "x(2) = 4" which might actually change the definition of the function itself, oddly.) It occurs to me that updatable symbols could provide for easy implementation of properties (as they are called in Borland C++ Builder), or getter/setter functions (particularly in the Java world). CurrentTime() might be a function, but that doesn't preclude it from being updated. You could allow for "CurrentTime() = '13:24:06.004'", which would internally set the system clock, or some offset used by the database.

Side-notes: you can stop reading here
In an aside, Chris Date notes (insightfully) that databases can be seen as tuples. By this he seems to mean that a database (as a type) is a tuple (header) of named attributes, each with a type. (And as a variable, it's a tuple variable with one value for each attribute.) This would correspond to the system catalog of symbols inside the database, the database itself not being represented inside itself (which is fine.) As a tuple, the database would have one (named, type) attribute for each variable (relation or not). It seems like all types of symbols, including classical variables and functions, would be listed here (one namespace.) It's a thought to discuss elsewhere, perhaps an article on namespaces. (Also included in this discussion would be something about passing by-copy or by-reference, particularly into unnamed functions. Passing functions to other functions, where some of the input parameters are maybe pre-provided, seems even more interesting. Lots of stuff there to look at, though it's likely just a language issue, not a database issue.)

While I'm leaving notes to myself, it should be noted (to myself, at least) that most of what I say about a database applies in a programming language as well. Programming languages have statements, data in variables, types, etc. Without transactions, a database is pretty much equivalent to any programming environment. The idea that I'm trying to define a good database server doesn't prevent me from thinking that I could design a "client" language as well, without persistence. In the end, it's all the same. Programs that talk to database servers are really just programs extending their namespace around a database, creating a larger one that is their own + the database (shared between several programs.) You get a local namespace visible only to yourself, and a shared one visible to all client programs. All this to say that a database, and its functions, domains, and other definitions, is equivalent to a program, and vice-versa. It's something to explore more clearly. (Current wording is atrocious, I know.)

Again, while I'm thinking of it, I mean to write a separate article that connects the dots concerning the relationship(s) between databases, programming languages, and file systems. Today I came across a whitepaper about a distributed/clustered filesystem, and realized that a lot of what was said there applied equally well to the internet. They had a metadata server (central namespace authority) much like DNS servers (which in turn are a lot like the yellow pages, taking a name and giving you a phone number) (but without the hierarchy our DNS servers use) which redirected you to the proper physical storage device for the information you requested, which in turn could have some sort of load-balancing system (larger websites often have dozens of identical servers on a shelf, and you never know which one will answer, it's based entirely on how busy each one is), or redundancy, etc. The logical/namespace issues were handled on one side of the equation, while the physical/storage/transfer issues were handled elsewhere. A similar case of "disengagement" between services for the sake of all sorts of things (cleanliness, redundancy, speed, maintenance, vendor-independence) appears in the database world with separate transaction managers (TM) which can talk to several database servers to coordinate work. We could also have database namespace managers which redirect your queries to the appropriate resource managers (RM) (re-using the term applied in the discussion of transaction managers) so as to minimize the load on each individual resource manager (make each one responsible for, say, a particular variable, symbol, table, group of tables, etc.) Each of those could in turn really be a dispatcher for several less-abstract resource managers, to lighten the load even more. This is particularly easily applied in cases of read-only transactions, where different servers can keep entirely separate copies of the database with no ill effect, but is less fun when you need to do updates -- someone has to make sure you don't mess anything up, make sure everyone gets the memo, etc.

Really, what I'm trying to say is that everything is the same, particularly in this field (where you don't have to get down to the level of atoms or quarks or strings to say that two things are similar.) We re-use the same techniques all across our discipline when it comes to speeding things up (distributing the load, minimizing the distance between client and server, caching results, etc.). I think a lot of us re-invent the wheel in these situations, because we fail to see the similarities, we fail to learn the lessons of the previous generation.

And just because IBM runs television ads that say you can just "add another server" when you need one doesn't make it true -- there are logical implications to adding a new server, particularly if you only have one at the moment. The issue of cardinality is the same here as when you decide you want to enter two pieces of information into a database designed to hold only one. Generally, if you plan for "more than one", you wind up planning for any number, ever. (Sadly, a lot of people think that providing five fields named "thing1", "thing2", ..., "thing5" will be sufficient. You should avoid such people at all costs. Besides, I wouldn't mind the contracting work.)

I've been keeping written notes while I travel recently; from reading a Java book (I've been sufficiently mocked by my peers, I assure you), I have the following concerning namespaces and inheritance:
- logical interface and the physical implementation behind it should not be directly tied together: it makes sense to pull logical interfaces from a variety of sources and pull the physical implementations from a subset thereof, but possibly leaving some things undefined in "child" classes, mix'n'matching them, or otherwise playing with things ... both C++ and Java are overly interested in how you implement things when they really should give you the freedom to do it your own way; only the logical interface matters anyway, right?
- access to symbols from other symbols should be more clearly defined; C++ uses public/private/protected, friend, and inheritance to determine access, Java adds namespaces and packages to the fray. What we need is a way to say "this is an authoritative list of symbols permitted to do these things to this other symbol" and a way of specifying whether or not this is an open set so it can be extended later as people get "bright" ideas. And it needs to be cleaner to look at, darn it.
- scope and namespaces should be more clearly (by which I mean "explictly") defined, as separate issues from the above. What you can see is not what you can do, which is not the what and how of classes, and so forth.
I was tired; I'm not sure it made sense then, I'm not sure it does now -- but I have a strong sense people mix namespaces too closely with scope and interfaces, and inheritance of logical things with inheritance of physical things ... and that's bad.

Continued at top

Extras