Michael Zuskin

Elegant Code - Ocean of Databases



Only surrogate primary keys!

Each table must have a primary key consisting of one field which has one and only one meaning - to uniquely identify records (with no other real-life context). The only exception - tables, whose task is etablishing many-to-many relation: their PKs consist of 2 fields poining to the related tables.


Obey this rule even if you are sure that the values, which can be used as a PK, are unique and required in the business! Such keys are named "natural keys". ALL PRIMARY KEYS MUST SIMPLY BE NUMERATORS (1, 2, 3 ETC.) OR MNEMONIC CODES (ONLY IN CATALOG TABLES) HAVING NO OTHER CONTEXT BESIDES BEING RECORS UNIQUE IDENTIFIERS! Such keys are named "surrogate (synthetic) keys". For example, you create an information system for a company where each worker already has an employee number - don't use that number as the employees table's primary key! Instead, create other one, like emp_id. In addition, you can create a field to keep the existing employee numbers, and "hang" on it the NOT NULL and UNIQUE constraints (as well as for other "real-life" fields like Social Insurance Number) - why not? The database theory calls such fields "candidate keys" - they are like candidates to be elected as primary keys, but don't allow them to win the elections! Many data architects don't realize which problems developers can encounter when natural keys are used! The small problem: when a table has a combination of multiple fields (I have seen 5-8!!!) as its primary key, developers are forced to:

But there is also the big problem that can appear when an information system needs to be changed as result of changes in business requirement. Imagine: one day the client stops to manage employee numbers (other example - duplicate ID cards numbers found in at least one country as I know). You will change the functionality (relationships between tables as well as database and GUI application's objects) spending time to re-build and re-test the system. But if you have used surrogate PKs - you don't need to perform so "black" work: you are OK because the primary key's meaning has not been changed - it was a primitive numerator yesterday, and it will stay it after years through any business changes! I wonder why in developers' forums there are discussions which primary keys - surrogate or natural - should be used. We have always to think what will happen in the worst case, and I have described above what will happen if you use a natural key - a lot of head ache and more chances of bugs. And what will happen in the worst case if surrogate keys are used? In that case the table will have one extra field. Not a big trouble - we don't have the goal to save maximum disk space: these days we can leave in a restaurant more money than a hard disk costs.

SELECT statements

Divide one complicated SQL query into a few sub-queries if it doesn't impact performance.


Caution! This tip must be used carefully. It really makes the code more elegant, but you should make sure it doesn't decrease the query execution speed. Especially avoid producing of correlated sub-queries - ones where a value of the sub-query field is compared to a value from the outer query field (in this sitiation the sub-query is executed once per each row of the main query!). So, always check execution plans!


The "divide and conquer" method works well also in SQL queries. If you can take a part of a query into a sub-query - do that (unless it can impair performance)! From the readability viewpoint, it's always better to write a number of simple SELECT statements than throw all the stuff into one unreadable heap of tables, fields, operators and arguments. SQL seems to be easy language - it's enough to learn 15-20 keywords to say "I know it!", but, in fact, it is the most difficult language. Don't worry about indenting when you are in SQL editor - that problem exist more in languages with "command-after-command" flow, not in fourth-generation SQL - it has other problem: sometimes you see a heavy query and want to change your profession... So, why not to have a number of simple ones instead? Let's compare two the approaches in a simple query, reporting the country name by a city appearing in an address:


*** BAD code (the "all in one heap" approach): ***

 SELECT @country_name = country.country_name
   FROM country,
         city,
         address
  WHERE country.country_id = city.country_id
    AND city.city_id = address.city_id
    AND address.address_id = @address_id;

*** GOOD code (the "divide and conquer" approach): ***

 SELECT @country_name = country_name
   FROM country
  WHERE country_id =

        (SELECT country_id
           FROM city
          WHERE city_id =

                (SELECT  city_id
                   FROM address
                  WHERE address_id = @address_id))

In contradistinction from the first variant, when you look at the second one, you know definitely where to begin the investigation - in the deepest query (on the other hand, if you get hourly salary, may be the "heap" method is better for you...)! OK, it returns address_id, and that converts the next-level query to something like "SELECT city_id FROM address WHERE address_id = 12345". And so on, and so on... As you see, this example is very easy and breaking it down into sub-queries doesn't really help us to be happier, but what do you say about our real everyday SQLs? Unfortunately, sometimes the first method is the only possible... There is also third method - SELECT each time INTO a variable and pass that variable to the subsequent query as retrieval argument. It is the best practice from the viewpoint of readability:


*** VERY GOOD code (the "divide into simple sub-tasks" approach + using variables): ***

 SELECT @city_id = city_id
   FROM address
  WHERE address_id = @address_id

 SELECT @country_id = country_id
   FROM city
  WHERE city_id = @city_id -- select by previously populated variable

 SELECT @country_name = country_name
   FROM country
  WHERE country_id = @country_id -- select by previously populated variable

The advantages of this approach:

But it also have small disadvantages:

But moving the SQLs into a stored procedure and calling it from the client or using it as the object’s data source can easily solve both the problems. In fact, that should be the way to create client/server and n-tier applications, I will write more about that below.

WHERE clauses

Join tables using the ANSI (the "old") syntax (with keyword JOIN) instead of listing the tables in FROM clause and restrictions in WHERE clause (the "old" syntax).


This tip will be helphul if you use the "all in one heap" approach anyway. Of course, using of ANSI syntax must be allowed by coding standards of your company. I had used the old one for a fiew years before coming to one of my projects where I was forced to adopt the coding rules of the company, and these rules forced developers to use only the "new" method. In the first days of working there I was feeling a bit uncomfortable, but now I am happy and recommend all my friends-developers to move to the ANSI standard - it is more understandable, and when you are looking at one table of a complicated join, you in fact concentrate on that table, so the SQL statement is no more one heap...


*** BAD code: ***

 SELECT @country_name = country.country_name
   FROM country,
         city,
         address
  WHERE country.country_id = city.country_id
    AND city.city_id = address.city_id
    AND address.address_id = @address_id;

*** GOOD code: ***

 SELECT @country_name = country.country_name
   FROM country
   JOIN city
     ON city.country_id = country.country_id
   JOIN address
     ON address.city_id = city.city_id
    AND address.address_id = @address_id;


OUTER JOIN: only LEFT

Use only left outer join.


Using OUTER JOINs, use only LEFT ones all over the application (even all over your career if it's possible) - and never RIGHT! It will facilitate imagining of the joined tables: the left imagined table always has values in all the rows, and the right one has "holes" of NULLs... Oppositely, you can always use RIGHT and never LEFT, it doesn't matter, what is important - use only one of them (no politics, really!).

INSERT: mention fields

Write fields names in SELECT parts of INSERT statements.


SQL Server only:


In INSERT statements, write the field name near each inserted value ("field_name = inserted_value"):


*** BAD code: ***

    INSERT #recordset (
           company_number
          ,entered_date
          ,description
          ,accounting_flag
          ,interface_flag
          ,subcontractor_contact_ba_id)
    SELECT @company_number
          ,received_date
          ,full_description
          ,'Y'
          ,'N'
          ,NULL /* will be populated later */
      FROM v_change_order_header
     WHERE ...

*** GOOD code: ***

    INSERT #recordset (
           company_number
          ,entered_date
          ,description
          ,accounting_flag
          ,interface_flag
          ,subcontractor_contact_ba_id)
    SELECT company_number = @company_number
          ,entered_date = received_date
          ,description = full_description
          ,accounting_flag = 'Y'
          ,interface_flag = 'N'
          ,subcontractor_contact_ba_id = NULL /* will be populated later */
      FROM v_change_order_header
     WHERE ...

It will allow you see exactly which value goes into which field - very useful when you add/remove fields to/from an INSERT statement having really a lot of fields. Don't forget that the field name, written with "=" before the inserted value, acts like a comment - the insert anyway occurs by the position of the field in the INSERT part (so, if the field "last_name" is listed third in the INSERT part and you will write "first_name = @first_name" in the third line of the SELECT part, you will populate the field "last_name" with a value, stored in @first_name var!). Unfortunately, it is impossible in Oracle, but you can use comments to have the same effect ("inserted_value /* field_name */" instead of "field_name = inserted_value").


By the way, talking about the INSERT statement... If you don't insert any value into a field, you can simply omit that field in the statement. But it's more beautifully to explicitly insert NULL (as into 'subcontractor_contact_ba_id' field in the above example). Why? Firstly, developers will see that exist columns which are not populated by the INSERT statement (otherwise, the developers can think the table is smaller than it really is). Secondly, it acts like a comment saying the developer hasn't forgotten to populate the field - he has left it empty intentionally. It's especially good when you do the initial INSERT into a temporary table in your stored procedure leaving some fields empty (these fields will be populated later in the procedure using UPDATE of the temporary table (I even add a comment in this situation: "/* will be populated later */".

Temp table in stored proc

If a stored procedure uses a temporary table to form its returned recordset, the following rules would be helpful:

CREATE TABLE #recordset (
     /*********** Fields to be returned in recordset: ***********/
     start_date             datetime         null
    ,end_date               datetime         null
    ,certificate            varchar(200)     null 
    ,company                varchar(200)     null
    ,date_taken             datetime         null
    ,comments               varchar(200)     null
    ,class_description      varchar(100) not null
    ,class_code             varchar(50)      null 
    ,recertification_date   datetime         null /* recertification_period passed after date_taken */
     /*********** Fields for internal use: ***********/
    ,student_id             integer      not null
    ,sc_id                  integer          null 
    ,recertification_period tinyint          null /* needed to calculate recertification_date */)

Advantages:

Entity existence checking

Checking if an entity exists in a table, use EXISTS instead of COUNT.


Elegant programming means not only producing a code which is elegant cosmetically and read easily. For example, inefficient code doesn't look very elegant to me even if it is perfect cosmetically... I don't feel comfortable seeing the following construction to check an entity existence:

SELECT COUNT(1)
  INTO counter
  FROM...;
IF counter > 0 THEN...

I don't understand why developers perform full table scan (which sometimes forces users to enjoy minutes looking at turning hourglass) while there is another, much nicer way which reports an entity existence immediately after finding the first occurrence in the table (which can be very large):

BEGIN
   SELECT 1
     INTO existence_flag
     FROM DUAL
    WHERE EXISTS (SELECT 1 FROM...);
EXCEPTION
   WHEN NO_DATA_FOUND THEN
      existence_flag := 0;
END;
IF existence_flag = 1 THEN...

Data Sentinel Layer

Write business logic in database stored procedures.


Caution! The following tip can be irrelevant for very large systems with heavy transactions-intensive processing!


Client parts of many client/server applications (and middle tier components of many n-tier applications) contain SQL statements in one of the existing forms (imbedded into a host language, data sources of data access objects etc.), and that is looking normal and traditional. But that should not make us happy. It's better to have a "SQL-free" Business Logic Layer (BLL) - and even Data Access Layer (DAL) if appears. If you begin to build a new database-oriented application (client/server or distributed) and wish to really facilitate your life, think about creation of Data Sentinel Layer (that's how I call it - it's not an official term) which is the way of the interaction between the application and the data stored in the database. In fact, it is a set of stored procedures which act as API between the client (or middle layer application's DAL), written in .NET, Java, PowerBuilder etc., and the data storage (physical database tables). Only these procs have permissions to access and manipulate the stored data directly; DAL/client app is only allowed to call the procs (which are the main "working horses" of such a system). So, SQL statements may not appear in the client/middle application at all - not even in it's DAL! Yes, yes, I am serious - a large scale, enterprise level data-centric application can perfectly exist with no SQL out of the database!


So, if an object of ADO.NET's DbCommand class is used in the DAL then set its CommandType property to StoredProcedure and populate its SelectCommand, InsertCommand, UpdateCommand and DeleteCommand properties with the names of the stored procedure used for SELECT, INSERT, UPDATE and DELETE (and not with hardcoded SQL statements as it happens when the CommandType is set to Text).


Did you pay attention how development tools were replacing each other during the last two-three decades? COBOL, C, C++, PowerBuilder, Java, today is the golden age of .NET... And nobody knows what the industry will suggest us tomorrow! So, if you use the described approach and program business logic in stored procs (and not in the BLL of your current-technology-dependent application), there will be less problems with the conversion of the system to modern technologies in the future - I am sure that Oracle and SQL Server (with their stored procedures languages PL/SQL and Transact-SQL) will continue to exist even after End of Days!


So, go ahead!




blog comments powered by Disqus



<< prev CONTENTS next >>