Task: Database Design
This task explains how to design a database to implement persistency within an application.
Disciplines: Analysis & Design
Purpose
  • To ensure that persistent data is stored consistently and efficiently.
  • To define behavior that must be implemented in the database.
Relationships
Main Description

The steps presented in this task assume that the persistent data design of the application will be implemented using a relational database management system (RDBMS). It is assumed that you have familiarity with database concepts, including normalization and de-normalization, as well as with database terminology as covered in references such as [DAT99]. 

The steps in this task also refer to the Unified Modeling Language (UML) profile for database modeling, which is discussed in [NBG01]. In addition, [NBG01] contains a general description of the process for modeling and designing relational databases using UML.  For background information on the relationship between relational data models and object models, consult Concept: Relational Databases and Object Orientation.

Steps
Develop Logical Data Model (Optional)
Purpose Define a model of the logical design of the database.

The purpose of the Logical Data Model is to provide an idealized view of the key logical data entities and their relationships that is independent of any specific software or database implementation. It is generally in third normal form (see Concept: Normalization), which is a data-modeling form that minimizes redundancy and ensures no transitive dependencies. Such a model is concerned with what the database will look like when capturing data, rather than with the applications that use the data and their performance. Note that a Logical Data Model is considered to be part of the Artifact: Data Model and is not a separate RUP work product. However, it is often important to define individual Logical Data Models for:

  • projects in which the database and application designs are being developed by separate teams.
  • projects in which there are multiple applications that will share a common database.

If you are creating a Logical Data Model, you can start from scratch using the model elements discussed in Guideline: Data Model, or you can begin by starting with entities for each persistent class in the Analysis Model or Design Model.

You might decide not to create a separate Logical Data Model, especially if you are designing a database that serves a single application. In this case, the Database Designer develops the Physical Data Model based on the set of persistent classes and their associations in the Design Model.

In either approach, it is important for the Database Designer and the Designer to collaborate throughout the analysis and design process to identify which classes in the Artifact: Design Model need to store information in a database. As described in the step titled, "Identify persistent classes of the Task: Class Design," the database designer works with the designer to identify which design classes in the Design Model are considered to be persistent and are potential candidates for becoming tables in the database.

Develop Physical Database Design
Purpose Define the detailed physical design of the database.

The physical database design includes model elements (such as tables, views, and stored procedures) that represent the detailed physical structure of the database and model elements (such as schemas and tablespaces) that represent the underlying data storage design of the database.  Collectively, these model elements comprise the Physical Data Model of the database.  This Physical Data Model is contained in the Artifact: Data Model and is not a separate model work product.

The detailed steps for developing the physical database design are as follows:

Define Domains

Purpose To define reusable user-defined types. 

Domains might be used by the database designer to enforce type standards throughout the database design. Domains are user-defined data types that can be applied to a column in a table.  Domains have the properties of a column without the name. 

Create Initial Physical Database Design Elements

Purpose Create the initial database tables and relationships.

The database designer models the Physical Data Model elements using tables and columns in tables, as described in Guideline: Data Model

If a Logical Data Model has been created, then its logical entities can be used as the basis for an initial set of tables.

Alternatively, the database designer might jump-start the Physical Data Model by using the persistent classes in the Design Model as a starting point for tables in the Physical Data Model.  The database designer models the persistent classes and their attributes as tables and columns respectively.  The database designer also needs to define the relationships between the tables based on the associations between the persistent classes in the Design Model.  A description of how the Design Model elements and relationships map to Data Model elements and relationships is provided in Work Product Guideline: Forward Engineering Relational Databases.

If you are starting the model from persistent classes rather than from a normalized Logical Data Model, then you will generally need to apply some normalization in order to eliminate data redundancies and non-key field dependencies. See Concept: Normalization for more information on database normalization.

Define Reference Tables

Purpose To define standard reference tables used across the project.

Often there are standard look-up tables, validation tables, or reference tables used throughout the project. Since the data in these tables tends be frequently accessed but seldom-changing, that data is worth special consideration. In the Design Model, these tables might contain standard product codes, state or province codes, postal or zip codes, tax tables, area code validation tables, or other frequently accessed information. In financial systems, these tables might contain lists of policy codes, insurance policy rating categories, or conversion rates. Look in the Design Model for classes that are primarily read-only, providing validation information for a large number of clients.

If the reference table is small, do not bother to index it, since indexing might actually add additional overhead for small tables. A small, frequently accessed table also tends to remain in memory, because caching algorithms often keep frequently accessed tables in the data cache.

If possible, make sure that the database cache is large enough to keep all reference tables in memory, along with normal "working set space" for queries and transactions. Often the secret to increasing database performance is reducing disk I/O.

Once the reference table structures are defined, determine a strategy for populating the reference tables. Since these tables are accessed near the beginning of the project, determining the reference values and loading the tables often need to occur relatively early during application runtime. While the database designer is not responsible for obtaining the data, he or she is responsible for determining how and when the reference tables will be refreshed.

Create Primary Key and Unique Constraints

Purpose To define the one or more columns that uniquely identify a row in the table.
To define constraints on columns that guarantee the uniqueness of the data or collection of data.

A primary key is one or more columns that uniquely identify rows in a table. A table has a single primary key. There is often a "natural" key that can be used to uniquely identify a row of data (for example, the postal code in a reference table). The primary key should not contain data that might change with the business environment. If the "natural" key is a value that can change (for example a person's name), then it is recommended that the database designer create a single non-meaningful, non-user-entered column when creating a primary key. This creates a data structure that has greater adaptability to changes in the business structure, rules, or environment.

The use of a non-meaningful, non-user-entered column as the primary key is an essential concept in designing a data warehouse. Transactional systems often choose a "natural" primary key that might be subject to minimal change over a non-meaningful, non-user-entered column.

A unique constraint designates that the data in the column or collection of columns is unique per row. If the unique constraint is on a column, the data in a specific row in the specified column must be unique from the data in a different row in the same column. 

When a unique constraint is defined for a group of columns, the uniqueness is based on the collective whole of the data in the columns that make up that unique constraint. The data in a specific row in a specific column does not have to be unique from the data in a different row in the same column. The database designer uses the unique constraint to ensure uniqueness of business data.

Define Data and Referential Integrity Enforcement Rules

Purpose To ensure the integrity of the database.

Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where these ranges can be identified, the database can enforce them. (This is not to say that data validation should not be done in the application, but only that the database can serve as a "validator of last resort" in the event that the application does not work correctly.) Where data validation rules exist, the database constraints must be designed to enforce them.

A foreign key is one or more columns in a table that map to the primary key in another table. One table might have many foreign keys, and each foreign key is a map to a different table. This mapping, or relationship, between the tables is often referred to as a parent-child relationship. The child table contains the foreign key, which maps to the primary key in the parent table. 

The definition of foreign key constraints is also often used by the query optimizer to accelerate query performance.  In many cases, the foreign key enforcement rules use reference tables.

De-Normalize Database Design to Optimize for Performance

Purpose To optimize the database data structures for performance.

In the case of a relational Data Model, the initial mapping generally yields a simple class-to-table mapping. If objects from different classes need to be retrieved at the same time, the RDBMS uses an operation called a "table join" to retrieve the rows related to the objects of interest. For frequently accessed data, join operations can be computationally expensive. To eliminate the cost of the join, a standard relational technique called "de-normalization" is often employed.

De-normalization combines columns from two or more different tables into the same table, effectively pre-joining the information. De-normalization reflects a tradeoff between more-expensive update operations in favor of less-expensive retrieval operations. This technique also reduces the performance of the system in queries that are interested only in the attributes of one of the objects that are effectively joined in the de-normalized table, since all attributes are normally retrieved on every query. For cases in which the application normally wants all attributes, there can be a significant performance improvement.

De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the cost of non-join queries. Limiting de-normalization to two tables is a good policy unless strong and convincing evidence can be produced regarding the benefits.

De-normalization can be inferred from the design classes in cases in which classes are nested. Nested classes can be mapped to a de-normalized table.

Some object databases allow a concept similar to de-normalization, in which related objects are clustered together on disk and retrieved in single operations. The concept in use is similar: Reduce object retrieval time by reducing the work the system must do in order to retrieve related objects from the database.

In some cases, optimizing the Data Model can unmask problems in the Design Model, including performance bottlenecks, poor modeling, or incomplete designs. In this event, discuss the problems with the Designer of the class, triggering change requests where appropriate.

Optimize Data Access

Purpose To provide for efficient data access using indexing.
To provide for efficient data access using database views.

Once the table structure has been designed, you must determine the types of queries that will be performed against the data. Indexing is used by the database to speed access. Indexing is most effective when the data values in the column being indexed are relatively distinct.

Consider the following indexing principles:

  • The primary key column of the table must always be indexed. Primary key columns are used frequently as search keys and for join operations.
  • Tables smaller than 100 rows in size with only a few columns benefit little from indexing. Small tables generally fit easily in the database cache.
  • Indexes should also be defined for frequently executed queries or for queries that must retrieve data quickly (generally, any searches done while a person might be waiting). An index should be defined for each set of attributes that are used together as search criteria. For example, if the system needs the ability to find all Orders on which a particular product is ordered, an index on the Line Item table on the product number column would be necessary.
  • Indexes should generally be defined only on columns used as identifiers, not on numeric values, such as account balances or textual information such as order comments. Identifier column values tend to be assigned when the object is created and then remain unchanged for the life of the object.
  • Indexes on simple numbers (integer and number data types) are much simpler and faster than indexes on strings. Given the large data volumes processed on a query or a large join, small savings add up quickly.  Indexes on numeric columns tend to take significantly less space than indexes on characters.

On the down side, the use of indexes is not free; the more indexes on a table, the longer inserts and updates take to process. When contemplating the use of indexes, bear in mind the following precautions:

  • Do not index just to speed up an infrequently executed query, unless that query occurs at a critical point, making maximum speed essential.
  • In some systems, update and insertion performance is more important than query performance. A common example is in factory data acquisition applications in which quality data is captured in real time. In these systems, only occasional online queries are executed, and most of the data is analyzed periodically by batch reporting applications that perform statistical analysis on it. For data-acquisition systems, remove all indexes to achieve maximum throughput. If indexes are needed, they can be rebuilt just before the batch reporting and analysis applications run, then dropped when the reporting and analysis is complete.
  • Always remember that indexes have hidden costs. For example, they take time to update (a tax paid on every insert, update, or delete) and occupy disk space. Be sure you get value from using them.

Many databases offer a choice of index types. The most common include:

  • B-tree indexes-The most frequently used kind are based on balanced b-tree index data structures. They are useful when the index key values are randomly distributed and tend to have wide variability. They tend to perform poorly, however, when data being indexed is already in sequential order.
  • Hashed indexes-Less frequently, index key values are hashed. Hashing offers better performance when the range of index key values is known, relatively unchanging, and unique. This technique relies upon the use of the key value to calculate the address of the data of interest. Because of the need for predictability, hash indexes tend to be useful only for medium-sized lookup tables that change very infrequently.

Your choice of indexing strategy and timing of index creation can have a large impact on performance. Bulk data loads should be performed without indexes (this can be achieved by dropping the index, loading the data, and then re-creating the index). The reason for this is that the index structure is re-balanced as each row is added. Since subsequent rows will change the optimal index structure, the work done re-balancing the index as each row is inserted is largely wasted. It is faster and more efficient to load data without indexes, then re-create the index when the data load is done. Some databases provide bulk data-loaders to do this automatically.

Another strategy for optimizing database access performance is the use of views. Database views are virtual tables that have no independent storage of their own. To the calling program (or user), however, a view behaves like a table. A view supports retrieval of data, and it can be used to update data as well-depending on the database structure and database vendor. The view contains data from one or more tables that can be accessed through a single select statement. The performance gain occurs during the selection of data, especially in frequently queried tables. The data is retrieved from a single location-the view-instead of by searching the multiple or large tables that exist in the database.

Views also play a significant role in database security. A view containing parts of a table can restrict access to sensitive data contained in the base table.

Define Storage Characteristics

Purpose To design the space allocation and disk page organization of the database.

A database designer uses tablespaces to represent the amount of storage space that is allocated to tables, indexes, stored procedures, and so forth. One or more tablespaces are mapped to a database. The database designer must analyze the tables in the Data Model to determine how to distribute them, along with other support database elements, across the storage space in the database.

In determining the tablespace structures for the database, bear in mind that databases do not perform I/O on rows, records, or even whole tables. Instead they perform I/O on disk blocks. The reason for this is simple: Block I/O operations are usually optimized in the software and hardware on the system. As a result, the physical organization of the tables and indexes in the database can have a dramatic impact on the performance of the system.

When planning the space allocation and disk page organization of the database, consider the following factors:

  • the density of information in the disk pages
  • the location of disk pages on disk or across multiple disks
  • the amount of disk space to allocate to the table

These factors are discussed in the sections that follow.

Disk Page Density

The density of disk pages depends on the extent to which data is expected to change over time. Basically, a less-dense page is more capable of accepting changes in values or the addition of data over time, while a fuller data page provides better read performance, since more data is retrieved per block read.

To simplify disk management, the database designer can group tables by the extent to which they tend to change. The following three groups constitute a good beginning for this type of organization:

  • highly dynamic tables
  • somewhat dynamic tables
  • mostly static tables

The highly dynamic tables should be mapped onto disk pages that have a great deal of empty space in them (perhaps 30%); the somewhat dynamic tables should be mapped onto disk pages that have less empty space (perhaps 15%); and the mostly static should be mapped onto disk pages that have very little empty space (perhaps 5%). The indexes for the tables must be similarly mapped.

Disk Page Location

After the groups of tables are mapped, the database designer must determine where to put the disk pages. The goal here is to try to balance the workload across a number of different drives and heads to reduce or eliminate bottlenecks. Consider the following guidelines:

  • Never put data on the same disk as the operating system, its temporary files, or the swap devices. These drives are busy enough without the addition of further workload to them.
  • Put data that is accessed simultaneously on different drives in order to balance the workload. Some systems support parallel I/O channels. If this is the case, put the data on different channels.
  • Put the indexes on a different drive from the data that it indexes in order to spread out the workload.
  • Refer to the database vendor's documentation for guidelines.
  • The type of storage used (for example, RAID-5, RAID-10, SAN, NAS, and channel attached) affects database performance. Make use of the performance guidelines provided by the storage provider.

Database I/O is generally the limiting factor in database performance. I/O balancing is an iterative, experimental process. By prototyping database access performance during the elaboration phase, coupled with appropriate instrumentation to monitor physical and logical I/O, you can uncover performance problems early while there is still time to adjust the database design.

Disk Space Allocation

Using the characteristics of the persistence design mechanism, estimate the number of objects that must be stored. The amount of disk space required to store the objects varies from RDBMS to RDBMS. When calculating disk space, make sure to account for growth due to additions of data.  To estimate the disk space for a database, first estimate the disk space required for each table, and then calculate the space requirements for all tables.  Consult the database administrator manual for the specific RDBMS product to determine the precise size estimation formula.  Here are some general steps for estimating the space requirements for a table:

  • Calculate average row size.  This calculation should include any control information at the record level, as well as any control information required for variable-length columns.
  • Calculate the number of rows that will fit into a page or block of I/O. Because most databases store only complete records on a page or I/O block, this should be the integer number of rows that will fit into a page or block of I/O.
  • Calculate the number of pages or I/O blocks required to store the estimated number of records in the database.  The estimated number of records must include any load factors.
  • Multiply the number of pages or I/O blocks required by the size of the page or I/O block.
  • Add any overhead for additional indexes.
  • Add any fixed overhead for the table.

Once the table space requirements have been defined:

  • Compute the sum of the space required by the tables.
  • Add in any required fixed amount of space for database management.
  • Add in disk space required for the transaction log and audit trail. 

In a frequently updated environment, the retention requirements for the audit trail require significant amounts of storage. The documentation for major commercial database management systems usually provides detailed sizing instructions. Be sure to refer to these instructions when calculating your estimates of the database disk space requirements.

Design Stored Procedures to Distribute Class Behavior to the Database

Purpose To determine if the stored procedures or triggers should be used to implement data access class operations.

Most databases support a stored procedure capability. A stored procedure is executable code that runs within the process space of the database management system. It provides the ability to perform database-related actions on the server without having to transfer data across a network. The judicious use of stored procedures can the improve performance of the system.

Stored procedures are usually one of these two types: actual procedures or triggers. Procedures are executed explicitly by an application, generally have parameters, and provide an explicit return value. Triggers, on the other hand, are invoked implicitly when some database event occurs (for example, insert a row, update a row, or delete a row), have no parameters other than the row being modified (since they are invoked implicitly), and do not provide an explicit return value.

In database systems that lack constraints, triggers are often used to enforce referential and data integrity. Otherwise, they tend to be used when an event needs to trigger (or cause) another event. Triggers are also frequently used for security purposes by auditing the trigger event.

The design classes in the Design Model must be examined to see if they have operations that should be implemented using the stored procedure or trigger facility. Candidates include:

  • any operations that primarily deal with persistent data (creating, updating, retrieving, or deleting it).
  • any operations in which a query is involved in a computation (such as calculating the average quantity and value of a product in inventory).
  • operations that must access the database in order to validate data.

Remember that improving database performance usually means reducing I/O. Therefore, if performing a computation on the DBMS server will reduce the amount of data passed over the network, the computation should probably be performed on the server.

Work with the designer of the design class to discuss how the database can be used to improve performance. The designer will update the operation method to indicate whether one or more stored procedures can be used to implement the operation.

Review the Results
Purpose To ensure the quality and integrity of the Data Model.

Continuously throughout this task, you must consider the Checklist: Data Model to assess the completeness and quality of the effort.  Furthermore, the database designer must regularly review the implemented structure of the database to ensure that the Data Model is consistent with any changes that have been made directly in the database.  If the project is using data-modeling tools that support synchronization of the Data Model with the physical structure of database, the database designer must periodically check the state of the Data Model with the database and makes adjustments as needed. 

Identified defects that will not be corrected at this time must be documented in Change Requests and eventually assigned to someone to own and drive to resolution.

More Information