Modeling
Methodologies
From
www.aisintl.com
Many
forms of symbolic notation have been developed to enable data models
to represent various levels of abstraction. Some are lexical, others
graphic; the better approaches are both.
One
of the earliest, Chen's
Entity
Relationship Model, offers a set of shapes and lines which, much
like musical notation, deliver a wealth of information with sparse
economy of drawing. Entity Relationship modeling was readily adopted
outside academia, introducing the concepts of data modeling to a
generation of information professionals.
Chen's
ER spawned a number of variations and improvements, some of which
have been embodied in computer assisted software engineering (CASE) products
employing ER methodology, e.g., CSA's Silverrun.
Barker90
(p. 5-1) defines an entity as "... a thing or object of
significance, whether real or imagined, about which information
needs to be known or held." Martin90
(vol. II, p. 219) agrees that an entity "is something about
which we store data." Chen's original E-R technique made a firm
(if not clear) distinction between entities, as defined above, and
relationships between them. To cope with inevitable complexities,
Chen allowed relationships to have attributes of their own, making
them look a lot like entities and giving rise to heated debate over
just what is an entity versus a relationship.
Given
the lack of clarity in definitions, it is not surprising that Codd90
(p. 477) says "The major problem with the entity-relationship
approach is that one person's entity is another person's
relationship."Date95
(p.363) agrees , saying "[the ER approach] is seriously flawed
because the very same object can quite legitimately be
regarded as an entity by some users and a relationship by
others." Thus Codd90
(p. 9) says emphatically that "... precisely the same structure
is adopted for entities as for relationships between entities."
Date95
(p.362) puts this in perspective with "[the ER approach] is
essentially just a thin layer on top of the basic relational
model."
James
Martin's Information Engineering, laid out in Martin90,
is a streamlined refinement on the ER theme which discards
the arbitrary notion of the complex "relationship"
with an arity (i.e., the number of entities related) of two,
three, four, or even more. Martin models them as simply
associated entities. Thus every relationship in IE is
binary, involving two entities (or possibly only one if
reflexive). Martin also simplified the graphic notation in
his diagram style. IE has become the basis for a number of
CASE products,
including Powersoft's PowerDesigner
and several others.
|
Another
common modeling technique is IDEF,
developed in the late 1970's and early 1980's by Bob Brown at the
Bank of America and well described in Bruce92.
IDEF was later extended by various parties into a set of tools and
standards which were adopted by the U.S. Air Force as the required
methodology for government projects. IDEF
is semantically weaker than ER and IE and forces its practitioners
into some rather arbitrary methods which lack a sound foundation in
theory. Nonetheless it is a workable, easily learned methodology
which has been taken up, either by choice or for government
contracts, by many modelers. LogicWorks' ERwin,
Popkin's System
Architect,
and InfoModeler
from InfoModerlers, Inc. offer IDEF1X
data modeling products.
Entity-Relationship,
IDEF1X, and Information Engineering all translate business
requirements into formal symbols and statements which can eventually
be transformed into database structural code. Thus the modeling
process reduces undisciplined, non-mathematical narrative to
algebraic regularity. Early practice (see DeMarco78)
when data modeling techniques were not widely known, was built on a
bottom-up approach. Analysts harvested an inventory of raw data
elements or statements ("A customer order has a date of
entry.") from the broad problem space. This examination was
frequently conducted via data
flow diagram
(DFD) techniques, which were invented for the express purpose of
discovering the pool of data items so that their structure could be
considered. Expert analysis of this pool, including various forms of
normalization, rendered aggregations of data elements into entities.
Unfortunately,
according to Teorey94,
"The number of entities in a database is typically an order of
magnitude less than the number of data elements ..."
Conversely, the number of data items or attributes is one or two
orders of magnitude greater than the number of entities. In
approaching from discovery of the multitude of details, one has the
discouraging experience of watching the work funnel into a black
hole of diagrams and documents, seldom allowing the escape of an
illuminating ray of understanding.
Top-down,
entity-based approachs (ER, IE, etc.) are more concise, more
understandable and far easier to visualize than those which build up
from a multitude of details. Top-down techniques rapidly fan out
through the power of abstraction to generate the multitude of
implementation details. Current practice therefore leans toward
modeling entites (e.g., "customer", "order")
first, since most information systems professionals now understand
the concept of entities or tables in a relational database. Entities
are later related amongst each other and fleshed out with
attributes; during these processes the modeler may choose to
rearrange data items into different entity structures.
While
this delays the analysts' inevitable agony of populating the model's
details, it has the corollary shortcoming of placing responsibility
for critical structural decisions on the designers. We do not mean
to suggest that professional data analysts are incapable of making
such decisions but rather that their time could be better spent if
the CASE tool can make those decisions - swiftly, reliably,
consistently - for them.
Proponents
(e.g.,Halpin95)
of the Object Role Modeling (ORM)
or NIAM schools represent that their methodologies accomplish
precisely that, in addition to enabling the capture of a much larger
range of structural features and constraints than in ER based
methods. In ORM it is the calculus of relational mapping, rather
than the whim or experience of a designer, which determines how data
items ("objects") are assembled into entities. This does
not snatch all judgment and creativity from the designer. Rather it
elevates them to a more symbolic plane of discussion concerning
business issues and implementation options. Dr. Terry Halpin
explains this more thoroughly and articulately in his several
articles on Object
Role Modeling.
Accidents
of history rather than relative deficiencies seem to have kept ORM
in the shadows of ER for many years. Contrary to a frequent
misconception, the academic foundations of ORM date back twenty
years, to the same period which gave birth to ER. Over the years
several CASE tools have employed this methodology yet there has
seldom been even one commercial product available. For a
comprehensive display of the current art of ORM, see Asynetrix's InfoModeler.
The
modeling methodologies discussed above deal with conceptual and
logical understanding of data but not necessarily the physical
details of its storage. Additional techniques from the area of
relational schema design are generally employed to represent tables,
columns, indexes, constraints and other storage structures which
implement a data design. For example, the table below illustrates
some design choices which must be implemented in declarative or
procedural integrity constraints to implement a model.
The
conceptual, logical, and physical models together comprise a
complete data model which can represent a given database design from
its highest abstraction through its most detailed level of column
data type and index expression.
In
our limited experience no single methodology, method, or tool covers
the full scope of data modeling from raw discovery to instantiated
database, as sketched above. Notice that in the upper half the
techniques funnel downward toward coalescence and conceptual clarity
(or into the black hole of bloated, aborted projects); in the lower
half the process fans rapidly out as automated algorithms replicate
abstract patterns to implement details (e.g., a simple foreign key
reference propagates a lengthy SQL trigger).
|