Memory-Resident Databases
Jiri Soukup
Practically every program must deal with some organization of
data. The data may range from a few integers that are to be multiplied
and added, to large databases such as those required for airline
reservation systems.
Most applications are somewhere in between these two extremes
and work well with a couple of linked lists or arrays and possibly
a hash table. Graphics applications often use trees, either for
fast searches or when displaying windows.
Some programs, e.g., compilers, work with complex data but do
not have to store it on disk. Other programs may have less complex
data but must store it at the end of each run. A simple example
of this might be a program to catalog your home collection of
musical recordings (CDs, records, and magnetic tapes).
Let's consider these two entirely different problems (the compiler
and the catalog of musical recordings). in both cases, it would
be nice to treat the data like a database. Instead of assembling
data structures from the classes available from a library, you
would simply declare the whole organization on several lines and
then load, retrieve, or traverse the data using database commands.
The first type of application (the compiler) would benefit from
this approach because it would not have to deal with complex internal
data directly. Debugging would be easier; diffficult-to-find pointer
errors would simply be avoided.
The second type of application (the home music catalog) would
benefit because storage to disk would be automatically available.
Moving data to disk and reading it back is not a big science,
but it takes time to code it and debug it.
Both types of applications would also benefit from being more
flexible to changes. Modifying a program with complex internal
data is tricky, and if the data objects or their organization
changes, I/O operations must often be completely redesigned.
In spite of all these advantages, databases are rarely used
in either case. Programs, like compilers, must run extremely fast,
and slow access combined with memory overhead makes databases
impractical for this particular application. Small programs like
the home music catalog do not use databases because they are expensive
and require additional resources. Also, the programmer must learn
how to use the database and put up with the restrictions of its
interface, such as SQL.
VLSI DATABASES
Since the early 1980s,1 2 CAD designers have been using a solution
that, with some modifications, is applicable to the two situations
described above.
CAD programs are notorious for large and complex data. A system
that designs a silicon chip typically consists of about a dozen
larger programs that must be executed one at a time. The first
module divides the logic into several larger blocks, the second
module places smaller cells within the blocks, the next module
plans the overall flow of wiring, the following one runs the wires
through the available channels, and so on. These programs are
often called more than once. For example, after the first attempt
to wire the chip, the designer often returns to the placement
stage and reruns the whole design to optimize chip area or the
speed of the circuit.
All these programs, including a fast color display, usually
share common data (essentially a database), which typically contains
10,000-100,000 objects with about 20 different types. Since it
takes more than a single run to design a VLSI chip, the data must
be stored to disk, either at the end of the run or after finishing
a major design step.
The competitiveness and viability of any CAD product depends
on how fast it can access its own internal data. A prod- uct is
not going to be successful if it takes ages to display a screen.
Also, if the program is slow, the quality of the result suffers
because the algorithms have less time to perform the same task.
For this reason, practically all commercial CAD systems have
an internal database that is actually a huge memory-resident data
structure, accessible by a special set of macros or functions.
When coding in C, macros are often preferred; even function call
overhead counts in multiple loops that are often used in this
type of programs. More recent C++ projects, such as in Mentor
Graphics, build the data organization from collections and other
abstract data extracted from a class library.
The important idea here is that we use memory-resident data
but treat it as a database. Access is fast, organized, and clean.
This method also separates data from application code.
I participated in three similar projects (BNR/IBM 1975-80, Bell
Labs 1980-82, and Cadence 1983-88) and in all three cases it took
two to three people about half a year to develop such a database.
The storage to disk and debugging facilities were always an important
part of the project.
EXAMPLE I Figure 1 shows a simplified representation of
the connectivity record for an electric circuit (usually called
netlist). The library contains masters of cells such as NAND,
INVERTOR, FLIP- FLOP gates. Each master has logical ports (here
called farrnal tenninals orfTerTns) to which other cells may be
connected. Each logical terminal has one or more pins (we call
them for- mal pins or florins) to which the wires actually connect.
fPins of the same fTerm are internally connected; it does not
matter to which fPin the wire is attached.
"The important idea here is that we use memory-resident data
but treat it as a database. Access is fast, organized, and clean.
This method also separates data from application code."
The circuit is built from instances of these cells. There are usually
many NAND gates and FLIP-FLOPS on a silicon chip. Each instance
has its terminals (actual terminals or aTerms), which are connected
by signal nets. Each cell is an instance of the master cell from
the library, and each aTerm is an instance of the fTerm on the master
cell.
A typical operation on such a database is to find all cell instances
connected to a given cell. The operation involves traversing all
aTerms on the given cell, then for each aTerm finding the connected
Net, traversing aTerms on that Net, and then, finally, finding
the cell instance for each aTerm.
Another typical operation for a given cell instance is to find
xy locations of all its pins. For the sake of saving memory, pins
are usually stored only on the cell master (see Fig. 1). For the
given cell, we have to find its master, all fTerms, and, for each
fTerm, all its fPins. The coordinates of the fPins must be translated
considering the position and rotation of the cell.
Listing 1 illustrates the solution to this example.
EXAMPLE 2
When cataloging a music collection, you may want to record the
following information:
- int: tape or disc id
- string: composer
- string: title
- string: instrument
The operation will be to add or remove a record and to search for
a combination of composer/title/instrument with * being used as
a wild card. For example, the request "Brahms * violin" would find
everything you have by Brahms for violin.
Since your collection probably contains dozens or hundreds of
items, not thousands, the access speed will not be critical. Assuming
that your computer disk is as full as mine, you may try to minimize
the disk storage and store all records as one singly linked list,
using a hashed vocabulary for names. This will definitely save
memory since the names of composers and instruments will frequently
repeat. The vocabulary can be a simple linked list or a hash table
if you want fast access (see Fig. 2). in terms of C++ class library,
such as NIH, this means implementing LinkedList of Records and
IdentSet (hash table) or LinkedList of names. Listing 2 gives
the solution for this example.
Figure 2. Database for musical recordings
(pointer representation).
IMPLEMENTING THE DATABASE
At first glance, it may appear that such a database is nothing
more than a combination of abstract and persistent data, available
from more or less any standard class library. This approach will
certainly work in Example 2.
However, if you attempt to implement more complex data such
as that shown in Example 1 with multiple inheritance or templates
(indirect links), you will end up with mutually connected classes
and many new or modified access methods. This is why CAD projects
usually have a database group that manages internal data structure
and maintains a clean interface layer separating the application
programmers from the data.
Another possibility is to design a library that would implement
associations and aggregations, such as described in Ref- erence
4. The difference between such a library and most existing class
libraries would be a generic interface class for each type of
association. By itself, this is a most interesting subject that
I would like to explore in one of my future columns. For today,
I will limit myself to a couple of examples of how such a library
can be used.
The examples have been coded with the Code Farms library, which
allows the declaration of a data organization as a set of statements
that look like a database schema. Each statement corresponds to
one association or aggregation.4 After declaring logical relations,
you access the data as if working with templates. The declaration
syntax is slightly nonstandard. For example:
ZZ_HYPER_SINGLE_COLLECT(myCol, Head, Record);
vaguely corresponds to the template declaration
declare_collection(Head,Record) myCol;
Figure 3. Two forms of aggregation.
Here, SINGLE_COLLECT represents a singly linked list of Records
with an encapsulated entry point in the class Head (see Fig. 3).
The statement generates an interface class that contains all the
access methods for this collection; myCol will be an instance
of this class. In a similar way:
ZZ_HYPER SINGLE_TRIANGLE(myTri, Head, Record);
declares an aggregation (linked list of Records with pointers
back to Head). The organization is called a triangle for historical
reasons; its diagram (see Fig. 3) resembles a triangle.
A full listing of this program is available on request from
the email address below. It demonstrates not only the data declaration
and use, but also saving to disk (persistence), which, due to
the lack of space, is not covered here.
"[A]ll object-oriented databases have markedly improved their
performance since...[April 1990]."
BENCHMARKS
Compared to full databases, a memory-resident database must
be faster and use less memory simply because it resides in memory
and cannot have the overhead associated with the other functions
a full database provides. Reference 5 quotes 5 times less memory
and 20 times less time to build a database for a VLSI simulation
compared to a commercial object-oriented database in April 1990.
Current numbers may not be as optimistic because, as a result
of fierce competition, all object-oriented databases have markedly
improved their performance since that time.
Listing 1. Declaration of the data from Example 1.
classChip; classInstance; classaTerm; classLibrary;
classMaster; class fTerm; class fPin; classNet;
ZZ_HYPER_SINGLE COLLECT (cells, Chip, Instance);
Z. Z_HYPER S. INGLE COLLECT ( nets , Chip , Net );
Z. Z_HYPER S. INGLE_COLLECT ( lib , Library , Master ) ;
ZZ_HYPER SINGLE COLLECT (fTerms,Master, fTerm);
ZZ_HYPER SINGLE_COLLECT (fPins, fTerm, fPin);
ZZ HYPER SINGLE_TRIANGLE (byInst , Instance , aTerm);
ZZ_HYPER_SINGLE_TRIANGLE (byNet, Net, aTerm);
ZZ_HYPER_SINGLE LINK(master,Instance,Master);
ZZ_HYPER_SINGLE_LINK (term, aTerm, fTerm);
// Traversing cells connected to cell instance ip
Instance *ip, *ii; aTerm *al, *a2; Net *np;
byInst_iterator it (ip);
while (al=it++) ( // walk through aTerms on ip
np=byNet.par(al); // net is the parent in byNet
triangle
byNet_iterator nt (np);
while (a2=nt++) ( // walk through aTerms on net np
ii=byInst . par (a2 ); // instance is the parent in
byInst
if (
(ii !=ip) {
... ii is connected to ip by net np ...
Catell and Skeen published a set of benchmarks (see Table 1 ) for
engineering databases.6 Even though the benchmark is intended for
true databases that can handle data of 400 Mb and more, it is interesting
to compare the performance of large commetcial databases with a
memory-resident database. The smallest benchmark works with about
2 Mb of useful data and 2 Mb of pointers, and can be implemented
as a memory-resident structure. I coded the benchmark with the Code
Farms library and ran it on a SUN3 conhguration similar to that
reported by Reference 6.
For the exact definition of the benchmark, see Reference 6.
It assumes two logical records:
Part: RECORD[id: INT, type: STRING[10], x,y: INT,
build : DATE]
Connection: RECORD[from: part-id, to: part-id, type:
STRING[10], length: INT]
where id values and type strings have random values distributed
in a certain way. There are 20,000 parts and 60,000 connections.
Exactly three connections start from each part and the targets are
then randomly selected using a special formula.
The program tests three operations:
- Look-up: generate 1000 random ids, find the parts, and call
a null function 14.
- Traversal: find all parts connected to a randomly selected
part, then parts connected to them, and so on (total of 3280
parts in seven levels)
- Insert 100 new parts and three random connections for each.
Commit these changes to disk.
Listing 2. Solution to problem in Example 2.
class Catalog { . . . }; class Vocabulary 1. . . ); class
Name { . . . };
class Record { int id; . . . );
Z Z_HYPER_SINGLE_COLLECT ( records , Catalog , Record);
ZZ_HYPER_HASH(names,Vocabulary,Name); // hash
// table of Names, on Vocabulary
ZZ_HYPER NAME(name,Name);//every Name has a
// variable length name
ZZ_HYPER_SINGLE_LINK ( canposer, Record, Name ); // po int er
// to the name
ZZ_HYPER SINGLE_LINK (title, Record, Name); //pointerto
// the name
ZZ_HYPER_SINGLE_LINK (instrument, Record, Name);
//pointer to the name
// compare given name a vocabulary name, skip if *
int Name:: cmp ( char *word) { // * is a wild-card entry
if ( ! strcmp (word, "*") ) return (0);
return ( strcmp (word, name . fwd (this ) ) );
}
// compare record with the given composer, title, and instrument
int Record: :cmp(char *com, char *tit, char *ins) {
Name *np;
np=composer.fwd{this); if (np->cmp(com) )
return ( 1 );
np=title.fwd(this); if (np->cmp(tit) )
return ( 1 );
np=instrument.fwd(this); if (np->cmp(ins) )
return ( 1 );
return (0);
}
// find all records with the given composer, title,
and instrument
Record* Catalog: :find(char *com, char *tit, char
*ins) {
Record *rp;
records iterator it (cp);
while (rp=it++) {
if (rp->cmp (com, tit, ins) ) continue;
printf ("%d %s %s %s\n",rp->getId(),
name.fwd (composer.fwd (rp) ),
name.fwd (title.fwd (rp) ),
name.fwd(Instrument.fwd(rp) ) );
}
}
Table 1. Results of database benchmark test
OODB1 OODB2 OODB3 OOOB4 RDBMS INDEX MEMRI MEMR2
Data Size 5.6 3.4 7.0 3.7 4.5 3.3 3.1 2.2 Mb
COLD:
Look-up 1 10 10 13 27 5.4 1.12 1.31 sec
Traverse 18 10 8.3 9.8 90 13 0.60 0.18
Insert 9 5.3 1.9 1.5 22 7.4 0.96 1.10
Total 39 25 20 24 .139 26 2.68 2.59 sec
WARM:
Look-up 0.1 0.03 1.1 1.0 19 2.4 0.92 1.31 sec
Traverse 0.7 0.1 1.2 1.2 84 8.4 0.56 0.18
Insert 3.7 3.1 1.0 2.9 20 7.5 0.86 1.10
Total 4.5 3.2 3.3 6 123 18 234 2.59 sec
The original purpose of the benchmark was to compare relational
and object-oriented technologies and it makes the point clear: object-oriented
databases are about an order of magnitude faster. The authors of
Reference 6 emphasize that the benchmark should not be used for
comparison of individual vendors. Participants (Objectivity, Object
Design, Ontologic, Versant, Servio, and Sun's Index) are listed
anonymously in random order:
The last two columns show results obtained with a memory-resident
database. MEMR1 is a simple straightforward implementation using
a GRAPH organization and variable-length names for every Record
and Connection. MEMR2 stored all names in a central hashed table,
saving about one-third of the memory on frequently repeated type
names.
The results do not include storage to disk because, normally,
a memory-resident database does not commit incremental changes
to disk. In a separate test, where I stored every new connection,
the insert time increased from 1.10 to 1.27 seconds (about 15%).
There is no storage requirement in Look-up or Traverse.
A similar database, implemented in plain C, required about the
same amount of memory and runtime as the C++ version. The results
show the memory-residentdatabase to be about 7.5 times faster
than the best of the databases in the COLD mode, and only 1.36
times faster than the best database in the WARM mode. The last,
relatively small difference puzzled me at first and I investigated
where the time was spent. In my runs, about one third of the time
was spent on allocation and initialization of objects, one-third
on generating random numbers, and only one-third on data manipulation
and other calculations.
It seems that, indeed, the benchmark cannot be used to compare
reliably object-oriented or any other databases with the memory-resident
database. The benchmark overhead masks the differences in their
performance. For example, the algorithms for random values may
have had more effect than the traversing algorithm. A more realistic
benchmark, which would traverse more and allocate fewer objects,
is needed, preferably without using a random number generator.
Also, after experimenting with the benchmark, the look up value
of 0.03 seconds for OODB2 appears suspicious.
LIMITATIONS AND ADVANTAGES
The main limitation of any memory-resident database is the size
of the data. The database can handle only data that fit virtual
memory without too much paging. In its simple form described here,
it also does not provide multiuser access (shared memory). However,
it is very efficient and inexpensive for small and medium problems,
including VLSI CAD programs. Memory-resident databases are not
suitable for large network applications such as airline reservations
or banking.
Memory-resident databases can coexist with other libraries and
environments. For example, you can use them inside the Borland
C++ integrated environment, with Saber C++, or as a part of some
C++ code that runs with Object Store.
REFERENCES
[1] Soukup, J. Circuit layout, Proceedings of IEEE, Oct. 1981,
pp. 1281-1304.
[2] Hsu, C.P. Theory and algorithms for signal routing in integrated
circuit layout, PhD Dissertation, Univ. of Calif., Berkeley, May
1983.
[3] Soukup, S. Organized C a unified method of handling data in
CAD programs and algorithms, 27th Design Automation
Conference, Orlando, June 1990, pp. 425-430.
[4] Rumbaugh S., M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen.
ObJect-Oriented Modelling and Design, Prentice Hall, Englewood
Cliffs, NS, l991.
[5] Galbiati, L. Comparing different implementations of a VLSI simu-
lation database, paper submitted to the 1992 Design Automation
Conference.
[6] Catell, R.G.G., and ]. Skeen. Engineering database benchmark,
Sun Microsysterms Technical Report, April 1990.
Jiri Soukup was the rnanager of tVUO teams (Bell Labs M.H. 1980-82
and Cadence Design Systems 1983 88) that developed the most advanced
VLSI layout systems available today. Since 1988, Jin has been involved
in the development of a nets generation of C++ libraries that are
more automatic and easier to use than standard class libranes. Sin
can be reached at Code Farms Inc. at jiri@debra.dgbt.doc.ca, by
telephone at (613) 8384829, andfax (613) 838-3316.
|