The secret of efficient software
design: Internal data organization
Jiri Soukup, President, Code Farms
Inc. 7214 Jock Trail, Richmond, Ont.,Canada, KOA 2Z0, eMail: jiri@codefarms.com,
613-838-4829, fax 613-838-3316
Abstract
Programmers often start coding
without thinking about data. As the program evolves, pointers
form an intricate network which is difficult to debug. The major
improvement introduced by object-oriented programming is that
it forces the programmer to consider both data and functions (methods)
right from the beginning. However, even in object-oriented languages,
relations between objects are often not treated properly, causing
a new style of complex, spaghetti-like code. This paper explains
a new method of managing internal program data. This method completely
separates data objects from relations, improving code clarity
and dramatically increasing software productivity. An additional
benefit is that it also improves run-time performance. The methodology
has been implemented as a library which works with regular C or
C++, without being a special language. The library provides generic,
fully-typed data structures which are automatically persistent.
Introduction
Every program involves two basic
parts: data and algorithms. In a good program, the two parts must
be carefully balanced and play into each others hand.
For example, let us consider a
program that stores a set of towns connected by highways, and
calculates the shortest connection between two given towns.
Before we start to code, we have
to consider several alternative means of managing internal data:
- If we store the towns in an
array, we use the minimum amount of memory per town.
- If the array is sorted then,
for a given name, we can quickly find the corresponding town,
using binary search.
- If we plan to add new towns
frequently, a linked list provides a more flexible solution
than an array.
- If we plan to delete towns frequently,
the list should be doubly-linked.
- Since linked lists do not allow
fast searches, we may need a hash table.
These are the most important decisions
a programmer makes, because they will affect both software design
and program performance. Different data structures often leads to
different algorithms.
As can be seen from this example,
the important part of the decision is not how we form data objects,
but what relations (data structures) we use. The general popularity
of object-oriented programming has helped to recognize the importance
of data. However, many textbooks stress objects and neglect the
importance of relations.
Perhaps, the root of this problem
lies in the object oriented paradigm itself. This paradigm assumes
that objects contain both data and methods (functions) that operate
on the data.
This model seems to fail when
implementing some data structures. For example, in our network
problem, every town must contain references (pointers or indices)
to adjacent highways, and every highway must contain references
to adjacent towns. The function that adds a new highway must modify
pointers in both highway and town objects, and therefore cannot
belong to either object, if we strictly adhere to the object-oriented
paradigm.
Instead of getting involved in
esoteric discussions, I will present a practical way of managing
data in C and C++ programs, by coding the townAighway example,
using the C++ version of the Code Farms library. I will refer
to it simply as the library. Readers interested in a comparison
of different methods and reasons for the selection of this particular
library may look at [2]. Note that Code Farms library is also
available in plain C.
STEP 1: Conceptual design
We can start with a naive representation
of data as shown in Fig.l. We have two classes, Town and Hwy.
Each Hwy includes an id# and a distance. Each Town has a variable-length
name, which is kept as a separate object.
Fig.2 shows a more elegant Booch
notation, where we have only one box for each type. Bars represent
relations. Since Booch does not have a special notation for a
graph, we can compose graph from two l-TO-N relations. One relation
links highways that start in a given town, the other links highways
that end there.
The program will have a simple
interface consisting of 3 commands:
add ltownl] [town2] [id] [distance]
add highway,
route [towel] [town2] ... find a route,
exit .. exit the program and store data on disk.
We assume that new towns are automatically
created when required and that, when invoking the program, the
old data is automatically restored from the disk.
This means that the names given
by the user will have to be translated to Town pointers. A fast
search by name is required. Fig.3 adds a hash table as a collection
of Towns (l-TO-N relation) on a dummy object Root.
The route searching algorithm
will expand from town1 to its neighbours, then expend to
their neighbours, updating the distance (best), and using
pointer route to record the best route direction. The algorithm
will need a circular stack of Towns that were updated in the last
round - you can imagine these towns as the front of a wave expanding
from town1.
For this reason, Fig.3 adds another
collection of Towns (wave) under Root. Note that the basic
collection in the library is a ring. Compare Fig.3 with Fig.4,
which shows the pointers that will actually be used. The complexity
of this network is the reason why C and C++ programs are difficult
to debug without a data structure library.
The last important consideration
is the persistence of the data. We assume that, after invoking
the program, old data will automatically be pulled in, and when
exiting the program, the data will be stored to the same file,
called backup. Each of these steps is only one command
in the library.
STEP 2: Mapping data into the
library
We will use one class for each
object type, with two special statements in it:
ZZ_EXT_ links the class to the
library
ZZ_INIT(); initializes all pointers
struct Root {
ZZ_EXT_Root
public:
Root() {ZZ_INIT(Root); }
};
struct Hwy {
int id;
int dist;
ZZ_EXT_Hwy
public:
Hwy() {ZZ_INIT(Hwy); };
Hwy(int a,int b);
int getDist(void) {return dist; }
int getID(void){return id;)
};
struct Town {
ZZ_EXT_Town
public:
Town() { ZZ_INIT(Town); };
Town(char *nm);
int best; // temporary
Hwy *route; // temporary
};
After declaring all object types
(classes), we declare the relations (data organization). If you
are familiar with databases, you can think about the following
statements as being similarto adatabase schema.In a way, they
resemble templates, but something much more complex and efficient
happens behind the scene:
ZZ_HYPER_SINGLE_GRAPH(netw,Town,Hwy);
ZZ_HYPER_NAME(name,Town);
ZZ_HYPER_HASH(hash,Root,Town);
ZZ_HYPER_DOUBLE_COLLECT(wave,Root,Town);
ZZ_HYPER_UTILITIES (util);
These five lines precisely declare
the entire organization. We have:
SINGLE_GRAPH - undirected graph with Towns as nodes and Hwys as
edges; internally, edges adjacent to each node form a singly-linked
list; netw is an identification name for the this organization.
NAME - similar to the string class; assigns a variable length name
to each Town; it's identification name is name.
HASH - generic hash table of Towns; it lives on a Root; its identification
name is hash. It can be controlled by a user-provided hashing function,
or by a default function from the library.
DOUBLE_COLLECT - double collection which encapsulates a doubly linked
list of Towns under a Root; wave is its id name.
UTILITIES - memory allocation and disk IO (persistence) utilities.
Word HYPER refers to a new concept of the hyper-class, used in this
particular library.
Each HYPER declaration creates
one instance of a special interface class, which contains no data,
only methods for the given data organization. Even though these
classes are global, they cannot be used out of local scope, if
the class on which they operate is local.
STEP 3: Coding the algorithm
Once organization is declared,
the library automatically provides all functions required for
its access/modification.
The entire program, including
the data declarations above is only 180 lines long, and took half
a day to code and fully debug. I apologize for small font and
crowded coding style in the enclosed listing; this paper is limited
to 6 pages.
Comments in the listing will help
you with the logic of the algorithm:
(0) contains data declaration shown in the paper
(1) strAlloc() allocates another copy of string nm
(2) route and best are initialized for the routing
(3) adding highway h between towns t[O] and t[l]
(4) add the name to the dummy Town object, s
(5) hash table search (this table lives on root)
(6) adding t to the hash table
(7) follow route pointers from town2 to towns
(8) for given Hwy h, nodes() returns adjacent Towns
(9) if the first child of wave is NULL, the wave is empty
(1O) fort) loop goes through the wave round-robin style.
When t==child, it is the end of the next round.
(11) iterator ni runs through all Hwys adjacent to t
(12) add tt to the wave
(13) delete tt from the wave
(14) clear() is similar to search(), but no cost is
considered. The expansion proceeds only through
Towns modified in the last search (route!=NULL).
(15) start search by putting townl into the wave
(16) testing whether there is file backup
(17) re-opening the old data in one command
(18) for the first call, start with a new root,
and form a new hash table.
(19) interactive loop that reads commands
(20) one command saves all data to disk
(21) functions that control hashing refer to
ZhashStr() provided by the library
(22) this file has been automatically generated
by the class generator, which is a part
of the library.
Conclusions:
The paper demonstrates how this
new method improves code clarity. From the conceptual design up
to final debugging, clear organization makes the software more
systematic and organized. Experience on large industrial projects
indicates, on average, 2-3 times faster coding and debugging,
with much improved maintenance and code re-usability.
References:
[1] Soukup J.: Organized C: A unified method of
handling data in CAD algorithms and databases,
X H FIR 27-th Design Automation Conference, June
1990, pp.425429.
[2] Soukup J.: Beyond templates, Coo Report, May
1992 (to be published).
#include <stdio.h>
#define Zmain #include "zzincl.h" // (22) #include "data.h" //
(O) #define INF OX7FFFFFFF Town::Town(char *nm){ ZZ INIT(Town);
char *n=util.strAlloc(nm); // (1) name.add(this,n); route=NULL;
best=lNF; // (2) } Hwy::Hwy(int a, int b){ ZZ_lNIT(Hwy); id=a;
dist=b; } Root *root; //------------------------------------ //
add a new hwy between two towns void addRoute(char *name1, char
*name2, int id, int dist){ Town *t[2],*getTown(char *); Hwy *h;
t[O]=getTown(name1); t[l]=getTown(name2); h=new Hwy(id,dist);
netw.add(t,h); // (3) } //------------------------------------
// get town, create new if not found Town *getTown(char *nm){
Town *t; static Town s; name.add(&s,nm); // (4) t=hash.get(root,&s);
// (S) if(lt)t=new Town(nm); nrn=narne.del(&s); hash.add(root,t);
// (6) return(t); } //------------------------------------- //
print a route, find total distance void prtRoute(Town *t){ //
(7) int tot; Hwy *h; Town *tt[2]; tot=O; if(l (t->route)) { printf("no
connection\n"); return; } while(t){ printf("%s",name.fwd(t));
h=t->route; if(lh)break; printf("(%d) " ,h->getID()); netw.nodes(tt,h);
// (8) if(tt[O])t=tt[1]; else t=tt[O]; tot+=h->getDist(); } printf("
dist=%d\n",tot); } //-------------------------------------- //
find route, mark by 'route' pointers void search(Town *t2){ Town
*t,*tt,*f v,*nt; Hwy *h; int bot; bot=O; for(t=wave.child(root);;t=nt){
// (9) if(t=wave.child(root)){ //(10) if(t2->route && t2->bestbestbest; netw_iterator ni(t); while(h=ni++) { //(1 1) tt=ni.adj();
if(t->best+h->distbest) { tt->route=h; tt->best=t->best+h->dist;
fw=wave.f vd(tt); if(!fw)wave.add(root,tt); // (12) } } nt=wave.bwd(t);
wave.del(root,t); // (13) if(nt_t)break; } } //----------------------------------------
// re-initiali7e used Towns void clear(Town *tl){ //(14) Town
*t,*tt,*fw,*nt; Hwy *h; wave.add(root,t1); for(t=wave.child(root);;
t=nt){ if(t->route||t==t1){ netw_iterator ni(t); while(h=ni+F)
{ tt=ni.adj(); if(tt->route){ fw=wave.fwd(tt); if(!fw)wave.add(root,tt);
} } } nt=wave.bwd(t); wave.del(root,t); t->route-NULL; t->best=lNF;
if(nt_t)break; } //------------------------------------------
// for given town names, find the route void findRoute(char Ftownl,char
*town2){ Town *t1,*t2,*getTown(char *); t1=getTown(town1); t2=getTown(town2);
if(t1==t2) { printf("same point\n"); return; } wave.add(root,t1);
//(15) t1->best=0; search(t2); prtRoute(t2); clear(t1); } //-------------------------------------------
#define BSIZE 80 int main(void){ char buff[BSlZE],cmd[BSIZE],
town1[BSIZE],town2[BSIZE]; static char *t[]=("Root"); char *v[1];
FILE *fp; int id,dist; fp=fopen("backup","r"); //(16) fclose(fp);
if(fp){ util.open("backup",1,v,t); //(17) root=(Root *)(v[0]);
} else { root=new Root; hash.form(root,200); //(18) } while(fgets(buff,BSIZE,stdin))
{ //(19) sscanf(buff,"%s",cmd); if(!stromp(cmd,"route")) { sscanf(buff,"%s
%s %s",cmd,townl,town2); findRoute(townl ,town2); } else if(!strcmp(cmd,"add")){
sscanf(buff,"%s %s %s %d %d",cmd,town1,town2,&id,&dist); addRoute(town
1 ,town2,id,dist); } else if(!stromp(cmd,"exit"))break; else printf("please
try again \n"); } v[0]=(char *)root; util.save("backup",l,v,t);
//(20) return(0); } //-------------------------------------------------
int zz_hashCmp_hash(char *t1,char *t2){ return(strcmp(name.fwd((Town
*)t1), name.fwd((Town *)t2))); } int zz_hashInd_hash(char *t,int
size){ int ZZhashStr(char *,int); return(ZhashStr(name.fwd((Town
*)t),size)); } #include"zzfunc.c" //(22)
|