Computer System Research Group
This data structure is general enough to encompass protocol to link layer address translation such as the Address Resolution Protocol (ARP), and the End System to Intermediate System Protocol (ES-IS), and should apply to any hierarchical routing scheme, such as source and quality-of-service routing, or choosing between multiple Datakits on a single system.
The system uses a message oriented mechanism to communicate between the kernel and user processes to maintain the routing database, inform user processes of spontaneous events such as redirects, routing lookup failures, and suspected timeouts through gateways.
The traditional hash-based scheme of routing lookups would perform poorly in this environment. The older algorithm assumed that it would be cheap to compute hashes, that one could easily identify the network portion of an address, and easily compare them.
It is likely to be expensive to compute the hash of a 20 byte address. Moreover, where there are multiple hierarchies, it would be complicated and context dependent to identify which portion of the address should be considered as the network portion for comparison at changing levels. In general, it is not apparent how to accommodate hierarchies while using hashing, other than rehashing for each level of hierarchy possible.
Van Jacobsen, of the Lawrence Berkeley Laboratory, suggested using the PATRICIA algorithm (described below), but with an additional invariant to maintain a routing tree. This meshes extremely well with notions of multiple hierarchical defaults, and the cost of an entire lookup is approximately the same as the cost of computing a single hash.
Since there is now a means to store variable length addresses, and reason to use addresses of differing sizes within a given route (using a protocol destination address with a link-layer gateway to accomplish ARP-like translation, for example), it was decided that using fixed length ioctl's to communicate between the kernel and routing process would be too restrictive. Instead, a message based mechanism is used for passing routing information to and from the kernel. This mechanism provides additional potential for remote management when future releases supply the ability to splice communications channels.
Some of the routing protocols currently in use give us criteria for making this choice, in what may seem a bizarre way: the space of addresses is partitioned into a set of equivalence classes by specifying a pair consisting of a prototype address and a bitmask; a test address is deemed to belong to the class if any bit in which it differs from the prototype address corresponds to a zero bit of the provided mask.
Let's give an example, using the protocol addresses for the Internet Family [Post], which are 32-bit numbers:
Example 1: Some Address Classes Prototype Mask ClassName 0x80030000 0xffff0000 LBL 0x80200000 0xffff0000 Berkeley 0x80208200 0xffffff00 CsDivSubnet 0x80209600 0xffffff00 SpurSubnet 0 0 TheOutsideThe author's machine (okeeffe.Berkeley.EDU) has the address 0x80208203. Consequently, it belongs to the classes Berkeley, CsDivSubnet, TheOutside, but not LBL nor SpurSubnet. With each class is associated a networking interface, in most cases a next-hop agent, and a collection of other useful information, and that collection is referred to as a route. Continuing the example, okeeffe.Berkeley.EDU can talk directly to any system in the class CsDivSubnet (all such systems are on a single ethernet), but requires an intermediary to talk to anybody else.
The routing lookup problem is to find the most specific class containing a given protocol address. Paradoxically, that will be the one with largest number of one bits in the mask. The NSF net may provide a regional router with about 2000 routes of this type. The lookup algorithm must look up the appropriate class quickly (among both numerous and lengthy addresses), and yet have nice properties with respect to masks.
We build a tree with internal nodes and leaves. The leaves will represent address classes, and will contain information common to all possible destinations in each class. As such, there will be at least a mask and prototype address. Each internal node represents a bit position to test. Given the tree and a candidate address thought of as a sequence of bits, the lookup algorithm is as follows:
Once we arrive at a leaf node, we need to check whether we have selected the appropriate class. The class may consist of a single host. This is a special case where the mask consists of all one bits, but it is a common enough occurence that we check for it (by use of a null pointer for the mask), and do an outright string compare. Otherwise, in which case we perform the masking operation.
It is possible to have the same prototype address with differing masks; this is handled by a linked list of leaf nodes. This arises due to boundary conditions for the smallest representation of the default route (which collides with the boundary marker for the empty tree). It also arises if you want to route to subnet 0 of a subnetted class A or B internet address.
If the leaf node isn't correct, then we backtrack up the tree looking for indications that a more general mask may apply (i.e. one having fewer one bits). This may happen if we are asked to look up an address other than the prototype addresses used to construct the tree. Rather than keep a separate stack of nodes traversed while searching the tree, backtracking is facilitated by having explicit parent pointers in each node. This also facilitates deletion, and allows non-recursive walks of the tree.
scale=100 define m0 | [ box invis ht 48 wid 120 with .sw at 0,0 line from 0,24 to 120,24 box ht 48 wid 120 with .nw at 0,48 ] | define m1 | [ box invis ht 71 wid 63 with .sw at 0,0 line from 24,32 to 0,0 circle rad 23 at 40,48 ] | box invis ht 359 wid 528 with .sw at 0,0 line from 320,196 to 324,207 dotted line from 299,225 to 317,223 dotted line from 300,225 to 282,213 dotted line from 282,213 to 262,185 dotted line -> from 262,185 to 254,128 dotted "\f1\s10\&default\f1\s0" at 57,300 "\f1\s9\&m = 0xffffff00\f1\s0" at 469,11 "\f1\s9\&m = 0xffffff00\f1\s0" at 309,12 "\f1\s9\&p = 0x80208200\f1\s0" at 309,36 "\f1\s9\&p = 0x80200000\f1\s0" at 227,116 "\f1\s9\&m = 0xffff0000\f1\s0" at 149,172 "\f1\s9\&p = 0x80030000\f1\s0" at 149,196 "\f1\s9\&m = 0x00000000\f1\s0" at 60,252 "\f1\s9\&p = 0x00000000\f1\s0" at 60,276 line from 400,80 to 432,48 m0 with .nw at 408,48 m0 with .nw at 248,48 m0 with .nw at 168,128 m0 with .nw at 88,208 m0 with .nw at 0,288 line from 152,320 to 208,272 line from 240,240 to 288,192 line from 320,160 to 368,112 m1 with .nw at 344,120 m1 with .nw at 264,200 m1 with .nw at 184,280 m1 with .nw at 96,359 "\f1\s9\&m = 0xffff0000\f1\s0" at 226,92 "\f1\s9\&p = 0x80209600\f1\s0" at 469,36 "\f1\s10\&SpurSubnet\f1\s0" at 472,60 "\f1\s10\&CsDivSubnet\f1\s0" at 303,61 "\f1\s10\&Berkeley\f1\s0" at 225,141 "\f1\s10\&LBL\f1\s0" at 147,219 "\f1\s9\&bit 0\f1\s0" at 137,335 "\f1\s9\&bit 10\f1\s0" at 225,256 "\f1\s9\&bit 19\f1\s0" at 384,96 "\f1\s9\&bit 16\f1\s0" at 304,176 line from 317,223 to 324,207 dotted .PE .DS C Figure 1. .DEFigure 1. shows how one would construct a reduced radix tree to decide among the prototype addresses given in the example of address classes. In examining the address for okeeffe.Berkeley.EDU (0x80208203), we find that bit 0 is on (0x80000000: go right), bit 10 is on (0x00200000: go right), bit 16 is on (0x00008000: go right), but that bit 19 is off (0x00001000: go left). And, in fact okeeffe does match the CsDivSubnet class.
If we were to look up another machine at Berkeley, say miro.Berkeley.EDU (0x80209514), we are driven down to the SpurSubnet class, which does not match. So we backtrack up to the second internal node above it, which has an indication that there is a mask which may apply. This is represented in the diagram by the dotted line, which actually points to data associated with mask contained in the leaf, rather than the leaf itself.