A Tree-Based Packet Routing Table for Berkeley Unix

Keith Sklower

Computer System Research Group
EECS Department, Computer Science Division
University of California
Berkeley, California 94720

Abstract

Packet forwarding for OSI poses strong challenges for routing lookups: the algorithm must be able to efficiently accommodate variable length, and potentially very long addresses. The 4.3 Reno release of Berkeley UNIX uses a reduced radix tree to make decisions about forwarding packets.

This data structure is general enough to encompass protocol to link layer address translation such as the Address Resolution Protocol (ARP), and the End System to Intermediate System Protocol (ES-IS), and should apply to any hierarchical routing scheme, such as source and quality-of-service routing, or choosing between multiple Datakits on a single system.

The system uses a message oriented mechanism to communicate between the kernel and user processes to maintain the routing database, inform user processes of spontaneous events such as redirects, routing lookup failures, and suspected timeouts through gateways.

Introduction

An important focus of the 4.3 Reno release of Berkeley UNIX was to make support for the OSI protocols publicly available. OSI addresses are typically very long (20 bytes) and with the explosive growth of the Internet, a router may have to contend with thousands of them.

The traditional hash-based scheme of routing lookups would perform poorly in this environment. The older algorithm assumed that it would be cheap to compute hashes, that one could easily identify the network portion of an address, and easily compare them.

It is likely to be expensive to compute the hash of a 20 byte address. Moreover, where there are multiple hierarchies, it would be complicated and context dependent to identify which portion of the address should be considered as the network portion for comparison at changing levels. In general, it is not apparent how to accommodate hierarchies while using hashing, other than rehashing for each level of hierarchy possible.

Van Jacobsen, of the Lawrence Berkeley Laboratory, suggested using the PATRICIA algorithm (described below), but with an additional invariant to maintain a routing tree. This meshes extremely well with notions of multiple hierarchical defaults, and the cost of an entire lookup is approximately the same as the cost of computing a single hash.

Since there is now a means to store variable length addresses, and reason to use addresses of differing sizes within a given route (using a protocol destination address with a link-layer gateway to accomplish ARP-like translation, for example), it was decided that using fixed length ioctl's to communicate between the kernel and routing process would be too restrictive. Instead, a message based mechanism is used for passing routing information to and from the kernel. This mechanism provides additional potential for remote management when future releases supply the ability to splice communications channels.

Kernel Issues

Routing Lookups

Restatement of the Problem.

Let's describe the problem once again in a little more detail: A packet arrives with a very long protocol address. If the destination address is not that of the local system, one wants to decide quickly how to forward it. This decision entails choosing a network interface and a next-hop agent. (Point-to-point links only have one agent on the other end of the link, so sometimes it's enough just to figure out which link!).

Some of the routing protocols currently in use give us criteria for making this choice, in what may seem a bizarre way: the space of addresses is partitioned into a set of equivalence classes by specifying a pair consisting of a prototype address and a bitmask; a test address is deemed to belong to the class if any bit in which it differs from the prototype address corresponds to a zero bit of the provided mask.

Let's give an example, using the protocol addresses for the Internet Family [Post], which are 32-bit numbers:

Example 1:  Some Address Classes

        Prototype        Mask        ClassName
        0x80030000       0xffff0000  LBL
        0x80200000       0xffff0000  Berkeley
        0x80208200       0xffffff00  CsDivSubnet
        0x80209600       0xffffff00  SpurSubnet
        0                0           TheOutside

The author's machine (okeeffe.Berkeley.EDU) has the address 0x80208203. Consequently, it belongs to the classes Berkeley, CsDivSubnet, TheOutside, but not LBL nor SpurSubnet. With each class is associated a networking interface, in most cases a next-hop agent, and a collection of other useful information, and that collection is referred to as a route. Continuing the example, okeeffe.Berkeley.EDU can talk directly to any system in the class CsDivSubnet (all such systems are on a single ethernet), but requires an intermediary to talk to anybody else.

The routing lookup problem is to find the most specific class containing a given protocol address. Paradoxically, that will be the one with largest number of one bits in the mask. The NSF net may provide a regional router with about 2000 routes of this type. The lookup algorithm must look up the appropriate class quickly (among both numerous and lengthy addresses), and yet have nice properties with respect to masks.

The algorithm

The collection of prototype addresses are assembled into a variant of a PATRICIA tree, which is technically a binary radix tree with one-way branching removed. (In fact some writers call any tree with explicit external and internal nodes a trie). Although this algorithm is given a lengthy exposition in [Sedg] and also is discussed in [Knut] and [Morr], we will review it here.

We build a tree with internal nodes and leaves. The leaves will represent address classes, and will contain information common to all possible destinations in each class. As such, there will be at least a mask and prototype address. Each internal node represents a bit position to test. Given the tree and a candidate address thought of as a sequence of bits, the lookup algorithm is as follows:

Set node to the top of the tree.
If at a leaf node, stop.
Extract a bit position to test.
If that bit of the candidate address is on, set the node to the right child of the current node, otherwise set node to the left child.
Repeat steps 2 - 4.

Once we arrive at a leaf node, we need to check whether we have selected the appropriate class. The class may consist of a single host. This is a special case where the mask consists of all one bits, but it is a common enough occurence that we check for it (by use of a null pointer for the mask), and do an outright string compare. Otherwise, in which case we perform the masking operation.

It is possible to have the same prototype address with differing masks; this is handled by a linked list of leaf nodes. This arises due to boundary conditions for the smallest representation of the default route (which collides with the boundary marker for the empty tree). It also arises if you want to route to subnet 0 of a subnetted class A or B internet address.

If the leaf node isn't correct, then we backtrack up the tree looking for indications that a more general mask may apply (i.e. one having fewer one bits). This may happen if we are asked to look up an address other than the prototype addresses used to construct the tree. Rather than keep a separate stack of nodes traversed while searching the tree, backtracking is facilitated by having explicit parent pointers in each node. This also facilitates deletion, and allows non-recursive walks of the tree.

A Lookup Example

scale=100
define m0 |
[ box invis ht 48 wid 120 with .sw at 0,0
line  from 0,24 to 120,24 
box ht 48 wid 120 with .nw at 0,48 
] |

define m1 |
[ box invis ht 71 wid 63 with .sw at 0,0
line  from 24,32 to 0,0 
circle rad 23 at 40,48
] |

box invis ht 359 wid 528 with .sw at 0,0
line  from 320,196 to 324,207 dotted
line  from 299,225 to 317,223 dotted
line  from 300,225 to 282,213 dotted
line  from 282,213 to 262,185 dotted
line -> from 262,185 to 254,128 dotted
"\f1\s10\&default\f1\s0" at 57,300
"\f1\s9\&m = 0xffffff00\f1\s0" at 469,11
"\f1\s9\&m = 0xffffff00\f1\s0" at 309,12
"\f1\s9\&p = 0x80208200\f1\s0" at 309,36
"\f1\s9\&p = 0x80200000\f1\s0" at 227,116
"\f1\s9\&m = 0xffff0000\f1\s0" at 149,172
"\f1\s9\&p = 0x80030000\f1\s0" at 149,196
"\f1\s9\&m = 0x00000000\f1\s0" at 60,252
"\f1\s9\&p = 0x00000000\f1\s0" at 60,276
line  from 400,80 to 432,48 
m0 with .nw at 408,48
m0 with .nw at 248,48
m0 with .nw at 168,128
m0 with .nw at 88,208
m0 with .nw at 0,288
line  from 152,320 to 208,272 
line  from 240,240 to 288,192 
line  from 320,160 to 368,112 
m1 with .nw at 344,120
m1 with .nw at 264,200
m1 with .nw at 184,280
m1 with .nw at 96,359
"\f1\s9\&m = 0xffff0000\f1\s0" at 226,92
"\f1\s9\&p = 0x80209600\f1\s0" at 469,36
"\f1\s10\&SpurSubnet\f1\s0" at 472,60
"\f1\s10\&CsDivSubnet\f1\s0" at 303,61
"\f1\s10\&Berkeley\f1\s0" at 225,141
"\f1\s10\&LBL\f1\s0" at 147,219
"\f1\s9\&bit 0\f1\s0" at 137,335
"\f1\s9\&bit 10\f1\s0" at 225,256
"\f1\s9\&bit 19\f1\s0" at 384,96
"\f1\s9\&bit 16\f1\s0" at 304,176
line  from 317,223 to 324,207 dotted
.PE
.DS C
Figure 1.
.DE

Figure 1. shows how one would construct a reduced radix tree to decide among the prototype addresses given in the example of address classes. In examining the address for okeeffe.Berkeley.EDU (0x80208203), we find that bit 0 is on (0x80000000: go right), bit 10 is on (0x00200000: go right), bit 16 is on (0x00008000: go right), but that bit 19 is off (0x00001000: go left). And, in fact okeeffe does match the CsDivSubnet class.

If we were to look up another machine at Berkeley, say miro.Berkeley.EDU (0x80209514), we are driven down to the SpurSubnet class, which does not match. So we backtrack up to the second internal node above it, which has an indication that there is a mask which may apply. This is represented in the diagram by the dotted line, which actually points to data associated with mask contained in the leaf, rather than the leaf itself.