Reducing BGP Update Noise
Overview
The underlying issues with scalability of the Internet’s
inter-domain routing system and the Border Gateway
Protocol (BGP) contain a number of persistent themes. These
themes relate to the overall stability of the routing protocol and the
performance of the packet
forwarding subsystem when dealing with ever-larger routing domains,
finer granularity of information and
ever-denser levels of element interconnection.
The study of scalability of routing
breaks into the sub-topics of memory, processing and bandwidth. The
larger the routing system grows the larger the required search space in
order to perform the correct lookup operation for packet forwarding,
inferring a requirement for larger memory for forwarding lookups. The
combination of a larger routing space and a denser set of
interconnections also increases the memory requirements for the
operation of the routing protocol itself. The combinations of the
absolute size of the routing space and the level of dynamic update, or
stability, have processing implications. The higher the level of
dynamic changes in the routing state, then the greater the processing
requirement that is imposed on each routing element. If the processing
element can fall behind in processing the update load then there are
implications in terms of the accuracy of the routing state, and
potential implications in increased memory demands to hold the
backlogged queues of unprocessed updates. And because routing is a
distributed protocol, the greater the size of the routing space, the
more densely interconnected the network and, more importantly, the
greater the level of dynamic change, then the larger the amount of
generated routing traffic and, by implication the higher the bandwidth
demands for routing itself. So as the routing system grows so does the
demand for additional memory processing capacity and bandwidth.
Of the three sub-topics, bandwidth is
perhaps the lesser of these scalability concerns, leaving memory and
processing loads as consistent themes in studies of the routing system
over many years. Early efforts to reduce the routing update load
focussed on the perception at the time that this was caused by an
unstable edge link that caused a route object to be announced, then
withdrawn, then announced again, and so on for extended periods of
time, and often at quite high frequency. A response to this type of
load in BGP was the introduction of Route Flap Damping. Subsequent
investigation has revealed that this form of edge behaviour is
relatively rare and current operational advice is to disable this
damping response in BGP routers.
Subsequent efforts in attempting to
mitigate the BGP Update load include use of the Minimum Route
Advertisement Interval (MRAI) timer to enforce a minimum period of 30
seconds between successive BGP advertisements of the same route object
to the same BGP peer, It has been noted that the heterogeneous
deployment of this timer in commonly deployed BGP configurations has
lead to some degree of amplification of prefix instability in certain
cases and extended convergence time intervals. Another recent effort is
combining TCP blocking with output queue compression, so that when the
output queue builds up the BGP speaker ensures that only a single entry
for each route object is in the queue. Enqueuing a new route object has
a side effect of removing any previously queued update that refers to
the same route object.
These measures are relatively general in
scope, and appear to be still somewhat ineffectual in addressing
overall growth in routing protocol update processing loads.
We propose to undertake a detailed
analysis of BGP updates collected at various locations in the network
in order to determine the relationship of the update sequences to
likely root cause events. The particular sequences that are of interest
are here those that relate to the behaviour of the BGP distance vector
algorithm undertaking path hunting prior to a complete withdrawal of
the prefix, and update sequences that relate to BGP path hunting to the
preferred path, or “path affinity”.
In this work program we propose to
undertake a analysis of BGP Update logs gathered from a number of
aggregated route views collectors, augmented by a number of more
specific views of the default-free routing system. The intended
approach in this analysis is to identify sequences of updates that
refer to the same prefix, and that are closely clustered in time, and
to associate with the profile of such update sequences a likely root
cause and the outcome of routing convergence.
We then propose to analyse the likely
effects of modification of the MRAI timer to investigate the outcome of
applying a set of heuristics to the both the application of the timer
and the length of the timer for each route update.
We then propose to investigate the
interactions of this form of BGP update processing behaviour with
commonly deployed BGP implementations when operating as a default-free
router in the public Internet.
To achieve this we propose to implement
output queue compression in the public domain implementation of BGP,
Quagga. We then propose to implement a modified MRAI implementation
that includes heuristics that identify update sequences and perform
selective suppression of BGP updates in those cases where the update
sequence matches the heuristics of withdrawal-triggered path
exploration, and where there appears to be a match of a path
exploration to route that has strong local affinity.