Test 3 Traffic Engineering Flashcards Preview

cs6250 > Test 3 Traffic Engineering > Flashcards

Flashcards in Test 3 Traffic Engineering Deck (55)
Loading flashcards...

Traffic Engineering defined

Traffic engineering is a fancy way of describing how network operators deal with large amounts of data flowing through their networks

Traffic engineering is the process of reconfiguring the network in response to changing traffic loads to achieve some operational goal.


Example operational goal accomplished by traffic engineering

A network operator might want to reconfigure the network in response to changing loads to, for example, maintain traffic ratios in a peering relationship or to relieve congestion on certain links in the network or to balance load more evenly across the available links in the network


Key questions traffic engineering tries to address

how routing should adapt to traffic
--traffic engineering attempts to avoid congested links and satisfy certain application requirements such as delay.
--TCP senders send less traffic during congestion and routing protocols will adapt to topology changes. The problem is that even though these protocols are designed to adapt to various changes, the network may not run efficiently. There may be congested links when idle paths exist or there might be a high-delay path that some traffic is taking when a low-delay path, in fact, exists.


intradomain traffic engineering

how to reconfigure the protocols within a single autonomous system to adjust traffic flows


interdomain traffic engineering

how to adjust how traffic flows between autonomous system

In a single autonomous system with static link weights, routers will flood information to one another to learn the network topology, including the link weights on, links connecting individual routers
--operator can affect how traffic flows through the network. By configuring the link weights. By changing the link weight configuration, an operator can affect the shortest path between two points in this graph.
--shift traffic flow from congested link by changing weight

by adjusting the link weights in a intra-domain topology, the operator can affect how traffic flows between different points in the network, thus affecting the load on the network links. In practice, network operators set these link weights in a variety of ways. One could set the link weights inversely proportional to capacity. Proportional to propagation delay, or the operator might perform some network wide optimization, based on traffic.


Steps to Traffic engineering

1. Measurement
--measuring the network to figure out the current traffic loads

--forming a model of how configuration affects the underlying paths in the network

3. Control
--reconfiguring the network to exert control over how the traffic flows through the network


Interdomain application of three steps of traffic engineering

1. Intradomain traffic engineering attempts to solve an optimization problem, where the input is the graph G, where R is the set of routers, and L is the set of unidirectional links. Each link L also has a fix capacity. Another input is the traffic matrix or offered traffic load, where Mij represent the traffic load from router i to router j.

2. The output of the optimization problem is a set of link weights, where wl is the weight on any unidirectional link l in the network topology.

3. Ultimately, the setting of these link weights should result in a fraction of the traffic from i to j traversing each link l, such that those fractions satisfy the network-wide objective function. Defining an objective function is tricky. We could talk about, for example, minimizing the maximum congested link in the network, evenly splitting traffic loads across links, and so forth.


Link Utilization Function

represent that the cost of congestion increases in a quadratic manner as the loads on the links continue to increase, Ultimately becoming increasingly expensive as link utilization approaches one.


utilization defined

the amount of traffic on the link divided by the capacity


utilization objective

minimize the sum of the piecewise linear cost function over all the links in the network

Unfortunately, solving this optimization is still NP-complete, which means that there is no efficient algorithm to find the optimal setting of link weights, even for simple objective functions. The implications of this are that we have to resort to searching through a large set of combinations of link weight settings to ultimately find a good setting. So clearly searching through all the link weights is suboptimal. But the graphs turn out to be small enough in practice such that this approach is still reasonably effective.


Link Utilization concerns

minimizing the number of changes to the network. Often just changing one or two link weights is enough to achieve a good traffic load balance solution


Requirements for link utilization optimization

resistant to failure

robust to measurement noise.

limit the frequency of changes that we make to the network


examples of interdomain routing

Peering between two ISPs

peering between a university network and its ISP

any peering at an Internet exchange point

Routing between multiple data centers typically involves routing in a wide area, and hence may involve inter-domain routing

But not Routing within a single data center because it's routing in a single domain


BGP in Interdomain Traffic Engineering

Inter-domain routing involves the reconfiguration of the border gateway protocol, policies or configurations that are running on individual routers in the network

cause routers inside an autonomous system to direct traffic to or away from certain edge links

change the set of egress links for a particular destination.

For example, an operator of autonomous system one might observe. Traffic to destination D traversing the green path. But by adjusting B to B policies, the operator might balance load across these two edge links, or shift all of the traffic for that destination to the lower path.

An operator might wish to use inter domain traffic engineering if an edge link is congested, if a link is upgraded, or if there's some violation of appearing agreement. For example, AS1 and AS2 have an agreement that they only send a certain amount of traffic load. Over that link in a particular time window. If the load exceeds that amount, an operator would need to use BGP to shift traffic from one [INAUDIBLE] link to another.


3 Interdomain Traffic Engineering Goals

1. predictability
--should be possible to predict how traffic flows will change in response to changes in the network configuration

2. limit the influence of neighbouring domains
--use BGP policies and changes to those policies that. Limit how neighbouring ASes might change their behaviour in response to changes to the PGP configuration that we make in our own network

3. reduce the overhead of routing changes by achieving our traffic engineering goals with changes to as few IP prefixes as possible.


how the inter-domain routing choices of a particular autonomous system can wreak havoc on predictability

Let's suppose that a downstream neighbor, is trying to reach, the autonomous system at the top of this figure. The AS here might wish to relieve congestion, on a particular peering link. To do so, this AS, might now send traffic to that destination out a different set of atonimous systems. But once this AS makes that change, note that it's choosing a longer AS path. Now taking a path of three half's rather than two. In response, the down stream neighbor might decide not to send its traffic for that destination Through this attonomous system at all. Thus affecting the traffic matrix that this AS sees. So, all the work that went in to optimizing the traffic load balance for this AS is for not, because the change that it made, effectively changed the offered traffic loads and hence the traffic matrix.


Avoiding interdomain traffic engineering affecting predictability

Need to achieve predictable traffic flow changes

avoid making changes like this that are globally visible. In particular, note that this change caused a change in the AS path link of the advertisement to this particular destination, from two to three. Thus, other neighbors, such as the downstream neighbor here. Might decide to use an alternate path as a result of that globally visible routing change. By avoiding these types of globally visible changes, we can achieve predicitability

limit the influence of neighbors. For example, an autonomous system might try to make a path look longer with AS path prepending. If we consider treating paths that have almost the same AS path length as a common group, we can achieve additional flexibility.

enforce a constraint that our neighbors should advertise consistent BGP route advertisements, over multiple appearing links, should multiple appearing links exists. That gives us additional flexibility, to send traffic over different e-egress points to the same autonomous system

Enforcing egress points to the same autonomous system.

Enforcing consistent advertisments turns out to be difficult in practice, but it is doable to reduce the overhead of routing changes.

We can group related prefixes. Rather than exploring all combinations of prefixes to move a particular volume of traffic. We can identify routing choices that group routes that have the same AS paths and we can move groups of prefixes according to these groups of prefixes that share an AS path. This allows us to move groups of prefixes by making tweaks to local preference on regular expressions on AS path.

We can also focus on the small fraction of prefixes that carry the majority of traffic. Ten percent of origin AS is responsible for about 82 percent of outbound traffic, therefore we can achieve significant gains in rebalancing traffic in the network by focusing on the heavy hitters.


In summary, to achieve predictability in interdomain traffic engineering

effect changes that are not globally visible.

enforce consistent advertisments and limit the influence of a s pass length to limit the influence of neighbors, we .

group prefixes according to those that have common AS path to reduce the overhead of routing changes,

move traffic in terms of groups of prefixes


Multipath Routing

Another way to perform traffic engineering

an operator can establish multiple paths in advance

This approach applies both to intra-domain routing, and inter-domain routing


equal cost multipath routing

Intra-domain routing
-- set link weights such that multiple paths of equal cost exist between two nodes in the graph

Thus traffic will be split across paths that have equal costs through the network.

A source router might also be able to change the fraction of traffic that's sent along each one of these paths. Sending for example 35% along the top path and 65% along the bottom path.

It might even be able to do this based on the level of congestion that's observed along these paths.

The way that the router would do this is by having multiple forwarding table entries with different stops for outgoing packets to the same destination.


how can a source router adjust paths to a destination when there are multiple paths to the destination

A source router can adjust traffic over multiple paths by having multiple forwarding table entries for the same destination, and splitting traffic flows across the multiple next hops, depending on, for example, the hash of the IP packet header.


Data Center Networking 3 important characteristics

1. Multi-tenancy
--allows a data center provider to advertise the cost of shared infrastructure, but because there are multiple independent users, the infrastructure must also provide some level of security and resource isolation

2. elastic
--as demand for service fluctuates, the operator can expand and contract these resources. It's also, can be pay per use, meaning that as the need to use more resources arises or disappears, a service provider can adjust how much resources are devoted to the particular service running in the data center

3. flexible service management
--the ability to move work, or work loads to other locations inside the data center.
--For example, as load changes for a particular service, an operator may need to provision additional virtual machines, or servers to handle the load for that service or potentially move it to a completely different set of servers inside the infrastructure. This workload movement and virtual machine migration essentially creates the need for traffic engineering solutions inside a data cente


key enabling technology in data center networking

the ability to virtualize servers. This makes it possible to quickly provision, move, and migrate servers and services in response to fluctuations in workload. But while provisioning servers and moving them is relatively easy, we must also develop traffic engineering solutions that allow the network to reconfigure in response to changing workloads and migrating services


Data Center Networking Challenges

traffic load balance support for migrating virtual machines in response to changing demands

Adjusting server and traffic placement to save power

Provisioning the network when demands fluctuate

providing various security guarantees, particularly in scenarios that involve multiple tenants

To understand these challenges, in a bit more detail, let's take a look at a typical data center topology. A topology typically has three layers: an access layer, which connect the servers themselves. An aggregation layer which connects the access layer, and then the core. Historically, the core of the network has been connected with layer three, but increasingly, modern data centers are connected with an entire layer-two topology. A layer-two topology makes it easier to perform migration of services from one part of the topology to another, since these services can stay on the same layer-two network and hence would not need new IP addresses when they moved. It also becomes easier to load balance traffic. On the other hand, a monolithic layer-two topology makes scaling difficult, since now we have tens of thousands of servers on a single flat topology. In other words, layer-two addresses are not topological. So the forwarding tables in these switches can't scale as easily, because they can't take advantage of the natural hierarchy that exists in the topology. Other problems that exist in this type of topology, is that the hierarchy can potentially create single points of failure. And links at the top of the topology, in the core, can be come oversubscribed. Modern data center operators have observed that as you move from the bottom of the hierarchy up towards the core, that the links at the top can carry as much as 200 times as much traffic as the links at the bottom of the hierarchy. So there's a serious capacity mismatch in that the top part of the topology has to carry a whole lot more traffic than the bottom


Data Center Topologies - Pods

the Scale problem arises because we have tens of thousands of servers on a flat layer two Topology, where all of the servers have a Topology independent MAC or hardware address and thus, in the default case every switch in the topology has to store affording table entry for every single MAC address

Solution: pods
-- assign sudo MAC addresses to each server corresponding to the pod in which they're location in the Topology.
--Each server has what's a real MAC address and a pseudo-MAC address
--switches in the Data Centre Topology no longer need to maintain forwarding table entries for every host.
--They only need to maintain entries for reaching other pods in the Topology.
--Once a frame answers a pod, the switch then of course has entries for all of the servers inside that pod but they don't need to maintain entries for the MAC addresses for servers outside of each pod.


Pods - fabric manager

hosts are unmodified. So, they are still going to respond to things like ARP queries, with their real MAC addresses

Also need a way of Mapping pseudo MAC addresses to real MAC addresses

Fabric manager achieves hierachial forwarding
--switch intercepts the query and forwards it to the Fabric Manager
--Fabric Manager then responds with the pseudo-MAC corresponding to that IP address.
--Host A then sends the frame with the destination pseudo-MAC address and switches in the Topology can forward that frame to the appropriate pod corresponding to the pseudo MAC address of the destination server
--Once the frame reaches the destination pod, let's say in this case pod 3, the switch at the top of that pod can then Map the pseudo MAC address back to the real MAC address.
--And the server that receives the frame receives an Ethernet frame with its real destination MAC address, so it knows that the Ethernet frame was intended for it


Data Center (Intradomain) Traffic Engineering goals

reduce linkulization

reduce the number of hops to each the edge of the data center

make the data center network easier to maintain



VL2: creating virtual layer-2 networks that allow application addresses to be separated from network devices

VL2 has two main objectives.
--achieve layer-two semantics across the entire data center topology. This is done with a name-location separation and a resolution service that resembles the fabric manager
--achieve uniform high capacity between the servers and balance load across links in topology, VL2 relies on flow based random traffic interaction using valiant load balancing.


Valiant Load Balancing

goals of Valient load balancing in the VL2 network:
--spread traffic evenly across the servers
--ensure that traffic load is balanced independently of the destinations of the traffic flows

achieved by inserting an indirection level into the switching hierarchy.
--When a switch at the access layer wants to send traffic to a destination, it first selects a switch at the indirection level to send the traffic at random.
--This intermediate switch then forwards the traffic to the ultimate destination Depending on the destination MAC address, of the traffic.
--Subsequent flows might pick different, indirection, points for the traffic, at random


Jellyfish Data Center Topology

Jellyfish is a technique to network data centers randomly

--achieve high throughput to support, for example, big data analytics or agile placement of virtual machines
--incremental expandability, so that network operators can easily add or replace servers and switches

For example, large companies like Facebook are adding capacity on a daily basis. Commercial products make it easy to expand or provision servers in response to changing traffic load but not the network. Unfortunately, the structure of the data center networks constrains expansion. Structures such as a hypercube require two to the K switches, where K is the number of servers. Even more efficient topologies, like a FAT tree, are still quadratic in the number of servers.


Data Center Topologies





where does the data center topology primarily constrain expansion and which topology answers this?

most of the congestion occurs at the top level. Jellyfishes answer to how data structure constrains expansion is to simply have no structure at all.


Jellyfish Random Regular Graph

Jellyfish's topology is what is called a random regular graph.
--It's random because each graph is uniformly selected at random from the set of all regular graphs.
--A regular graph is simply one where each node has the same degree.
--And a graph in Jellyfish is one where the switches in the topology are the nodes.

Every node in this graph has a fixed degree of 12. Jellyfish's approach is to construct a random graph at the Top of Rack switch layer. Every Top of Rack switch i, has some total number of Ki ports, of which it uses Ri to connect to other Top of Rack switches. The remaining Ki minus Ri ports are used to connect servers. With n racks, the network then supports n times Ki minus Ri servers. And the network is a random regular graph denoted as follows. Formally, random regular graphs are sampled uniformly from the space of all R regular graphs. Achieving such a property is a complex graph theory problem, but there's a simple procedure that produces a sufficiently uniform random graph that empirically have the desired properties.


Constructing a Jellyfish Topology

pick a random switch pair with free ports for which the switch pair are not already neighbors.

Next, join them with a link,

repeat this process until no further links can be added.

If a switch remains with greater than or equal to two free ports, which might happen during the incremental expansion by adding a new switch, these switches can be incorporated in the topology by removing a uniform random existing link and adding links to that switch.

For a particular equipment cost, using identical equipment, the jelly fish topology can achieve increased capacity by supporting twenty five percent more servers.
--This higher capacity is achieved because the paths through the topology are shorter than they would be in a Fat tree topology.


Jellyfish Topology Example

Consider a topology with sixteen servers, twenty switches, and a fixed degree of four for both the fat tree topology and the jellyfish random graph.

In the fat tree topology, only four of 16 servers are reachable in less than five hops. In contrast, in the jellyfish random graph, there are 12 servers reachable.

By making more servers reachable along shorter paths, jellyfish can increase capacity over a conventional Fat tree topology.


Jellyfish open questions

First, how close are these random graphs to optimal, in terms of the optimal throughput that could be achieved for a particular set of equipment.

Second, what about typologies where switches are heterogeneous with different numbers of ports or link speeds.

From a system design perspective, the random topology model could create problems with physically cabling the datacenter network,

how to perform routing or congestion control without the structure of a conventional datacenter network like a fat tree.


Traffic engineering three steps: What are the two things that need to be measured, and how could each be measured?

We need to measure the topology, including not only the connectivity but also the
capacity of each link and router. This could be done by routers self-reporting, similar to
how they exchange information in a Link State protocol, but in practice is probably more
often simply entered as data by a network engineer.

We also need to measure the traffic, or offered load. This can be done using the “simple
counters” measurement technique that we learned about earlier, since we want to know how much traffic is on each part of the network but don't necessarily need the
details of specific flows


Traffic engineering three steps: What are two ways that control could be implemented?

The “traditional” way to implement control is by adjusting link weights, which indirectly
affects the routes calculated by the routing protocol. In practice, link weights are more
often used this way than to represent any “real” property of the network, like
bandwidth or link latency. Another way to implement control is by using SDN to directly
control the routes that are used on the network.


In intra-AS multipath, traffic can be split among paths with the same path length (i.e., sum
of link weights along that path). In inter-AS multipath, what properties of the paths (i.e., of
the BGP advertisements) need to be equal in order to allow multipath over those paths?
(You can limit your answer to the BGP properties that we learned about in class – there are
couple that we didn't learn about that also apply here, but don't worry about those.)

LOCAL_PREF, the local preference parameter

AS_PATH length, as determined by counting the number of ASes in the AS_PATH

MULTI_EXIT_DISC, the MED value IGP metric to the NEXT_HOP, i.e., equal “hot potato” routing distance


How does using pods and pseudo-MACs improve the scalability of a Layer 2 network?

This changes the flat layer 2 addressing (MAC addresses) into a hierarchical addressing
(pseudo-MAC addresses). This means that switches only need to store a forwarding entry for
each host in the same pod plus one for each other pod, rather than needing an entry for each
host on the entire network. (Notice that hierarchical addressing is the same thing that allows IP
to scalable at layer 3, so the idea is to push that concept down into layer 2.)


What are the advantages of using a Jellyfish topology over a traditional hierarchical data
center topology?

Network load balancing – prevents bottleneck links and heavily loaded aggregation or core

• Higher capacity – since the network is balanced, more hosts can reasonably be hosted on a
network with the same number of switches

• Shorter paths – shorter average number of hops between any two hosts results in faster
network performance

• Incremental expansion – allows adding switches to the network without reconfiguring the
existing network infrastructure or adding additional “higher-level” switches


What are the drawbacks or problems with using a Jellyfish topology?

Does not handle heterogeneous switch devices well

• Long cable runs between random switch pairs may be necessary, but are inconvenient and
difficult to install


Briefly describe the functions of the logically centralized Fabric Manager used in PortLand.

The Fabric Manager is primarily responsible for maintaining network configuration soft state.
Using this soft state, the Fabric Manager performs ARP resolution, provides multicast capability
to the network, and achieves fault tolerance goals


Where does this Fabric Manager reside

The Fabric Manager is a user process,
running on a dedicated machine. This machine may be located on the network itself, or it can
reside on a separate control network.


Explain how PMAC (Pseudo MAC) addresses are generated for end hosts in a PortLand
network, i.e. What are the four components of a PMAC, and what does each component

A PMAC encodes the position of an end host in a fat-tree network. This encoding consists of
four components in the format pod.position.port.vmid .

pod component
-- encodes the pod
number the end host and the edge switch reside in,

position component
-- number encodes the end
host’s position in the pod.

port component
-- encodes the switch’s physical port number the
end host is attached to.

--vmid component encodes a unique ID for each virtual machine that is present on the end

The edge switch maintains a mapping for each VM, which uses its own AMAC (actual
MAC) address. This permits multiplexing of virtual hosts resident on a single physical host.


How does the PMAC design improve forwarding table sizes on large scale data center

The use of PMACs greatly simplify layer 2 forwarding due to their hierarchical nature. Switches
no longer need a forwarding table entry per virtual host. A single forwarding table entry can
be used to aggregate hosts, enabling forwarding behavior that exploits longest prefix match.
Using AMACs, switch state size is O(n), where n is the number of virtual hosts in the data
center, whereas state size is O(k) for PMACs, where k is the number of ports on switches used to construct the fat tree network.


Describe at a high level how a data center could generate a Jellyfish topology for their
specific network. Assume the desired topology is of the simplest variety, meaning each
switch has the same number of ports and will connect to the same number of servers.

To create a Jellyfish topology, we need to know three values:
--N, the number of racks / switches,
--k, the number of ports per switch,
--r, the number of ports to be used to connect to other

Next, an approximation algorithm is used to generate a RRG (Random Regular Graph)
using N, k, and r as input.

The result is a blueprint for the Jellyfish topology that can be used to
physically cable the switches and servers.


Describe at a high level how a Jellyfish topology can be incrementally expanded, using the
same assumptions from the previous question. Can we expect the expanded topology to be
uniformly random after an incremental expansion?

To incrementally add a new server rack, it is not necessary to generate a new RRG with N+1, k,
and r.

At a high level, we can add the new rack by iteratively selecting connections between
other ToR switches (not otherwise connected to the new ToR switch) and replacing that
connection with two new connections, each to the new switch.

This maintains the previous
connectivity of the topology, and also consumes two of the r ports on the new ToR switch
dedicated to connecting to other ToR switches.

This process is repeated until one or zero or the
r ports remain.

It is important to note that after expansion, the new topology cannot be expected to be
uniformly random, as it would be if a new RRG was created and the entire data center re-cabled


Data center network problems and solutions (from Jellyfish reading)

Incremental network expansion:
--adding servers and network capacity incrementally to the data center

--planned overprovisioning
of space
--power, or by upgrading old
servers to a larger number of more powerful but energyefficient
new servers
a switch with one of larger port count or oversubscribe
certain switches, but this makes capacity distribution
constrained and uneven across the servers
-- leave free ports for future network connections [14,
20] but this wastes investment until actual expansion


Jellyfish defined

Jellyfish, is a degree-bounded
random graph topology
among top-of-rack (ToR) switches.

The inherently
sloppy nature of this design has the potential to
be significantly more flexible than past designs.

components—racks of servers or switches to improve
capacity—can be incorporated with a few random
edge swaps.

The design naturally supports heterogeneity,
allowing the addition of newer network elements
with higher port-counts as they become available, unlike
past proposals which depend on certain regular portcounts.

Jellyfish also allows construction
of arbitrary-size networks, unlike topologies discussed
above which limit the network to very coarse design
points dictated by their structure.

Jellyfish supports more
servers than a fat-tree built using the same network
equipment while providing at least as high per-server
bandwidth, measured either via bisection bandwidth or
in throughput under a random-permutation traffic pattern.

In addition, Jellyfish has lower mean path length,
and is resilient to failures and miswirings.


Jellyfish challenges

routing (schemes depending on a structured topology
are not applicable)

physical construction



Limitations in existing layer 2 and layer 3 network protocols addressed by Data Center Networks like VL2

lack of scalability,
difficult management, inflexible communication, or
limited support for virtual machine migration


PortLand defined

a scalable, fault
tolerant layer 2 routing and forwarding protocol for data
center environments.

PortLand holds promise for supporting
a “plug-and-play” large-scale, data center network.

The goal of PortLand is to deliver scalable layer 2 routing,
forwarding, and addressing for data center network environments.


Requirements for data center plug and play deployment for switches

R1. Any VM may migrate to any physical machine.
Migrating VMs should not have to change their IP
addresses as doing so will break pre-existing TCP connections
and application-level state.

• R2. An administrator should not need to configure
any switch before deployment.

• R3. Any end host should be able to efficiently communicate
with any other end host in the data center along
any of the available physical communication paths.

• R4. There should be no forwarding loops.

• R5. Failures will be common at scale, so failure detection
should be rapid and efficient. Existing unicast
and multicast sessions should proceed unaffected to the
extent allowed by underlying physical connectivity.


Fabric Manager

PortLand employs a logically centralized fabric manager
that maintains soft state about network configuration information
such as topology.

The fabric manager is a user
process running on a dedicated machine responsible for assisting
with ARP resolution, fault tolerance, and multicast