A Parallel Routing Algorithm for Torus NoCs
Torus结构既具有 Mesh结构硬件实现简单、网络扩展
随着半导体制造工艺的发展,大量的晶体管和计算资源可集成到一块芯片上。
从片上系统(System on Chip,SoC)
片上网络 (Network on Chip,NoC)被视为解决下一代通信架构扩展性和功耗效率的新范式
HMesh来降低片上网络的功耗,并从理论上分析了它与经典拓扑结构的性能差异。
show that,compared with Mesh and Torus,the average power of HMesh drops by 12.9% and 11.24% respectively as
the networks is not in congestion,so it is more suitable for NoC.
Low Power Research Based on a New NoC Topology Architecture
CAO Hong-xin LI Guang-shun WU Jun-hua (School of Computer Science,Qufu Normal University,Rizhao 276826,China)
4个相邻的路 由 节 点 相 连。 它 的 网 络 拓 扑 结 构 是 规 则 的,对 应的路由算 法 比 较 简 单。但 它 的 网 路 直 径 和 节 点 间 距 离 较

Keywords: Torus, routing, placement, bisection, interconnection network, edge separator, congestion.
1 Introduction
Meshes and torus based interconnection networks have been utilized extensively in the design of parallel computers in recent years 5]. This is mainly due to the fact that these families of networks have topologies which re ect the communication pattern of a wide
variety of natural problems, and at the same time they are scalable, and highly suitable for hardware implementation. An important factor determining the e ciency of a parallel algorithm on a network is the e ciency of communication itself among processors. The network should be able to handle \large" number of messages without exhibiting degradation in performance. Throughput, the maximum amount of tra c which can be handled by the network, is an important measure of network performance 3]. The throughput of an interconnection network is in turn bounded by its bisection width, the minimum number of edges that must be removed in order to split the network into two parts each with about equal number of processors 8]. Here, following Blaum, Bruck, Pifarre, and Sanz 3, 4], we consider the behavior of torus networks with bidirectional links under heavy communication load. We assume that the communication latency is kept minimum by routing the messages through only shortest (minimal length) paths. In particular, we are interested in the scenario where every processor in the network is sending a message to every other processor (also known as complete exchange or all-to-all personalized communication). This type of communication pattern is central to numerous parallel algorithms such as matrix transposition, fast Fourier transform, distributed table-lookup, etc. 6], and central to e cient implementation of high-level computing models such as the PRAM and Bulk-Synchronous Parallel (BSP). In Valiant's BSP-model for parallel computation 14] for example, routing of h-relations, in which every processor in the network is the source and destination of at most h packets, forms the main communication primitive. Complete-exchange scenario that we investigate in this paper has been studied and shown to be useful for e cient routing of both random and arbitrary h-relations 7, 12, 13]. The network of d-dimensional k-torus is modeled as a directed graph where each node represents either a router or a processor-router pair, depending on whether or not a processor is attached at this node, and each edge represents a communication link between two adjacent nodes. Hence, every node in the network is capable of message routing, i.e. directly receiving from and sending to its neighboring nodes. A fully-populated d-dimensional k-torus where each node has a processor attached, contains kd processors. Its bisection width is 4kd? (k even), which gives kd=2 processors on each component of the bisection. Under the complete-exchange scenario, the number of messages passing through the bisection in both directions is 2(kd =2)(kd=2). Dividing by the bisection bandwidth, we nd that there must exist an edge in the bisection with a load kd =8. This means that unlike multistage networks, the maximum load on a link is not linear in the number of processors injecting messages into the network. To alleviate this problem, Blaum et al. 3, 4] have proposed partially-populated tori . In this model, the underlying network is torodial, but the nodes do not all inject messages into the network. We think of the processors as attached to a (relatively small) subset of nodes (called a placement), while the other nodes are left as routing nodes. This is similar to the case of a
Routing algorithm

Routing AlgorithmAbstractRouting algorithm can be distinguished by many features depending on the designer’s specific objectives. There are many kinds of routing algorithms with different affections on the network and router resources.The purpose of a routing algorithm is to define a set of rules for transferring units of data, known as packets, from one node to another.Key Wordshop path length least cost update time time delay Dijkstra’s algorithm Bellman-Ford algorithm. IntroductionRouting algorithm is to improve the function of routing protocol with the least overhead.In our book,it only talks about routing in switched networks including circuit-switching network and packet-switching network.In a circuit-switching network,to cope with the growing demands on public telecommunication networks,virtually all providers have moved away from the static hierarchical approach to a dynamic approach.In a packet-switching network ,the selection of a route is generally based on some performance criterion asfollows:1.to choose the minimum-hop route(least-cost routing)2.decision time and placework information source and update timeHence,a large number of routing stragies have evolved for dealing with the routing requirements of packet-switching networks including fixed routing,flooding,random routing and adaptive routing.The original routing algorithm designed in 1969 was a distributed adaptive algorithm ,which is a version of the Bell-Ford algorithm.After some years of experience the original routing algorithm was replaced by a quite difference using delay as the performance criterion.The third generation damp routing oscillations and reduce routing overhead. Routing algorithm should be flexible. The key technological factors are as follows:1. the shorest route(least hop or shorest path length) or the best route2.the communication subnet should adopt virtual circuit or datagram3.general routing algorithm or distributed routing algorithm4. cosider about the network topology,traffic and time delay5.static routing or dynamic routingThe most common routing algorithm is least-cost algorithm which is the variation of Dijkstra’s algorithm and the Bellman-Ford algorithm.System ModelExamples of adaptive-routing algorithms are the Routing Information Protocol (RIP) and the Open-Shortest-Path-First protocol (OSPF). Adaptive routing dominates the Internet. However, the configuration of the routing protocols often requires a skilled touch; networking technology has not developed to the point of the complete automation of routing. In P2P logical network, there is a simple ROP route which is responding to a complex RON route in communication network.The key point of this routing algorithm is how to pass the data to the destination fastly and reliably.。

。 NoC 从通信架构层
Received Date: 201607
* 基金项目: 国家自然科学基金 ( 61106020 ) 、 国家自然科学基金( 61204024 ) 、 国家自然科学基金( 61179036 ) 资助项目
第 31 卷
中, 片上网络的通信能力已超过计算单元成为限制整体 [5 ] 。 通常基于 NoC 的多核系统结构只
面有效地解决了总线型 SoC 扩展性差和并行度低的问 题, 并且通过采用全局时钟异步局部时钟同步( GALS ) 的 机制克服了 SoC 全局时钟同步造成的功耗和面积的额外 开销。 片上网络的设计包括: 拓扑结构、 路由算法和交换机 制等方面, 其中拓扑结构对后者具有重大影响。 片上网 络拓扑结构对片上多处理器整体性能和功耗的影响越来 在许多实例中, 特别是面向高密度计算的多核系统 越高,
第 31 卷 第 3 期 2017 年 3 月
Vol. 31 No. 3 ·361·
DOI: 10. 13382 / j. jemi. 2017. 03. 005
Torus 拓扑结构的双端口 NoC 模型与性能分析 *
宋宇鲲 钱庆松 张多利
合肥 230009 ) ( 合工业大学微电子设计研究所 摘
Can the Production Network Be the Testbed

Can the Production Network Be the Testbed?Rob Sherwood∗,Glen Gibb†,Kok-Kiong Yap†,Guido Appenzeller,Martin Casado ,Nick McKeown†,Guru Parulkar†∗Deutsche Telekom Inc.R&D Lab,Los Altos,CA USA†Stanford University,Palo Alto,CA USANicira Networks,Palo Alto,CA USAAbstractA persistent problem in computer network researchvalidation.When deciding how to evaluate a newor bugfix,a researcher or operator must trade-offism(in terms of scale,actual user traffic,realand cost(larger scale costs more money,real userfic likely requires downtime,and real equipment vendor adoption which can take years).Building atic testbed is hard because“real”networking takeson closed,commercial switches and routers withcial purpose hardware.But if we build our testbed software switches,they run several orders of slower.Even if we build a realistic network testbed, is hard to scale,because it is special purpose and is addition to the regular network.It needs its own tion,support and dedicated links.For a testbed to global reach takes investment beyond the reach of researchers.In this paper,we describe a way to build athat is embedded in—and thus grows with—the work.The technique—embodied in ourfirst FlowVisor—slices the network hardware by placing layer between the control plane and the data plane. demonstrate that FlowVisor slices our own network,with legacy protocols running in their protected slice,alongside experiments created by searchers.The basic idea is that if unmodified supports some basic primitives(in our prototype,Open-Flow,but others are possible),then a worldwide testbed can ride on the coat-tails of deployments,at no extra ex-pense.Further,we evaluate the performance impact and describe how FlowVisor is deployed at seven other cam-puses as part of a wider evaluation platform.1IntroductionFor many years the networking research community has grappled with how best to evaluate new research ideas.WhiteboardPlanC/C++/JavaNS2OPNetCustomVINIEmulabVMsFlowVisorVendorAdoption Design Simulate Test Deployin SliceDeployControl RealismFigure Today’s evaluation process is a continuum from controlled but synthetic uncontrolled but realistic testing,with no clear path vendor adoption. Simulation[17,19]and emulation[25]provide tightly controlled environments to repeatable experiments, but lack scale and realism;neither extend all the way to the end-user nor carry real user traffic.Special isolated testbeds[10,22,allow testing at scale,and can carry real user traffic,usually dedicated to a particular type of experiment are beyond the budget of most researchers.Without the means to realistically idea there has been relatively little technology the re-search lab to real-world networks.vendors are understandably reluctant to features be-fore they have been thoroughlyconditions with real user traffic.This slows the pace of innovation,and many good ideas never see the light of day.Peeking over the wall to the distributed systems com-munity,things are much better.PlanetLab has proved in-valuable as a way to test new distributed applications at scale(over1,000nodes worldwide),realistically(it runs real services,and real users opt in),and offers a straight-forward path to real deployment(services developed in a PlanetLab slice are easily ported to dedicated servers). In the past few years,the networking research commu-nity has sought an equivalent platform,funded by pro-grams such as GENI[8],FIRE[6],etc.The goal is to allow new network algorithms,features,protocols or ser-vices to be deployed at scale,with real user traffic,on a real topology,at line-rate,with real users;and in a man-ner that the prototype service can easily be transferred to run in a production network.Examples of experimen-tal new services might include a new routing protocol, a network load-balancer,novel methods for data center routing,access control,novel hand-off schemes for mo-bile users or mobile virtual machines,network energy managers,and so on.The network testbeds that come closest to achieving this today are VINI[1]and Emulab[25]:both provide a shared physical infrastructure allowing multiple simulta-neous experiments to evaluate new services on a physi-cal ers may develop code to modify both the data plane and the control plane within their own isolated topology.Experiments may run real routing software, and expose their experiments to real network events.Em-ulab is concentrated in one location,whereas VINI is spread out across a wide area network.VINI and Emulab trade off realism forflexibility in three main ways.Speed:In both testbeds packet processing and forwarding is done in software by a conventional CPU.This makes it easy to program a new service,but means it runs much slower than in a real network.Real networks in enter-prises,data centers,college campuses and backbones are built from switches and routers based on ASICs.ASICs consistently outperform CPU-based devices in terms of data-rate,cost and power;for example,a single switch-ing chip today can process over600Gb/s[2].Scale:Because VINI and Emulab don’t run new network-ing protocols on real hardware,they must always exist as a parallel testbed,which limits their scale.It would,for example,be prohibitively expensive to build a VINI or Emulab testbed to evaluate data-center-scale experiments requiring thousands or tens of thousands of switches, each with a capacity of hundreds of gigabits per second. VINI’s geographic scope is limited by the locations will-ing to host special servers(42today).Without enormous investment,it is unlikely to grow to global scale.Emu-lab can grow larger,as it is housed under one roof,but is still unlikely to grow to a size representative of a large network.Technology transfer:An experiment running on a net-work of CPUs takes considerable effort to transfer to specialized hardware;the development styles are quite different,and the development cycle of hardware takes many years and requires many millions of dollars.But perhaps the biggest limitation of a dedicated testbed is that it requires special infrastructure:equip-ment has to be developed,deployed,maintained and sup-ported;and when the equipment is obsolete it needs to be working testbeds rarely last more than one generation of technology,and so the immense engineer-ing effort is quickly lost.Our goal is to solve this problem.We set out to answer the following question:can we build a testbed that is embedded into every switch and router of the production network(in college campuses,data centers,W ANs,en-terprises,WiFi networks,and so on),so that the testbed would automatically scale with the global network,rid-ing on its coat-tails with no additional hardware?If this were possible,then our college campus networks—for example—interconnected as they are by worldwide backbones,could be used simultaneously for production traffic and new W AN routing experiments;similarly,an existing data center with thousands of switches can be used to try out new routing schemes.Many of the goals of programs like GENI and FIRE could be met without needing dedicated network infrastructure.In this paper,we introduce FlowVisor which aims to turn the production network itself into a testbed(Fig-ure1).That is,FlowVisor allows experimenters to eval-uate ideas directly in the production network(not run-ning in a dedicated testbed alongside it)by“slicing”the hardware already installed.Experimenters try out their ideas in an isolated slice,without the need for dedicated servers or specialized hardware.1.1Contributions.We believe our work makesfive main contributions: Runs on deployed hardware and at real line-rates. FlowVisor introduces a software slicing layer between the forwarding and control planes on network devices. While FlowVisor could slice any control plane message format,in practice we implement the slicing layer with OpenFlow[16].To our knowledge,no previously pro-posed slicing mechanism allows a user-defined control plane to control the forwarding in deployed production hardware.Note that this would not be possible with VLANs—while they crudely separate classes of traffic, they provide no means to control the forwarding plane. We describe the slicing layer in§2and FlowVisor’s architecture in§3.Allows real users to opt-in on a per-flow basis. FlowVisor has a policy language that mapsflows to slices.By modifying this mapping,users can easily try new services,and experimenters can entice users to bring real traffic.We describe the rules for mapping flows to slices in§3.2.Ports easily to non-sliced networks.FlowVisor(and its slicing)is transparent to both data and control planes, and therefore,the control logic is unaware of the slicinglayer.This property provides a direct path for vendor adoption.In our OpenFlow-based implementation,nei-ther the OpenFlow switches or the controllers need be modified to interoperate with FlowVisor(§3.3). Enforces strong isolation between slices.FlowVisor blocks and rewrites control messages as they cross the slicing layer.Actions of one slice are prevented from affecting another,allowing experiments to safely coexist with real production traffic.We describe the details of the isolation mechanisms in§4and evaluate their effectiveness in§5.Operates on deployed networks FlowVisor has been deployed in our production campus network for the last7 months.Our deployment consists of20+users,40+net-work devices,a production traffic slice,and four stand-ing experimental slices.In§6,we describe our cur-rent deployment and future plans to expand into seven other campus networks and two research backbones in the coming year.2Slicing Control&Data PlanesOn today’s commercial switches and routers,the con-trol plane and data planes are usually logically distinct but physically co-located.The control plane creates and populates the data plane with forwarding rules,which the data plane enforces.In a nutshell,FlowVisor as-sumes that the control plane can be separated from the data plane,and it then slices the communication between them.This slicing approach can work several ways:for example,there might already be a clean interface be-tween the control and data planes inside the switch.More likely,they are separated by a common protocol(e.g., OpenFlow[16]or ForCes[7]).In either case,FlowVisor sits between the control and data planes,and from this vantage point enables a single data plane to be controlled by multiple control planes—each belonging to a separate experiment.With FlowVisor,each experiment runs in their own slice of the network.A researcher,Bob,begins by re-questing a network slice from Alice,his network admin-istrator.The request specifies his requirements including topology,bandwidth,and the set of traffic—defined by a set offlows,orflowspace—that the slice controls.Within his slice,Bob has his own control plane where he puts the control logic that defines how packets are forwarded and rewritten in his experiment.For example,imagine that Bob wants to create a new http load-balancer to spread port80traffic over multiple web servers.He requests a slice:its topology should encompass the web servers, and itsflowspace should include allflows with port80. He is allocated a control plane where he adds his load-balancing logic to control howflows are routed in theProprietaryControl LogicProprietaryBusSwitchAlice'sLogicBob'sLogicCathy'sLogicOpenFlowProtocolForwardingLogicOpenFlowSwitchForwardingLogicOpenFlowFlowVisorControllersControlLogic1ControlLogicNSwitchForwardingLogicSlicing Layer...ClassicalSwitch ArchitectureGeneric SlicedSwitch ArchitectureSliced OpenFlowSwitch ArchitectureFigure2:Classical network device architectures have distinct forwarding and control logic elements(left).By adding a transparent slicing layer between the forward-ing and control elements,FlowVisor allows multiple control logics to manage the same forwarding element (middle).In implementation,FlowVisor uses OpenFlow and sits between an OpenFlow switch—the forwarding element—and multiple OpenFlow controllers—the con-trol logic(right).data plane.He may new service so as to at-tract users.Interested“opt-in”by contacting their network administratortheflowspace of Bob’sIn this example,for Bob,and allowsers)in the data plane.flows(e.g.when aplane.FlowVisorallowing him to control switches within his slice. FlowVisor slices the network along multiple dimen-sions,including topology,bandwidth,and forwarding table entries.Slices are isolated from each other,so that actions in one slice—be they faulty,malicious,or otherwise—do not impact other slices.2.1Slicing OpenFlowWhile architecturally FlowVisor can slice any data plane/control plane communication channel,we built our prototype on top of OpenFlow.OpenFlow[16,18]is an open standard that allows re-searchers to directly control the way packets are routed in the network.As described above,in a classical net-work architecture,the control logic and the data path are co-located on the same device and communicate via an internal proprietary protocol and bus.In OpenFlow,the control logic is moved to an external controller(typi-cally a commodity PC);the controller talks to the dat-apath(over the network itself)using the OpenFlow pro-tocol(Figure2,right).The OpenFlow protocol abstractsVoIP HTTP Game FlowVisorDougAlice'sControl LogicBob'sControl LogicCathy'sControl LogicVoIPServerWWWCacheDetourNodeGameServerFigure3:FlowVisor allows users(Doug)to delegate control of subsets of their traffic to distinct researchers (Alice,Bob,Cathy).Each research experiment runs in its own,isolated network slice.forwarding/routing directives as“flow entries”.Aflow entry consists of a bit pattern,a list of actions,and a set of counters.Eachflow entry states“perform this list of actions on all packets in thisflow”where a typical action is“forward the packet out port X”and theflow is defined as the set of packets that match the given bit pattern.The collection offlow entries on a network device is called the“flow table”.When a packet arrives at a switch or router,the device looks up the packet in theflow table and performs the corresponding set of actions.If the packet doesn’t match any entry,the packet is queued and a newflow event is sent across the network to the OpenFlow controller.The controller responds by adding a new rule to theflow table to handle the queued packet.Subsequent packets in the sameflow will be handled without contacting the con-troller.Thus,the external controller need only be con-tacted for thefirst packet in aflow;subsequent packets are forwarded at the switch’s full line rate. Architecturally,OpenFlow exploits the fact that mod-ern switches and routers already logically implement flow entries andflow tables—typically in hardware as TCAMs.As such,a network device can be made OpenFlow-compliant viafirmware upgrade.Note that while OpenFlow allows researchers to experiment with new network protocols on deployed hardware,only a single researcher can use/control an OpenFlow-enabled network at a time.As a result,with-out FlowVisor,OpenFlow-based research is limited to isolated testbeds,limiting its scope and realism.Thus, FlowVisor’s ability to slice a production network is an or-thogonal and indepenent contribution to OpenFlow-like software-defined networks.Designour main goal,FlowVisor aims to use the pro-network as a testbed.In operation,the FlowVisorthe network by slicing each of the network’s corre-packet forwarding devices(e.g.,switches andand links(Figure3).the FlowVisor,resources are sliced in terms of their band-topology,forward table entries,and device CPUslice has control over a set offlows,called its .Users can arbitrarily add(opt-in)and removetheir ownflows from a slice’sflowspace at any-(§3.2).slice has its own distinct,programmable con-that manages how packets are forwarded andfor traffic in the slice’sflowspace.In practice, slice owner implements their slice-specific controlas an OpenFlow controller.The FlowVisor inter-between data and control planes by proxying con-between OpenFlow switches and each slice con-(§3.3).•Slices are defined using a slice definition policy lan-guage.The language specifies the slice’s resource limits,flowspace,and controller’s location in terms of IP and TCP port-pair(§3.4).3.1Slicing Network ResourcesSlicing a network means correctly slicing all of the cor-responding network resources.There are four primary slicing dimensions:Topology.Each slice has its own view of network nodes (e.g.,switches and routers)and the connectivity between them.In this way,slices can experience simulated net-work events such as link failure and forwarding loops. Bandwidth.Each slice has its own fraction of bandwidth on each link.Failure to isolate bandwidth would allow one slice to affect,or even starve,another slice’s through-put.Device CPU.Each slice is limited to what fraction of each device’s CPU that it can consume.Switches and routers typically have very limited general purpose computational resources.Without proper CPU slicing, switches will stop forwarding slow-path packets(§5.3.2), drop statistics requests,and,most importantly,will stop processing updates to the forwarding table. Forwarding Tables.Each slice has afinite quota of for-warding work devices typically support afiniteTranslation Isolation Enforcement ResourceAllocationPolicyAlice'sSlice Def.Bob'sSlice Def.Cathy'sSlice Def.Alice's ControllerBob'sControllerCathy'sControllerFlowVisor12 34SwitchFigure4:The FlowVisor intercepts OpenFlowfrom guest controllers(1)and,using the user’s policy(2),transparently rewrites(3)the messagetrol a slice of the network.Messages from(4)forwarded only to guests if it matches theirof forwarding rules(e.g.,TCAM entries).ure to isolate forwarding entries between sliceslow slice to prevent another fromets.3.2Flowspace and Opt-InAsubsetform a well-defined(but not necessarily contiguous)sub-space of the entire space of possible packet headers.Ab-stractly,if packet headers have n bits,then the set of all possible packet header forms an n-dimensional space. An arriving packet is a single point in that space repre-senting all packets with the same header.Similar to the geometric representation used to describe access control lists for packet classification[14],we use this abstrac-tion to partition the space into regions(flowspace)and map those regions to slices.Theflowspace abstraction helps us manage users who opt-in.To opt-in to a new experiment or service,users signal to the network administrator that they would like to add a subset of theirflows to a slice’sflers can precisely decide their level of involvement in an ex-periment.For example,one user might opt-in all of their traffic to a single experiment,while another user might just opt-in traffic for one application(e.g.,port80for HTTP),or even just a specificflow(by exactly specify-ing all of thefields of a header).In our prototype the opt-in process is manual;but in a ideal system,the user would be authenticated and their request checked auto-matically against a policy.For the purposes of testbed we concludedflow-level opt-in is adequate—in fact,it seems quite powerful.An-other approach might be to opt-in individual packets, which would be more onerous.Message Slicingis a slicing layer interposed be-control planes of each device in the net-FlowVisor acts as a transpar-OpenFlow-enabled network devicesdata planes)and multiple OpenFlow(acting as programmable control logic—OpenFlow messages between the switchare sent through FlowVisor.FlowVi-protocol to communicate upwardsand and downwards to OpenFlowFlowVisor is transparent,the sliceno modification and believe they aredirectly with the switches.the FlowVisor’s operation by extend-from§2(Figure4).Recall that a re-has created a slice that is an HTTP proxyspread all HTTP traffic over a set of webthe controller will work on any HTTPFlowVisor policy slices the network sosees traffic from users that have opted-inHis slice controller doesn’t know the net-sliced,so doesn’t realize it only sees aHTTP traffic.The slice controller thinksi.e.,insertflow entries for,all HTTP traf-user.When Bob’s controller sends aflow entry to the switches(e.g.,to redirect HTTP traffic toa particular server),FlowVisor intercepts it(Figure4-1),examines Bob’s slice policy(Figure4-2),and re-writes the entry to include only traffic from the allowed source(Figure4-3).Hence the controller is controlling only theflows it is allowed to,without knowing that the FlowVisor is slicing the network underneath.Similarly, messages that are sourced from the switch(e.g.,a new flow event—Figure4-4)are only forwarded to guest con-trollers whoseflowspace match the message.That is,it will only be forwarded to Bob if the newflow is HTTP traffic from a user that has opted-in to his slice. Thus,FlowVisor enforces transparency and isolation between slices by inspecting,rewriting,and policing OpenFlow messages as they pass.Depending on the re-source allocation policy,message type,destination,and content,the FlowVisor will forward a given message un-changed,translate it to a suitable message and forward, or“bounce”the message back to its sender in the form of an OpenFlow error message.For a message sent from slice controller to switch,FlowVisor ensures that the message acts only on traffic within the resources as-signed to the slice.For a message in the opposite di-rection(switch to controller),the FlowVisor examines the message content to infer the corresponding slice(s) to which the message should be forwarded.Slice con-trollers only receive messages that are relevant to theirSwitch Switch Switch Switch SwitchFlowVisor FlowVisorFlowVisorAlice's ControllerBob's ControllerCathy's ControllerEric's Controller4455FlowVisor can trivially recursively sliced network,creating hierarchies of slice.Thus,from a slice controller’s tive,FlowVisor appears as a switch (or a from a switch’s perspective,FlowVisor controller.FlowVisor does not require a 1-to-1mapping sor even 3.4The slice policy defines the network resources,flows-pace,and OpenFlow slice controller allocated to each slice.Each policy is described by a text configuration file—one file per slice.In terms of resources,the policy defines the fraction of total link bandwidth available to this slice (§4.3)and the budget for switch CPU and for-warding table work topology is specified as a list of network nodes and ports.The flowspace for each slice is defined by an ordered list of tuples similar to firewall rules.Each rule descrip-tion has an associated action,e.g.,allow ,read-only ,or deny ,and is parsed in the specified order,acting on the first matching rule.The rules define the flowspace a slice controls.Read-only rules allow slices to receive Open-Flow control messages and query switch statistics,but not to write entries into the forwarding table.Rules are allowed to overlap,as described in the example below.Let’s take a look at an example set of rules.Alice,the network administrator,wants to allow Bob to conduct an HTTP load-balancing experiment.Bob has convinced some of his colleagues to opt-in to his experiment.Al-ice wants to maintain control of all traffic that is not part of Bob’s experiment.She wants to passively monitor all network performance,to keep an eye on Bob and the pro-duction network.Here is a set of rules Alice could install in the FlowVi-sor:Bob’s Experimental Network includes all HTTP traffic to/from users who opted into his experiment.Thus,his network is described by one rule per user:tcp port:80and ip=user ip .messages from the switch matching any of rules are forwarded to Bob’s controller.Any flow that Bob tries to insert are modified to meet these Production Network is the complement of Bob’s For each user in Bob’s experiment,the produc-traffic network has a negative rule of the form:tcp port:80and ip=user ip .The network would have a final rule that matches flows:Allow:all .only OpenFlow messages that do not go to Bob’s are sent to the production network controller.production controller is allowed to insert forwarding so long as they do not match Bob’s traffic.Monitoring Network is allowed to see all traffic all slices.It has one rule,Read-only:all .This rule-based policy,though simple,suffices for the and deployment described in this paper.We that future FlowVisor deployments will have more policy needs,and that researchers will create resource allocation policies.4FlowVisor ImplementationWe implemented FlowVisor in approximately 8000lines of C and the code is publicly available for download from .The notable parts of the im-plementation are the transparency and isolation mech-anisms.Critical to its design,FlowVisor acts as a transparent slicing layer and enforces isolation between slices.In this section,we describe how FlowVisor rewrites control messages—both down to the forwarding plane and up to the control plane—to ensure both trans-parency and strong isolation.Because isolation mech-anisms vary by resource,we describe each resource in turn:bandwidth,switch CPU,and forwarding table en-tries.In our deployment,we found that the switch CPU was the most constrained resource,so we devote partic-ular care to describing its slicing mechanisms.4.1Messages to Control PlaneFlowVisor carefully rewrites messages from the Open-Flow switch to the slice controller to ensure transparency.First,FlowVisor only sends control plane messages to a slice controller if the source switch is actually in the slice’s topology.Second,FlowVisor rewrites Open-Flow feature negotiation messages so that the slice con-troller only sees the physical switch ports that appear in the slice.Third,OpenFlow port up/port down mes-sages are similarly pruned and only forwarded to the af-fected ing these message rewriting techniques,FlowVisor can easily simulate network events,such as link and node failures.4.2Messages to Forwarding PlaneIn the opposite direction,FlowVisor also rewrites mes-sages from the slice controller to the OpenFlow switch. The most important messages to the forwarding plane were insertions and deletions to the forwarding table. Recall(§2.1)that in OpenFlow,forwarding rules consist of aflow rule definition,i.e.,a bit pattern,and a set of actions.To ensure both transparency and isolation,the FlowVisor rewrites both theflow definition and the set of actions so that they do not violate the slice’s definition. Given a forwarding rule modification,the FlowVisor rewrites theflow definition to intersect with the slice’s flowspace.For example,Bob’sflowspace gives him con-trol over HTTP traffic for the set of users—e.g.,users Doug and Eric—that have opted into his experiment.If Bob’s slice controller tried to create a rule that affected all of Doug’s traffic(HTTP and non-HTTP),then the FlowVisor would rewrite the rule to only affect the in-tersection,i.e.,only Doug’s HTTP traffic.If the inter-section between the desired rule and the slice definition is null,e.g.,Bob tried to affect traffic outside of his slice,e.g..,Doug’s non-HTTP traffic,then the FlowVi-sor would drop the control message and return an error to Bob’s controller.Becauseflowspaces are not necessar-ily contiguous,the intersection between the desired rule and the slice’sflowspace may result in a single rule be-ing expanded into multiple rules.For example,if Bob tried to affect all traffic in the system in a single rule,the FlowVisor would transparently expand the single rule in to two rules:one for each of Doug’s and Eric’s HTTP traffic.FlowVisor also rewrites the lists of actions in a for-warding rule.For example,if Bob creates a rule to send out all ports,the rule is rewritten to send to just the sub-set of ports in Bob’s slice.If Bob tries to send out a port that is not in his slice,the FlowVisor returns a“action is invalid”error(recall that from above,Bob’s controller only discovers the ports that do exist in his slice,so only in error would he use a port outside his slice).4.3Bandwidth IsolationTypically,even relatively modest commodity network hardware has some capability for basic bandwidth iso-lation[13].The most recent versions of OpenFlow ex-pose native bandwidth slicing capabilities in the form of per-port queues.The FlowVisor creates a per-slice queue on each port on the switch.The queue is configured for a fraction of link bandwidth,as defined in the slice def-inition.To enforce bandwidth isolation,the FlowVisor rewrites all slice forwarding table additions from“send out port X”to“send out queue Y on port X”,where Y is a slice-specific queue ID.Thus,all traffic from a given slice is mapped to the traffic class specified by the re-source allocation policy.While any queuing discipline can be used(weighted fair queuing,deficit round robin, strict partition,etc.),in implementation,FlowVisor uses minimum bandwidth queues.That is,a slice configured for X%of bandwidth will receive at least X%and pos-sibly more if the link is under-utilized.We choose min-imum bandwidth queues to avoid issues of bandwidth fragmentation.We evaluate the effectiveness of band-width isolation in§5.4.4Device CPU IsolationCPUs on commodity network hardware are typically low-power embedded processors and are easily over-loaded.The problem is that in most hardware,a highly-loaded switch CPU will significantly disrupt the network. For example,when a CPU becomes overloaded,hard-ware forwarding will continue,but the switch will stop responding to OpenFlow requests,which causes the for-warding tables to enter an inconsistent state where rout-ing loops become possible,and the network can quickly become unusable.Many of the CPU-isolation mechanisms presented are not inherent to FlowVisor’s design,but rather a work-around to deal with the existing hardware abstraction ex-posed by OpenFlow.A better long-term solution would be to expose the switch’s existing process scheduling and rate-limiting features via the hardware abstraction. Some architectures,e.g.,the HP ProCurve5400,already use rate-limiters to enforce CPU isolation between Open-Flow and non-OpenFlow VLANs.Adding these features to OpenFlow is ongoing.There are four main sources of load on a switch CPU: (1)generating newflow messages,(2)handling requests from controller,(3)forwarding“slow path”packets,and (4)internal state keeping.Each of these sources of load requires a different isolation mechanism.New Flow Messages.In OpenFlow,when a packet arrives at a switch that does not match an entry in the flow table,a newflow message is sent to the controller. This process consumes processing resources on a switch and if message generation occurs too frequently,the CPU resources can be exhausted.To prevent starvation,the FlowVisor rate limits the newflow message arrival rate. In implementation,the FlowVisor tracks the newflow message arrival rate for each slice,and if it exceeds some threshold,the FlowVisor inserts a forwarding rule to drop the offending packets for a short period.For example,the FlowVisor keeps a token-bucket style counter for eachflow space rule(“Bob’s slice gets(1)。

2012 International Conference on Computer Networks and Communication Systems (CNCS 2012)IPCSIT vol.35(2012) © (2012)IACSIT Press, SingaporeA Parallel Routing Algorithm for Torus NoCsKhaled Day, Nasser Alzeidi, Bassel Arafeh, Abderezak TouzeneDepartment of Computer Science, Sultan Qaboos University, Muscat, Oman Abstract. This paper proposes a parallel routing algorithm for routing multiple data streams over disjoint paths in the torus Network-on-Chip (NoC) architecture. We show how to construct a maximal set of disjoint paths between any two nodes of a torus network topology and then make use of the constructed paths for the simultaneous routing of multiple data streams between these nodes. Analytical performance evaluation results are obtained showing the effectiveness of the proposed parallel routing algorithm in reducing communication delays and increasing throughput when transferring large amounts of data in an NoC-based multi-core system.Keywords: network-on-chip (NoC), 2D mesh, torus, multipath routing, disjoint paths1.IntroductionWith advances in technology, chips with hundreds of cores are expected to become a reality in the near future. Traditionally, communication between processing elements was based on buses. When the number of processing elements is large, the bus becomes a bottleneck from performance, scalability and power dissipation points of view [1]. A network-on-chip (NoC) is instead used to interconnect the processing elements. The topology of the network-on-chip has a major impact on the overall multi-core system performance [2]. Several topologies have been proposed and studied for NoCs including mesh-based and tree-based topologies [2]. Mesh-based topologies (especially the 2D mesh and the torus) have been the most popular of these topologies. Their popularity is due to their modularity (they can be easily expandable by adding new nodes and links without modifying the existing structure), their ability to be partitioned into smaller meshes (a desirable feature for parallel applications), their simple XY routing strategy, and their facilitated implementation. They also have a regular structure and short inter-switch wires. They have been used in many systems such as the RAW processor [3], the TRIPS processor [4], the 80-node Intel's Teraflops research chip [5], and the 64-node chip multiprocessor from Tilera [6].In this paper we contribute to the study of the torus topology by showing how to construct a maximal set of disjoint paths between any two nodes of a torus and how to use these paths for parallel routing. The proposed parallel routing algorithm allows the transfer of multiple data streams between any two nodes in the torus over disjoint paths resulting in faster transfer of large amounts of data and higher throughput. PRA can also be used for fault-tolerance purposes by sending multiple copies of critical data on disjoint paths. The critical data can still be delivered even if only one of the disjoint paths is fault-free and the others are faulty. Sources of communication faults in NoCs include crosstalk, faulty links and congested network areas.2.Notations and PreliminariesA 2D mesh NoC consists of k×k switches interconnecting IP nodes. One disadvantage of the 2D mesh topology is its long diameter which has a negative effect on the communication latency. A torus NoC, illustrated in figure 1a, is basically the same as a 2D mesh NoC with the exception that the switches on the edges are connected with wrap-around links. Every switch in a torus has five active ports: one connected to the local IP node, and the other four connected to the four neighboring switches (left, right, up and down). The torus topology reduces the latency of the 2D mesh while keeping its simplicity. In order to reduce the length of the wrap-around links, the torus can be folded as shown in figure 1b.Fig. 1: (a) A Torus NoC Topology (k = 4) (b) A Folded Torus NoC Toplogy (k = 4)We refer to a node in the torus topology by its pair of X -Y coordinates as illustrated in figure 1a. We show in the next section how to construct node-disjoint paths from a source node S to a destination node D .A path from S to D is a sequence of nodes starting at S and ending at D such that any two consecutive nodes in the sequence are neighbor nodes. We say two paths are node-disjoint if they do not have any common nodes other than the source node and the destination node. A path from S to D can be specified by the sequence of node to node moves that lead from S to D . There are four possible moves from a node to a neighbor node (right, left, up, and down). We denote these moves as +X , -X , +Y and –Y respectively. When a node is on the rightmost border of the torus, a move to the right uses the wrap-around link that leads to a node on the leftmost border of the torus. Similarly for nodes on the leftmost, top and bottom borders. In other word all + and – operations on the X -Y coordinates are modulo k . With respect to a given source node S and a given destination node D , all +X moves are called forward X moves (denoted F X ) if and only if an initial +X move from S decreases the distance to D along the X dimension. Otherwise all +X moves are called backward X moves and are denotedB X . These F X and B X notations are only defined when S and D differ in the X dimension. Figure 2 provides more precise definitions for F X and B X in the form of two functions F X (S ,D ) and B X (S , D ) which return the moves that correspond to F X and B X for a given source S = (x S , y S ) and a given destination D = (x D , y D ). The forward Y moves and backward Y moves (along the Y dimension) and their corresponding F Y and B Y notations are similarly defined. We can obtain F Y (S , D ) and B Y (S , D ) functionsX X 3. Node-Disjoint Paths in A TorusLet S = (x S , y S ) and D = (x D , y D ) be any source and destination nodes in the torus. There are at most four node-disjoint paths from S to D corresponding to the four possible starting moves +X , -X , +Y and –Y from S . We now show how to construct a maximal set of four disjoint paths from S to D . Each of the constructed paths is defined by a sequence of moves that lead from S to D . In a path description we use a superscript notation to indicate the number of consecutive times a move is repeated. For example +X 2 denotes a sequence of two consecutive +X moves. Let δx be the distance from Sto Dalong the x dimension (i.e. δx = min(|x D – x S |, k –|x D – x S |) and let δy be the distance from S to D along the y dimension (i.e. δy = min(|y D – y S |, k –|y D – y S |). We distinguish the following three cases in the construction of disjoint paths from S to D :Case 1: If x S ≠ x D and y S ≠ y D (S and D on different rows and different columns): Table 1 shows sequences of routing moves of four node disjoint paths from S to D for Case 1 and Figure 3 illustrates these four paths for a 5×5 Torus. The wrap-around links are not shown for clarity of the figure.IP NodeSwitch (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2) (2, 0) (2, 1) (2, 2) (3, 0) (3, 1)(3, 2) (0, 3) (1, 3) (2, 3) (3, 3)Fig. 3: Disjoint Paths for Case 1 (k = 5) Case 2: If x S = x D and y S ≠ y D (S and D on the same column but different rows): Table 2 shows sequences of routing moves of four node disjoint paths from S to D for Case 2 and Figure 4 illustrates these four paths defined for a 5×5 Torus. The wrap-around links are not shown for clarity of the figure.Fig. 4: Disjoint Paths for Case 2 (k = 5) Case 3: If x S ≠ x D and y S = y D (S and D on the same row but different columns): Symmetric to Case 2.4. The Parallel Routing Algorithm (Pra)We now propose a parallel routing algorithm (PRA) that allows any source node S in the torus to send to any destination node D , a set of m packets in parallel over disjoint paths. Figure 5 outlines the operation of the parallel routing algorithm at a source node S that wants to send m packets p 1, p 2, …, p m in parallel to a destination node D . The source node scatters the m packets over the disjoint paths in a round-robin fashion. Table 1: Four Disjoint Paths from S to D for Case 1Path Sequence of Routing Movesπ11 F x δx , F y δy π12 F y δy , F x δx πB F δy+1F δx+1B Table 2: Four Disjoint Paths from S to D for Case 2Path Sequence of Routing Movesπ21 F y δy π22 +X, F y δy , -X π23 -X, F y δy , +X π24B y , +X 2, F y δy+2, -X 2, B y5. Performance EvaluationIn this section we derive performance characteristics of PRA. We first obtain the lengths of the constructed parallel paths. These lengths are readily obtained from Table 1 and Table 2. The lengths, d ij , of the constructed four πij paths, 1 ≤ i ≤ 3, 1 ≤ j ≤ 4, are shown in Table 3.Table 3: Lengths of the Constructed Parallel PathsPath Case 1 Case 2 Case 31 δx + δy δy δx2 δx + δy δy + 2 δx + 23 δx + δy +4 δy + 2 δx + 24 δx + δy + 4 δy + 8 δx + 8The proposed routing algorithm splits a message of size M flits over four disjoint paths resulting in approximately M/4 flits sent on each path. The message latency for a message can then be calculated as the maximum latency of transferring M/4 flits on the four disjoint paths. There are in total k 2(k 2-1) source-destination pairs (where the source and the destination are different) of which k 2(k -1)2 pairs correspond to Case 1, k 2(k -1) pairs correspond to Case 2, and another k 2(k -1) pairs correspond to Case 3. Therefore, the probability p i of generating a message that belongs to Case i is given by the following formula:(1)/(1)11/(1)21/(1)3i k k i p k i k i −+=⎧⎪+=⎨⎪+=⎩(1) Averaging over the three possible cases, the average message latency can be calculated as: mean message latency ()123431max ,,,i i i i i i p T T T T ==Σ (2)In what follows we calculate . Under uniform traffic pattern, the channel arrival rate can be found by dividing the total channel arrival rates over the number of channels in the network. If each IP generates an average of messages per network cycle then a total of messages will be generated in the network. Since each message on path traverses hops and there are 4 output channels in each node, the rate of message received by each channel in the path can be calculated as:44g i j g ijij N d d N λλλ==(3)Since the traffic is uniform and the network is symmetric, all channels in the network have similar statistical characteristics. Therefore the message latency along path is composed of the time to transmit the flits, M/4, the routing time, , and the blocking delay encountered at each hop along the path. We assume here that each flit takes one network cycle to be transmitted from one node to the next and the routing decision takes also one cycle. Hence the message latency along path can be written as:()4i j i j i j ij ij ij M T d d w PB V ⎡⎤=++⎢⎥⎣⎦ (4)where is the multiplexing factor andis the blocking delay calculated by multiplying the mean waiting time to acquire a channel by the blocking probability. The mean waiting time to acquire a channel can be approximated as the mean waiting time of an M/G/1 queue [7]:()()()()2221/4/21ij ij ij ij ij ij ij T T M T W T λλ⎡⎤+−⎢⎥⎣⎦=− (5)To calculate the blocking probability for path , we assume that virtual channels can be used per physical channel where any available virtual channel can be selected to route the message to the next hop. Therefore a message will be blocked only when all virtual channels are busy. The probability that v virtual channels are busy at a physical channel can be determined using a Markovian model [8] as follows:(1)()()i j i j v ij i j ij v ij v ij T T P T λλλ⎧−⎪=⎨⎪⎩ 1v V v V ≤<= (6)When multiple virtual channels are used per physical channel they share the bandwidth in a time-multiplexed manner. Therefore the message latency has to be scaled by the average degree of multiplexing, (see equation (3) above) which takes place at a physical channel. This can be calculated as follows [8]:211/v v v v ij v ij v ij V v p vp ===ΣΣ(7) An iterative technique with error bound of 0.0001 has been used to evaluate the different variables of the above model. Figure 7 shows plots of the obtained message latency (in network cycles) against the traffic load (message generation rate) for different scenarios. It can be seen from this figure that the message latency is significantly reduced by the use of parallel routing of the message flits over the constructed disjoint paths. It should be mentioned however, that an extra overhead will be needed to assemble the flits of a message at destination nodes. This also requires that flits utilize sequence numbering to maintain the correct order of the message flits. The plots of figure 7 also reveal that using path 1 or path 2 only gives lower latency. This is an expected behavior as these two paths correspond to the optimal routing paths (i.e. shortest paths between any source and destination nodes). It is also clear from the plots that when PRA is used for routing in the torus6. ConclusionWe have proposed a parallel routing algorithm (PRA) for transferring multiple data streams over disjoint paths in a torus NoC architecture. The algorithm is based on a construction of disjoint paths between network nodes. Analytical performance evaluation results have been obtained showing the effectiveness of the proposed parallel routing algorithm in reducing communication delays and increasing throughput. The algorithm can be adapted to support fault-tolerant routing of multiple copies of critical data in a Torus NoC over the multiple disjoint paths.7. References[1] L. Benini and G. D. Micheli, Networks on Chips: A New SoC Paradigm, Computer , vol. 35, no. 1, Jan 2002, pp.70-78.[2]L. Benini and G. D. Micheli, Networks on Chips: Technology and Tools, Morgan Kaufmann, 2006.[3]M. B. Taylor, W. Lee, S. Amarasinghe, and A, Agarwal. Scalar Operand Networks: On-Chip Interconnect for ILPin Partitioned Architectures, International Symposium on High-Performance Computer Architecture (HPCA), pp.341–353, Anaheim, California, 2003.[4]P. Gratz, C. Kim, R. McDonald, S. Keckler, and D. Burger, Implementation and Evaluation of On-Chip NetworkArchitectures, International Conference on Computer Design (ICCD), 2006.[5]S. Vangal et al. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS, IEEE Int'l Solid-State CircuitsConference, Digest of Technical Papers (ISSCC), 2007.[6] A. Agarwal, L. Bao, J. Brown, B. Edwards, M. Mattina, C. - C. Miao, C. Ramey, and D. Wentzlaff, Tile Processor:Embedded Multicore for Networking and Multimedia, Hot Chips 19, Stanford, CA, Aug. 2007.[7]L. Kleinrock, Queuing Systems: Theory, vol. 1, New York: John Wiley, 1975.[8]W.J. Dally, Virtual channel flow control, IEEE Transactions on Parallel and Distributed Systems, 3(2), 1992.。