An Evalulation of the Pool Maintenance Overhead in Reliable Server Pooling Systems

An Evalulation of the Pool Maintenance Overhead in Reliable Server Pooling Systems

An Evalulation of the Pool Maintenance Overhead in Reliable Server Pooling Systems∗Thomas Dreibholz,Erwin P.RathgebUniversity of Duisburg-Essen,Institute for Experimental Mathematics Ellernstrasse29,45326Essen,Germany{thomas.dreibholz,erwin.rathgeb}@uni-due.deAbstractReliable Server Pooling(RSerPool)is a protocol frame-work for server redundancy and session failover,currently still under standardization by the IETF RSerPool WG.An important property of RSerPool is its lightweight architec-ture:server pool and session management can be realized with small CPU power and memory requirements.That is,RSerPool-based services can also be managed and pro-vided by embedded systems.Currently,there has already been some research on the performance of the data struc-tures managing server pools.But a generic,application-independent performance analysis–in particular also in-cluding measurements in real system setups–is still miss-ing.Therefore,the aim of this paper is–after an outline of the RSerPool framework,an introduction to the pool man-agement procedures and a description of our pool manage-ment approach–tofirst provide a detailed performance evaluation of the pool management structures themselves. Afterwards,the performance of a prototype implementation is analysed in order to evaluate its applicability under real network conditions.Keywords:RSerPool,Server Pools,Handlespace Man-agement,SCTP,Performance1Introduction and ScopeService availability is getting increasingly important in today’s Internet.But–in contrast to the telecommunica-tions world,where availability is ensured by redundant links and devices[27]–there had not been any generic,stan-dardized approaches for the availability of Internet-based services.Each application had to realize its own solution and therefore to re-invent the wheel.This deficiency–once more arisen for the availability of SS7(Signalling System No.7[23])services over IP networks–had been the initial motivation for the IETF RSerPool WG to define the Reli-able Server Pooling(RSerPool)framework.The basic ideas of RSerPool are not entirely new(see[1,32]),but their com-bination into one application-independent framework is.∗Parts of this work have been funded by the German Research Founda-tion(Deutsche Forschungsgemeinschaft).The Reliable Server Pooling(RSerPool)architecture currently under standardization by the IETF RSerPool WG is an overlay network framework to provide server replica-tion and session failover capabilities to its applications[9]. In particular,server redundancy leads to the issues of load distribution and load balancing[22],which are also cov-ered by RSerPool[13,15,19].But in full contrast to al-ready available solutions in the area of GRID and high-performance computing[20],the RSerPool architecture is intended to be lightweight.That is,RSerPool may only in-troduce a small computation and memory overhead for the management of pools and sessions[6,12].Especially,this means the limitation to a single administrative domain and only taking care of pool and session management–but not for tasks like data synchronization,locking and user man-agement(which are considered to be application-specific). On the other hand,these restrictions allow for RSerPool components to be situated on embedded devices like routers or telecommunications equipment.There has already been some research on the perfor-mance of RSerPool for applications like SCTP-based mo-bility[11],V oIP with SIP[4],web server pools[28],IP Flow Information Export(IPFIX)[10],real-time distributed computing[9,13,19]and battlefield networks[34,35].Fur-thermore,some ideas and rough performance estimations for the pool management have been described in our pa-per[12].But up to now,a detailed performance analysis of these data structures,as well as an evaluation of the pool management overhead in a real system setup,are still miss-ing.The goal of our work is therefore to provide these anal-yses.In particular,we intend to identify critical parameter spaces to provide guidelines for designing and provisioning efficient RSerPool systems.2The RSerPool ArchitectureFigure1provides an illustration of the RSerPool archi-tecture,as defined in[17,26];the protocol stack is presented infigure2.RSerPool consists of three component classes: servers of a pool are called pool elements(PE).A pool is identified by a unique pool handle(PH)in the handlespace, which is the set of all pools.The handlespace is managed by pool registrars(PR).PRs of an operation scope synchronize their view of the handlespace using the Endpoint haNdle-Figure1.The RSerPoolArchitectureFigure2.The RSerPool Protocol Stackspace Redundancy Protocol(ENRP[36]).In the operation scope,each PR is identified by a PR ID.An operation scope has a limited range,e.g.a company or organization;RSer-Pool does not intend to scale to the whole Internet.Never-theless,it is assumed that PEs can be distributed globally, for their service to survive localized disasters[16].A PE can register into a pool at an arbitrary PR of the operation scope,using the Aggregate Server Access Proto-col(ASAP[30]).In its pool,the PE will be identified by a random32-bit identifier which is denoted as PE ID.The PR chosen for registration becomes the Home-PR(PR-H) of the PE and is in particular also responsible for moni-toring the PE’s health by endpoint keep-alive messages.If not acknowledged,the PE is assumed to be dead and re-moved from the handlespace.Furthermore,PUs may re-port unreachable PEs;if a certain threshold of such reports is reached,a PR may also remove the corresponding PE. The PE failure detection mechanism of a PU is application-specific.A non-PR-H only sets a lifetime expiration timer for each PE(owned and monitored by another PR).If not updated by its PR-H in time,a PE is simply removed from the local handlespace.A client is called pool user(PU)in RSerPool terminol-ogy.To use the service of a pool given by its PH,a PU requests a PE selection–which is called handle resolution –from an arbitrary PR of the operation scope,again us-ing ASAP[30].The PR selects the requested list of PE identities using a pool-specific selection rule,called pool policy.The maximum number of selected entries per re-quest is defined by the constant MaxHResItems.Adaptive and non-adaptive pool policies are defined in[33];for a de-tailed discussion of these policies,see[13,15,19,37,38]. Relevant for this paper are the non-adaptive policies Round Robin(RR)and Random(RAND)and the adaptive policy Least Used(LU).LU selects the least-used PE,according to up-to-date load information;the actual definition of load is application-specific.Round robin selection is applied among multiple least-loaded PEs[12].The ASAP protocol also provides an optional Session Layer between a PU and a PE.That is,a PU establishes a logical session with a pool;ASAP takes care of the trans-port connection establishment,for the connection monitor-ing and for triggering a failover to a new PE in case of a fail-ure(see[5,14]).All associations among the three RSerPool component types(see alsofigure2)are usually based on the Stream Control Transmission Protocol(SCTP[29]),which in particular allows for path multi-homing(see[24,25]for details).3The Handlespace Management Approach 3.1RequirementsThe challenge of the handlespace management is to ful-fil two important properties,with particular regard of the “lightweight”requirement of the RSerPool architecture: (1)Server pools may get large(up to many thousands of PEs[8])and(2)A handlespace may contain various pools, each one may use a different policy for server selection[15] (and new applications may even introduce further poli-cies[16,19]).Clearly,in order to keep such a handlespace maintainable,it is necessary to use an unified storage struc-ture(which is usable for all policies)and realize it in an effi-cient way.Furthermore,the handlespace data structure has to support the following six operations:(1)Registration de-notes the registration of a new PE.(2)Deregistration means the removal of a PE entry.(3)Re-Registration is an infor-mation update for an exiting PE entry.In particular,a re-registration is necessary to update the policy information of an adaptive policy(e.g.the load state for LU[13]).(4)Han-dle Resolution denotes a PE selection operation.(5)Timer denotes scheduling and expiry of a handlespace timer.For a PR-H,this means scheduling a keep-alive transmission time,its timeout,scheduling a timeout for the keep-alive and cancelling it(on acknowledgement reception).For a non-PR-H,it denotes the scheduling of a registration’s life-time expiration and its cancellation(for an update).(6)Syn-chronization is the step-wise traversal of the complete hand-lespace,in order to obtain a block-wise copy for another PR.3.2Policy RealizationOn the topic of supporting different policies,we have already proposed in[12]to realize the handlespace in form of multiple sets(as illustrated infigure3):a handlespace is simply a set of pools(Pools Set);each pool contains a set of PE references sorted by PE ID(Index Set)and a second set of these references sorted by a policy-specific sorting order(Selection Set).In order to realize different policies,it is simply necessary to specify a sorting order for the Selection Set,as well as a selection procedure(which is usually to take thefirst PE).Upon selection of a PE entry,its position in the Selection Set is updated.In[12],we have already shown the scalability of this approach for a specific example application scenario.How-ever,a performance analysis for a broader parameter range has still been missing.Furthermore,our handlespace man-agement approach had to be extended by more features, which are described in the following.3.3Timer ScheduleScheduling and expiration of timers for PE entries is an additional task of the handlespace management.There are three types of timers:a keep-alive transmission timer sched-ules the transmission of an ASAP keep-alive to a PE;the keep-alive timeout timer schedules the timeout for the PE’s answer.A lifetime expiry timer schedules the expiration of a PE entry on a non-PR-H.An important observation for these three timers is that at any given time exactly one of them is scheduled for each PE.That is,each PE entry only has to contain the type of the timer and the expiration time stamp.Then,the timer schedule is simply another set of PE entries(sorted by time stamp,of course),as shown in figure3.3.4Checksum and Ownership SetThe ENRP protocol takes care of the handlespace syn-chronization.In order to detect discrepancies in the hand-lespace views of different PRs,each PR calculates a check-sum of its own PE entries(i.e.the PEs for which it is in the role of a PR-H).These checksums can be transmitted to other PRs,which can compare the value expected from their own handlespace view with the announced value.In case of a difference,a synchronization is necessary.The checksum algorithm used by ENRP is the16-bit Internet Checksum[3],which allows for incremental updates[9].The synchronization procedure requires to traverse all PE entries belonging to a certain PR.This functionality can be realized by introducing the so called Ownership Set–containing the PE references sorted by PR-H(seefigure3).Figure4.The Measurement Setup4The Measurement SetupIn[12],the pool management workload of a PR has al-ready been examined for different implementation strate-gies of the Set datatype–but only for a very specific setup.4.1Data Structure PerformanceHowever,a detailed analysis of the handlespace opera-tions throughput is still missing.Therefore,this will be thefirst part of this paper.Our program for the corre-sponding measurements simply performs as much opera-tions of the requested type as possible,in the pool built up in advance.Since registrations and deregistrations can-not be examined separately(the pool would either grow or shrink),these operations are examined combinedly:a Registration/Deregistration operation simply performs the deregistration of a randomly selected element if the pool has the configured size;otherwise,a new PE is registered. The system used for the performance measurements uses a 1.3GHz AMD Athlon CPU–which has been state of the art in early2001(i.e.almost seven years ago)and whose performance seems to be realistic for upcoming router or embedded device generations(which could host a PR ser-vice).All measurements are repeated18times in order to provide statistical accuracy.4.2Real System PerformanceWhile the operations throughput is useful to estimate the scalability of the handlespace management,the resulting question is clearly how a real system performs.In order to evaluate such a system,i.e.including real components, protocol stacks and network overhead,we have set up a lab scenario as shown infigure4:it consists of a set of10PCs (each having a2.4GHz Pentium IV CPU and1GB of mem-ory)connected by a gigabit switch to a Linux-based router. Two PRs(using the same CPU as for the data structure per-formance evaluation,see subsection4.1)are connected to the router by Gigabit Ethernet.On each of the hosts,a con-figurable number of test PEs,PUs and PRs can be started.All systems run Kubuntu Linux6.10“Edgy Eft”,us-ing kernel2.6.17-11and the kernel SCTP module providedby the distribution.Our RSerPool implementation RSP-LIB[7,9,18],version2.2.0has been installed on all ma-chines.Each measurement run is repeated12times toachieve statistical accuracy.GNU R has been used for the statistical post-processing of our results–including the computation of95%confi-dence intervals–and plotting.All results plots show the average values and their confidence intervals.5Performance AnalysisOur performance evaluation is subdivided into two parts. Thefirst part in subsection5.1provides a performance anal-ysis of the handlespace management structure itself and constitutes the foundation of the real system evaluation in subsection5.2.5.1Data Structure PerformanceThe most important operation for the PE side is the regis-tration/deregistration(see subsection3.1)at the PR.In[12], it has already been shown that deterministic policies can lead to systematic insertion and removal operations in the Selection Set(see subsection3.2).On the other hand,ran-domized policies are not affected.Therefore,only a bal-anced tree structure is appropriate to base the Set datatype on.We have examined the scalability on the number of PEs for the two state-of-the-art representations of this datatype: the red-black tree[21](a deterministic approach)and the treap[2](a randomized approach).The left-hand side offigure5shows the throughput of registration/deregistration operations per PE and second for both tree structures and classes of policies.While the per-formance difference between the two policy types is small, the treap has a slightly lower performance:using a deter-ministically balanced tree is–despite the greater complex-ity of the insertion and removal algorithms[21]–the faster solution.For a pool of20,000PEs,it would be possible to register or deregister each PE about2times per second (red-black tree).Clearly,this is more than sufficient in realistic scenar-ios.But while the frequency of registration/deregistration operations(i.e.actual insertions of new or removals of ex-isting PEs)is assumed to be rare,a re-registration(i.e.a registration update)of a PE occurs frequently,in particu-lar if the policy is dynamic.For a dynamic policy(e.g. LU),the position of the PE entry within the Selection Set changes(see also subsection3.2).In order to show the im-pact on the reregistration operations performance,the right-hand side offigure5presents the reregistrations throughput per PE and second.For the adaptive policy(here:LU),each reregistration updates the load value with a random value. As expected,a significant difference between adaptive and non-adaptive policies is shown:for20,000PEs,the non-adaptive policy still achieves a throughput of about5op-erations per PE and second(red-black tree),while it sinks to only about3in the adaptive case.That is,care has to be taken of the application behaviour–which actually has to decide when the policy information needs to be updated!Again,the performance for using a red-black tree is slightly better than using a treap.The throughput of timer operations is depicted on the left-hand side offigure6.Clearly,the two extreme cases for this operation are0%and100%of owned PEs.Therefore, the results of these two settings for both tree implementa-tions are shown.However,the difference keeps very small: re-scheduling a timer is quite inexpensive–the CPU’s cache helps to quickly re-insert the updated structure as described in subsection3.3.As already expected,the performance for a red-black tree is slightly better than for a treap.Handle resolution is the operation relevant for the PUs. Its performance is influenced by two factors:MaxHRe-sItems and the type of policy–randomized or determin-istic.For a randomized policy,it is necessary to move down the Selection Set tree(whose depth is O(log n)–n number of PEs–for red-black tree and treap)in order to obtain a random PE[12]–for each of the MaxHResItems entries. Deterministic policies,on the other hand,simply allow for taking a complete chain of PE entries from the list(since their order is deterministic and therefore already defined by the sorting order,see subsection3.2),i.e.the overall run-time is O(1)instead.The throughput of handle resolution operations per PE and second is depicted on the right-hand side offigure6. Clearly,it can be observed that the higher MaxHResItems, the lower the throughput:it sinks from13at MaxHRe-sItems h=1to about7.5at h=3for10,000PEs(determinis-tic policy,red-black tree).Furthermore,the performance for a randomized policy is clearly lower:7at h=1vs.about4 at h=3for10,000PEs(red-black tree).Again,the perfor-mance for the treap is somewhat lower than for the red-black tree.In a real system,the frequency of handle reso-lutions strongly depends on the application’s PU workload. Having a PU with a high handle resolution frequency(e.g.a web proxy like[28]),it is possible to apply a handle resolu-tion cache at the PU[13].Furthermore,the handle resolu-tion operation has an advantage over the previously exam-ined operations:it can be performed independently of other PRs.That is,in case of a high handle resolution workload, the PUs could be distributed among multiple PRs.The last operation is the synchronization,which only oc-curs when PRs detect an inconsistency or on PR startup. That is,the operation is quite rare(e.g.up to a few times per day only).However,the actual performance for a pool of30,000PEs allows for more than100operations per sec-ond,which is by orders of magnitude more than sufficient. Therefore,a plot has been omitted.5.2Real System PerformanceWhile our RSerPool handlespace management approach –based on red-black trees–handles pools of10,000and more PEs,pools of up to a few hundreds of PEs seem to be most realistic for the application cases of RSerPool.There-fore,the following measurements focus on smaller pools, but with a high PR request frequency in order to fathom the limits.050001000015000200002500051015202530Number of Pool Elements [1]R e g i s t r a t i o n O p e r a t i o n s p e r P E a n d S e c o n d [1/P E *s ]50001000015000200002500051015202530Number of Pool Elements [1]R e −R e g i s t r a t i o n O p e r a t i o n s p e r P E a n d S e c o n d [1/P E *s]Figure 5.The Scalability of the Registration/Deregistration and Re-Registration Operations050001000015000200002500051015202530Number of Pool Elements [1]T i m e r O p e r a t i o n s p e r P E a n d S e c o n d [1/P E *s ]50001000015000200002500051015202530Number of Pool Elements [1]H a n d l e R e s o l u t i o n O p e r a t i o n s p e r P E a n d S e c o n d [1/P E *s ]Figure 6.The Scalability of the Timer Handling and Handle Resolution Operations5.3Pool Elements ScalabilityIn order to show the scalability on PEs,the number of PEs has been varied.The pool is using the RR pol-icy (i.e.deterministic)and an inter-reregistration time be-tween 250ms and 1000ms (such high rates may occur for dynamic policies).All ASAP (re-)registrations are per-formed on PR #1(see figure 4),PR #2is synchronized by ENRP only.That is,we have used the worst case here.The CPU utilization of PR #1and PR #2are shown on the left-hand side of figure 7.Randomized policy results have been omitted,since the results do not differ significantly (see also subsection 5.1).Clearly,the workload on PR #1is highest:it not only has to handle up to 3,000simultaneous SCTP associations to PEs (for ASAP),but also has to send out an ENRP update to the other PR on every update of a PE entry.This leads to a load of about 90%for 2,000PEs at an inter-reregistration time of a =250ms.Extending this time to a =1000ms,it isalready possible to manage 3,000PEs at a load of only about 25%.Obviously,the workload of PR #2is significantly lower:it only has to maintain a single SCTP association to PR #1to obtain the handlespace data.This results in a load of only about 15%for 2,000PEs at a =250ms,and about 25%for 3,000PEs at a =1000ms.It is therefore a clear recom-mendation to try to distribute the load among the PRs of the operation scope.In reality,this can be achieved using the automatic configuration feature of RSerPool [34].However,care has to be taken of redundancy:in case of PR failure(s),there must be a sufficient number of other PRs!But what about the costs of the ENRP synchronization among PRs?5.4Registrars ScalabilityIn order to show the scalability on the number of PRs,we have again used PR #1for the ASAP associations and PR #2for ENRP synchronization only (as shown in figure 4).Fur-Pool Elements e [1]Registrar's PerspectiveRegistrars G [1]Provider's PerspectiveFigure 7.Registrar CPU Utilization for Pool Maintenancether PRs have been started on the other PCs (since only the utilizations of PR #1and PR #2are relevant).For our measurement,we have used a pool of 1,000PEs and inter-reregistration times of a =250ms to a =1000ms.The CPU utilization results for PR #1and PR #2are presented on the right-hand side of figure 7.Clearly,the number of PRs does not significantly af-fect PR #2.While it has to maintain an association with each other PR of the operation scope,the actual workload –which remains constant –is only transported via the as-sociation with PR #1.On the other hand,the utilization for PR #1is significantly increased with the number of PRs,in particular if the inter-reregistration time is small:e.g.from about 20%for a single PR to slightly more than 60%for 6PRs (at a =250ms).The bottleneck in this case is the in-terface between userland application (i.e.the PR)and the kernel’s SCTP API.For each PR,a separate ENRP mes-sage has to be passed to the kernel’s SCTP API.Clearly,the context switching and memory copying for this opera-tion is time-consuming,while the actual message transport (IP packets via Ethernet interface)is quite efficient (a recent system can transport hundreds of thousands of packets per second).The analysis of the described userland/kernel bottleneck has led to the suggestion of a SCTP API extension:the SCTP SENDALL option (see subsection 5.2.2of [31]).Us-ing this option,a message to all PRs is passed to the kernel only once –and sent via all PR associations.But although the new option is already a part of the SCTP API standards document [31],it is not implemented for the current Linux kernel (version 2.6.20)yet.Therefore,a performance eval-uation using this option has to be part of future work.In summary,using a reasonably small number of PRs (e.g.two or three are usually sufficient to achieve redun-dancy),the ENRP overhead remains in an acceptable range –with room for future improvement on the SCTP layer.5.5Pool Users ScalabilityFinally,we have evaluated the scalability on the num-ber of PUs for handle resolution operations using two PRs.Again,we have observed the CPU utilization of PR #1and PR #2(see figure 4)for a pool of 1,000PEs using deter-ministic (solid lines)and randomized policies (dotted lines),an inter-reregistration time of 1000ms and inter-handle-resolution times between 100ms and 500ms.For the first measurement,we have used PR #1for both,registrations and handle resolutions (left-hand side),while we have put the burden of handle resolutions on PR #2for the second measurement (right-hand side).Clearly,if using PR #1for all operations,PR #2only has to synchronize and therefore its load keeps constant.But nevertheless,the CPU load of PR #1only slightly ex-ceeds 25%for 2,000PUs and a inter-handle-resolution time of 500ms.For a higher handle resolution rate,however,the CPU utilization quickly grows:at 100ms,there is already a load of more than 80%for 1,000PUs.The performance difference between the two types of policies is small –even at 2,000PEs,the CPU utilization of a randomized policy is only by less than 5%higher (see subsection 5.1).That is,compared to the protocol overhead,the pool maintenance effort is small for this number of PEs.So,with regard to these results,it is obviously a good idea to split up the workload of registration management and handle resolutions among the PRs.Therefore,PR #2in the second measurement (right-hand side of figure 4)is re-sponsible for all handle resolutions.Clearly,the system per-formance gets better now:at a CPU utilization below 80%(PR #2),it is now possible to serve 1,500PUs with a handle-resolution rate of only 100ms –at a workload of about 10%for PR #1.Splitting up the workload of both operations be-tween the two PRs would clearly result in an even better performance.However,a redundant system should always be provisioned for the worst case –which is a failure of n −1of the n PRs.That is,the sum of the workloads of both PRs must remain significantly lower than 100%!Pool Users u [1]Registrar's Perspective using PR #1Pool Users u [1]Registrar's Perspective using PR #2Figure 8.Registrar CPU Utilization for Handle Resolution5.6Results SummaryIn summary,our handlespace performance analysis has shown that our approach of reducing the handlespace man-agement to the storage of sets and operations on these sets is efficient if using a red-black tree to actually realize the sets.Critical operations are the re-registration (which may occur very frequently for adaptive policies)and the handle resolution.But in our real system performance analysis,we have shown that even a low-performance CPU is able to handle scenarios of significantly more than 1,000PEs and PUs.As general recommendation,it is useful to dis-tribute the PEs und PUs to different PRs of the operation scope to achieve the highest performance.However,care has to be taken of sufficient PR redundancy to cope with PR failures.Depending on the inter-reregistration and handle resolution frequency,also much larger scenarios are possi-ble.A room for further performance improvement will be the SCTP SENDALL option of the SCTP stack,which will be realized in future SCTP implementations.6ConclusionsThe analyses of this paper have shown that our hand-lespace realization is efficient:using a red-black tree as base structure to store the handlespace content,all hand-lespace operations can be reduced to the management of balanced trees.The performance of this approach is suffi-cient to maintain handlespaces of many thousands of PEs –even on a low-performance CPU being realistic for upcom-ing routers and embedded systems.In the second part of this paper,we have also proven that our approach is applicable and efficient in reality:a system based on the same CPU is also capable of handling the ASAP/ENRP protocol overhead and the maintenance of SCTP associations.As part of our future research,we are going to further evaluate our approach for certain RSerPool-based applica-tion scenarios.Such real-world scenarios set requirementson pool size and policy type as well as on re-registration and handle resolution frequency.In particular,we intend to es-timate a lower threshold for the CPU performance needed to handle these application scenarios.This also includes tests with our implementation on Linux-based embedded systems.References[1]L.Alvisi,T.C.Bressoud,A.El-Khashab,K.Marzullo,andD.Zagorodnov.Wrapping Server-Side TCP to Mask Con-nection Failures.In Proceedings of the IEEE Infocom 2001,volume 1,pages 329–337,Anchorage,Alaska/U.S.A.,Apr.2001.ISBN 0-7803-7016-3.[2] C.Aragon and R.Seidel.Randomized search trees.In Pro-ceedings of the 30th IEEE Symposium on Foundations of Computer Science ,pages 540–545,Oct.1989.[3]R.Braden,D.Borman,and puting theInternet Checksum.Standards Track RFC 1071,IETF,Sept.1988.[4]P.Conrad,A.Jungmaier,C.Ross,W.-C.Sim,and M.T¨u xen.Reliable IP Telephony Applications with SIP using RSer-Pool.In Proceedings of the State Coverage Initiatives,Mobile/Wireless Computing and Communication Systems II ,volume X,Orlando,Florida/U.S.A.,July 2002.ISBN 980-07-8150-1.[5]T.Dreibholz.An Efficient Approach for State Sharing inServer Pools.In Proceedings of the 27th IEEE Local Com-puter Networks Conference (LCN),pages 348–352,Tampa,Florida/U.S.A.,Oct.2002.ISBN 0-7695-1591-6.[6]T.Dreibholz.Policy Management in the Reliable ServerPooling Architecture.In Proceedings of the Multi-Service Networks Conference (MSN,Cosener’s),Abingdon,Ox-fordshire/United Kingdom,July 2004.[7]T.Dreibholz.Das rsplib–Projekt –Hochverf¨u gbarkeit mitReliable Server Pooling.In Proceedings of the LinuxTag ,Karlsruhe/Germany,June 2005.[8]T.Dreibholz.Applicability of Reliable Server Pool-ing for Real-Time Distributed Computing.Internet-DraftVersion 03,IETF,Individual Submission,June 2007.draft-dreibholz-rserpool-applic-distcomp-03.txt,work in progress.。



















Since SS7signalling networks offer a very high degree of availability(e.g.at most10minutes downtime per year for any signalling relationship between two signalling endpoints;for more information see[1]),all links and components of the network devices must be redundant. When transporting signalling over IP networks,such redundancy concepts also have to be applied to achieve the required availability.Link redundancy in IP net-works is supported using the Stream Control Transmission Protocol(SCTP[2],[3],details follow in section II); redundancy of network device components is supported by the SGP/ASP(signalling gateway process/application server process)concept[1].However,this concept has some limitations:no support of dynamic addition and re-moval of components,limited ways of server selection,noFig.1.The RSerPool Architecturespecific failover procedures and inconsistent application to different SS7adaptation layers.B.IntroductionTo cope with the challenge of creating a unified, lightweight,real-time,scalable and extendable redun-dancy solution(see[4]for details),the IETF Reliable Server Pooling Working Group was founded to spec-ify and define the Reliable Server Pooling concept.An overview of the architecture currently under standardiza-tion and described by several Internet Drafts is shown in figure1.Multiple server elements providing the same service belong to a server pool to provide both redundancy and scalability.Server pools are identified by a unique ID called pool handle(PH)within the set of all server pools, the handlespace.A server in a pool is called a pool element(PE)of the respective pool.The handlespace is managed by redundant registrars(PR).The registrars synchronize their view of the handlespace using the End-point haNdlespace Redundancy Protocol(ENRP[5]).PRsFig.2.Registration and Monitoringannounce themselves using multicast mechanisms,i.e.it is not necessary(although possible)to pre-configure PR addresses into the components described in the following. PEs providing a specific service can register for a corresponding pool at an arbitrary PR using the Aggregate Server Access Protocol(ASAP[6])as shown infigure2. The home PR(PR-H)is the PR which was chosen by the PE for initial registration.It monitors the PE using SCTP heartbeats(layer4,not shown infigure;see section II)and ASAP Endpoint Keep Alives.RSerPool does not rely on the layer4heartbeat mechanism of SCTP here:the application itself could e.g.hang in an infinite loop while the system’s kernel is still responding to the SCTP ing additional keep alives above SCTP therefore improves the monitoring reliability. The frequency of monitoring messages depends on the availability requirements of the provided service.When a PE becomes unavailable,it is immediately removed from the handlespace by its home PR.A PE can also inten-tionally de-register from the handlespace by an ASAP de-registration allowing for dynamic reconfiguration of the server pools.PR failures are handled by requiring PEs to re-register regularly(and therefore choosing a new PR when necessary).Re-registration also makes it possible for the PEs to update their registration information(e.g. transport addresses or policy states).The home PR,which registers,re-registers or de-registers a PE,propagates this information to all other PRs via ENRP.Therefore,it is not necessary for the PE to use any specific PR.In case of a failure of its home PR,a PE can simply use another arbitrarily chosen one.C.Server SelectionWhen a client requests a service from a pool,itfirst asks an arbitrary PR to translate the pool handle to a list of PE identities selected by the pool’s selection policy(pool policy),e.g round robin or least used(we show examples in section V-B;the standards policies are defined in[7],a quantitative policy performance comparison can be found in[8]).The PU adds this list of PE identities to its local cache(denoted as PU-side cache) and again selects one entry from its cache by policy.To this selected PE,a connection is established,using the application’s protocol,to actually use the service.The client then becomes a pool user(PU)of the PE’s pool. It has to be emphasized,that there are two locations where a selection by pool policy is applied during this process:1)at the PR when compiling the list of PEs and2)in the local PU-side cache where the target PE isselected from the list.If the connection to the selected PE fails,e.g.due to overload or failure of the PE,the PU selects another PE (i.e.directly from cache or by asking a PRfirst)and tries again.The PU may report a PE failure to a PR,which may decide to remove this PE from the handlespace. D.Failover ProcedureRSerPool supports optional client-based state synchro-nization[9]for failover:a PE can store its current state with respect to a specific connection in a state cookie which is sent to the corresponding PU.When a failover to a new PE is necessary,the PU can send this state cookie to the new PE,which can then restore the state and resume service at this point.However,RSerPool is not restricted to client-based state synchronization;any other application-specific failover procedure can be used as well.E.The Protocol StackFigure3illustrates the RSerPool protocol stack.All components are based on SCTP over IPv4and/or IPv6. For the PR,the application layer consists of ENRP and ASAP.While ENRP provides handlespace redundancy between multiple PRs,ASAP is used for registration,re-registration,de-registration and monitoring of PEs as well as for handle resolutions and failure reports by PUs. Between a PU and PE,ASAP becomes a session layer protocol that provides the client-based state synchro-nization as described in section I-D.This session layer communication,called control channel,is multiplexed with the application’s protocol,called data channel,over the same SCTP association.Optionally,PRs can announce themselves via ASAP and ENRP via multicast so that other PRs,PEs and PUs may be fully auto-configuring.This functionality has been omitted in thefigure to enhance its readability.F.ApplicationsThe lightweight,real-time,scalable and extendable ar-chitecture of RSerPool is not only applicable to the trans-port of SS7-based telephony signalling;other applica-tion scenarios include reliable SIP-based telephony[10], mobility management[11]and the management of dis-tributed computing pools[12],[13].Finally,load balancing using RSerPool is currently under discussion by the IETF RSerPool Working Group: due to itsflexible server selection policies and pool man-agement functionalities,it has many similarities to loadFig.3.The RSerPool Protocol Stackbalancer protocols.A very common application for such load balancing systems is to distribute HTTP requests in web server farms.There is an ongoing effort to merge both the RSerPool framework and the Server/Application State Protocol (SASP [14],a contribution of IBM)for load balancers into one common architecture for highly-available server pool management and load distribution.II.T HE SCTP P ROTOCOLWhile the duty of RSerPool is to provide fault-tolerance against component failures,it relies on the SCTP transport protocol [2]to provide fault-tolerance against network failures.As explained in the introduction,SCTP allows multi-homing to fulfil the fault tolerance requirements of SS7.That is,two SCTP endpoints can be connected via two or more networks.When there are multiple disjoint paths between the two endpoints,SCTP can use another one when its primary path becomes unavailable.Such unavailability can occur by network component and link failures or simply due to long convergence times of inter-domain routing protocols (e.g.in the range of several minutes for BGP).SS7requires a failover time of at most 800ms and SCTP is able to satisfy this requirement [15];from an endpoint’s view,each destination address is considered as a possible path –denoted as SCTP path [2]–to transmit data over.SCTP uses path monitoring to check these paths for availability:in configurable intervals,SCTP sends control messages,called heartbeats ,over each possible path.The peer endpoint,when receiving such a heartbeat,acknowledges it by sending a heartbeat acknowledgement .Paths on which acknowledgements are received,are considered to be usable paths for data transport.When the actual data transport path (called primary path )becomes unavailable,a working one is selected and the data transmission is continued.The whole process of path monitoring and selection of a new primary path is transparent to the application layer.For details on the configuration of suitable heartbeat intervals and path selection parameters,see [15].SCTP has been designed to be independent of the underlying network layer protocol,i.e.it is not only possible to use IPv4and IPv6but also adapt it to other or future protocols.In the view of SCTP,network layerprotocols appear as SCTP paths to the multi-homing functionality.For example,an endpoint supports IPv4and IPv6and the peer endpoint is reachable via IPv4and IPv6.Then,an association between these endpoints has two SCTP paths:one via IPv4and one via IPv6.If there is a failure e.g.on the IPv4path,it is therefore still possible to use the IPv6path.Such multi-protocol setups are very likely in today’s networks,due to the growing IPv6deployment in formerly IPv4-only networks.In the area of telecommunications,associations are established for durations in the range of months or even years.Therefore,it has been necessary to define a dynamic address reconfiguration extension (abbreviated Add-IP ,see [16])allowing for the dynamic addition to and removal of transport addresses from an SCTP association without connection interruption.This especially allows interruption-free IPv6site renumbering,i.e.changing the address prefix on a provider change to keep BGP routing tables small or even add an additional provider for redundancy reasons.Furthermore,it even allows an association to be established in an IPv4-only network,being upgraded to IPv4+IPv6and finally turned into IPv6only –interruption-free and transparent to the upper layers.III.T HE RS ER P OOL APIThe programming API for RSerPool is currently ac-tively being discussed by the IETF RSerPool WG.It will consist of two styles:the basic mode and the enhanced mode .A.Basic Mode APIThe basic mode provides only the fundamental RSer-Pool function calls for PEs to register,re-register or de-register and for PUs to resolve a pool handle and select a PE by policy.All session layer functionalities between PE and PU –especially failure detection and failover –have to be provided by the application programs themselves.That is,a control channel is not supported here.The reason for having the basic mode API is to provide easy deployment of RSerPool functionality to existing applications,e.g.a FTP service application that supports download continuation using FTP’s reget functionality.An example for using the basic mode API can be found in[12].B.Enhanced Mode APIUnlike the basic mode API,the enhanced mode API offers a complete session layer between PE and PU, including optional failover handling using client-basedstate synchronization.That is,a PU establishes a session to a pool provid-ing its desired service.The session layer provided by the enhanced mode API transparently handles pool han-dle resolutions,PE selections,association establishments, failure detection on the association using SCTP heart-beats,selecting a new PE when the former one becomes unreachable and optionally failover handling using state cookies via the control channel.For the application itself, this session layer can be completely transparent1.In fact, the pool appears to the user as one highly available server. To provide easy adaptation of existing and new ap-plications to RSerPool’s session layer functionality,the API mimics the Unix socket API to provide session layer functionality.A pseudo-code example is shown in algo-rithm1:similarly to creating a TCP socket,connecting it to a remote server andfinally using the application’s protocol to do something,a RSerPool session is created, connected to a pool and the application’s protocol is used over the session.But unlike a simple TCP connection, RSerPool provides seamless service continuation in case of server failure–transparent to the application.The PE side of the enhanced mode API also looks similar to TCP-based servers,but instead of binding a socket to a port number,it is registered as PE under the service’s pool handle.Note,that applications do not have to care about any transport address when using the enhanced mode API.A PE is by default registered under all of its transport addresses–regardless of whether they are IPv4or IPv6. Furthermore,using the Add-IP[16]extension of SCTP as described in section II,transport addresses may change at runtime,e.g.due to IPv6prefix change.At the PU side, transport association management and therefore handling of addresses is completely transparent to the application layer.Currently,there are only two existing implementations of RSerPool:a closed source version by Motorola[17] and the authors’own GPL-licensed Open Source proto-type rsplib.This prototype will be explained in detail in the following section.IV.T HE rsplib P ROTOTYPEAs part of our RSerPool research and to verify the results of our simulation model[8],[18]in real-life sce-narios,we have created a complete implementation[12], [19]prototype of the RSerPool framework.It consists of a PR and a library–the rsplib–providing the PE and PU functionalities.Our implementation package,called the 1If the application uses a custom failover procedure,some interaction may berequired.Fig.4.The rsplibRegistrarFig.5.The rsplib PU/PE Libraryrsplib prototype,has been released[19]as Open Source under the GPL license.Elementary design criteria of our prototype have been platform independence and the support of both IPv4 and IPv6.To ensure platform independence,we have chosen C instead of C++as implementation language, because C is more common on exotic devices.Although currently only Linux,FreeBSD and Darwin(MacOS X) are supported by our prototype,our long-term goal is to make it also available on embedded devices like PDAs and smartphones.A short-term goal is to extend our support to the Windows and Solaris platforms.When we started the development of our prototype in2002,the only stable SCTP implementation on our three main platforms(Linux,FreeBSD,Darwin)has been our own Open Source userland SCTP implementation sctplib[20].Meanwhile,the native SCTP support of these platforms has improved so that we also support the built-in kernel SCTP of Linux,FreeBSD(KAME stack) and Darwin.All afore-mentioned SCTP implementations, including our own sctplib,support the Add-IP extension for dynamic address reconfiguration.The rsplib prototype is a complete implementation of RSerPool,also including the features being optional in the standards documents.In particular,we support both the basic and enhanced mode APIs,full auto-configuration by PR announcements via multicast–both,for ASAP and ENRP–and all optional policies defined in the draft[7]. This draft is one contribution,based on our RSerPool research on policy performance[8],and has become a working group draft of the IETF RSerPool WG.Algorithm1A PU Pseudo-Code Examplersp_connect(sd,"MyDownloadServerPool",...);rsp_write(sd,"GET MyMovie.mpeg HTTP/1.0\r\n\r\n"); while((length=rsp_read(sd,buffer,...))>0){ doSomething(buffer,length);}rsp_close(sd);The building blocks of the rsplib prototype are shown infigure4(registrar)andfigure5(PU/PE library).Both parts contain the Dispatcher component encapsulating the platform-dependent timer andfile/socket event manage-ment as well as thread-safety functionality.On top of this component,the registrar realizes the ASAP and ENRP protocols.Their functionality is controlled by the Reg-istrar Management,which consists of the binding layer between protocols and the registrar’s central component: the Handlespace Management.This component takes care of storing the handespace’s content and providing access functionality for both ASAP(registration,re-registration, deregistration and monitoring of PEs,handle resolutions for PUs)and ENRP(handlespace synchronization be-tween PRs).Authenticating and authorizing requests to the handlespace management is the duty of the Registrar Management.It also takes care of the optional transmis-sion of ASAP and ENRP announcements via multicast. For PUs and PEs,the Dispatcher is the foundation of the ASAP Instance component.The ASAP Instance consists of three sub-components:ASAP Protocol is the implementation of ASAP for communication to PRs and between PE and PU via the multiplexed control/data chan-nel.For creating and parsing ASAP messages,it contains the sub-components ASAP Creator and ASAP Parser.In the Registrar Table,addresses of usable PRs–either stat-ically configured or learned by the PRs’announcements via multicast–are managed.When communication to a PR is necessary,this component also takes care of establishing a connection to a PR.The last sub-component of the ASAP Instance is the ASAP Cache,i.e.the PU-side cache for handle resolutions.For the implementation,the data structures and algorithms necessary to manage the cache are equal to the PR’s handlespace management. Therefore,its code can be reused here.In an early version of our prototype,we realized the handlespace management using linear lists and provided only round robin as pool policy.This workedfine for simple lab scenarios;however,there has been a grow-ing demand to realize additional policies like random selection or least used for the research on load distribu-tion performance.Furthermore,pools of load balancing scenarios can become very large(hundreds of elements) and efficiency becomes crucial.Therefore,our simple approach became unsuitable and a more sophisticated handlespace management concept was necessary.We will explain our concept in the followingsection.Fig.6.Handlespace ManagementV.H ANDLESPACE M ANAGEMENTBefore we describe our implementation of the han-dlespace management,wefirst define it as abstract datatype:the handlespace is a set of n pools(n∈N), denoted by PH h1to h n.Each poolπcontains a non-empty set of PE entries,denoted by their PE IDiπk∈{0,...,232−1}⊂N0.A PE entry includes the PE’s policy information yπk(e.g. the PE’s load in case of LU policy)and the PE’s non-empty set of transport addresses aπk.The following operations must be possible on the handlespace datatype:1)Insertion,lookup and removal of pools by poolhandle;2)Insertion,lookup and removal of PEs within a poolby PE ID;3)Selection of PEs within a pool by policy;4)Traversal of the handlespace for ENRP synchro-nization purposes.Furthermore,it should be easily possible to add new selection policies for new applications.A.Implementing the HandlespaceImplementing the abstract handlespace datatype be-comes straightforward as illustrated infigure6:there is a set of pools sorted by pool handle and each pool contains two sets of PE references–thefirst set sorted by PE ID(solid line),the second set sorted by a sorting order defined by the pool’s policy(dotted line).We will explain later how to actually implement a set.A policy-specific selection procedure implements the selection of a PE.In the default case,this simply means to selectthefirst element from the set ordered by the sorting ing the structures above,it is only necessary to define a specific sorting order and selection procedure to implement a certain policy.Such definitions are the next step.B.Implementing PoliciesBefore we define sorting order and selection procedure for some important policies defined in[7],we introduce two helper constructs for simplification:For simplifying randomized selection,we define the following:to every PE entry i,a value v i∈R may be mapped.It has to be possible to request the sum(called value sum)V=iv iof all PE entries’values within the set.Then,randomized selection is possible by choosing a numberr∈R{0,...,V}⊂R.Since the set is ordered,r specifies the uniquely identifi-able PE entry j that satisfies the conditioni=1,...,j−1v i<r≤v j+i=1,...,j−1.Furthermore,to guarantee uniqueness of sorting orders, we add sequence numbers to pools and PE entries:each pool gets a pool sequence number and each PE a PE sequence number.Every time a PE entry i is inserted into the selection set or being selected,its PE sequence number seq i is set to the pool sequence number of its pool. Finally,this pool sequence number is then incremented by1.Now,we can define some example policies and show how the helper constructs are used:a)Round Robin:The only sorting key is the PE entries’sequence number in ascending order and the selection procedure is the default one,i.e.getting thefirst element of the set.An example is given in table I:the upper block shows the pool“Example”before a selection.A selection returns PE entry ID-#1(since it is thefirst entry of the set),its sequence number is set to the pool sequence number(4)and it is reinserted into the pool. Since it now has the highest sequence number,it is appended to the end of the set.Finally,the pool sequence number is incremented by one.A further selection will fetch PE ID-#2,then PE-ID-#3,again PE ID-#1and so on,providing the desired round robin behaviour.b)Weighted Random:Since random selection can-not take elements from the top of the selection set but has to use values v i and their sum V,it is only necessary to ensure that the sorting keys in the set are unique. Using the PE sequence number as sorting key ensures this property.For the weighted random policy,the value v i of PE entry i is set to the PE’s given weight constant. Table II shows an example:the pool consists of3PEs where PEs ID-#6and ID-#7have weight1(therefore v= 1).PE ID-#2has weight3(and therefore v=3)and PE ID-#8has weight2(and therefore v=2).The weight sum is thereforeV=1+2+3+1=7.For a selection,a random numberr∈R{0,...,7}⊂Ris chosen.Let r=5.75.In this case,only j=3satisfies the conditioni=1,...,j−1v i<5.75≤v j+i=1,...,j−1,that is1+3<5.75≤1+3+2.Then,the third(j-th)PE of the set is selected:PE ID-#8. Using an uniform distribution for choosing r,weighted random results in the desired behaviour of selecting PEs at a probability proportional to their weight constant.c)Least Used:Using the least used policy,each PE’s policy information specifies the current server load as value from0%to100%.Clearly,thefirst part of the sorting key is this load value in ascending order.The second part is the PE sequence number in ascending order. We use the default selection procedure,i.e.taking the set’sfirst element.Obviously,this will select the PE of the least load.And for the case that there are multiple PEs having the same least load,the PE sequence number as second part of the composite sorting key ensures round robin selection between these elements.Clearly,arbitrary other policies can be expressed through definition of a sorting order and a selection procedure.That is,our policy implementation concept of-fers a solid foundation for future and application-specific extensions.C.PerformanceAfter definition of data structure and policies,the only remaining question of handlespace management is how to implement the datatype for the required sets.The naive solution is to simply use a linear list.A more efficient solution may be to use a binary tree,a red-black tree[21] (balanced tree)or a treap[22](randomized tree).But does the effort for realistic pool sizes justify a more complicated structure?To answer this question,we made a rough performance evaluation of our handlespace management implementa-tion on an Athlon1.3GHz CPU.We have chosen this CPU since its power seems to be realistic for upcoming router CPU generations2–routers are devices on which a PR process could be started.For our handlespace performance evaluation,we are not interested in SCTP or network layer efficiency,therefore we omit it here. As test scenario,we assume two large pools,using the least used policy,in which we scale the average amount of PEs from1to1000.Since pools map to specific applications,it is realistic to assume a small amount 2The current Juniper ERX1400,a300,000US$router,only contains a Pentium-III at500MHzTABLE IR OUND R OBIN P OLICY E XAMPLEPool“Example”Policy RRseq=4Pool Element ID-#1seq=1Pool Element ID-#2seq=2Pool Element ID-#3seq=3Pool“Example”Policy RRseq=5Pool Element ID-#2seq=2Pool Element ID-#3seq=3Pool Element ID-#1seq=4TABLE IIW EIGHTED R ANDOM P OLICY E XAMPLEPool“Example”Policy WRR seq=5V=7PoolElementID-#7seq=1,v=1Pool Element ID-#2seq=2,v=3PoolElement ID-#8seq=3,v=2PoolElement ID-#6seq=4,v=1Fig.7.Performance(e.g.less than20).Furthermore,applications requiringsignificantly large pools are assumed to be rare(e.g.a web server farm or a distributed computing service).Therefore,two large pools seem to be realistic.We omitadding additional small pools(e.g.2to5PEs)here,sincethis would not significantly affect the results.Each PE is assumed to handle10PU requests/s(themore PEs,the more PU requests–adding servers onlymakes sense when there is more work to be done).Thatis,10handle resolutions per second and PE are requiredfrom the handlespace management.A PE stays registeredfor an average duration of30m(uniform distribution)andthen deregisters.During its runtime,a re-registration ismade every30s(default from[5]).When a PE is removed,a new PE is added to keep the average amount of PEsconstant.Synchronization(this means traversal of thehandlespace)is made every5minutes.The handlespaceis a priorifilled with the given amount of PEs;then,each test runs for10m.The more components are inthe scenario,the more handlespace operations have tobe executed.For statistical accuracy,each test has beenrepeated5times;the shown results are the average valuesand their95%confidence intervals,being computed by RProject.Figure7shows the CPU’s load as a fraction of the run-time(10m)required for handlespace operations in percentfor the implementation of a set by linear list,binary tree,red-black tree and treap.On the x-axis,the total amount ofPEs is shown(they divide up to the two pools).Obviously,balanced trees(red-black)and randomized trees(treap)。
