Workload characterization of Java server applications on two PowerPC processors
Java Runtime Systems Characterization and Architectural Implications
Java Runtime Systems: Characterization and Architectural ImplicationsRamesh Radhakrishnan,Member,IEEE,N.Vijaykrishnan,Member,IEEE, Lizy Kurian John,Senior Member,IEEE,Anand Sivasubramaniam,Member,IEEE,Juan Rubio,Member,IEEE,and Jyotsna SabarinathanAbstractÐThe Java Virtual Machine(JVM)is the cornerstone of Java technology and its efficiency in executing the portable Java bytecodes is crucial for the success of this technology.Interpretation,Just-In-Time(JIT)compilation,and hardware realization are well-known solutions for a JVM and previous research has proposed optimizations for each of these techniques.However,each technique has its pros and cons and may not be uniformly attractive for all hardware platforms.Instead,an understanding of the architectural implications of JVM implementations with real applications can be crucial to the development of enabling technologies for efficient Java runtime system development on a wide range of platforms.Toward this goal,this paper examines architectural issues from both the hardware and JVM implementation perspectives.The paper starts by identifying the important execution characteristics of Javaapplications from a bytecode perspective.It then explores the potential of a smart JIT compiler strategy that can dynamically interpret or compile based on associated costs and investigates the CPU and cache architectural support that would benefit JVMimplementations.We also study the available parallelism during the different execution modes using applications from the SPECjvm98 benchmarks.At the bytecode level,it is observed that less than45out of the256bytecodes constitute90percent of the dynamic bytecode stream.Method sizes fall into a trinodal distribution with peaks of1,9,and26bytecodes across all benchmarks.Thearchitectural issues explored in this study show that,when Java applications are executed with a JIT compiler,selective translation using good heuristics can improve performance,but the saving is only10-15percent at best.The instruction and data cacheperformance of Java applications are seen to be better than that of C/C++applications except in the case of data cache performance in the JIT mode.Write misses resulting from installation of JIT compiler output dominate the misses and deteriorate the data cacheperformance in JIT mode.A study on the available parallelism shows that Java programs executed using JIT compilers haveparallelism comparable to C/C++programs for small window sizes,but falls behind when the window size is increased.Java programs executed using the interpreter have very little parallelism due to the stack nature of the JVM instruction set,which is dominant in the interpreted execution mode.In addition,this work gives revealing insights and architectural proposals for designing an efficient Java runtime system.Index TermsÐJava,Java bytecodes,CPU and cache architectures,ILP,performance evaluation,benchmarking.æ1I NTRODUCTIONT HE Java Virtual Machine(JVM)[1]is the cornerstone of Java technology,epitomizing theªwrite-once run-any-whereºpromise.It is expected that this enabling technology will make it a lot easier to develop portable software and standardized interfaces that span a spectrum of hardware platforms.The envisioned underlying platforms for this technology include powerful(resource-rich)servers,net-work-based and personal computers,together with resource-constrained environments such as hand-held devices,specialized hardware/embedded systems,and even household appliances.If this technology is to succeed,it is important that the JVM provide an efficient execution/ runtime environment across these diverse hardware plat-forms.This paper examines different architectural issues, from both the hardware and JVM implementation perspec-tives,toward this goal.Applications in Java are compiled into the bytecode format to execute in the Java Virtual Machine(JVM).The core of the JVM implementation is the execution engine that executes the bytecodes.This can be implemented in four different ways:1.An interpreter is a software emulation of the virtualmachine.It uses a loop which fetches,decodes,andexecutes the bytecodes until the program ends.Dueto the software emulation,the Java interpreter has anadditional overhead and executes more instructionsthan just the bytecodes.2.A Just-in-time(JIT)compiler is an execution modelwhich tries to speed up the execution of interpretedprograms.It compiles a Java method into nativeinstructions on the fly and caches the nativesequence.On future references to the same method,the cached native method can be executed directlywithout the need for interpretation.JIT compilers.R.Radhakrishnan,L.K.John,and J.Rubio are with the Laboratory forComputer Architecture,Department of Electrical and Computer Engineer-ing,University of Texas at Austin,Austin,TX78712.E-mail:{radhakri,ljohn,jrubio}@..N.Vijaykrishnan and A.Sivasubramaniam are with the Department ofComputer Science and Engineering,220Pond Lab.,Pennsylvania State University,University Park,PA16802.E-mail:{vijay,anand}@..J.Sabarinathan is with the Motorola Somerset Design Center,6263McNeil Dr.#1112,Austin,TX78829.E-mail:jyotsna@.Manuscript received28Apr.2000;revised16Oct.2000;accepted31Oct.2000.For information on obtaining reprints of this article,please send e-mail to:tc@,and reference IEEECS Log Number112014.0018-9340/01/$10.00ß2001IEEEhave been released by many vendors,like IBM[2],Symantec[3],and piling duringprogram execution,however,inhibits aggressiveoptimizations because compilation must only incura small overhead.Another disadvantage of JITcompilers is the two to three times increase in theobject code,which becomes critical in memoryconstrained embedded systems.There are manyongoing projects in developing JIT compilers thataim to achieve C++-like performance,such asCACAO[4].3.Off-line bytecode compilers can be classified intotwo types:those that generate native code and thosethat generate an intermediate language like C.Harissa[5],TowerJ[6],and Toba[7]are compilersthat generate C code from bytecodes.The choice of Cas the target language permits the reuse of extensivecompilation technology available in different plat-forms to generate the native code.In bytecodecompilers that generate native code directly,likeNET[8]and Marmot[9],portability becomesextremely difficult.In general,only applications thatoperate in a homogeneous environment and thosethat undergo infrequent changes benefit from thistype of execution.4.A Java processor is an execution model thatimplements the JVM directly on silicon.It not onlyavoids the overhead of translation of the bytecodesto another processor's native language,but alsoprovides support for Java runtime features.It can beoptimized to deliver much better performance than ageneral purpose processor for Java applications byproviding special support for stack processing,multithreading,garbage collection,object addres-sing,and symbolic resolution.Java processors can becost-effective to design and deploy in a wide rangeof embedded applications,such as telephony andweb tops.The picoJava[10]processor from SunMicrosystems is an example of a Java processor.It is our belief that no one technique will be universally preferred/accepted over all platforms in the immediate future.Many previous studies[11],[12],[13],[10],[14]have focused on enhancing each of the bytecode execution techniques.On the other hand,a three-pronged attack at optimizing the runtime system of all techniques would be even more valuable.Many of the proposals for improve-ments with one technique may be applicable to the others as well.For instance,an improvement in the synchronization mechanism could be useful for an interpreted or JIT mode of execution.Proposals to improve the locality behavior of Java execution could be useful in the design of Java processors,as well as in the runtime environment on general purpose processors.Finally,this three-pronged strategy can also help us design environments that efficiently and seamlessly combine the different techniques wherever possible.A first step toward this three-pronged approach is to gain an understanding of the execution characteristics of different Java runtime systems for real applications.Such a study can help us evaluate the pros and cons of the different runtime systems(helping us selectively use what works best in a given environment),isolate architectural and runtime bottlenecks in the execution to identify the scope for potential improvement,and derive design enhance-ments that can improve performance in a given setting.This study embarks on this ambitious goal,specifically trying to answer the following questions:.Do the characteristics seen at the bytecode level favor any particular runtime implementation?Howcan we use the characteristics identified at thebytecode level to implement more efficient runtimeimplementations?.Where does the time go in a JIT-based execution(i.e., in translation to native code or in executing thetranslated code)?Can we use a hybrid JIT-inter-preter technique that can do even better?If so,whatis the best we can hope to save from such a hybridtechnique?.What are the execution characteristics when execut-ing Java programs(using an interpreter or JITcompiler)on general-purpose CPU(such as theSPARC)?Are these different from those for tradi-tional C/C++programs?Based on such a study,canwe suggest architectural support in the CPU(eithergeneral-purpose or a specialized Java processor)thatcan enhance Java executions?To our knowledge,there has been no prior effort that has extensively studied all these issues in a unified framework for Java programs.This paper sets out to answer some of the above questions using applications drawn from the SPECjvm98[15]benchmarks,available JVM implementa-tions such as JDK1.1.6[16]and Kaffe VM0.9.2[17],and simulation/profiling tools on the Shade[18]environment. All the experiments have been conducted on Sun Ultra-SPARC machines running SunOS5.6.1.1Related WorkStudies characterizing Java workloads and performance analysis of Java applications are becoming increasingly important and relevant as Java increases in popularity,both as a language and software development platform.A detailed characterization of the JVM workload for the UltraSparc platform was done in[19]by Barisone et al.The study included a bytecode profile of the SPECjvm98 benchmarks,characterizing the types of bytecodes present and its frequency distribution.In this paper,we start with such a study and extend it to characterize other metrics, such as locality and method sizes,as they impact the performance of the runtime environment very strongly. Barisone et e the profile information collected from the interpreter and JIT execution modes as an input to a mathematical model of a RISC architecture to suggest architectural support for Java workloads.Our study uses a detailed superscalar processor simulator and also includes studies on available parallelism to understand the support required in current and future wide-issue processors. Romer et al.[20]studied the performance of interpreters and concluded that no special hardware support is needed for increased performance.Hsieh et al.[21]studied the cache and branch performance of interpreted Java code,C/C++version of the Java code,and native code generated by Caffine (a bytecode to native code compiler)[22].They attribute the inefficient use of the microarchitectural resources by the interpreter as a significant performance penalty and suggest that an offline bytecode to native code translator is a more efficient Java execution model.Our work differs from these studies in two important ways.First,we include a JIT compiler in this study which is the most commonly used execution model presently.Second,the benchmarks used in our study are large real world applications,while the above-mentioned study uses microbenchmarks due to the unavailability of a Java benchmark suite at the time of their study.We see that the characteristics of the application used affects favor different execution modes and,therefore,the choice of benchmarks used is important.Other studies have explored possibilities of improving performance of the Java runtime system by understand-ing the bottlenecks in the runtime environment and ways to eliminate them.Some of these studies try to improve the performance through better synchronization mechan-isms [23],[24],[25],more efficient garbage collection techniques [26],and understanding the memory referen-cing behavior of Java applications [27],etc.Improving the runtime system,tuning the architecture to better execute Java workloads and better compiler/interpreter perfor-mance are all equally important to achieve efficient performance for Java applications.The rest of this paper is organized as follows:The next section gives details on the experimental platform.In Section 3,the bytecode characteristics of the SPECjvm98are presented.Section 4examines the relative performance of JIT and interpreter modes and explores the benefits of a hybrid strategy.Section 5investigates some of the questions raised earlier with respect to the CPU and cache architec-tures.Section 6collates the implications and inferences that can be drawn from this study.Finally,Section 7summarizes the contributions of this work and outlines directions for future research.2E XPERIMENTAL P LATFORMWe use the SPECjvm98benchmark suite to study the architectural implications of a Java runtime environment.The SPECjvm98benchmark suite consists of seven Java programs which represent different classes of Java applica-tions.The benchmark programs can be run using three different inputs,which are named s100,s10,and s1.Theseproblem sizes do not scale linearly,as the naming suggests.We use the s1input set to present the results in this paper and the effects of larger data sets,s10and s100,has also been investigated.The increased method reuse with larger data sets results in increased code locality,reduced time spent in compilation as compared to execution,and other such issues as can be expected.The benchmarks are run at the command line prompt and do not include graphics,AWT (graphical interfaces),or networking.A description of the benchmarks is given in Table 1.All benchmarks except mtrt are single-threaded.Java is used to build applications that span a wide range,which includes applets at the lower end to server-side applications on the high end.The observations cited in this paper hold for those subsets of applications which are similar to the SPECjvm98bench-marks when run with the dataset used in this study.Two popular JVM implementations have been used in this study:the Sun JDK 1.1.6[16]and Kaffe VM 0.9.2[17].Both these JVM implementations support the JIT and interpreted mode.Since the source code for the Kaffe VM compiler was available,we could instrument it to obtain the behavior of the translation routines for the JIT mode in detail.Some of the data presented in Sections 4and 5are obtained from the instrumented translate routines in Kaffee.The results using Sun's JDK are presented for the other sections and only differences,if any,from the KaffeVM environment are mentioned.The use of two runtime implementations also gives us more confidence in our results,filtering out any noise due to the implementation details.To capture architectural interactions,we have obtained traces using the Shade binary instrumentation tool [18]while running the benchmarks under different execution modes.Our cache simulations use the cachesim5simulators available in the Shade suite,while branch predictors have been developed in-house.The instruction level parallelism studies are performed utilizing a cycle-accurate superscalar processor simulator This simulator can be configured to a variety of out-of-order multiple issue configurations with desired cache and branch predictors.3C HARACTERISTICSAT THEB YTECODE L EVELWe characterize bytecode instruction mix,bytecode locality,method locality, order to understand the benchmarks at the bytecode level.The first characteristic we examine is the bytecode instruction mix of the JVM,which is a stack-oriented architecture.To simplify the discussion,weRADHAKRISHNAN ET AL.:JAVA RUNTIME SYSTEMS:CHARACTERIZATION ANDARCHITECTURAL IMPLICATIONS 133TABLE 1Description of the SPECjvm98Benchmarksclassify the instructions into different types based on their inherent functionality,as shown in Table 2.Table 3shows the resulting instruction mix for the SPECjvm98benchmark suite.The total bytecode count ranges from 2million for db to approximately a billion for compress .Most of the benchmarks show similar distribu-tions for the different instruction types.Load instructions outnumber the rest,accounting for 35.5percent of the total number of bytecodes executed on the average.Constant pool and method call bytecodes come next with average frequen-cies of 21percent and 11percent,respectively.From an architectural point of view,this implies that transferring data elements to and from the memory space allocated for local variables and the Java stack paring this with the benchmark 126.gcc from the SPEC CPU95suite,which has roughly 25percent of memory access operations when run on a SPARC V.9architecture,it can be seen that the JVM places greater stress on the memory system.Consequently,we expect that techniques such as instruction folding proposed in [28]for Java processors and instructioncombining proposed in [29]for JIT compilers can improve the overall performance of Java applications.The second characteristic we examine is the dynamic size of a method.1Invoking methods in Java is expensive as it requires the setting up of an execution environment and a new stack for each new method [1].Fig.1shows the method sizes for the different benchmarks.A trinodal distribution is observed,where most of the methods are either 1,9,or 26bytecodes long.This seems to be a characteristic of the runtime environment itself (and not of any particular application)and can be attributed to a frequently used library.However,the existence of single bytecode methods indicates the presence of wrapper methods to implement specific features of the Java language like private and protected methods or interfaces .These methods consist of a control transfer instruction which transfers control to an appropriate routine.Further analysis of the traces shows that a few unique bytecodes constitute the bulk of the dynamic bytecode134IEEE TRANSACTIONS ON COMPUTERS,VOL.50,NO.2,FEBRUARY 2001TABLE 2Classification ofBytecodesTABLE 3Dynamic Instruction Mix at the BytecodeLevel1.A java method is equivalent to a ªfunctionºor ªprocedureºin a procedural language like most benchmarks,fewer than 45distinct bytecodes constitute 90percent of the executed bytecodes and fewer than 33bytecodes constitute 80percent of the executed bytecodes (Table 4).It is observed that memory access and memory allocation-related bytecodes dominate the bytecode stream of all the benchmarks.This also suggests that if the instruction cache can hold the JVM interpreter code corresponding to these bytecodes (i.e.,all the cases of the switch statement in the interpreter loop),the cache performance will be better.Table 5presents the number of unique methods and the frequency of calls to those methods.The number of methods and the dynamic calls are obtained at runtime by dynamically profiling the application.Hence,only methods that execute at least once have been counted.Table 5also shows that the static size of the benchmarks remain constant across the different data sets (since the number of unique methods does not vary),although the dynamic instruction count increases for the bigger data sets (due to increased method calls).The number of unique calls has an impact on the number of indirect call sites present in the application.Looking at the three data sets,we see that there is very little difference in the number of methods across data sets.Another bytecode characteristic we look at is the method reuse factor for the different data sets.The method reuse factor can be defined as the ratio of method calls to number of methods visited at least once.It indicates the locality of methods.The method reuse factor is presented in Table 6.The performance benefits that can be obtained from using a JIT compiler are directly proportional to the method reuse factor since the cost of compilation is amortized over multiple calls in JIT execution.The higher number of method calls indicates that the method reuse in the benchmarks for larger data sets would be substantially more.This would then lead to better performance for the JITs (as observed in the next section).In Section 5,we show that the instruction count when the benchmarks are executed using a JIT compiler is much lower than when using an interpreter for the s100data set.Since there is higher method reuse in all benchmarks for the larger data sets,using a JIT results in better performance over an interpreter.The bytecode characteristics described in this section help in understanding some of the issues involved in the performance of the Java runtime system (presented in the remainder of the paper).4W HENORW HETHERTOT RANSLATEDynamic compilation has been popularly used [11],[30]to speed up Java executions.This approach avoids the costly interpretation of JVM bytecodes while sidestepping the issue of having to precompile all the routines that could ever be referenced (from both the feasibility and perfor-mance angles).Dynamic compilation techniques,however,pay the penalty of having the compilation/translation to native code falling in the critical path of program execution.Since this cost is expected to be high,it needs to be amortized over multiple executions of the translated code.Or else,performance can become worse than when the code is just interpreted.Knowing when to dynamically compile a method (using a JIT),or whether to compile at all,is extremely important for good performance.To our knowledge,there has not been any previous study that has examined this issue in depth in the context of Java programs,though thereRADHAKRISHNAN ETAL.:JAVA RUNTIME SYSTEMS:CHARACTERIZATION AND ARCHITECTURAL IMPLICATIONS 135Fig.1.Dynamic method size.TABLE 4Number of Distinct Bytecodes that Account for 80Percent,90Percent,and 100Percent of the Dynamic Instruction StreamTABLE 5Total Number ofMethod Calls (Dynamic)and Unique Methods for the Three Data Setshave been previous studies [13],[31],[12],[4]examining efficiency of the translation procedure and the translated code.Most of the currently available execution environ-ments,such as JDK 1.2[16]and Kaffe [17],employ limited heuristics to decide on when (or whether)to JIT.They typically translate a method on its first invocation,regardless of how long it takes to interpret/translate/execute the method and how many times the method is invoked.It is not clear if one could do better (with a smarter heuristic)than what many of these environments provide.We investigate these issues in this section using five SPECjvm98[15]benchmarks (together with a simple HelloWorld program 2)on the Kaffe environment.Fig.2shows the results for the different benchmarks.All execution times are normalized with respect to the execu-tion time taken by the JIT mode on Kaffe.On top of the JIT execution bar is given the ratio of the time taken by this mode to the time taken for interpreting the program using Kaffe VM.As expected (from the method reuse character-istics for the various benchmarks),we find that translating (JIT-ing)the invoked methods significantly outperforms interpreting the JVM bytecodes for the SPECjvm98.The first bar,which corresponds to execution time using the default JIT,is further broken down into two components,the total time taken to translate/compile the invoked methods and the time taken to execute these translated (native code)methods.The considered workloads span the spectrum,from those in which the translation times dominate,such as hello and db (because most of the methods are neither time consuming nor invoked numerous times),to those in which the native code execution dominates,such as compress and jack (where the cost of translation is amortized over numerous invocations).The JIT mode in Kaffe compiles a method to native code on its first invocation.We next investigate how well the smartest heuristic can do so that we compile only those methods that are time consuming (the translation/compila-tion cost is outweighed by the execution time)and interpret the remaining methods.This can tell us whether we should strive to develop a more intelligent selective compilation heuristic at all and,if so,what the performance benefit is that we can expect.Let us say that a method i takes s i time to interpret, i time to translate,and i i time to execute the translated code.Then,there exists a crossover point x i i a s i Ài i ,where it would be better to translate themethod if the number of times a method is invoked n i b x i and interpret it otherwise.We assume that an oracle supplies n i (the number of times a method is invoked)and x i (the ideal cut-off threshold for a method).If n i `x i ,we interpret all invocations of the method,and otherwise translate it on the very first invocation.The second bar in Fig.2for each application shows the performance with this oracle,which we shall call opt .It can be observed that there is very little difference between the naive heuristic used by Kaffe and opt for compress and jack since most of the time is spent in the execution of the actual code anyway (very little time in translation or interpretation).As the translation component gets larger (applications like db ,javac ,or hello ),the opt model suggests that some of the less time-consuming (or less frequently invoked)methods be inter-preted to lower the execution time.This results in a 10-15percent savings in execution time for these applica-tions.It is to be noted that the exact savings would definitely depend on the efficiency of the translation routines,the translated code execution and interpretation.The opt results give useful insights.Fig.2shows that,by improving the heuristic that is employed to decide on when/whether to JIT,one can at best hope to trim 10-15percent in the execution time.It must be observed that the 10-15percent gains observed can vary with the amount of method reuse and the degree of optimization that is used.For example,we observed that the translation time for the Kaffe JVM accounts for a smaller portion of overall execution time with larger data sets (7.5percent for the s10dataset (shown in Table 7)as opposed to the 32percent for the s1dataset).Hence,reducing the translation overhead will be of lesser importance when execution time dominates translation time.However,as more aggressive optimizations are used,the translation time can consume a significant portion of execution time for even larger datasets.For instance,the base configuration of the translator in IBM's Jalapeno VM [32]takes negligible translation time when using the s100data set for javac.However,with more aggressive optimizations,about 30percent of overall execution time is consumed in translation to ensure that the resulting code is executed much faster [32].Thus,there exists a trade-off between reducing the amount of time spent in optimizing the code and the amount of time spent in actually executing the optimized code.136IEEE TRANSACTIONS ON COMPUTERS,VOL.50,NO.2,FEBRUARY2001Fig.2.Dynamic compilation:How well can we do?2.While we do not make any major conclusions based on this simple program,it serves to observe the behavior of the JVM implementation while loading and resolving system classes during system initialization.TABLE 6Method Reuse Factor for the Different DataSets。
∗ © 2001 IEEE. Reprinted with permission from “Workload Characterization of Multithreaded Java Servers on Two PowerPC Processors” by Pattabi Seshadri and Alex Mericas, Proceedings of the Fourth Annual Workshop on Workload Characterization, Austin, Texas, December 2001, pp. 36-44.Workload Characterization of Java Server Applications on Two PowerPCProcessors ∗Pattabi Seshadri and Lizy K. John Dept of Electrical and Computer Engr The University of Texas at Austin {seshadri,ljohn}@Alex Mericas IBM Corporation mericas@AbstractJava has become fairly popular on commercial servers in recent years. However, the behavior of Java server applications has not been studied extensively. We characterize two Java server benchmarks, SPECjbb2000 and VolanoMark 2.1.2, on two IBM PowerPC architectures, the RS64-III and the POWER3-II, and compare them to more traditional workloads as represented by selected benchmarks from SPECint2000. We find that our Java server benchmarks have generally the same characteristics on both platforms: in particular, high instruction cache, ITLB, and BTAC (Branch Target Address Cache) miss rates. These benchmarks also exhibit high L2 miss rates due mostly to data loads. Instruction cache and L2 misses are seen to be the primary contributors to CPI.1. IntroductionJava, originally used extensively for web client software, is an emerging paradigm for server applications because of its portability and enhanced security features. However, while Java server applications are coming into wide use, their behavior is not yet well understood. Java client applications have been studied [17,9,15], but Java server applications differ significantly from client workloads, particularly in their need to maintain many concurrent client connections. Since in the current version of Java, I/O multiplexing, polling, and signals are not available, the only method available to Java programmers to maintain a large number of client connections is threads. One or more separate threads are created tohandle each client connection [12]. Therefore performance in the presence of a large number of concurrent threads is vital to a Java server application. This distinct characteristic of Java server applications could lead to differences with Java client workloads in terms of branch behavior, cache behavior, and other metrics that contribute to overall performance.The aim of this study is to characterize the impact of multithreaded Java server applications on modern processor microarchitectures. To this end, we compare multithreaded Java server benchmarks with selected benchmarks from SPECint2000, a suite of more “traditional” workloads. We run these benchmarks on two IBM PowerPC microarchitectures, the RS64-III and the POWER3-II.2. Related WorkCommercial workloads have been increasing in importance, and efforts have been made to understand their behavior [2,11,8,7,16,1]. Most of these studies have been focused on applications written in C or C++, in particular OLTP, DSS, and web server applications. Java has also been a popular subject of research. The majority of Java studies use SPECjvm98 [17,9,15], which is a client benchmark suite. SPECjvm98 has been observed to have as much as 31% kernel activity due for the most part to a TLB service routine, which indicates a high TLB miss rate. SPECjvm running on an interpreter has also been observed to have poor ILP and insensitivity to wider issue width [9]. However, it has better instruction cache performance than some C/C++ applications [15]. Commercial Java servers are emerging workloads and thus research has just begun on their behavior. Most of the research in this area has been on the effectof multithreading. Cain and Rajwar [6] studied branch prediction and cache behavior in SPECjbb2000 and TPC-W with the full-system simulation of a coarse-grained multithreaded processor. They found destructive interference between threads that degraded performance. Luo and John [10] studied the impact of multithreading in Java server benchmarks on a Pentium Pro machine. They did see constructive interference in the instruction stream and branch prediction behavior, but these benefits were eventually overcome by increasing resource stalls as the number of threads grew large.This paper focuses on the differences between Java server applications and more “traditional” workloads (represented by SPECint2000). We use two popular IBM PowerPC platforms that represent the state of the art in microprocessor design. Several performance metrics, such as cache behavior, branch behavior, dispatch behavior, CPI components, etc., are studied.3. MethodologyThis section describes the hardware platforms and benchmarks used in this study as well as the methods used to collect performance monitor data.3.1. PlatformsWe use two IBM PowerPC microarchitectures for our study: the RS64-III and the POWER3-II. Both are current microprocessor architectures, but they differ in many significant ways.The RS64-III [4,5] is a 64-bit, superscalar, in order, speculative execution machine and is targeted specifically for commercial applications. It has one single cycle integer unit, one multiple cycle integer unit, one four stage pipelined floating point unit, one branch unit, and one load/store unit. The RS64-III can fetch, dispatch, and retire up to four instructions per cycle and has a five stage pipeline. It does not predict branches dynamically like the POWER3-II, but rather prefetches up to eight instructions from the branch target into a branch target buffer during normal execution, predicts the branch not taken, continues to fetch from the instruction stream and then, once the branch is resolved in the dispatch stage, either continues fetching from the current instruction stream with no penalty or flushes the instructions after the branch and begins fetching from the branch target buffer, with a penalty of at most one and often zero cycles. The RS64-III has a 128KB, two way set associative L1 instruction cache, a 128KB, two way set associative data cache, and a 4MB, four way set associative unified L2 cache. It also has a 512 entry four way set associative unified TLB and a 64 entry instruction effective to real address translation buffer (IERAT) that allows fast address translation without the use of the TLB. The processor clock is 500Mhz.The POWER3-II [13,14] is a 64-bit, superscalar, out of order, speculative execution machine. It has two single cycle integer units, one multiple cycle integer unit, one branch/condition register unit, two load/store units, and two three stage pipelined floating point units. It can fetch, dispatch, and retire up to four instructions in the same cycle. It has a 256 entry branch target address cache (BTAC), which works like a branch target buffer, and a 2048 entry, 2 bits per entry branch history table for dynamic branch prediction. The POWER3-II has a 64KB, 128 way set associative, four way interleaved L1 instruction cache, a 64KB, 128 way set associative, four way interleaved L1 data cache, and a 8MB, four way set associative unified off-chip L2 cache. It also has a 256 entry two way set associative instruction TLB and two 256 entry two way set associative data TLBs. The POWER3-II is designed with separate buses to memory and L2 for greater memory bandwidth. The POWER3-II also employs a data prefetching mechanism that detects sequential data access patterns and prefetches cache lines to match these patterns. The processor clock is 450 MHz.Both of these processors are deployed in IBM p-series systems. The RS64-III system we use in the experiment is the M80 and the POWER3-II system we use is the 44p-170, both of which are configured as uniprocessor systems. Both systems have 2 GB of main memory and run AIX 4.3.3 and the IBM JDK version BenchmarksIn this study, we characterize VolanoMark 2.1.2 and SPECjbb2000, both of which are Java server benchmarks.VolanoMark 2.1.2 [20] is a Java server benchmark that simulates a chat server environment, as illustrated in Figure 1. The VolanoMark server accepts connections from the chat client, which simulates a specifiable number of chat users by creating a number of chat rooms. Each chat room contains a number of users that continuously send messages to the server and wait for the server to send the messages to other usersin the room. The VolanoChat server creates two threads for each client connection.Figure 1. VolanoMarkSPECjbb2000 [19] is another Java serverbenchmark. As illustrated in Figure 2, it emulates a three-tier client/server system with emphasis on the middle tier, the business logic engine. The other tiers are emulated, and thus user emulation and a database are not required. SPECjbb is patterned after TPC-C in Figure 2. SPECjbb2000that it models a wholesale company with warehouses that serve a number of districts. The transactions generated in this system include new orders and order status requests (both customer-generated transactions), as well as processing orders, entering customer payments, and checking stock levels (company-generated transactions). Each warehouse, which is represented by 25MB of data stored in binary trees, is assigned one active customer. One thread is created for each warehouse. SPECjbb is a memory resident benchmark.In addition to these two Java server benchmarks, we run five SPECint2000 benchmarks [18] on the two platforms. This allows us to compare themultithreaded Java server applications to more traditional workloads. We use 255.vortex, 300.twolf, 176.gcc, 252.eon, and 186.crafty, which cover a wide range of application sizes and also contain the only SPECint2000 benchmark written in C++.3.3. MeasurementsWe use the hardware performance monitors built into each microprocessor to make performance measurements. Each performance monitor has eight counters that can be programmed to count a variety of processor events. The list of countable events differs between the two machines, but many important events can be counted on both. We interface with the performance monitor using the IBM-supplied performance monitor API and pmcount (a utility that allows the user to interface with the performance monitor), both of which are AIX kernel extensions. Since we only want to collect performance monitor counts for VolanoMark while client connections are being made and not during server startup or shutdown, we send signals to a wrapper that makes API calls to start counting after server startup and stop counting before server shutdown. Similarly, since we only want to do performance monitoring on SPECjbb during the two-minute “measurement period,” we instrument the code for SPECjbb (modifying only to send signals to a wrapper that makes API calls to start counting at the beginning of the measurement period and stop counting at the end of the period. While pmcount is simpler to use, requiring only a list of events and the executable to count for as arguments, it does not allow this kind of selective counting. However, we do use pmcount for the SPECint benchmarks, since we count for the entire workload in those cases.For VolanoMark, we run the client on a separate machine. Each chat room has 20 users, while the number of chat rooms is varied from 1 to 40, resulting in a number of connections ranging from 20 to 800. Since VolanoMark creates two threads for every connection, this results in a number of connection threads ranging from 40 to 1600. For SPECjbb, we vary the number of warehouses from 1 to 25. One thread is created for each warehouse.4. ResultsTable 1 and Table 2 compare the Java serverbenchmarks to the SPECint benchmarks on the RS64-clienttreesIII and POWER3-II, respectively. VolanoMark is runwith 1,10, and 30 chat rooms (indicated as vol01, vol10, and vol30)), and SPECjbb is run with 1, 10, and 25 warehouses (indicated as jbb1, jbb10, and jbb25). The metrics collected are similar to those collected by Bhandarkar et. al. [3].Table 1. Java servers vs. SPECint2000 (RS64-III) As the tables indicate, VolanoMark spends a high proportion of its execution cycles in kernel mode (os cyc %). This phenomenon is likely due both to the factthat it spends a great deal of time sending andTable 2. Java servers vs. SPECint2000 (POWER3-II)receiving messages over the network and to the fact that the number of threads in VolanoMark is very large, requiring the OS to spend a significant amount of time in thread scheduling routines. The user code is concerned mainly with distributing messages, which is a relatively simple task. We can also see that VolanoMark exhibits a higher CPI than the SPECint benchmarks, which is understandable since OS code is known to have a higher CPI than user code [8]. Since SPECjbb2000 contains no network component, has far fewer threads than VolanoMark, and is memory resident and therefore does not generate many page faults, it has a very small proportion of cycles spent in kernel mode. The same is true for the SPECint benchmarks.Also, Table 1 and Table 2 show the data references per instruction and the memory transactions per 1000 instructions for the Java server and SPECint workloads. On the average, the Java server workloads generate less data references per instruction than the SPECint workloads, with some of the SPECint workloads far exceeding them, but the Java server workloads still generate considerably more memory transactions per instruction, by one to three orders of magnitude. This is an interesting observation that will be discussed later.4.1. Dispatch BehaviorBoth the RS64-III and the POWER3-II can dispatch up to four instructions per cycle (dispatch for the RS64-III meaning the cycle in which the instruction is sent directly to the execution unit, and dispatch for the POWER3-II meaning the cycle in which the instruction is sent to the execution unit reservation station). From Figure 3 it seems that our machines have more difficulty exploiting ILP in the Java server benchmarks than in the SPECint benchmarks. For almost all of the Java server benchmarks on the RS64-III, zero instructions are dispatched for over 50% of the execution cycles (the lone exception being sjbb1). Only one SPECint benchmark, twolf, has zero instructions dispatched for over 50% of the execution cycles. On the POWER3-II, the dispatch profile is similar (we show only the percentage of cycles with zero instructions dispatched because the other counts were not available on this machine). All of the Java server benchmarks on the POWER3-II have zero instructions dispatched for more than 60% of the execution cycles, while only twolf crosses this threshold among the SPECint workloads. The profile is almost identical for the percentage of zero-instructions-retired cycles on the POWER3-II, which is reasonable given that pipeline delays are being created in the dispatch stage.It should be noted that the dispatch stage in these machines is not the stage in which operands are read—in the POWER3-II, it is the stage in which the instructions are sent to the reservation stations, and inFigure 3. Dispatch behaviorthe RS64-III (which, being an in-order machine, has no reservation stations) it is the stage in which theinstructions are sent directly to the execution units. In both machines, operands are read in later stages. Therefore, delays in dispatch in these machines are not necessarily due to dependencies between instructions that limit exploited ILP. Nevertheless, dispatch in the RS64-III is stalled if the operand read stage (which directly follows the dispatch stage) is stalled due to instruction dependencies. In the POWER3-II, dispatch can be stalled if the execution unit reservation stations fill, which can occur if dependencies between instructions prevent instruction issue. Therefore instruction dependencies do affect dispatch, and the above dispatch numbers are, to a degree, reflective (though more so in the RS64-III) of exploited ILP. These numbers seem to indicate that the processors cannot exploit as much ILP in the Java server workloads as they can in the SPECint workloads, which is, as mentioned above, an observed characteristic of SPECjvm running on a Java interpreter..4.2. Cache and TLB PerformanceAs mentioned earlier, the Java server workloads generate significantly more memory transactions per instruction than the SPECint workloads. And, as one might expect from a higher number of memory accesses per instruction, Figure 4a and Figure 4c show that the Java server workloads exhibit poorer cache performance than the SPECint workloads on both machines, particularly in the instruction cache and L2 cache.High instruction cache miss rates have also been observed in server applications written in C or C++ [1,2]. Just-in-Time compiling (which our JDK uses) might also contribute to higher instruction cache miss rates for Java applications. With a JIT, bytecode is dynamically compiled into native code, and as a result, code for consecutively called methods may not lie in contiguous address spaces. Thus the instruction data spatial locality can be expected to be poor, causing higher instruction cache miss rates. Also, not surprisingly, the instruction cache miss rates are higher on the POWER3-II for most of the workloads (since its instruction cache is 64KB as opposed to 128KB for the RS64- III), but for vortex and crafty the instruction cache miss rates are higher on the RS64-III. This indicates that, for the Java server workloads and the other SPECint benchmarks, size is more important than associativity for instruction cache performance, while for vortex and crafty associativity (2 for the RS64-III and 128 for the POWER3-II) is more important than(a) Dispatch profile, RS64-IIIdispatched cycles, POWER3-II(b) Percentage of zero instructionsretired cycles, POWER3-IIsize for performance. Figure 5 shows that the Java server benchmarks cause more instruction TLB misses than the SPECint benchmarks on the RS64-III, and aFigure 4. Cache behaviorFigure 5. TLB behaviorTLB misses on the POWER3-II (ITLB miss count not available on POWER3-II). This TLB performance data is further evidence that Java server benchmarks have a large/scattered instruction footprint. (Note: the RS64-III has a much smaller ITLB misses per instruction count than the POWER3-II’s TLB misses per instruction count because its Instruction Effective to Real Address Table (IERAT), which caches address translations and obviates the use of the ITLB if there is a hit, seldom misses.)Figure 4b shows the components (load misses, store misses, and instruction misses) of L2 misses for the RS64-III. (These counts were not available on the POWER3-II.) It is clear from this figure that most of the L2 misses for the Java server workloads aregenerated by load references. While twolf shows aPOWER3-II1000 instructions, RS64-III0. gcc twolf vortex sjbb1sjbb10sjbb25vol1vol10vol30(a) ITLB misses per 1000 instructions,12345craftyeon gcc twolf vortex sjbb1sjbb10sjbb25vol1vol10vol30(b) TLB misses per 1000 instructions, POWER3-IIdata cache miss rate comparable to Volano, its data appears to be L2 resident.Figure 6. L2 miss ratioFigure 6, which shows the L2 miss ratios (as opposed to misses per 1000 instructions) on each machine, confirms that the Java server benchmarks are putting more pressure on the L2 than the SPECint benchmarks. We cannot explain this behavior with certainty, but a reasonable explanation could be that the Java server benchmarks have a much larger data footprint than the SPECint workloads (though we cannot obtain the data set size for VolanoMark, we know that each warehouse in SPECjbb uses 25MB of data, while the SPEC workloads are for the most part L1 and at worst L2 resident) and therefore generate more capacity L2 misses (hence the much higher number of memory transactions per instruction seen in Table 1).4.3. Branch BehaviorFigure 7a indicates that the POWER3-II’s branchFigure 7. Branch behaviorprediction mechanism works as well for the Java server programs as for the SPECint benchmarks (branch prediction numbers for RS64-III not shown because it does not employ dynamic branch prediction). Figure 7b and Figure 7c show that the speculative factors(instructions dispatched/instructions executed) of the0. RS64-III0. raft yeon gc c tw vortex s jbb1s jbb10s jbb25vol1vol10vol30(b) POWER3-IIbehavior, POWER3-IIJava server benchmarks are within the range of SPECint2000, indicating that the two sets of benchmarks have much the same effect on speculative execution. However, the Java server benchmarks (with the exception of vol30) exhibit, on the average, worse BTAC (Branch Target Address Cache) performance than gcc, twolf, and vortex. This could indicate that the BTAC of the POWER3-II, which caches branch target addresses and does not store any target instructions, does not work very well for Java server code. Further, eon, which shows BTAC performance similar to the Java server benchmarks, is written in C++ and makes heavy use of virtual functions, which are also widely used in Java. Java programs are known to have poor branch target predictability due to indirect branches resulting from virtual function calls and code interpretation [15].4.4. CPI ComponentsFigure 8 compares the Java server benchmarks to the SPECint benchmarks on the RS64-III from another perspective: CPI components per instruction. (TheseFigure 8. CPI components, RS64-IIIstalls in the above figure do not comprise a comprehensive list, but they are the significant memory access related stalls on the machine. “Ideal CPI” refers to (total execution cycles – storage latency)/instructions executed. “Storage latency” is a single countable event on the RS64- III that indicates the non-overlapped total amount of storage related stalls (i.e. multiple storage related stalls in one cycle count as one stall). Thus “Ideal CPI” is an approximation of CPI in the absence of all storage related stalls. “Isync” and “Other sync” stalls are caused by various synchronizing PowerPC instructions. It is clear, as could be predicted from the earlier discussion of cache misses, that the Java server benchmarks incur significantly more instruction cache stalls and L2 cache stalls than the SPECint benchmarks, and further, that these along with ideal CPI (which is determined by internal resource conflicts that we cannot count for) are responsible for most of the total CPI. For SPECjbb, data cache miss stalls also play a large role in the CPI. In contrast, the SPECint benchmarks suffer from very little, if any, of the storage related stalls included in the figure. However, despite the large number of storage stall cycles for the Java server benchmarks, Figure 8 shows that the CPIs of the benchmarks are lower than the sum total of the CPI components, which indicates the effectiveness of the RS64-III’s superscalar pipelined architecture in hiding some of the storage latency.5. ConclusionWe performed a comparison of two Java server benchmarks, SPECjbb2000 and VolanoMark2.1.2, with selected benchmarks from SPECint2000 on two IBM PowerPC architectures, the RS64-III and the POWER3-II. We find that our Java server applications differ from SPECint in several ways:Clearly, instruction stream behavior is particularly poor for these Java server workloads. High instruction cache, ITLB, and BTAC miss rates are observed. These point toward a large or scattered instruction footprint. Instruction cache stalls make up a substantial component of the CPIs of these workloads, while they are near negligible in the SPECint workloads.We also see that L2 performance is a major factor in overall performance for the Java server workloads. L2 misses per instruction and per L2 reference are significantly higher than those for SPECint2000. L2 load misses make up the vast majority of the Java server benchmarks’ L2 misses, due possibly to a large data footprint that causes a higher proportion of L2 capacity misses.Clearly, if one is to study the impact of Java server applications on modern processor architectures, L2 performance must not be neglected.In addition, these Java server workloads have a high proportion of zero dispatch cycles, suggesting that ILP is not very easily exploited in these workloads.Given the significant differences between our two PowerPC architectures, the RS64-II being an in-orderexecution machine with static branch prediction and the POWER3-II being a highly aggressive out-of-order execution machine, the fact that the above characteristics were found on both platforms suggests that they are real properties of the workload and not machine-dependent.6. AcknowledgmentsWe would like to thank Steve Stevens of the IBM Austin PowerPC Performance group for his encouragement and support, Rick Eickemeyer of IBM Rochester for his assistance in calculating CPI components and advice on performance metrics, and Steve Kunkel and Frank O’Connell of IBM for their help in understanding the RS64-III and POWER3-II architectures. Thanks also go to Yue Luo of the Laboratory for Computer Architecture at the University of Texas at Austin Department of Electrical and Computer Engineering for his helpful comments and suggestions.This study was funded by a grant from the IBM Austin Center for Advanced Studies.7. References[1] A. Alimaki, D. J. DeWitt, M. D. Hill and D. A.Wood. DBMSs on a Modern Processor: WhereDoes Time Go? In Proceedings of the 25th VLDBConference, Edinburgh, Scotland, 1999.[2] L.A. Barroso, K. Gharachorloo and E. Bugnion.Memory System Characterization of CommercialWorkloads. In Proceedings of the 25thInternational Symposium on ComputerArchitecture, 1998, pp. 3-14.[3] D. Bhandarkar and J. Ding. PerformanceCharacterization of the Pentium Pro Processor. InProceedings of the Third International Symposiumon High-Performance Computer Architecture,1997, pp. 288-297.[4] J.M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla,and S.R. Kunkel. A Multithreaded PowerPCProcessor for Commercial Servers. IBM Journal ofResearch and Development, Vol. 44, No. 6, 2000,pp. 885-894.[5] J. Borkenhagen and S. Storino. Fourth Generation64-Bit PowerPC-Compatible CommercialProcessor Design. White Paper, IBM Corporation,/resource/technology/nstar.html, January 1999.[6] H.W. Cain, R. Rajwar, M. Marden, and M.H.Lipasti. An architectural Evaluation of Java TPC-W. In Proceedings of the Seventh InternationalSymposium on High-Performance ComputerArchitecture, 2001.[7] Q. Cao, P. Trancoso, J.-L. Larriba-Pey, J.Torrellas, R. Knighten, Y. Won. Detailedcharacterization of a Quad Pentium Pro ServerRunning TPC-D. In Proceedings of InternationalConference on Computer Design, 1999.[8] K. Keeton, D. A. Patterson, Y. Q. He, R. C.Raphael, and W. E. Baker. PerformanceCharacterization of a Quad Pentium Pro SMPUsing OLTP Workloads. In Proceedings of the25th International Symposium on ComputerArchitecture, Barcelona, Spain, June 1998, pp. 15-26.[9] T. Li, L.K. John, N. Vijaykrishnan, A.Sivasubramaniam, A. Murthy, and J. Sabarinathan,Using Complete System Simulation to CharacterizeSPECjvm98 Benchmarks. In Proceedings ofInternational Conference on Supercomputing,2000, pp. 22-33.[10] Y. Luo and L.K. John. Workload Characterizationof Multithreaded Java Servers. Technical ReportTR-010815-01, Department of Electrical andComputer Engineering, University of Texas atAustin, June 2001,/projects/ece/lca. [11] A.M.G. Maynard, C.M. Donnelly and B.R.Olszewski. Contrasting characteristics and cacheperformance of technical and multi-usercommercial workloads. In Proceedings of the 6thInternational Conference on Architectural Supportfor Programming Languages and OperatingSystems. San Jose, October 1994, pp. 145-156. [12] S. Oaks and H. Wong. Java Threads, 2nd Edition,O’Reilly and Associates, January 1999.[13] F.P. O’Connell and S.W. White. POWER3: theNext Generation of PowerPC Processors. IBMJournal of Research and Development, Vol. 44,No. 6, 2000, pp. 873-884.[14] M. Papermaster, R. Dinkjian, M. Mayfield, P.Lenk, B. Ciarfella, F. O’Connell, and R. DuPont.POWER3: Next Generation 64-bit PowerPCProcessor Design. White Paper, IBM Corporation,1998.[15] R. Radhakrishnan, N. Vijaykrishnan, L.K. John,and A. Sivasubramaniam, Architectural Issues inJava Runtime Systems. In Proceedings of the SixthInternational Conference on High PerformanceComputer Architecture, January 2000, pp. 387-398.[16] P. Ranganathan, K. Gharachorloo, S.V. Adve andL.A. Barroso. Performance of DatabaseWorkloads on Shared-Memory Systems with Out-of-Order Processors. In Proceedings of the 8thInternational Conference on Architectural Supportfor Programming Languages and OperatingSystems, October 1998, pp. 307-318.[17] B. Rychik and J.P. Shen. Characterization ofValue Locality in Java Programs, Workshop on。