ITC_13_paper_4.3-3
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
EDT Bandwidth Management – Practical Scenarios for Large SoC Designs
J. Janicki, J. Tyszer
W.-T. Cheng, Y. Huang
M. Kassab, N. Mukherjee, J. Rajski
Y. Dong, G. Giles
Poznań University of Technology 60-965 Poznań, Poland Mentor Graphics Corporation
Wilsonville, OR 97070, USA
Advanced Micro Devices Inc.
Sunnyvale, CA 94088, USA
Abstract
The paper discusses practical issues involved in applying scan bandwidth management to large industrial system-on-chip (SoC) designs deploying embedded test data com-pression. These designs pose significant challenges to the channel bandwidth management methodology itself, flow, and tools. The paper introduces several test logic architec-tures that facilitate preemptive test scheduling for SoC circuits with EDT-based test data compression. Moreover, some recently proposed SoC test scheduling algorithms are refined accordingly by making provision for (1) setting up test configurations minimizing test time, (2) optimiza-tion of SoC pin allocation based on scan data volume, and (3) handling physical constraints in realistic applications. Detailed presentation of a case study is illustrated with a variety of experiments that allow one to learn how to tra-deoff different architectures and test scheduling.
1. Introduction
One of the main reasons for the growing popularity of system-on-chip (SoC) designs is their ability to encapsu-late many disparate types of complex IP cores running at different clock rates with different power requirements and multiple power-supply voltage levels. As it turns out, de-vising a proper test methodology under constraints im-posed by these factors is no trivial pursuit. Many SoC-based test schemes proposed so far utilize dedicated in-strumentation including test access mechanisms (TAM) and test wrappers [9], [11], [24]. TAMs are typically used to transfer test data between the SoC pins and the embed-ded cores, while test wrappers form the interface between the core and the SoC environment. Solutions involving both TAMs and wrappers are accomplishing such tasks as optimizing test interface architecture [27] or control logic [22] while addressing routing and layout constraints or hierarchy of cores [2], [6], scheduling test procedures [1], [20], [21], [23], [33], tests of identical cores [4], [8], [28], and minimizing power consumption [10], [32], [18]. Tech-niques proposed in [5], [11], [12], and [26] attempt to minimize SoC test time. The integrated scheme of [11] reduces test time for a given number of pins by optimizing dedicated TAMs and pin-count-aware test scheduling. Packet-switched networks-on-chip can also be used to test SoCs [26]. In this approach, test data are delivered through an on-chip communication infrastructure, with no dedicat-ed TAMs and their associated routing overhead.
There are techniques addressing synergistically TAM and wrapper design as well as test data compression [3], [13], [14], [29], [30], [31]. In fact, test compression is gradually becoming an integral part of several SoC-based DFT schemes. For example, the approach of [19] encodes test vectors for each core separately through LFSR reseeding with time-multiplexed ATE channels delivering data to successive cores. Similarly, a scheme of [34] utilizes static reseeding to test SoCs, where all cores share a single on-chip LFSR. XOR test logic performs compression in [7], which further guides a TAM design process.
ATE channel bandwidth management for SoC designs can play a key role in increasing test data compression with no visible impact on test application time. For example, the approach presented in [15] encompasses (1) a solver capa-ble of using input and output channels dynamically, (2) test scheduling algorithms, and (3) TAM design schemes, all devised for the Embedded Deterministic Test (EDT) environment [25]. It is assumed that all cores in the SoC are either heterogeneous modules, or wrapped testable units, and they come with their individual EDT-based compression logic, which is subsequently interfaced with ATE through an optimized number of channels. All these features enable flexible ATE channel bandwidth manage-ment with a very simple test flow. As a result, test schedul-ing and test access mechanisms can assign a fraction of the ATE interface capacity to each core that is handled as if it were tested alone on a smaller platform. It increases com-pression ratios and allows trade-offs between test applica-tion time, volume of test data, test pin count, and interface design complexity. The scheme of [15] is not confined to EDT-based compression but is applicable to any SoC-based test data reduction scheme capable of working with a varying number of input and output channels. Traditionally, implementing a hierarchical DFT methodol-ogy for designs with a large number of cores poses signifi-cant challenges. First of all, the number of chip-level pins is limited and does not suffice to drive all cores in parallel. Given the pin limitations, it is impossible to determine the optimal allocation of pins to cores for the best compression. Furthermore, since a particular core can be reused in mul-tiple designs, an optimal number of channel pins for this core when embedded in one design may invalidate test reuse in other designs. Under such circumstances, the chip integrators collect data for all individual cores, examine the data along with all constraints for the design, and then manually determine test schedules. This may result in suboptimal test data volume and compromised test applica-tion time, especially because of some outlier blocks having large pattern counts.
Bandwidth management mitigates the dependency of core channels on the number of available chip-level pins, allows automatic scheduling of cores by making it transparent to the users, and significantly improves test planning at the
core-level. In essence, it provides better utilization of the chip-level channel pins [17], thereby guaranteeing the best possible data volume and test time reduction for the overall design. In this paper, we present a bandwidth management scheme for hierarchical designs that takes into considera-tion practical scenarios such as sharing of inputs for iden-tical cores, trade-offs between fixed channel count alloca-tion vs. scalable channel count allocation per core, and physical constraints to minimize the routing overhead associated with the TAM-based interconnection networks. In addition, several techniques to deliver the control data during test are studied and compared. To further optimize the scheme, the paper also presents a scheduling algorithm that allows changing the input and output channel alloca-tions when switching the channel configurations.
2. Prior work
The SoC test environment (Fig. 1) deployed in this work comprises two switching networks, as originally intro-duced in [15], [16]. External ATE channels feed an input-switching network that reroutes compressed test data to different cores based on the control data produced by a test scheduler. Since the scan routing paths from the chip-level test pins to the core-level test pins are dynamically selected by patterns, this interconnection network is also referred to as a dynamic scan router (DSR). Identical modules share the same test data in the broadcast mode. In addition to individual EDT decompressors, each core is armed with X-masking logic protecting its response compactor against unknown states and connecting the core with an output-switching network. This network allows the compressed output streams from successive cores to reach the output channels, and to be sent back to the ATE. In order to facili-
As shown in Fig. 1, the input DSR consists of n demultip-lexers, where n is the number of ATE input channels. Giv-en a group of test patterns, each demultiplexer connects the
corresponding channel to one of several destination cores, as indicated by the content of address register. The number of ATE input channels cannot be smaller than the capacity of the largest single core in terms of its EDT inputs. That is to say, in the worst case, we can still test the largest cores, one at a time. Typically, low-order inputs of each core are used more frequently than others. Hence, suitably-sized OR gates [15] are deployed to assure that these inputs can receive data from more than a single ATE channel to in-crease flexibility of a test scheduler.
Given the ATE input channels, the associated demultiplex-ers, and all cores with OR gates driving their EDT inputs, the actual connections between these terminals are ar-ranged as follows. Successive cores are connected with the demultiplexers in such a way that k EDT inputs of a given core (including inputs of associated OR gates) are linked with k different demultiplexers. Moreover, all demultip-lexers have approximately the same number of outputs in use. Some final adjustments within a single module are also possible to simplify the resultant DSR layout and to avoid long connections. Note that the above method yields the actual size of the input demultiplexers, which also determines the size of their control registers.
The output DSR (Fig. 1) interfaces all individual core outputs with the ATE by reducing the number of test re-sponse streams so that they fit into the number m of output ATE channels. It is made up of m multiplexers such that each multiplexer serves to connect several cores with a designated ATE output channel. Again, the address regis-ters specify which cores are to be observed for a given group of test patterns. The output channel mapping is car-ried out in a manner similar to that of the input DSR. Note that all core outputs feature a user-defined fan-out in order to increase flexibility of selecting observation points con-testing many cores in parallel by dy-ATE channels to different cores in with their needs may shorten test application core, ATPG-produced test cubes are passed that merges and encodes them to arrive compressed test patterns; furthermore, the the minimal number of EDT input required EDT output classes (groups) having identical both input channels and observation points,
∙ upon completion of the above operations for each core, all classes are passed to a test scheduler; given archi-
tecture of both test access networks and various con-straints (ATE channels, DSR layout, power consump-tion, and others), it yields the final allocation of ATE input/output channels to selected cores when applying successive test patterns.
Given both DSRs, one can now schedule
test patterns for successive cores. Every test pattern be characterized by its descriptor comprising that is to be exercised when applying test t nel capacity c,descriptor form a setup class x represented by its count P (x segments such that test patterns from the same total test time, and reduce the volume of control data.
Test scheduling merges complementary setup form clusters of cores that can be tested in parallel constraints imposed by ATE channels and DSRs. classes are complementary if they comprise class x having the largest value of c P (x ) seating current result b of merging to form a base from L that satisfies all constraints, including The process of forming the base class terminates when either there are no more setup classes complementary with the base, or one of the constraints cannot be satisfied. The merging algorithm removes then the first element from list L and attempts to form another base cluster until the list of setup classes becomes empty, in which case the scheduling algorithm returns a list of base classes that determines the actual schedule, i.e., an order according to which cores will be tested, as well as actual channel allocations. It can be further reordered to group tests for the same core in conti-guous time intervals, finally forming test configurations indicating which ATE channels are to connect with which EDT channels of which cores and for how long.
3. Control data delivery
The approach summarized in the previous section does not make any specific provisions for the way the control data are delivered to SoC test logic in order to setup test confi-gurations. However, the number of test configurations, and hence the amount of control data one needs to employ and transfer between the ATE and the DSR address registers, may visibly impact test scheduling. In fact, ignoring con-trol bits may lead to a test scheduling with too many confi-gurations and the amount of control bits growing to a point where the overall test time is compromised. Consequently, we begin our study by analyzing three alternative schemes that can be used to upload control bits and show how they determine the final SoC test logic architecture.
Using IJTAG. The IEEE P1687 working group is develop-ing a standard for accessing on-chip test and debug fea-
tures via the IEEE 1149.1 test access port (TAP). The purpose of this Internal JTAG (IJTAG) standard is to au-tomate the way one can manage on-chip instruments, and to describe a language for communicating with them via the IEEE 1149.1 test data registers (TDR). If there is an DSR interfacing ATE with C 1 and C 2. TAP can be in-structed to enable a test path via the P1687 segment inser-tion bits (SIB). Every SIB either enables or disables the inclusion of an instrument into the TDI/TDO path. The TDR in C 1 or C 2 can be either bypassed or loaded with data putting both cores into specific test modes. The TDR in DSR receives the control data indicating a core and connections of its test channels with ATE channels.
The advantage of using the IJTAG network to deliver the DSR control data is a simple and easy to implement flow as the network is frequently used to set the cores TDRs. However, such approach can only support a limited num-ber of configurations. This is because the IJTAG shift clock is typically 10 to 20 times slower than the scan shift clock. Delivering a large volume of control data can incur an unacceptable test time overhead, and hence an increase in the total test time. In reality, we use this architecture only for less than 10 test configurations.
Dedicated control chain . The SoC design of Fig. 3 uses two dedicated control chains to deliver the control data. In principle, these structures are obtained by daisy chaining address registers of both DSRs. Let the design have n input test pins and m output test pins at the chip-level. One can insert n control chains driven by n input test pins through n demultiplexers. The selector pin of each demultiplexer is controlled by a signal CFS. It is worth noting that if n ≥ m , then n control chains suffice to supervise both input and output sides. If n < m , one chain can be used to control one input channel and multiple output channels for the number of control bits per channel is typically much smaller than the pattern shift cycles.
data are shifted in, flip-flop CF becomes asserted, so does CFS. As a result, the regular scan test patterns can be ap-plied. Given a test configuration, all patterns but the last one set CF and CFS to 1. The last scan pattern of one test configuration sets CF back to 0 and subsequently CFS is de-asserted. In the next pattern, it uses the de-asserted CFS to upload the new control data into the control chain and to setup a new test configuration. The remaining configura-tions and test patterns are applied in a similar manner by toggling the values of CF and CFS.
The advantage of using this architecture is its ability to support as many test configurations as desired since the control chain shift frequency is the same as that of conven-tional scan chains.
Using pipelines. One can also use the regular scan chan-nels to deliver controls through pipelining stages, as shown in Fig. 4. For each channel, this approach concatenates n + m control bits, where n and m are the numbers of control bits used by the input demultiplexers and the output mul-tiplexers, respectively. Moreover, each control bit is sha-dowed to avoid distorting test configurations in the middle of test data shifting. The shadow registers are updated at the end of each pattern upload. Thus, when test pattern p launches a new test configuration, the corresponding con-nimizing the number of test configurations may decrease the actual test time. Indeed, considering one would need to transfer control data necessary to reconfigure DSRs many times, reducing this activity may translate into some signif-icant test time savings. The additional test-scheduling step taking care of this phenomenon will be further referred to as conditional merging, and it can be detailed as follows. Assume we know the total number of control bits needed to reconfigure the input and output interconnection net-work(s). Let us also assume that we know the frequency of the IJTAG shift clock used when moving the control data. As a result, one can easily assess time T c spent on updating test configurations. It paves the way for a new test schedul-ing where an additional verification step precedes a possi-ble merging (or bin-packing) of a new setup class with the existing base. It estimates the resultant impact on test time if we decide to proceed with merging. Clearly, if the setup class is not merged with the base, it may reduce test time as there is no need to send additional control data forming a new test configuration. On the other hand, as a part of trade-off, one may observe test time increase due to certain ATE channels not being utilized.
Consider a bin-packing diagram representing a part of a hypothetical test schedule, as shown in Fig. 5a. Here, two cores, C1 and C2, are to be tested. Core C1 requires 3 chan-
nels to apply t 1 test patterns (it forms a setup class a ), 2 channels to deliver t 2 tests (setup class b ), and then a single channel for the remaining t 3 patterns (class c ). Core C 2 needs 2 channels to deliver t 3 test patterns (class d ), and a single channel to apply t 2 patterns (setup class e ). Let t 1, t 2, and t 3 be much longer than the duration T c of test configu-ration setup. If classes b and e as well as classes c and d are merged as shown in the figure, then extra time is needed to reconfigure the network three times. More im-portantly, however, we reduce the actual test time since C 2 is tested in parallel with C 1. On the other hand, if these classes are not merged, then two test configurations suffice to complete tests of C 1 and C 2. Unfortunately, test of C 2 is delayed, thus increasing visibly the total test application time. As shown in Fig. 5a, the second scenario can be characterized by some unused space (in the schedule dia-gram) whose area A can be easily computed. Let k be the total number of SoC input channels. Then the total test time increase τ caused by not merging some of the setup classes can be estimated as τ = A / k . If τ < T c , then one can reduce the total test time by actually not merging some of the classes. Consequently, the above condition can be used to dynamically decide whether given setup classes
It is worth noting that optimizing SoC pin allocation based on scan data volume at the input and output sides can fur-ther enhance test scheduling. As typical SoC designs pack hundreds of cores – interfaced by strictly limited channel resources – into a space of a single chip, it is not unusual that the number of channels assigned to a single core as well as the ratio of EDT inputs to EDT outputs vary signif-icantly. The same ratio also depends on a test pattern as the solver always selects the minimal number of inputs that suffice to encode a given vector while still observing all outputs. This disproportion is even more pronounced when several identical cores receive the same data in the broad-cast mode, whereas their outputs must be observed inde-pendently to guarantee high quality of fault detection and diagnosis. Consider, for example, a circuit comprising 8 identical cores with 5 inputs and 5 outputs. Such a design would require one to supply different subsets of 5 input channels while uninterruptedly monitoring all 40 outputs. The above observations clearly indicate that the DSRs should provide each core with channel resources adequate to its bandwidth requirements. Although the total number of ATE channels is typically fixed, most of SoC pins are bidirectional and can be configured either as inputs or Figure 6 Test data volume for industrial design
According to our findings, the most suitable partitions channels can be obtained by comparing the corres-ponding scan data volumes (SDV) on both sides. Consider Test configurations
(a)
t 1 t 2 c h a n n e l s
1
2 3 C 1
1 2 C 2
3 t 3
T C T C
T C t 1
t 2
C 1
T C T C
C 2
2
1
t 3 t 3 t 2
c h a n n e l s
1
2 3 Test configurations
(b)
t 1
t 2 c h a n n e l s
1
2 3 C 1
1
2
C 2
3
t 3
T C T C T C
t 1 t 2 t 3
T C T C 1
t 3 t 2
2
c h a n n e l s
1
2 3 C 1 C 2
5000
10000
15000 20000 25000
30000
35000Test patterns
Output data volume
are assigned to the input and output DSRs in accordance with the ratio determined earlier. In the aforementioned test case, we would have two outputs per one input, on the average. Occasionally, the desired ratio can also be deter-mined by using the upper bound on the number of input channels assigned to a given core, especially when channel profiles similar to those of Fig. 6 are not readily available. The same design would have then the ratio 1 : 1.98. Expe-rimental results indicating importance of using the proper channel assignments are presented in Section 5.
Finally, the trend of producing integrated circuits at a finer scale and packing more transistors into the same area adds another constraint to the already complicated test schedul-ing process. As a result of increasing IC complexity, cer-tain cores of a given SoC can only use a subset of I/O pins because of layout congestions. As shown in Fig. 7a, all SoC design pins and cores can be partitioned into a num-ber (here four) of disjoint subsets. Clearly, a single core can only be interfaced through a subset of pins hosted by the corresponding group. This approach alleviates many practical problems, such as routing complexity, aggressive timing parameters, signal integrity, or power dissipation specifications. However, it significantly impacts test logic architecture and test scheduling algorithms. In particular, one has to employ a number of routers (see Fig. 7b) driven by subsets of input test channels and feeding subsets of output test channels. Although the test scheduler is aware of these constraints, it may not produce as optimal test scenarios as those of single-partition SoC designs due to their reduced flexibility (not every channel can be con-nected freely to a desired core) and unbalanced test re-
SoC design comprising 281 isolated cores. This number includes 121 unique cores plus 20 groups of identical cores, each having 8 modules. When testing identical cores in
one group, test stimuli are broadcasted to the inputs of these 8 identical cores, while test responses from all out-puts of 8 cores are observed together. The core gate counts vary from a few million to more than 10M, while the core pattern counts ranges from a few thousands to over 36K and 190K for stuck-at and transition faults, respectively. In our experiments, when different cores are tested in parallel, we use the size of the longest scan chain among these cores to determine the total test time. In this case, there is no power constraint, but we have to follow the layout and physical constraints.
477 out of 497 SoC digital pins were used for testing. Each core or a group of identical cores can use only a subset of pins because of layout constraints. Following the scheme of Fig. 7, this particular design was partitioned into four quadrants Q 1 – Q 4 with the corresponding allocation of cores and I/O pins into individual quadrants, as listed in Table I. The same table provides some vital statistics of the design that include the gate count, the number of scan cells, the total number of EDT inputs, the total number of EDT outputs (the last two numbers would define the interface size if no scheduling was used), and the corresponding ratios of inputs to outputs: the one estimated based on the EDT interface and the most preferable one, obtained as shown in Section 4 based on the input test data volume (IDV) and the output test data volume (ODV) given in the table for both stuck-at and transition fault tests, with a further data breakdown with respect to successive qua-drants.
T ABLE I. S O C CHARACTERISTICS
cores to be tested in parallel within one configuration. If there are still SoC pins available, it picks more cores from each quadrant until either there are no more SoC pins or all
be tested as follows: G 1 = {C 1, C 5, C 9, C 13}, G 2C 10, C 14}, G 3 = {C 3, C 7, C 11, C 15}, G 4 = {C 4, C 8, C we use JTAG to setup each configuration. The obtained this way (see Table II) form a baseline be compared against our dynamic SoC agement outcomes.
that ended up within each configuration, while row provides the number of scan shift cycles per EDT inputs, the number of EDT outputs, and The JTAG shift cycles deployed to setup each tion are ignored because of only resultant test schedule is illustrated in Fig. 8 as packing diagram for both the input (the upper to different cores or identical core groups. nels (oriented horizontally) are unused. It is that the baseline is a fully static channel assignment, as connections between core test channels and SoC test pins (within each configuration) are fixed.
T ABLE II. S O C GROUP SETUP
The test architectures of Section 3 and the test scheduling
methods of Section 4 were subsequently employed to run our dynamic SoC bandwidth management. Given stuck-at fault and transition fault test sets, the experimental results are summarized in Table III for eight test cases, out of which four correspond to a fixed channel count allocation, whereas the next four cases represent a scalable core-level channel count . In the fixed channel count allocation scena-rio, a given core is always assigned SoC pins to all of its EDT channels regardless of the actual test pattern require-ments, although the connections may differ in different test configurations. The scalable channel count allocation sce-nario assumes that only the minimal number of SoC pins is allocated to a given core for a particular test pattern. Fur-thermore, we consider four test setups, as described in Section 3: two of them use the IJTAG architecture with a shift clock 10 and 20 times slower than the scan shift clock,
Figure 8 The baseline bin-packing diagram For each test case, successive columns of Table III report the following posed schemes:
∙ the total number of test patterns and the resultant num-ber of base classes,
∙ the channel fill rate indicating the actual usage of the ATE channels; it assumes the value of 1.0 only if all channels are used uninterruptedly till the end of test, ∙ the test application time reported as the total number of shift cycles needed to deliver all test data in terms of the actual test patterns and the accompanying control bits; moreover the ratio of both quantities (control overhead) is also reported here,
∙ the test time reduction ratio relative to the original shift cycles given in Table II.
As indicated in Table III, it appears that one can reduce up to 3.26 times the test application time compared to the baseline and still be able to deliver all necessary test pat-terns, thus preserving the original test quality. The largest test application time reduction is observed for test scena-rios based on per pattern scalable channel allocation while employing dedicated chains to deliver all required control test data. These results are even more evident for transition fault test sets. Further experimental results, not presented in Table III, reveal an intrinsic relation between the ratio of input and output channels and the resultant test time reduc-tion. This is illustrated in Fig. 9. As can be seen, the test time reduction clearly depends on the aforementioned ratio, and hence its careful selection, as shown in Section 4, may guarantee the shortest test application time.
Fig. 10 pictures a complete and optimized test schedule for input and output sides obtained by allocating test channels in a scalable channel count allocation scenario for 145 input (the upper part) and 332 output test channels (the lower part – both oriented horizontally). As can be seen, there is a well-pronounced difference between the test schedules of Fig. 8 and Fig. 10 – note that the time axis of Fig. 8 has been largely squeezed compared to the same axis of Fig. 10.
ATE channels。