Different aspects of design and implementation of an NoC architecture have been addressed in [2,5,11,25]. Note that an NoC architecture primarily consists of three basic building blocks:
IP core, router, and channel. These components may malfunction because of the presence of various transient or permanent faults [26, 27]. Radiation-induced soft errors, crosstalk, voltage-induced delay errors often result in transient faults in an NoC system [28,29], whereas manufacturing defects, hardware aging, thermal and physical stress may cause permanent faults such as logical stuck-at and shorts [5, 15]. In order to perform reliable computation with MPSoCs, the underlying NoC fabric that is being engaged must be guaranteed to be fault free.
Testing for faults in an NoC architecture should cover these three components:IP cores, routers, and channels. In other words, the test of the NoC architecture characteristically is classified into three broad areas: (a) testing of IP cores, (b) testing of routers, and (c) testing of communication channels. The testing of IP cores is based on the reuse of the NoC infrastructure as a test access mechanism (TAM) and has been studied in the past [30–32]. On the other hand, test strategies for routers in terms of testing of arbiter, routing logic blocks (RLBs), I/O ports, and first-in-first-out (FIFO) buffers are well studied [6, 33–35]. In both cases, the test methods assumed the correctness of the communication channels to carry test
data and test responses. Therefore, the sequence of testing NoCs’ basic components matter and must be prioritized in the testing cycle. The testing of NoC channels in the priority sequence must be conducted earlier than the routers which must again be done before the cores because channels are the primary means of transportation for both test and application data, and involve significant portion of the network area [36]. One must then be ensured about the correct functionality of the channels before using them as the test instrument in testing of routers and cores. In this thesis, different manufacturing channel faults in an NoC are primarily considered. These channels suffer from poor observability and controllability because of their placement and density. Although all channels connected to a node can exercise the same test set, optimization of test time, fault coverage metrics, and performance overhead pose a challenge while applying the same test set at the node [37,38].
The search for designing a suitable test paradigm for permanent (manufacturing) faults in NoC channels had been a topic of research for quite some time. For example, Cota et al. [37, 38] proposed an off-line and high fault-coverage test model that addresses pairwise shorts in the channels of a 2×2 neighborhood. The four IP cores simultaneously transmit the test sets to each other separated at four hops in the neighborhood. A hop is defined as the single channel length. The model can be used to test larger size mesh NoCs by testing several 2×2 sub-meshes that cover it for the detection of short-channel faults. Although, the 2×2-Model provides high coverage metrics, it requires high test time. The situation worsens when the model is applied in the on-line mode as it needs increased number of test iterations. The same test model is extended by Herveet al. [39]. The extended post-burning off-line test method accounts for co-existent short and stuck-at faults (CSSAFs) in channels of a2×2network neighborhood. The method as before, incurs both high test area overhead and time due to the analysis of test responses after traveling four hops. Similarly, the test time is increased while the model is applied in on-line mode. Moreover, it works with traditional mesh-type NoCs only due to this2×2test configuration. Today, NoC-based systems often have both conventional topologies like mesh networks and unconventional topologies like octagon, spidergon networks. If one wants to account channel faults in unconventional NoCs, the2×2- Model should not be preferred, rather one may employ the test model discussed in [40,41] for torus NoC. In this scheme, the channel-short faults on a neighborhood of 2×1 that consists of an interswitch channel and its adjacent local channels get tested. Multiple instance of this neighborhood is applied concurrently for detecting channel shorts on a larger torus network.
In the on-line mode, one can iterate this 2×1 neighborhood to cover channel faults on a general network. Although the scheme enhances the scalability issue irrespective of network size and type but results in high test time and hardware area overhead on these networks.
Strano et al. [42] have proposed a self-diagnosis test method that detects a stuck-at fault in the interswitch channels. In this method, a router and its neighbor routers construct a neighborhood. A router in one direction transmits test sets to another router in the opposite
direction in the neighborhood. The method does not only take higher test area overhead but also needs high test time to detect the fault. The area and time will be more while short faults are considered additionally. Kakoeeet al.[6] have proposed an on-line test approach to address the stuck-at faults in interswitch channels of general NoCs. The approach sequentially selects a router to test its interswitch channels. The method claims that it can be used to detect short faults on the interswitch channels. Although the approach offers a scalable feature irrespective of the network type, it is not cost-efficient for larger NoCs since test time linearly grows with the NoC-size. Further, faults in local channels like the previous scheme [42] are not addressed while faults on these channels are as natural as in the interswitch channels. Later this sequential router selection based test methodology is extended in [7] to detect stuck-at faults in different components of NoC routers.
Next to the permanent faults, channels are also exposed to transient faults, e.g., crosstalk and faults due to aging, such as hot carrier injection (HCI). These faults are temporary and recoverable. The leading error correction code (ECC) techniques, such as automatic repeat request (ARQ), forward error correction (FEC), or a mixture of both schemes i.e., hybrid ARQ/FEC procedure, are followed in practice to tackle these temporary faults [26,43]. Many approaches include these faults alongside the permanent faults in channels. For example, Amirali et al. [28] have presented an end-to-end (E2E) on-line fault detection methodology that accounts for transient faults in NoC interswitch channels. The method is based on the FEC scheme that corrects transient faults in the channels. Further, the observed FEC syndromes are reused to detect the channel’s permanent faults. Liu et al.[44] have proposed an on-line fault detection technique for the transient faults in NoC interconnects. The method works by sequentially selecting a channel shared by a router pair. Also, it is supposed that the method can detect stuck-at and short faults as the permanent faults in channels.
Prior works may compliment each other with respect to a specific issue but one needs an advanced approach that can meet most of the quality characteristics in general. From the above literature survey, one may classify the limitations of the prior works in terms of the following quality characteristics:
• Maintaining System Reliability and Yield– It can be achieved by considering an cost- efficient test scheme.
• Test Size Reduction– a test mechanism should exhibit low test area overhead.
• Test Time Reduction– The overall test time should be at the lowest possible value.
• Fault Coverage Metric– The test mechanism should reach a high fault coverage value that may be up to 100%.
• Performance Overhead- Application of a test scheme should incur low performance overhead at the run time.
• Method Scalability– The scope of a test scheme should not be confined to a particular topology, network size, and channel width.
• Fault Efficacy– Along with stuck-at and short faults, a test mechanism should have the capability to detect other faults, such as open fault.