Texas A&M University College Station - Engineering
The present disclosure relates to an example of a method for a first router to adaptively \ndetermine status within a network. The network may include the first router
a second router \nand a third router. The method for the first router may comprise determining status \ninformation regarding the second router located in the network
and transmitting the status \ninformation to the third router located in the network. The second router and the third router \nmay be indirectly coupled to one another.
Method and apparatus for congestion-aware routing in a computer interconnection network
The University of Texas at Austin
Texas A&M University
Associate Professor
Bryan/College Station
Texas Area
Texas A&M University
The University of Texas at Austin
IEEE
Doctor of Philosophy (PhD)
Designed the Last-level Cache and on-chip interconnect for the TRIPS processor system.
Electrical and Computer Engineering
Tau Beta Pi
The University of Texas at Austin
Bachelor of Science (BS)
Electrical Engineering
The University of Florida
Intel
Intel
Texas A&M University
Dept. of Electrical Engineering
College Station
Texas
Assistant Professor
Robert McDnoald
Haiming Liu
Logic Design
High Performance Computing
Algorithms
Distributed Systems
Simulations
Parallel Computing
Verilog
Memory Design
VHDL
Computer Architecture
Microprocessors
LaTeX
Embedded Systems
VLSI
C
Interconnect
ModelSim
Xilinx
C++
Memory Test
B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors
Reena Panda
David Kadjo
Microarchitecture (MICRO)
2014 47th Annual IEEE/ACM International Symposium on
For decades
the primary tools in alleviating the \"Memory Wall\" have been large cache hierarchies and dataprefetchers. Both approaches
become more challenging in modern
Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt
given VLSI power scaling trends
this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover
in the context of CMPs running multiple concurrent processes
prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n
Stride
etc.) or high-overhead and high accuracy (STeMS
ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads
improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing \"best-of-class\" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average
with 65% less storage overhead.
B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors
Gwan Choi
This paper presents a bidirectional interconnect design which achieves significant reduction in area and power by allowing for simultaneous transmission and reception of signals on a single interconnect segment. The proposed interconnect design achieves twice the throughput with the same link width. We have modeled the bi-directional link on the 7×7 cycle accurate NoC design. We have explored the latency for synthetic and realistic SPLASH-2 benchmarks. Synthetic benchmark results show that bidirectional design does exceedingly well in high congestion. Combination of realistic benchmarks shows that bidirectional design does much better with latency whenever the injection level of the combined benchmark is higher.
Bidirectional interconnect design for low latency high bandwidth NoC
The energy cost of asymmetric cryptography
a vital component of modern secure communications
inhibits its wide spread adoption within the ultra-low energy regimes such as Implantable Medical Devices (IMDs)
Wireless Sensor Networks (WSNs)
and Radio Frequency Identification tags (RFIDs). Consequently
a gamut of hardware/software acceleration techniques exists to alleviate this energy burden. In this paper
we explore this design space
estimating the energy consumption for three levels of acceleration across the commercial security spectrum. First we examine an efficient baseline architecture centered around a pipelined RISC processor. We then include simple
yet beneficial instruction set extensions to our microarchitecture and evaluate the improvement in terms of energy per operation compared to baseline. Finally
we introduce a novel
dedicated accelerator to our microarchitecture and measure the energy per operation against the baseline and the ISA extensions. For ISA extensions
we show between 1.28 to 1.41 factor improvement in energy efficiency over baseline
while for full acceleration we demonstrate a 4.36 to 6.45 factor improvement.\n
The design space of ultra-low energy asymmetric cryptography
Umit Ogras
As the core count in processor chips grows
so do the on-die
shared resources such as on-chip communication fabric and shared cache
which are of paramount importance for chip performance and power. This paper presents a method for dynamic voltage/frequency scaling of networks-on-chip and last level caches in multicore processor designs
where the shared resources form a single voltage/frequency domain. Several new techniques for monitoring and control are developed
and validated through full system simulations on the PARSEC benchmarks. These techniques reduce energy-delay product by 56% compared to a state-of-the-art prior work.
Dynamic voltage and frequency scaling for shared resources in multicore processor designs
With the breakdown of Dennard scaling
future processor designs will be at the mercy of power limits as Chip Multi-Processor (CMP) designs scale out to many-cores. It is critical
therefore
that future CMPs be optimally designed in terms of performance efficiency with respect to power. A characterization analysis of future workloads is imperative to ensure maximum returns of performance per Watt consumed. Hence
a detailed analysis of emerging workloads is necessary to understand their characteristics with respect to hardware in terms of power and performance tradeoffs. In this paper
we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insights into the upper bounds of performance that future architectures can achieve. Furthermore it identifies the bottlenecks of emerging workloads. To the best of our knowledge
our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using a specialized trace-driven simulator. We make several contributions describing the high-level behavior of next-generation applications. For example
we show these applications contain up to a factor of 929X more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP)
instruction window size
realistic branch prediction
realistic memory latency
and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differed vastly from one another. As a result
we expect no single
homogeneous
micro-architecture will work optimally for all
arguing for reconfigurable
heterogeneous designs.
ILP and TLP in shared memory applications: a limit study
Umit Ogras
Michael Kishinevsky
Jiang
H.J.Kim
Z.Xu
· Targeted an uninvestigated but promising computer architecture for power management\n· Proposed a new but practical monitoring technique\n· Employed DVFS based PID control policy to control system\n· Achieved around 33% dynamic energy savings with negligible performance degradation\n· Implemented in C++
In-network Monitoring and Control Policy for DVFS of Networks-on-Chip and Last Level Caches in CMPs
Umit Ogras
Michael Kishinevsky
In chip design today and for a foreseeable future
the last-level cache and on-chip interconnect is not only performance critical but also a substantial power consumer. This work focuses on employing dynamic voltage and frequency scaling (DVFS) policies for networks-on-chip (NoC) and shared
distributed last-level caches (LLC). In particular
we consider a practical system architecture where the distributed LLC and the NoC share a voltage/frequency domain that is separate from the core domain. This architecture enables the control of the relative speed between the cores and memory hierarchy without introducing synchronization delays within the NoC. DVFS for this architecture is more complex than individual link/core-based DVFS since it involves spatially distributed monitoring and control. We propose an average memory access time (AMAT)-based monitoring technique and integrate it with DVFS based on PID control theory. Simulations on PARSEC benchmarks yield a 27% energy savings with a negligible impact on system performance.
In-network monitoring and control policy for dvfs of cmp networks-on-chip and last level caches
Michael Kishinevsky
Umit Ogras
David Kadjo
System-on-Chip Conference (SOCC)
2014 27th IEEE International
This paper presents a platform-level power management framework for mobile platforms. The proposed framework minimizes the overall platform energy while meeting system level performance and power budget constraints. To this end
we construct analytical performance and power models using dynamic information collected via performance monitoring counters. Using these models
we design two different closed loop controllers to ensure that both the performance and the power targets are achieved and maintained in the presence of dynamic workload variations. Experimental evaluations performed on an Android platform show up to 8% energy savings at the platform level and up to 15% CPU energy savings.
Towards platform level power management in mobile systems
David Kadjo
Computer Design (ICCD)
2013 IEEE 31st International Conference on
We propose a novel technique to significantly reduce the leakage energy of last level caches while mitigating any significant performance impact. In general
cache blocks are not ordered by their temporal locality within the sets; hence
simply power gating off a partition of the cache
as done in previous studies
may lead to considerable performance degradation. We propose a solution that migrates the high temporal locality blocks to facilitate power gating
where blocks likely to be used in the future are migrated from the partition being shutdown to the live partition at a negligible performance impact and hardware overhead. Our detailed simulations show energy savings of 66% at low performance degradation of 2.16%.
Power gating with block migration in chip-multiprocessor last-level caches
Moore's Law scaling is continuing to yield even higher transistor density with each succeeding process generation
leading to today's multi-core Chip Multi-Processors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately
deep sub-micron CMOS process technology is marred by increasing susceptibility to wearout. Prolonged operational stress gives rise to accelerated wearout and failure
due to several physical failure mechanisms
including Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Each failure mechanism correlates with different usage-based stresses
all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic for the system
a single fault in the inter-processor Network-on-Chip (NoC) fabric could render the entire chip useless
as it could lead to protocol-level deadlocks
or even partition away vital components such as the memory controller or other critical I/O. In this paper
we develop critical path models for HCI- and NBTI-induced wear due to the actual stresses caused by real workloads
applied onto the interconnect microarchitecture. A key finding from this modeling being that
counter to prevailing wisdom
wearout in the CMP on-chip interconnect is correlated with lack of load observed in the NoC routers
rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wearout-sensitive components exercised
without significantly impacting cycle time
pipeline depth
area or power consumption of the overall router. We subsequently show that the proposed design yields a 13.8x-65x increase in CMP lifetime.
Use it or lose it: wear-out and lifetime in future chip multiprocessors
With increasing core counts in Chip Multi-Processor (CMP) designs
the size of the on-chip communication fabric and shared Last-Level Caches (LLC)
which we term uncore here
is also growing
consuming as much as 30% of die area and a significant portion of chip power budget. In this work
we focus on improving the uncore energy-efficiency using dynamic voltage and frequency scaling. Previous approaches are mostly restricted to reactive techniques
which may respond poorly to abrupt workload and uncore utility changes. We find
however
there are predictable patterns in uncore utility which point towards the potential of a proactive approach to uncore power management. In this work
we utilize artificial intelligence principles to proactively leverage uncore utility pattern prediction via an Artificial Neural Network (ANN). ANNs
however
require training to produce accurate predictions. Architecting an efficient training mechanism without a priori knowledge of the workload is a major challenge. We propose a novel technique in which a simple Proportional Integral (PI) controller is used as a secondary classifier during ANN training
dynamically pulling the ANN up by its bootstraps to achieve accurate predictions. Both the ANN and the PI controller
then
work in tandem once the ANN training phase is complete. The advantage of using a PI controller to initially train the ANN is a dramatic acceleration of the ANN's initial learning phase. Thus
in a real system
this scenario allows quick power-control adaptation to rapid application phase changes and context switches during execution. We show that the proposed technique produces results comparable to those of pure offline training without a need for prerecorded training sets. Full system simulations using the PARSEC benchmark suite show that the bootstrapped ANN improves the energy-delay product of the uncore system by 27% versus existing state-of-the-art methodologies.\n
Up by their bootstraps: Online learning in artificial neural networks for CMP uncore power management
Jasson Caseyal
The Software Defined Networking (SDN) approach has numerous advantages
including the ability to program the network through simple abstractions
provide a centralized view of network state
and respond to changing network conditions. One of the main challenges in designing SDN enabled switches is efficient packet classification in the data plane. As the complexity of SDN applications increases
the data plane becomes more susceptible to Denial of Service (DoS) attacks
which can result in increased delays and packet loss. Accordingly
there is a strong need for network architectures that operate efficiently in the presence of malicious traffic. In particular
there is a need to protect authorized flows from DoS attacks. In this work we utilize a probabilistic data structure to pre-classify traffic with the aim of decoupling likely legitimate traffic from malicious traffic by leveraging the locality of packet flows. We validate our approach by examining a fundamental SDN application: software defined network firewall. For this application
our architecture dramatically reduces the impact of unknown/malicious flows on established/legitimate flows. We explore the effect of stochastic pre-classification in prioritizing data plane classification. We show how pre-classification can be used to increase the effective Quality of Service (QoS) for established flows and reduce the impact of adversarial traffic.\n
Stochastic Pre-classification for SDN Data Plane Matching
Gwan Choi
Yoonseok Yangre
ACM Transactions on Design Automation of Electronic Systems (TODAES)
WaveSync is a network-on-chip architecture for a globally asynchronous locally-synchronous (GALS) design. The WaveSync design facilitates low-latency communication leveraging the source-synchronous clock sent along with the data to time components in the datapath of a downstream router
reducing the number of synchronizations needed. WaveSync accomplishes this by partitioning the router components at each node into different clock domains
each synchronized with one of the orthogonal incoming source-synchronous clocks in a GALS 2D mesh network. The data and clock subsequently propagate through each node/router synchronously until the destination is reached
regardless of the number of hops this may take. As long as the data travels in the path of clock propagation and no congestion is encountered
it will be propagated without latching as if in a long combinatorial path
with both the clock and the data accruing delay at the same rate. The result is that the need for synchronization between the mesochronous nodes and/or the asynchronous control associated with the typical GALS network is completely eliminated. To further reduce the latency overhead of synchronization
for those occasions when synchronization is still required (when a flit takes a turn or arrives at the destination)
we propose a novel less-than-one-cycle synchronizer. The proposed WaveSync network outperforms conventional GALS networks by 87--90% in average latency
synthesized using a 45nm CMOS library.
WaveSync: Low-Latency Source-Synchronous Bypass Network-on-Chip Architecture
Firewalls are ubiquitous security functions and exist in almost all network connected devices whether protecting host stacks or providing transient packet filtering. Firewall performance
which is a key ingredient for network performance
can be greatly degraded by traffic crafted to exploit its filtering algorithms. These attacks can greatly reduce the Quality of Service (QoS) received by existing authorized flows in the firewall. This paper proposes a novel architecture that decouples this linkage between authorized flow QoS and adversarial traffic
marginalizing disruption caused by unauthorized flows
and ultimately improving overall performance of software defined firewalls. We show substantial improvements in throughput
packet loss
and latency over baseline software defined firewalls with varying ratios of attack traffic. All results are obtained using the cycle accurate architecture simulator gem5
and Internet packet traces obtained from 10 Gbps interfaces of core Internet routers.
Stochastic Pre-Classification for Software Defined Firewalls
Cheng Limark
Computer-Aided Design of Integrated Circuits and Systems
IEEE Transactions on
To meet energy-efficient performance demands
the computing industry has moved to parallel computer architectures
such as chip multiprocessors (CMPs)
internally interconnected via networks-on-chip (NoC) to meet growing communication needs. Achieving scaling performance as core counts increase to the hundreds in future CMPs
however
will require high performance
yet energy-efficient interconnects. Silicon nanophotonics is a promising replacement for electronic on-chip interconnect due to its high bandwidth and low latency
however
prior techniques have required high static power for the laser and ring thermal tuning. We propose a novel nano-photonic NoC (PNoC) architecture
LumiNOC
optimized for high performance and power-efficiency. This paper makes three primary contributions: a novel
nanophotonic architecture which partitions the network into subnets for better efficiency; a purely photonic
in-band
distributed arbitration scheme; and a channel sharing arrangement utilizing the same waveguides and wavelengths for arbitration as data transmission. In a 64-node NoC under synthetic traffic
LumiNOC enjoys 50% lower latency at low loads and ~40% higher throughput per Watt on synthetic traffic
versus other reported PNoCs. LumiNOC reduces latencies ~40% versus an electrical 2-D mesh NoCs on the PARSEC shared-memory
multithreaded benchmark suite.
LumiNOC: A Power-Efficient
High-Performance
Photonic Network-on-Chip
As processor chips become increasingly parallel
an efficient communication substrate is critical for meeting performance and energy targets. In this work
we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic
as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor
our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types
we show that 1) the prediction mechanism achieves very high accuracy
with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent
on average
and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique.
Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip
Reena Panda
IEEE Computer Architecture Letters
Computer architecture is beset by two opposing trends. Technology scaling and deep pipelining have led to high memory access latencies; meanwhile
power and energy considerations have revived interest in traditional in-order processors. In-order processors
unlike their superscalar counterparts
do not allow execution to continue around data cache misses. In-order processors
therefore
suffer a greater performance penalty in the light of the current high memory access latencies. Memory prefetching is an established technique to reduce the incidence of cache misses and improve performance. In this paper
we introduce B-Fetch
a new technique for data prefetching which combines branch prediction based lookahead deep path speculation with effective address speculation
to efficiently improve performance in in-order processors. Our results show that B-Fetch improves performance 38.8% on SPEC CPU2006 benchmarks
beating a current
state-of-the-art prefetcher design at ~1/3 the hardware overhead.
B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors
As modern CMPs scale to ever increasing core counts
Networks-on-Chip (NoCs) are emerging as an interconnection fabric
enabling communication between components. While NoCs provide high and scalable bandwidth
current routing algorithms
such as dimension-ordered routing
suffer from poor load balance
leading to reduced throughput and high latencies. Improving load balance
hence
is critical in future CMP designs where increased latency leads to wasted power and energy waiting for outstanding requests to resolve. Adaptive routing is a known technique to improve load balance
however
prior adaptive routing techniques either use local or regionally aggregated information to form their routing decisions. This paper proposes a new
light-weight
adaptive routing algorithm for on-chip routers based on global link state and congestion information
Global Congestion Awareness (GCA). GCA uses a simple
low-complexity route calculation unit
to calculate paths to their destination without the myopia of local decisions
nor the aggregation of unrelated status information
found in prior designs. In particular GCA outperforms local adaptive routing by 26%
Regional Congestion Awareness (RCA) by 15%
and a recent competing adaptive routing algorithm
DAR
by 8% on average on realistic workloads.
GCA: Global congestion awareness for load balance in networks-on-chip
Viacheslav Fedorov
ACM Transactions on Architecture and Code Optimization (TACO)
Decreasing the traffic from the CPU LLC to main memory is a very important issue in modern systems. Recent work focuses on cache misses
overlooking the impact of writebacks on the total memory traffic
energy consumption
IPC
and so forth. Policies that foster a balanced approach
between reducing write traffic to memory and improving miss rates
can increase overall performance and improve energy efficiency and memory system lifetime for NVM memory technology
such as phase-change memory (PCM). We propose Adaptive Replacement and Insertion (ARI)
an adaptive approach to last-level CPU cache management
optimizing the two parameters (miss rate and writeback rate) simultaneously. Our specific focus is to reduce writebacks as much as possible while maintaining or improving the miss rate relative to conventional LRU replacement policy. ARI reduces LLC writebacks by 33%
on average
while also decreasing misses by 4.7%
on average. In a typical system
this boosts IPC by 4.9%
on average
while decreasing energy consumption by 8.9%. These results are achieved with minimal hardware overheads.
ARI: Adaptive LLC-memory traffic management
Paul V.
Gratz