Last update:
Tue Sep 30 09:26:37 MDT 2025
Brad Calder and
Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2
W. Zhang and
J. S. Hu and
V. Degalahal and
M. Kandemir and
N. Vijaykrishnan and
M. J. Irwin Reducing instruction cache energy
consumption using a compiler-based
strategy . . . . . . . . . . . . . . . . 3--33
Nemanja Isailovic and
Mark Whitney and
Yatish Patel and
John Kubiatowicz and
Dean Copsey and
Frederic T. Chong and
Isaac L. Chuang and
Mark Oskin Datapath and control for quantum wires 34--61
Karthikeyan Sankaralingam and
Ramadass Nagarajan and
Haiming Liu and
Changkyu Kim and
Jaehyuk Huh and
Nitya Ranganathan and
Doug Burger and
Stephen W. Keckler and
Robert G. McDonald and
Charles R. Moore TRIPS: a polymorphous architecture for
exploiting ILP, TLP, and DLP . . . . . . 62--93
Kevin Skadron and
Mircea R. Stan and
Karthik Sankaranarayanan and
Wei Huang and
Sivakumar Velusamy and
David Tarjan Temperature-aware microarchitecture:
Modeling and implementation . . . . . . 94--125
Alex Alet\`a and
Josep M. Codina and
Antonio González and
David Kaeli Removing communications in clustered
microarchitectures through instruction
replication . . . . . . . . . . . . . . 127--151
Yu Bai and
R. Iris Bahar A low-power in-order/out-of-order issue
queue . . . . . . . . . . . . . . . . . 152--179
Philo Juang and
Kevin Skadron and
Margaret Martonosi and
Zhigang Hu and
Douglas W. Clark and
Philip W. Diodato and
Stefanos Kaxiras Implementing branch-predictor decay
using quasi-static memory cells . . . . 180--219
Oliverio J. Santana and
Alex Ramirez and
Josep L. Larriba-Pey and
Mateo Valero A low-complexity fetch architecture for
high-performance superscalar processors 220--245
Jin Lin and
Tong Chen and
Wei-Chung Hsu and
Pen-Chung Yew and
Roy Dz-Ching Ju and
Tin-Fook Ngai and
Sun Chan A compiler framework for speculative
optimizations . . . . . . . . . . . . . 247--271
Brian A. Fields and
Rastislav Bodik and
Mark D. Hill and
Chris J. Newburn Interaction cost and shotgun profiling 272--304
Karthik Sankaranarayanan and
Kevin Skadron Profile-based adaptation for cache decay 305--322
Fen Xie and
Margaret Martonosi and
Sharad Malik Intraprogram dynamic voltage scaling:
Bounding opportunities with analytic
modeling . . . . . . . . . . . . . . . . 323--367
A. Hartstein and
Thomas R. Puzak The optimum pipeline depth considering
both power and performance . . . . . . . 369--388
Adrián Cristal and
Oliverio J. Santana and
Mateo Valero and
José F. Martínez Toward kilo-instruction processors . . . 389--417
Haitham Akkary and
Ravi Rajwar and
Srikanth T. Srinivasan An analysis of a resource efficient
checkpoint architecture . . . . . . . . 418--444
Chia-Lin Yang and
Alvin R. Lebeck and
Hung-Wei Tseng and
Chien-Hao Lee Tolerating memory latency through push
prefetching for pointer-intensive
applications . . . . . . . . . . . . . . 445--475
Brad Calder and
Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2
Yuanyuan Zhou and
Pin Zhou and
Feng Qin and
Wei Liu and
Josep Torrellas Efficient and flexible architectural
support for dynamic monitoring . . . . . 3--33
Chuanjun Zhang and
Frank Vahid and
Jun Yang and
Walid Najjar A way-halting cache for low-energy
high-performance systems . . . . . . . . 34--54
Jaume Abella and
Antonio González and
Xavier Vera and
Michael F. P. O'Boyle IATAC: a smart predictor to turn-off L2
cache lines . . . . . . . . . . . . . . 55--77
John W. Haskins, Jr. and
Kevin Skadron Accelerated warmup for sampled
microarchitecture simulation . . . . . . 78--108
Tao Li and
Ravi Bhargava and
Lizy Kurian John Adapting branch-target buffer to improve
the target predictability of Java code 109--130
Lingli Zhang and
Chandra Krintz The design, implementation, and
evaluation of adaptive code unloading
for resource-constrained devices . . . . 131--164
Prasad A. Kulkarni and
Stephen R. Hines and
David B. Whalley and
Jason D. Hiser and
Jack W. Davidson and
Douglas L. Jones Fast and efficient searches for
effective optimization-phase sequences 165--198
Esther Salamí and
Mateo Valero Dynamic memory interval test vs.
interprocedural pointer analysis in
multimedia applications . . . . . . . . 199--219
Yan Meng and
Timothy Sherwood and
Ryan Kastner Exploring the limits of leakage power
reduction in caches . . . . . . . . . . 221--246
María Jesús Garzarán and
Milos Prvulovic and
José María Llabería and
Víctor Viñals and
Lawrence Rauchwerger and
Josep Torrellas Tradeoffs in buffering speculative
memory state for thread-level
speculation in multiprocessors . . . . . 247--279
David Tarjan and
Kevin Skadron Merging path and gshare indexing in
perceptron branch prediction . . . . . . 280--300
Xiangyu Zhang and
Rajiv Gupta Whole execution traces and their
applications . . . . . . . . . . . . . . 301--334
Wankang Zhao and
David Whalley and
Christopher Healy and
Frank Mueller Improving WCET by applying a WC
code-positioning optimization . . . . . 335--365
George A. Reis and
Jonathan Chang and
Neil Vachharajani and
Ram Rangan and
David I. August and
Shubhendu S. Mukherjee Software-controlled fault tolerance . . 366--396
Jian Li and
José F. Martínez Power-performance considerations of
parallel computing on chip
multiprocessors . . . . . . . . . . . . 397--422
Saurabh Sharma and
Jesse G. Beu and
Thomas M. Conte Spectral prefetcher: An effective
mechanism for L2 cache prefetching . . . 423--450
Brad Calder and
Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2
Lin Tan and
Brett Brotherton and
Timothy Sherwood Bit-split string-matching engines for
intrusion detection and prevention . . . 3--34
Priya Nagpurkar and
Hussam Mousa and
Chandra Krintz and
Timothy Sherwood Efficient remote profiling for
resource-constrained devices . . . . . . 35--66
Jin Lin and
Wei-Chung Hsu and
Pen-Chung Yew and
Roy Dz-Ching Ju and
Tin-Fook Ngai Recovery code generation for general
speculative optimizations . . . . . . . 67--89
Yoonseo Choi and
Hwansoo Han Optimal register reassignment for
register stack overflow minimization . . 90--114
Jingling Xue and
Qiong Cai A lifetime optimal algorithm for
speculative PRE . . . . . . . . . . . . 115--155
Joseph J. Sharkey and
Dmitry V. Ponomarev and
Kanad Ghose and
Oguz Ergin Instruction packing: Toward fast and
energy-efficient instruction scheduling 156--181
Luis Ceze and
Karin Strauss and
James Tuck and
Josep Torrellas and
Jose Renau CAVA: Using checkpoint-assisted value
prediction to hide L2 misses . . . . . . 182--208
Lixin Zhang and
Mike Parker and
John Carter Efficient address remapping in
distributed shared-memory systems . . . 209--229
Min Zhao and
Bruce R. Childers and
Mary Lou Soffa An approach toward profit-driven
optimization . . . . . . . . . . . . . . 231--262
Kim Hazelwood and
Michael D. Smith Managing bounded code caches in dynamic
binary optimization systems . . . . . . 263--294
Olivier Rochecouste and
Gilles Pokam and
André Seznec A case for a complexity-effective,
width-partitioned microarchitecture . . 295--326
Ahmad Zmily and
Christos Kozyrakis Block-aware instruction set architecture 327--357
Jedidiah R. Crandall and
S. Felix Wu and
Frederic T. Chong Minos: Architectural support for
protecting control data . . . . . . . . 359--389
Jaydeep Marathe and
Frank Mueller and
Bronis R. de Supinski Analysis of cache-coherence bottlenecks
with hybrid hardware/software techniques 390--423
Ilya Ganusov and
Martin Burtscher Future execution: a prefetching
mechanism that uses multiple cores to
speed up single threads . . . . . . . . 424--449
Michele Co and
Dee A. B. Weikle and
Kevin Skadron Evaluating trace cache energy efficiency 450--476
Shiwen Hu and
Madhavi Valluri and
Lizy Kurian John Effective management of multiple
configurable units using dynamic
optimization . . . . . . . . . . . . . . 477--501
Chris Bentley and
Scott A. Watterson and
David K. Lowenthal and
Barry Rountree Implicit array bounds checking on 64-bit
architectures . . . . . . . . . . . . . 502--527
Brad Calder and
Dean Tullsen Introduction . . . . . . . . . . . . . . 1:1--1:1
Kypros Constantinides and
Stephen Plaza and
Jason Blome and
Valeria Bertacco and
Scott Mahlke and
Todd Austin and
Bin Zhang and
Michael Orshansky Architecting a reliable CMP switch
architecture . . . . . . . . . . . . . . 2:1--2:37
Ruchira Sasanka and
Man-Lap Li and
Sarita V. Adve and
Yen-Kuang Chen and
Eric Debes ALP: Efficient support for all levels of
parallelism for complex media
applications . . . . . . . . . . . . . . 3:1--3:30
Yan Luo and
Jia Yu and
Jun Yang and
Laxmi N. Bhuyan Conserving network processor power
consumption by exploiting traffic
variability . . . . . . . . . . . . . . 4:1--4:26
Vassos Soteriou and
Noel Eisley and
Li-Shiuan Peh Software-directed power-aware
interconnection networks . . . . . . . . 5:1--5:40
Yuan-Shin Hwang and
Jia-Jhe Li Snug set-associative caches: Reducing
leakage power of instruction and data
caches with no performance penalties . . 6:1--6:28
Hongbo Rong and
Zhizhong Tang and
R. Govindarajan and
Alban Douillet and
Guang R. Gao Single-dimension software pipelining for
multidimensional loops . . . . . . . . . 7:1--7:44
Fred A. Bower and
Daniel J. Sorin and
Sule Ozev Online diagnosis of hard faults in
microprocessors . . . . . . . . . . . . 8:1--8:??
Pierre Michaud and
André Seznec and
Damien Fetis and
Yiannakis Sazeides and
Theofanis Constantinou A study of thread migration in
temperature-constrained multicores . . . 9:1--9:??
Yu Chen and
Fuxin Zhang Code reordering on limited branch offset 10:1--10:??
A. S. Terechko and
H. Corporaal Inter-cluster communication in VLIW
architectures . . . . . . . . . . . . . 11:1--11:??
Jialin Dou and
Marcelo Cintra A compiler cost model for speculative
parallelization . . . . . . . . . . . . 12:1--12:??
Wolfram Amme and
Jeffery von Ronne and
Michael Franz SSA-based mobile code: Implementation
and empirical evaluation . . . . . . . . 13:1--13:??
Xiaodong Li and
Ritu Gupta and
Sarita V. Adve and
Yuanyuan Zhou Cross-component energy management: Joint
adaptation of processor and memory . . . 14:1--14:??
Ron Gabor and
Shlomo Weiss and
Avi Mendelson Fairness enforcement in switch on event
multithreading . . . . . . . . . . . . . 15:1--15:??
Diego Andrade and
Basilio B. Fraguela and
Ramón Doallo Precise automatable analytical modeling
of the cache behavior of codes with
indirections . . . . . . . . . . . . . . 16:1--16:??
Kris Venstermans and
Lieven Eeckhout and
Koen De Bosschere Java object header elimination for
reduced memory consumption in 64-bit
virtual machines . . . . . . . . . . . . 17:1--17:??
Shu Xiao and
Edmund M.-K. Lai VLIW instruction scheduling for minimal
power variation . . . . . . . . . . . . 18:1--18:??
Sriraman Tallam and
Rajiv Gupta Unified control flow and data dependence
traces . . . . . . . . . . . . . . . . . 19:1--19:??
Engin Ipek and
Sally A. McKee and
Karan Singh and
Rich Caruana and
Bronis R. de Supinski and
Martin Schulz Efficient architectural design space
exploration via predictive modeling . . 1:1--1:??
Yunhe Shi and
Kevin Casey and
M. Anton Ertl and
David Gregg Virtual machine showdown: Stack versus
registers . . . . . . . . . . . . . . . 2:1--2:??
Jun Yan and
Wei Zhang Exploiting virtual registers to reduce
pressure on real registers . . . . . . . 3:1--3:??
Zoe C. H. Yu and
Francis C. M. Lau and
Cho-Li Wang Object co-location and memory reuse for
Java programs . . . . . . . . . . . . . 4:1--4:??
Chuanjun Zhang Reducing cache misses through
programmable decoders . . . . . . . . . 5:1--5:??
Amit Golander and
Shlomo Weiss Hiding the misprediction penalty of a
resource-efficient high-performance
processor . . . . . . . . . . . . . . . 6:1--6:??
Brad Calder and
Dean Tullsen Editorial . . . . . . . . . . . . . . . 1:1--1:??
Shashidhar Mysore and
Banit Agrawal and
Rodolfo Neuber and
Timothy Sherwood and
Nisheeth Shrivastava and
Subhash Suri Formulating and implementing profiling
over adaptive ranges . . . . . . . . . . 2:1--2:??
Antonia Zhai and
J. Gregory Steffan and
Christopher B. Colohan and
Todd C. Mowry Compiler and hardware support for
reducing the synchronization of
speculative threads . . . . . . . . . . 3:1--3:??
Jonathan A. Winter and
David H. Albonesi Addressing thermal nonuniformity in SMT
workloads . . . . . . . . . . . . . . . 4:1--4:??
Asadollah Shahbahrami and
Ben Juurlink and
Stamatis Vassiliadis Versatility of extended subwords and the
matrix register file . . . . . . . . . . 5:1--5:??
Zhi Guo and
Walid Najjar and
Betul Buyukkurt Efficient hardware code generation for
FPGAs . . . . . . . . . . . . . . . . . 6:1--6:??
Thomas Kotzmann and
Christian Wimmer and
Hanspeter Mössenböck and
Thomas Rodriguez and
Kenneth Russell and
David Cox Design of the Java HotSpot\TM client
compiler for Java 6 . . . . . . . . . . 7:1--7:??
Ram Rangan and
Neil Vachharajani and
Guilherme Ottoni and
David I. August Performance scalability of decoupled
software pipelining . . . . . . . . . . 8:1--8:??
Jieyi Long and
Seda Ogrenci Memik and
Gokhan Memik and
Rajarshi Mukherjee Thermal monitoring mechanisms for chip
multiprocessors . . . . . . . . . . . . 9:1--9:??
Ajay Joshi and
Lieven Eeckhout and
Robert H. Bell, Jr. and
Lizy K. John Distilling the essence of proprietary
workloads into miniature benchmarks . . 10:1--10:??
Vincenzo Catania and
Maurizio Palesi and
Davide Patti Reducing complexity of multiobjective
design space exploration in VLIW-based
embedded systems . . . . . . . . . . . . 11:1--11:??
Jacob Leverich and
Hideho Arakida and
Alex Solomatnikov and
Amin Firoozshahian and
Mark Horowitz and
Christos Kozyrakis Comparative evaluation of memory models
for chip multiprocessors . . . . . . . . 12:1--12:??
Joseph J. Sharkey and
Jason Loew and
Dmitry V. Ponomarev Reducing register pressure in SMT
processors through L2-miss-driven early
register release . . . . . . . . . . . . 13:1--13:??
Mojtaba Mehrara and
Todd Austin Exploiting selective placement for
low-cost memory protection . . . . . . . 14:1--14:??
Hans Vandierendonck and
André Seznec Speculative return address stack
management revisited . . . . . . . . . . 15:1--15:??
Siddhartha Chhabra and
Brian Rogers and
Yan Solihin and
Milos Prvulovic Making secure processors OS- and
performance-friendly . . . . . . . . . . 16:1--16:??
Daniel A. Jiménez Generalizing neural branch prediction 17:1--17:??
Jinseong Jeon and
Keoncheol Shin and
Hwansoo Han Abstracting access patterns of dynamic
memory using regular expressions . . . . 18:1--18:??
Ghassan Shobaki and
Kent Wilken and
Mark Heffernan Optimal trace scheduling using
enumeration . . . . . . . . . . . . . . 19:1--19:??
Prasad A. Kulkarni and
David B. Whalley and
Gary S. Tyson and
Jack W. Davidson Practical exhaustive optimization phase
order exploration and evaluation . . . . 1:1--1:??
Manuel Hohenauer and
Felix Engel and
Rainer Leupers and
Gerd Ascheid and
Heinrich Meyr A SIMD optimization framework for
retargetable compilers . . . . . . . . . 2:1--2:??
Stijn Eyerman and
Lieven Eeckhout Memory-level parallelism aware fetch
policies for simultaneous multithreading
processors . . . . . . . . . . . . . . . 3:1--3:??
Lukasz Strozek and
David Brooks Energy- and area-efficient architectures
through application clustering and
architectural heterogeneity . . . . . . 4:1--4:??
Guru Venkataramani and
Ioannis Doudalis and
Yan Solihin and
Milos Prvulovic MemTracker: An accelerator for memory
debugging and monitoring . . . . . . . . 5:1--5:??
Ron Gabor and
Avi Mendelson and
Shlomo Weiss Service level agreement for
multithreaded processors . . . . . . . . 6:1--6:??
Wilson W. L. Fung and
Ivan Sham and
George Yuan and
Tor M. Aamodt Dynamic warp formation: Efficient MIMD
control flow on SIMD graphics hardware 7:1--7:??
Cheng-Kok Koh and
Weng-Fai Wong and
Yiran Chen and
Hai Li Tolerating process variations in large,
set-associative caches: The buddy cache 8:1--8:??
Lian Li and
Hui Feng and
Jingling Xue Compiler-directed scratchpad memory
management via graph coloring . . . . . 9:1--9:??
Amit Golander and
Shlomo Weiss Checkpoint allocation and release . . . 10:1--10:??
Weifeng Xu and
Russell Tessier Tetris-XL: a performance-driven spill
reduction technique for embedded VLIW
processors . . . . . . . . . . . . . . . 11:1--11:??
Timothy M. Jones and
Michael F. P. O'Boyle and
Jaume Abella and
Antonio González and
O\uguz Ergin Exploring the limits of early register
release: Exploiting compiler analysis 12:1--12:??
Timothy M. Jones and
Michael F. P. O'Boyle and
Jaume Abella and
Antonio González and
O\uguz Ergin Energy-efficient register caching with
compiler assistance . . . . . . . . . . 13:1--13:??
Weijia Li and
Youtao Zhang and
Jun Yang and
Jiang Zheng Towards update-conscious compilation for
energy-efficient code dissemination in
WSNs . . . . . . . . . . . . . . . . . . 14:1--14:??
Michal Wegiel and
Chandra Krintz The single-referent collector:
Optimizing compaction for the common
case . . . . . . . . . . . . . . . . . . 15:1--15:??
Samantika Subramaniam and
Gabriel H. Loh Design and optimization of the store
vectors memory dependence predictor . . 16:1--16:??
Xiaohang Wang and
Mei Yang and
Yingtao Jiang and
Peng Liu A power-aware mapping approach to map IP
cores onto NoCs under bandwidth and
latency constraints . . . . . . . . . . 1:1--1:??
Zhong-Ho Chen and
Alvin W. Y. Su A hardware/software framework for
instruction and data scratchpad memory
allocation . . . . . . . . . . . . . . . 2:1--2:??
Dong Hyuk Woo and
Joshua B. Fryman and
Allan D. Knies and
Hsien-Hsin S. Lee Chameleon: Virtualizing idle
acceleration cores of a heterogeneous
multicore processor for caching and
prefetching . . . . . . . . . . . . . . 3:1--3:??
Daniel Sanchez and
George Michelogiannakis and
Christos Kozyrakis An analysis of on-chip interconnection
networks for large-scale chip
multiprocessors . . . . . . . . . . . . 4:1--4:??
Xiuyi Zhou and
Jun Yang and
Marek Chrobak and
Youtao Zhang Performance-aware thermal management via
task scheduling . . . . . . . . . . . . 5:1--5:??
Arun Raghavan and
Colin Blundell and
Milo M. K. Martin Token tenure and PATCH: a
predictive/adaptive token-counting
hybrid . . . . . . . . . . . . . . . . . 6:1--6:??
Christian Wimmer and
Hanspeter Mössenbösck Automatic feedback-directed object
fusing . . . . . . . . . . . . . . . . . 7:1--7:??
Benjamin C. Lee and
David Brooks Applied inference: Case studies in
microarchitectural design . . . . . . . 8:1--8:??
R. Rakvic and
Q. Cai and
J. González and
G. Magklis and
P. Chaparro and
A. González Thread-management techniques to maximize
efficiency in multicore and simultaneous
multithreaded microprocessors . . . . . 9:1--9:??
Derek Pao and
Wei Lin and
Bin Liu A memory-efficient pipelined
implementation of the Aho--Corasick
string-matching algorithm . . . . . . . 10:1--10:??
Xuejun Yang and
Ying Zhang and
Xicheng Lu and
Jingling Xue and
Ian Rogers and
Gen Li and
Guibin Wang and
Xudong Fang Exploiting the reuse supplied by
loop-dependent stream references for
stream processors . . . . . . . . . . . 11:1--11:??
Vijay Janapa Reddi and
Simone Campanoni and
Meeta S. Gupta and
Michael D. Smith and
Gu-Yeon Wei and
David Brooks and
Kim Hazelwood Eliminating voltage emergencies via
software-guided code transformations . . 12:1--12:??
Qin Zhao and
Ioana Cutcutache and
Weng-Fai Wong PiPA: Pipelined profiling and analysis
on multicore systems . . . . . . . . . . 13:1--13:??
Fei Guo and
Yan Solihin and
Li Zhao and
Ravishankar Iyer Quality of service shared cache
management in chip multiprocessor
architecture . . . . . . . . . . . . . . 14:1--14:??
Xiaoxia Wu and
Jian Li and
Lixin Zhang and
Evan Speight and
Ram Rajamony and
Yuan Xie Design exploration of hybrid caches with
disparate memory technologies . . . . . 15:1--15:??
Kornilios Kourtis and
Georgios Goumas and
Nectarios Koziris Exploiting compression opportunities to
improve SpMxV performance on shared
memory systems . . . . . . . . . . . . . 16:1--16:??
Betul Buyukkurt and
John Cortes and
Jason Villarreal and
Walid A. Najjar Impact of high-level transformations
within the ROCCC framework . . . . . . . 17:1--17:??
Yuan-Shin Hwang and
Tzong-Yen Lin and
Rong-Guey Chang DisIRer: Converting a retargetable
compiler into a multiplatform binary
translator . . . . . . . . . . . . . . . 18:1--18:??
Michael Boyer and
David Tarjan and
Kevin Skadron Federation: Boosting per-thread
performance of throughput-oriented
manycore architectures . . . . . . . . . 19:1--19:??
Grigori Fursin and
Olivier Temam Collective optimization: a practical
collaborative approach . . . . . . . . . 20:1--20:??
Fang Liu and
Yan Solihin Understanding the behavior and
implications of context switch misses 21:1--21:??
Stijn Eyerman and
Lieven Eeckhout Fine-grained DVFS using on-chip
regulators . . . . . . . . . . . . . . . 1:1--1:??
Chen-Yong Cher and
Eren Kursun Exploring the effects of on-chip thermal
variation on high-performance multicore
architectures . . . . . . . . . . . . . 2:1--2:??
Carole-Jean Wu and
Margaret Martonosi Adaptive timekeeping replacement:
Fine-grained capacity management for
shared CMP caches . . . . . . . . . . . 3:1--3:??
Lucas Vespa and
Ning Weng Deterministic finite automata
characterization and optimization for
scalable pattern matching . . . . . . . 4:1--4:??
Abhishek Bhattacharjee and
Gilberto Contreras and
Margaret Martonosi Parallelization libraries:
Characterizing and reducing overheads 5:1--5:??
Xiangyu Dong and
Yuan Xie and
Naveen Muralimanohar and
Norman P. Jouppi Hybrid checkpointing using emerging
nonvolatile memories for future exascale
systems . . . . . . . . . . . . . . . . 6:1--6:??
Jianjun Li and
Chenggang Wu and
Wei-Chung Hsu Efficient and effective misaligned data
access handling in a dynamic binary
translation system . . . . . . . . . . . 7:1--7:??
Guru Venkataramani and
Christopher J. Hughes and
Sanjeev Kumar and
Milos Prvulovic DeFT: Design space exploration for
on-the-fly detection of coherence misses 8:1--8:??
Jason D. Hiser and
Daniel W. Williams and
Wei Hu and
Jack W. Davidson and
Jason Mars and
Bruce R. Childers Evaluating indirect branch handling
mechanisms in software dynamic
translation systems . . . . . . . . . . 9:1--9:??
Xi E. Chen and
Tor M. Aamodt Hybrid analytical modeling of pending
cache hits, data prefetching, and MSHRs 10:1--10:??
Marios Kleanthous and
Yiannakis Sazeides CATCH: a mechanism for dynamically
detecting cache-content-duplication in
instruction caches . . . . . . . . . . . 11:1--11:??
Hans Vandierendonck and
André Seznec Managing SMT resource usage through
speculative instruction window weighting 12:1--12:??
Po-Han Wang and
Chia-Lin Yang and
Yen-Ming Chen and
Yu-Jung Cheng Power gating strategies on GPUs . . . . 13:1--13:??
Min Feng and
Chen Tian and
Changhui Lin and
Rajiv Gupta Dynamic access distance driven cache
replacement . . . . . . . . . . . . . . 14:1--14:??
Ahmad Samih and
Yan Solihin and
Anil Krishna Evaluating placement policies for
managing capacity sharing in CMP
architectures with private caches . . . 15:1--15:??
Chang-Ching Yeh and
Kuei-Chung Chang and
Tien-Fu Chen and
Chingwei Yeh Maintaining performance on power gating
of microprocessor functional units by
using a predictive pre-wakeup strategy 16:1--16:??
Hyunjin Lee and
Sangyeun Cho and
Bruce R. Childers DEFCAM: a design and evaluation
framework for defect-tolerant cache
memories . . . . . . . . . . . . . . . . 17:1--17:??
Per Stenström and
Koen De Bosschere Introduction to the special issue on
high-performance and embedded
architectures and compilers . . . . . . 18:1--18:??
Jorge Albericio and
Rubén Gran and
Pablo Ibáñez and
Víctor Viñals and
Jose María Llabería ABS: a low-cost adaptive controller for
prefetching in a banked shared
last-level cache . . . . . . . . . . . . 19:1--19:??
Ali Galip Bayrak and
Nikola Velickovic and
Paolo Ienne and
Wayne Burleson An architecture-independent instruction
shuffler to protect against side-channel
attacks . . . . . . . . . . . . . . . . 20:1--20:??
John Demme and
Simha Sethumadhavan Approximate graph clustering for program
characterization . . . . . . . . . . . . 21:1--21:??
Mihai Pricopi and
Tulika Mitra Bahurupi: a polymorphic heterogeneous
multi-core architecture . . . . . . . . 22:1--22:??
Jeroen V. Cleemput and
Bart Coppens and
Bjorn De Sutter Compiler mitigations for time attacks on
modern x86 processors . . . . . . . . . 23:1--23:??
Jason Mccandless and
David Gregg Compiler techniques to improve dynamic
branch prediction for indirect jump and
call instructions . . . . . . . . . . . 24:1--24:??
Antonio García-Guirado and
Ricardo Fernández-Pascual and
Alberto Ros and
José M. García DAPSCO: Distance-aware partially shared
cache organization . . . . . . . . . . . 25:1--25:??
Zhenjiang Wang and
Chenggang Wu and
Pen-Chung Yew and
Jianjun Li and
Di Xu On-the-fly structure splitting for heap
objects . . . . . . . . . . . . . . . . 26:1--26:??
Dibyendu Das and
B. Dupont De Dinechin and
Ramakrishna Upadrasta Efficient liveness computation using
merge sets and DJ-graphs . . . . . . . . 27:1--27:??
George Patsilaras and
Niket K. Choudhary and
James Tuck Efficiently exploiting memory level
parallelism on asymmetric coupled cores
in the dark silicon era . . . . . . . . 28:1--28:??
Roman Malits and
Evgeny Bolotin and
Avinoam Kolodny and
Avi Mendelson Exploring the limits of GPGPU scheduling
in control flow bound applications . . . 29:1--29:??
Lois Orosa and
Elisardo Antelo and
Javier D. Bruguera FlexSig: Implementing flexible hardware
signatures . . . . . . . . . . . . . . . 30:1--30:??
Ruben Titos-Gil and
Manuel E. Acacio and
Jose M. Garcia and
Tim Harris and
Adrian Cristal and
Osman Unsal and
Ibrahim Hur and
Mateo Valero Hardware transactional memory with
software-defined conflicts . . . . . . . 31:1--31:??
Yongjoo Kim and
Jongeun Lee and
Toan X. Mai and
Yunheung Paek Improving performance of nested loops on
reconfigurable array processors . . . . 32:1--32:??
Madhura Purnaprajna and
Paolo Ienne Making wide-issue VLIW processors viable
on FPGAs . . . . . . . . . . . . . . . . 33:1--33:??
Petar Radojkovi\'c and
Sylvain Girbal and
Arnaud Grasset and
Eduardo Quiñones and
Sami Yehia and
Francisco J. Cazorla On the evaluation of the impact of
shared resources in multithreaded COTS
processors in time-critical environments 34:1--34:??
Leonid Domnitser and
Aamer Jaleel and
Jason Loew and
Nael Abu-Ghazaleh and
Dmitry Ponomarev Non-monopolizable caches: Low-complexity
mitigation of cache side channel attacks 35:1--35:??
Alejandro Rico and
Felipe Cabarcas and
Carlos Villavieja and
Milan Pavlovic and
Augusto Vega and
Yoav Etsion and
Alex Ramirez and
Mateo Valero On the simulation of large-scale
architectures using multiple application
abstraction levels . . . . . . . . . . . 36:1--36:??
Selma Saidi and
Pranav Tendulkar and
Thierry Lepley and
Oded Maler Optimizing explicit data transfers for
data parallel applications on the Cell
architecture . . . . . . . . . . . . . . 37:1--37:??
Min Feng and
Changhui Lin and
Rajiv Gupta PLDS: Partitioning linked data
structures for parallelism . . . . . . . 38:1--38:??
Benoit Pradelle and
Alain Ketterlin and
Philippe Clauss Polyhedral parallelization of binary
code . . . . . . . . . . . . . . . . . . 39:1--39:??
Yaozu Dong and
Yu Chen and
Zhenhao Pan and
Jinquan Dai and
Yunhong Jiang ReNIC: Architectural extension to SR-IOV
I/O virtualization for efficient
replication . . . . . . . . . . . . . . 40:1--40:??
Tom M. Bruintjes and
Karel H. G. Walters and
Sabih H. Gerez and
Bert Molenkamp and
Gerard J. M. Smit Sabrewing: a lightweight architecture
for combined floating-point and integer
arithmetic . . . . . . . . . . . . . . . 41:1--41:??
Mario Kicherer and
Fabian Nowak and
Rainer Buchty and
Wolfgang Karl Seamlessly portable applications:
Managing the diversity of modern
heterogeneous systems . . . . . . . . . 42:1--42:??
Nathanael Premillieu and
Andre Seznec SYRANT: SYmmetric Resource Allocation on
Not-taken and Taken paths . . . . . . . 43:1--43:??
William Hasenplaugh and
Pritpal S. Ahuja and
Aamer Jaleel and
Simon Steely, Jr. and
Joel Emer The gradient-based cache partitioning
algorithm . . . . . . . . . . . . . . . 44:1--44:??
Javier Lira and
Timothy M. Jones and
Carlos Molina and
Antonio González The migration prefetcher: Anticipating
data promotion in dynamic NUCA caches 45:1--45:??
Kishore Kumar Pusukuri and
Rajiv Gupta and
Laxmi N. Bhuyan Thread Tranquilizer: Dynamically
reducing performance variation . . . . . 46:1--46:??
Dongsong Zhang and
Deke Guo and
Fangyuan Chen and
Fei Wu and
Tong Wu and
Ting Cao and
Shiyao Jin TL-plane-based multi-core
energy-efficient real-time scheduling
algorithm for sporadic tasks . . . . . . 47:1--47:??
Michael J. Lyons and
Mark Hempstead and
Gu-Yeon Wei and
David Brooks The accelerator store: a shared memory
framework for accelerator-based systems 48:1--48:??
Daniel Orozco and
Elkin Garcia and
Rishi Khan and
Kelly Livingston and
Guang R. Gao Toward high-throughput algorithms on
many-core architectures . . . . . . . . 49:1--49:??
Kevin Stock and
Louis-Noël Pouchet and
P. Sadayappan Using machine learning to improve
automatic vectorization . . . . . . . . 50:1--50:??
Kanit Therdsteerasukdi and
Gyungsu Byun and
Jason Cong and
M. Frank Chang and
Glenn Reinman Utilizing RF-I and intelligent
scheduling for better throughput/watt in
a mobile GPU memory system . . . . . . . 51:1--51:??
Frederick Ryckbosch and
Stijn Polfliet and
Lieven Eeckhout VSim: Simulating multi-server setups at
near native hardware speed . . . . . . . 52:1--52:??
Miao Zhou and
Yu Du and
Bruce Childers and
Rami Melhem and
Daniel Mossé Writeback-aware partitioning and
replacement for last-level caches in
phase change main memory systems . . . . 53:1--53:??
Qingping Wang and
Sameer Kulkarni and
John Cavazos and
Michael Spear A transactional memory with automatic
performance tuning . . . . . . . . . . . 54:1--54:??
Bartosz Bogdanski and
Sven-Arne Reinemo and
Frank Olaf Sem-Jacobsen and
Ernst Gunnar Gran sFtree: a fully connected and
deadlock-free switch-to-switch routing
algorithm for fat-trees . . . . . . . . 55:1--55:??
Walid J. Ghandour and
Haitham Akkary and
Wes Masri Leveraging Strength-Based Dynamic
Information Flow Analysis to Enhance
Data Value Prediction . . . . . . . . . 1:1--1:??
Jaekyu Lee and
Hyesoon Kim and
Richard Vuduc When Prefetching Works, When It Doesn't,
and Why . . . . . . . . . . . . . . . . 2:1--2:??
Bita Mazloom and
Shashidhar Mysore and
Mohit Tiwari and
Banit Agrawal and
Tim Sherwood Dataflow Tomography: Information Flow
Tracking For Understanding and
Visualizing Full Systems . . . . . . . . 3:1--3:??
Jung Ho Ahn and
Norman P. Jouppi and
Christos Kozyrakis and
Jacob Leverich and
Robert S. Schreiber Improving System Energy Efficiency with
Memory Rank Subsetting . . . . . . . . . 4:1--4:??
Xuejun Yang and
Li Wang and
Jingling Xue and
Qingbo Wu Comparability Graph Coloring for
Optimizing Utilization of
Software-Managed Stream Register Files
for Stream Processors . . . . . . . . . 5:1--5:??
Abhinandan Majumdar and
Srihari Cadambi and
Michela Becchi and
Srimat T. Chakradhar and
Hans Peter Graf A Massively Parallel, Energy Efficient
Programmable Accelerator for Learning
and Classification . . . . . . . . . . . 6:1--6:??
Stijn Eyerman and
Lieven Eeckhout Probabilistic modeling for job symbiosis
scheduling on SMT processors . . . . . . 7:1--7:??
Rachid Seghir and
Vincent Loechner and
Beno\^\it Meister Integer affine transformations of
parametric $Z$-polytopes and
applications to loop nest optimization 8:1--8:??
Yi Yang and
Ping Xiang and
Jingfei Kong and
Mike Mantor and
Huiyang Zhou A unified optimizing compiler framework
for different GPGPU architectures . . . 9:1--9:??
Choonki Jang and
Jaejin Lee and
Bernhard Egger and
Soojung Ryu Automatic code overlay generation and
partially redundant code fetch
elimination . . . . . . . . . . . . . . 10:1--10:??
Zahra Abbasi and
Georgios Varsamopoulos and
Sandeep K. S. Gupta TACOMA: Server and workload management
in Internet data centers considering
cooling-computing power trade-off and
energy proportionality . . . . . . . . . 11:1--11:??
Andreas Lankes and
Thomas Wild and
Stefan Wallentowitz and
Andreas Herkersdorf Benefits of selective packet discard in
networks-on-chip . . . . . . . . . . . . 12:1--12:??
Yangchun Luo and
Antonia Zhai Dynamically dispatching speculative
threads to improve sequential execution 13:1--13:??
Huimin Cui and
Jingling Xue and
Lei Wang and
Yang Yang and
Xiaobing Feng and
Dongrui Fan Extendable pattern-oriented optimization
directives . . . . . . . . . . . . . . . 14:1--14:??
Adam Wade Lewis and
Nian-Feng Tzeng and
Soumik Ghosh Runtime energy consumption estimation
for server workloads based on chaotic
time-series approximation . . . . . . . 15:1--15:??
Alejandro Valero and
Julio Sahuquillo and
Salvador Petit and
Pedro López and
José Duato Combining recency of information with
selective random and a victim cache in
last-level caches . . . . . . . . . . . 16:1--16:??
Bin Li and
Li-Shiuan Peh and
Li Zhao and
Ravi Iyer Dynamic QoS management for chip
multiprocessors . . . . . . . . . . . . 17:1--17:??
Polychronis Xekalakis and
Nikolas Ioannou and
Marcelo Cintra Mixed speculative multithreaded
execution models . . . . . . . . . . . . 18:1--18:??
Mageda Sharafeddine and
Komal Jothi and
Haitham Akkary Disjoint out-of-order execution
processor . . . . . . . . . . . . . . . 19:1--19:??
Diego Andrade and
Basilio B. Fraguela and
Ramón Doallo Static analysis of the worst-case memory
performance for irregular codes with
indirections . . . . . . . . . . . . . . 20:1--20:??
Yang Chen and
Shuangde Fang and
Yuanjie Huang and
Lieven Eeckhout and
Grigori Fursin and
Olivier Temam and
Chengyong Wu Deconstructing iterative optimization 21:1--21:??
Apala Guha and
Kim Hazelwood and
Mary Lou Soffa Memory optimization of dynamic binary
translators for embedded systems . . . . 22:1--22:??
James R. Geraci and
Sharon M. Sacco A transpose-free in-place SIMD optimized
FFT . . . . . . . . . . . . . . . . . . 23:1--23:??
Bart Coppens and
Bjorn De Sutter and
Jonas Maebe Feedback-driven binary code
diversification to the special issue on
high-performance embedded architectures
and compilers . . . . . . . . . . . . . 24:1--24:??
Jeremy Fowers and
Greg Brown and
John Wernsing and
Greg Stitt A performance and energy comparison of
convolution on GPUs, FPGAs, and
multicore processors . . . . . . . . . . 25:1--25:??
Erven Rohou and
Kevin Williams and
David Yuste Vectorization technology to improve
interpreter performance . . . . . . . . 26:1--26:??
Jimmy Cleary and
Owen Callanan and
Mark Purcell and
David Gregg Fast asymmetric thread synchronization 27:1--27:??
Yong Li and
Rami Melhem and
Alex K. Jones PS-TLB: Leveraging page classification
information for fast, scalable and
efficient translation for future CMPs 28:1--28:??
Kristof Du Bois and
Stijn Eyerman and
Lieven Eeckhout Per-thread cycle accounting in multicore
processors . . . . . . . . . . . . . . . 29:1--29:??
Christian Wimmer and
Michael Haupt and
Michael L. Van De Vanter and
Mick Jordan and
Laurent Dayn\`es and
Douglas Simon Maxine: an approachable virtual machine
for, and in, Java . . . . . . . . . . . 30:1--30:??
Malik Khan and
Protonu Basu and
Gabe Rudy and
Mary Hall and
Chun Chen and
Jacqueline Chame A script-based autotuning compiler
system to generate high-performance CUDA
code . . . . . . . . . . . . . . . . . . 31:1--31:??
Kenzo Van Craeynest and
Lieven Eeckhout Understanding fundamental design choices
in single-ISA heterogeneous multicore
architectures . . . . . . . . . . . . . 32:1--32:??
Samuel Antão and
Leonel Sousa The CRNS framework and its application
to programmable and reconfigurable
cryptography . . . . . . . . . . . . . . 33:1--33:??
Boubacar Diouf and
Can Hantas and
Albert Cohen and
Özcan Özturk and
Jens Palsberg A decoupled local memory allocator . . . 34:1--34:??
Huimin Cui and
Qing Yi and
Jingling Xue and
Xiaobing Feng Layout-oblivious compiler optimization
for matrix computations . . . . . . . . 35:1--35:??
Stephen Dolan and
Servesh Muralidharan and
David Gregg Compiler support for lightweight context
switching . . . . . . . . . . . . . . . 36:1--36:??
Pablo Abad and
Valentin Puente and
Jose-Angel Gregorio LIGERO: a light but efficient router
conceived for cache-coherent chip
multiprocessors . . . . . . . . . . . . 37:1--37:??
Jorge Albericio and
Pablo Ibáñez and
Víctor Viñals and
Jose María Llabería Exploiting reuse locality on inclusive
shared last-level caches . . . . . . . . 38:1--38:??
Paraskevas Yiapanis and
Demian Rosas-Ham and
Gavin Brown and
Mikel Luján Optimizing software runtime systems for
speculative parallelization . . . . . . 39:1--39:??
Cedric Nugteren and
Pieter Custers and
Henk Corporaal Algorithmic species: a classification of
affine loop nests for parallel
programming . . . . . . . . . . . . . . 40:1--40:??
Marco E. T. Gerards and
Jan Kuper Optimal DPM and DVFS for frame-based
real-time systems . . . . . . . . . . . 41:1--41:??
Zhichao Yan and
Hong Jiang and
Yujuan Tan and
Dan Feng An integrated pseudo-associativity and
relaxed-order approach to hardware
transactional memory . . . . . . . . . . 42:1--42:??
Doris Chen and
Deshanand Singh Profile-guided floating- to fixed-point
conversion for hybrid FPGA-processor
applications . . . . . . . . . . . . . . 43:1--43:??
Yan Cui and
Yingxin Wang and
Yu Chen and
Yuanchun Shi Lock-contention-aware scheduler: a
scalable and energy-efficient method for
addressing scalability collapse on
multicore systems . . . . . . . . . . . 44:1--44:??
Kishore Kumar Pusukuri and
Rajiv Gupta and
Laxmi N. Bhuyan ADAPT: a framework for coscheduling
multithreaded programs . . . . . . . . . 45:1--45:??
Michele Tartara and
Stefano Crespi Reghizzi Continuous learning of compiler
heuristics . . . . . . . . . . . . . . . 46:1--46:??
Grigorios Chrysos and
Panagiotis Dagritzikos and
Ioannis Papaefstathiou and
Apostolos Dollas HC-CART: a parallel system
implementation of data mining
classification and regression tree
(CART) algorithm on a multi-FPGA system 47:1--47:??
Jongwon Lee and
Yohan Ko and
Kyoungwoo Lee and
Jonghee M. Youn and
Yunheung Paek Dynamic code duplication with
vulnerability awareness for soft error
detection on VLIW architectures . . . . 48:1--48:??
Fabien Coelho and
François Irigoin API compilation for image hardware
accelerators . . . . . . . . . . . . . . 49:1--49:??
Carlos Luque and
Miquel Moreto and
Francisco J. Cazorla and
Mateo Valero Fair CPU time accounting in CMP+SMT
processors . . . . . . . . . . . . . . . 50:1--50:??
Pavlos M. Mattheakis and
Ioannis Papaefstathiou Significantly reducing MPI
intercommunication latency and power
overhead in both embedded and HPC
systems . . . . . . . . . . . . . . . . 51:1--51:??
Riyadh Baghdadi and
Albert Cohen and
Sven Verdoolaege and
Konrad Trifunovi\'c Improved loop tiling based on the
removal of spurious false dependences 52:1--52:??
Antoniu Pop and
Albert Cohen OpenStream: Expressiveness and data-flow
compilation of OpenMP streaming programs 53:1--53:??
Sven Verdoolaege and
Juan Carlos Juega and
Albert Cohen and
José Ignacio Gómez and
Christian Tenllado and
Francky Catthoor Polyhedral parallel code generation for
CUDA . . . . . . . . . . . . . . . . . . 54:1--54:??
Yu Du and
Miao Zhou and
Bruce Childers and
Rami Melhem and
Daniel Mossé Delta-compressed caching for overcoming
the write bandwidth limitation of hybrid
main memory . . . . . . . . . . . . . . 55:1--55:??
Suresh Purini and
Lakshya Jain Finding good optimization sequences
covering program space . . . . . . . . . 56:1--56:??
Mehmet E. Belviranli and
Laxmi N. Bhuyan and
Rajiv Gupta A dynamic self-scheduling scheme for
heterogeneous multiprocessor
architectures . . . . . . . . . . . . . 57:1--57:??
Anurag Negi and
Ruben Titos-Gil SCIN-cache: Fast speculative versioning
in multithreaded cores . . . . . . . . . 58:1--58:??
Thibaut Lutz and
Christian Fensch and
Murray Cole PARTANS: an autotuning framework for
stencil computation on multi-GPU systems 59:1--59:??
Chunhua Xiao and
M-C. Frank Chang and
Jason Cong and
Michael Gill and
Zhangqin Huang and
Chunyue Liu and
Glenn Reinman and
Hao Wu Stream arbitration: Towards efficient
bandwidth utilization for emerging
on-chip interconnects . . . . . . . . . 60:1--60:??
Yunji Chen and
Tianshi Chen and
Ling Li and
Ruiyang Wu and
Daofu Liu and
Weiwu Hu Deterministic Replay Using Global Clock 1:1--1:??
Daniel Lustig and
Abhishek Bhattacharjee and
Margaret Martonosi TLB Improvements for Chip
Multiprocessors: Inter-Core Cooperative
Prefetchers and Shared Last-Level TLBs 2:1--2:??
Rong Chen and
Haibo Chen Tiled-MapReduce: Efficient and Flexible
MapReduce Processing on Multicore with
Tiling . . . . . . . . . . . . . . . . . 3:1--3:??
Michela Becchi and
Patrick Crowley A-DFA: a Time- and Space-Efficient DFA
Compression Algorithm for Fast Regular
Expression Evaluation . . . . . . . . . 4:1--4:26
Sheng Li and
Jung Ho Ahn and
Richard D. Strong and
Jay B. Brockman and
Dean M. Tullsen and
Norman P. Jouppi The McPAT Framework for Multicore and
Manycore Architectures: Simultaneously
Modeling Power, Area, and Timing . . . . 5:1--5:??
Angeliki Kritikakou and
Francky Catthoor and
George S. Athanasiou and
Vasilios Kelefouras and
Costas Goutis Near-Optimal Microprocessor and
Accelerators Codesign with Latency and
Throughput Constraints . . . . . . . . . 6:1--6:??
Lei Jiang and
Yu Du and
Bo Zhao and
Youtao Zhang and
Bruce R. Childers and
Jun Yang Hardware-Assisted Cooperative
Integration of Wear-Leveling and
Salvaging for Phase Change Memory . . . 7:1--7:??
Kyuseung Han and
Junwhan Ahn and
Kiyoung Choi Power-Efficient Predication Techniques
for Acceleration of Control Flow
Execution on CGRA . . . . . . . . . . . 8:1--8:??
Chao Wang and
Xi Li and
Junneng Zhang and
Xuehai Zhou and
Xiaoning Nie MP-Tomasulo: a Dependency-Aware
Automatic Parallel Execution Engine for
Sequential Programs . . . . . . . . . . 9:1--9:??
Anonymous TACO Reviewers 2012 . . . . . . . . . . 9:1--9:??
Eran Shifer and
Shlomo Weiss Low-latency adaptive mode transitions
and hierarchical power management in
asymmetric clustered cores . . . . . . . 10:1--10:??
Yosi Ben Asher and
Nadav Rotem Hybrid type legalization for a sparse
SIMD instruction set . . . . . . . . . . 11:1--11:??
Yuanwu Lei and
Yong Dou and
Lei Guo and
Jinbo Xu and
Jie Zhou and
Yazhuo Dong and
Hongjian Li VLIW coprocessor for IEEE-754
quadruple-precision elementary functions 12:1--12:??
Motohiro Kawahito and
Hideaki Komatsu and
Takao Moriyama and
Hiroshi Inoue and
Toshio Nakatani Idiom recognition framework using
topological embedding . . . . . . . . . 13:1--13:??
Ghassan Shobaki and
Maxim Shawabkeh and
Najm Eldeen Abu Rmaileh Preallocation instruction scheduling
with register pressure minimization
using a combinatorial optimization
approach . . . . . . . . . . . . . . . . 14:1--14:??
Dongrui She and
Yifan He and
Henk Corporaal An energy-efficient method of supporting
flexible special instructions in an
embedded processor with compact ISA . . 15:1--15:??
V. Krishna Nandivada and
Rajkishore Barik Improved bitwidth-aware variable packing 16:1--16:??
Jung Ho Ahn and
Young Hoon Son and
John Kim Scalable high-radix router
microarchitecture using a network switch
organization . . . . . . . . . . . . . . 17:1--17:??
Libo Huang and
Zhiying Wang and
Nong Xiao and
Yongwen Wang and
Qiang Dou Adaptive communication mechanism for
accelerating MPI functions in NoC-based
multicore processors . . . . . . . . . . 18:1--18:??
Avinash Malik and
David Gregg Orchestrating stream graphs using model
checking . . . . . . . . . . . . . . . . 19:1--19:??
Zheng Wang and
Michael F. P. O'Boyle Using machine learning to partition
streaming programs . . . . . . . . . . . 20:1--20:??
Ali Bakhoda and
John Kim and
Tor M. Aamodt Designing on-chip networks for
throughput accelerators . . . . . . . . 21:1--21:??
Michael R. Jantz and
Prasad A. Kulkarni Exploring single and multilevel JIT
compilation policy for modern machines 1 22:1--22:??
Xiangyu Dong and
Norman P. Jouppi and
Yuan Xie A circuit-architecture co-optimization
framework for exploring nonvolatile
memory hierarchies . . . . . . . . . . . 23:1--23:??
Jishen Zhao and
Guangyu Sun and
Gabriel H. Loh and
Yuan Xie Optimizing GPU energy efficiency with
$3$D die-stacking graphics memory and
reconfigurable memory interface . . . . 24:1--24:??
Chien-Chi Chen and
Sheng-De Wang An efficient multicharacter transition
string-matching engine based on the
Aho--Corasick algorithm . . . . . . . . 25:1--25:??
Yangchun Luo and
Wei-Chung Hsu and
Antonia Zhai The design and implementation of
heterogeneous multicore systems for
energy-efficient speculative thread
execution . . . . . . . . . . . . . . . 26:1--26:??
Dyer Rolán and
Basilio B. Fraguela and
Ramón Doallo Virtually split cache: an efficient
mechanism to distribute instructions and
data 1 . . . . . . . . . . . . . . . . . 27:1--27:??
Samantika Subramaniam and
Simon C. Steely and
Will Hasenplaugh and
Aamer Jaleel and
Carl Beckmann and
Tryggve Fossum and
Joel Emer Using in-flight chains to build a
scalable cache coherence protocol . . . 28:1--28:??
Daniel Sánchez and
Yiannakis Sazeides and
Juan M. Cebrián and
José M. García and
Juan L. Aragón Modeling the impact of permanent faults
in caches . . . . . . . . . . . . . . . 29:1--29:??
Sanghoon Lee and
James Tuck Automatic parallelization of
fine-grained metafunctions on a chip
multiprocessor . . . . . . . . . . . . . 30:1--30:??
Christophe Dubach and
Timothy M. Jones and
Edwin V. Bonilla Dynamic microarchitectural adaptation
using machine learning . . . . . . . . . 31:1--31:??
Long Chen and
Yanan Cao and
Zhao Zhang E$^3$CC: a memory error protection
scheme with novel address mapping for
subranked and low-power memories . . . . 32:1--32:??
Yingying Tian and
Samira M. Khan and
Daniel A. Jiménez Temporal-based multilevel correlating
inclusive cache replacement . . . . . . 33:1--33:??
Qixiao Liu and
Miquel Moreto and
Victor Jimenez and
Jaume Abella and
Francisco J. Cazorla and
Mateo Valero Hardware support for accurate per-task
energy metering in multicore systems . . 34:1--34:??
Sanyam Mehta and
Gautham Beeraka and
Pen-Chung Yew Tile size selection revisited . . . . . 35:1--35:??
Bogdan Prisacari and
German Rodriguez and
Cyriel Minkenberg and
Torsten Hoefler Fast pattern-specific routing for fat
tree networks . . . . . . . . . . . . . 36:1--36:??
Maximilien B. Breughe and
Lieven Eeckhout Selecting representative benchmark
inputs for exploring microprocessor
design spaces . . . . . . . . . . . . . 37:1--37:??
Christoph Kerschbaumer and
Eric Hennigan and
Per Larsen and
Stefan Brunthaler and
Michael Franz Information flow tracking meets
just-in-time compilation . . . . . . . . 38:1--38:??
Rupesh Nasre Time- and space-efficient flow-sensitive
points-to analysis . . . . . . . . . . . 39:1--39:??
Wenjia Ruan and
Yujie Liu and
Michael Spear Boosting timestamp-based transactional
memory by exploiting hardware cycle
counters . . . . . . . . . . . . . . . . 40:1--40:??
Tanima Dey and
Wei Wang and
Jack W. Davidson and
Mary Lou Soffa ReSense: Mapping dynamic workloads of
colocated multithreaded applications
using resource sensitivity . . . . . . . 41:1--41:??
Adri\`a Armejach and
Ruben Titos-Gil and
Anurag Negi and
Osman S. Unsal and
Adrián Cristal Techniques to improve performance in
requester-wins hardware transactional
memory . . . . . . . . . . . . . . . . . 42:1--42:??
Myeongjae Jeon and
Conglong Li and
Alan L. Cox and
Scott Rixner Reducing DRAM row activations with eager
read/write clustering . . . . . . . . . 43:1--43:??
Zhijia Zhao and
Michael Bebenita and
Dave Herman and
Jianhua Sun and
Xipeng Shen HPar: a practical parallel parser for
HTML --- taming HTML complexities for
parallel parsing . . . . . . . . . . . . 44:1--44:??
Ehsan Totoni and
Mert Dikmen and
María Jesús Garzarán Easy, fast, and energy-efficient object
detection on heterogeneous on-chip
architectures . . . . . . . . . . . . . 45:1--45:??
Viacheslav V. Fedorov and
Sheng Qiu and
A. L. Narasimha Reddy and
Paul V. Gratz ARI: Adaptive LLC-memory traffic
management . . . . . . . . . . . . . . . 46:1--46:??
Cecilia González-Álvarez and
Jennifer B. Sartor and
Carlos Álvarez and
Daniel Jiménez-González and
Lieven Eeckhout Accelerating an application domain with
specialized functional units . . . . . . 47:1--47:??
Xiaolin Wang and
Lingmei Weng and
Zhenlin Wang and
Yingwei Luo Revisiting memory management on
virtualized environments . . . . . . . . 48:1--48:??
Chuntao Jiang and
Zhibin Yu and
Hai Jin and
Chengzhong Xu and
Lieven Eeckhout and
Wim Heirman and
Trevor E. Carlson and
Xiaofei Liao PCantorSim: Accelerating parallel
architecture simulation through
fractal-based sampling . . . . . . . . . 49:1--49:??
Srdan Stipi\'c and
Vesna Smiljkovi\'c and
Osman Unsal and
Adrián Cristal and
Mateo Valero Profile-guided transaction
coalescing-lowering transactional
overheads by merging transactions . . . 50:1--50:??
Zhe Wang and
Shuchang Shan and
Ting Cao and
Junli Gu and
Yi Xu and
Shuai Mu and
Yuan Xie and
Daniel A. Jiménez WADE: Writeback-aware dynamic cache
management for NVM-based main memory
system . . . . . . . . . . . . . . . . . 51:1--51:??
Yong Li and
Yaojun Zhang and
Hai LI and
Yiran Chen and
Alex K. Jones C1C: a configurable, compiler-guided
STT-RAM L1 cache . . . . . . . . . . . . 52:1--52:??
Naznin Fauzia and
Venmugil Elango and
Mahesh Ravishankar and
J. Ramanujam and
Fabrice Rastello and
Atanas Rountev and
Louis-Noël Pouchet and
P. Sadayappan Beyond reuse distance analysis: Dynamic
analysis for characterization of data
locality potential . . . . . . . . . . . 53:1--53:??
Alen Bardizbanyan and
Magnus Själander and
David Whalley and
Per Larsson-Edefors Designing a practical data filter cache
to improve both energy efficiency and
performance . . . . . . . . . . . . . . 54:1--54:??
Andrei Hagiescu and
Bing Liu and
R. Ramanathan and
Sucheendra K. Palaniappan and
Zheng Cui and
Bipasa Chattopadhyay and
P. S. Thiagarajan and
Weng-Fai Wong GPU code generation for ODE-based
applications with phased shared-data
access patterns . . . . . . . . . . . . 55:1--55:??
Junghee Lee and
Chrysostomos Nicopoulos and
Hyung Gyu Lee and
Jongman Kim TornadoNoC: a lightweight and scalable
on-chip network architecture for the
many-core era . . . . . . . . . . . . . 56:1--56:??
Christos Strydis and
Robert M. Seepers and
Pedro Peris-Lopez and
Dimitrios Siskos and
Ioannis Sourdis A system architecture, processor, and
communication protocol for secure
implants . . . . . . . . . . . . . . . . 57:1--57:??
Wonsub Kim and
Yoonseo Choi and
Haewoo Park Fast modulo scheduler utilizing
patternized routes for coarse-grained
reconfigurable architectures . . . . . . 58:1--58:??
Dorit Nuzman and
Revital Eres and
Sergei Dyshel and
Marcel Zalmanovici and
Jose Castanos JIT technology with C/C++:
Feedback-directed dynamic recompilation
for statically compiled languages . . . 59:1--59:??
Thejas Ramashekar and
Uday Bondhugula Automatic data allocation and buffer
management for multi-GPU machines . . . 60:1--60:??
Hans Vandierendonck and
George Tzenakis and
Dimitrios S. Nikolopoulos Analysis of dependence tracking
algorithms for task dataflow execution 61:1--61:??
Yeonghun Jeong and
Seongseok Seo and
Jongeun Lee Evaluator-executor transformation for
efficient pipelining of loops with
conditionals . . . . . . . . . . . . . . 62:1--62:??
Rajkishore Barik and
Jisheng Zhao and
Vivek Sarkar A decoupled non-SSA global register
allocation using bipartite liveness
graphs . . . . . . . . . . . . . . . . . 63:1--63:??
Peter Gavin and
David Whalley and
Magnus Själander Reducing instruction fetch energy in
multi-issue processors . . . . . . . . . 64:1--64:??
Anonymous List of distinguished reviewers ACM TACO 65:1--65:??
Neeraj Goel and
Anshul Kumar and
Preeti Ranjan Panda Shared-port register file architecture
for low-energy VLIW processors . . . . . 1:1--1:32
Zheng Wang and
Georgios Tournavitis and
Björn Franke and
Michael F. P. O'Boyle Integrating profile-driven parallelism
detection and machine-learning-based
mapping . . . . . . . . . . . . . . . . 2:1--2:26
Mehrzad Samadi and
Amir Hormati and
Janghaeng Lee and
Scott Mahlke Leveraging GPUs using cooperative loop
speculation . . . . . . . . . . . . . . 3:1--3:26
Jue Wang and
Xiangyu Dong and
Yuan Xie and
Norman P. Jouppi Endurance-aware cache line management
for non-volatile caches . . . . . . . . 4:1--4:24
Lei Liu and
Zehan Cui and
Yong Li and
Yungang Bao and
Mingyu Chen and
Chengyong Wu BPM/BPM+: Software-based dynamic memory
partitioning mechanisms for mitigating
DRAM bank-/channel-level interferences
in multicore systems . . . . . . . . . . 5:1--5:28
Christian Häubl and
Christian Wimmer and
Hanspeter Mössenböck Trace transitioning and exception
handling in a trace-based JIT compiler
for Java . . . . . . . . . . . . . . . . 6:1--6:26
Yongbing Huang and
Licheng Chen and
Zehan Cui and
Yuan Ruan and
Yungang Bao and
Mingyu Chen and
Ninghui Sun HMTT: a hybrid hardware/software tracing
system for bridging the DRAM access
trace's semantic gap . . . . . . . . . . 7:1--7:25
Quan Chen and
Minyi Guo Adaptive workload-aware task scheduling
for single-ISA asymmetric multicore
architectures . . . . . . . . . . . . . 8:1--8:25
Gülfem Savrun-Yeniçeri and
Wei Zhang and
Huahan Zhang and
Eric Seckler and
Chen Li and
Stefan Brunthaler and
Per Larsen and
Michael Franz Efficient hosted interpreters on the JVM 9:1--9:24
Prashant J. Nair and
Chia-Chen Chou and
Moinuddin K. Qureshi Refresh pausing in DRAM memory systems 10:1--10:26
Komal Jothi and
Haitham Akkary Tuning the continual flow pipeline
architecture with virtual register
renaming . . . . . . . . . . . . . . . . 11:1--11:27
Thomas Carle and
Dumitru Potop-Butucaru Predicate-aware, makespan-preserving
software pipelining of scheduling tables 12:1--12:26
Angeliki Kritikakou and
Francky Catthoor and
Vasilios Kelefouras and
Costas Goutis A scalable and near-optimal
representation of access schemes for
memory management . . . . . . . . . . . 13:1--13:25
Hugh Leather and
Edwin Bonilla and
Michael O'Boyle Automatic feature generation for machine
learning--based optimising compilation 14:1--14:32
Theo Kluter and
Samuel Burri and
Philip Brisk and
Edoardo Charbon and
Paolo Ienne Virtual Ways: Low-Cost Coherence for
Instruction Set Extensions with
Architecturally Visible Storage . . . . 15:1--15:26
Bin Ren and
Todd Mytkowicz and
Gagan Agrawal A Portable Optimization Engine for
Accelerating Irregular Data-Traversal
Applications on SIMD Architectures . . . 16:1--16:??
Zhengwei Qi and
Jianguo Yao and
Chao Zhang and
Miao Yu and
Zhizhou Yang and
Haibing Guan VGRIS: Virtualized GPU Resource
Isolation and Scheduling in Cloud Gaming 17:1--17:25
Bor-Yeh Shen and
Wei-Chung Hsu and
Wuu Yang A Retargetable Static Binary Translator
for the ARM Architecture . . . . . . . . 18:1--18:??
Darío Suárez Gracia and
Alexandra Ferrerón and
Luis Montesano Del Campo and
Teresa Monreal Arnal and
Víctor Viñals Yúfera Revisiting LP--NUCA Energy Consumption:
Cache Access Policies and Adaptive Block
Dropping . . . . . . . . . . . . . . . . 19:1--19:??
Zhibin Liang and
Wei Zhang and
Yung-Cheng Ma Deadline-Constrained Clustered
Scheduling for VLIW Architectures using
Power-Gated Register Files . . . . . . . 20:1--20:26
Shuangde Fang and
Zidong Du and
Yuntan Fang and
Yuanjie Huang and
Yang Chen and
Lieven Eeckhout and
Olivier Temam and
Huawei Li and
Yunji Chen and
Chengyong Wu Performance Portability Across
Heterogeneous SoCs Using a Generalized
Library-Based Approach . . . . . . . . . 21:1--21:??
Abdulrahman Kaitoua and
Hazem Hajj and
Mazen A. R. Saghir and
Hassan Artail and
Haitham Akkary and
Mariette Awad and
Mageda Sharafeddine and
Khaleel Mershad Hadoop Extensions for Distributed
Computing on Reconfigurable Active SSD
Clusters . . . . . . . . . . . . . . . . 22:1--22:??
Jue Wang and
Xiangyu Dong and
Yuan Xie Preventing STT-RAM Last-Level Caches
from Port Obstruction . . . . . . . . . 23:1--23:??
M. A. Gonzalez-Mesa and
Eladio Gutierrez and
Emilio L. Zapata and
Oscar Plata Effective Transactional Memory Execution
Management for Improved Concurrency . . 24:1--24:??
Rakesh Kumar and
Alejandro Martínez and
Antonio González Efficient Power Gating of SIMD
Accelerators Through Dynamic Selective
Devectorization in an HW/SW Codesigned
Environment . . . . . . . . . . . . . . 25:1--25:??
Stefano Di Carlo and
Salvatore Galfano and
Marco Indaco and
Paolo Prinetto and
Davide Bertozzi and
Piero Olivo and
Cristian Zambelli FLARES: an Aging Aware Algorithm to
Autonomously Adapt the Error Correction
Capability in NAND Flash Memories . . . 26:1--26:??
Davide B. Bartolini and
Filippo Sironi and
Donatella Sciuto and
Marco D. Santambrogio Automated Fine-Grained CPU Provisioning
for Virtual Machines . . . . . . . . . . 27:1--27:??
Trevor E. Carlson and
Wim Heirman and
Stijn Eyerman and
Ibrahim Hur and
Lieven Eeckhout An Evaluation of High-Level Mechanistic
Core Models . . . . . . . . . . . . . . 28:1--28:??
Farrukh Hijaz and
Omer Khan NUCA-L1: a Non-Uniform Access Latency
Level-1 Cache Architecture for
Multicores Operating at Near-Threshold
Voltages . . . . . . . . . . . . . . . . 29:1--29:??
Andi Drebes and
Karine Heydemann and
Nathalie Drach and
Antoniu Pop and
Albert Cohen Topology-Aware and Dependence-Aware
Scheduling and Memory Allocation for
Task-Parallel Languages . . . . . . . . 30:1--30:??
Venkata Kalyan Tawa and
Ravi Kasha and
Madhu Mutyam EFGR: an Enhanced Fine Granularity
Refresh Feature for High-Performance
DDR4 DRAM Devices . . . . . . . . . . . 31:1--31:??
Gulay Yalcin and
Oguz Ergin and
Emrah Islek and
Osman Sabri Unsal and
Adrian Cristal Exploiting Existing Comparators for
Fine-Grained Low-Cost Error Detection 32:1--32:??
Pradeep Ramachandran and
Siva Kumar Sastry Hari and
Manlap Li and
Sarita V. Adve Hardware Fault Recovery for I/O
Intensive Applications . . . . . . . . . 33:1--33:??
Stijn Eyerman and
Pierre Michaud and
Wouter Rogiest Multiprogram Throughput Metrics: a
Systematic Approach . . . . . . . . . . 34:1--34:??
Cedric Nugteren and
Henk Corporaal Bones: an Automatic Skeleton-Based
C-to-CUDA Compiler for GPUs . . . . . . 35:1--35:??
Jue Wang and
Xiangyu Dong and
Yuan Xie Building and Optimizing MRAM-Based
Commodity Memories . . . . . . . . . . . 36:1--36:??
Rakesh Komuravelli and
Sarita V. Adve and
Ching-Tsun Chou Revisiting the Complexity of Hardware
Cache Coherence and Some Implications 37:1--37:??
Gabriel Rodríguez and
Juan Touriño and
Mahmut T. Kandemir Volatile STT--RAM Scratchpad Design and
Data Allocation for Low Energy . . . . . 38:1--38:??
Cristóbal Camarero and
Enrique Vallejo and
Ramón Beivide Topological Characterization of Hamming
and Dragonfly Networks and Its
Implications on Routing . . . . . . . . 39:1--39:??
Hanbin Yoon and
Justin Meza and
Naveen Muralimanohar and
Norman P. Jouppi and
Onur Mutlu Efficient Data Mapping and Buffering
Techniques for Multilevel Cell
Phase-Change Memories . . . . . . . . . 40:1--40:??
Nathanael Prémillieu and
André Seznec Efficient Out-of-Order Execution of
Guarded ISAs . . . . . . . . . . . . . . 41:1--41:??
Zheng Wang and
Dominik Grewe and
Michael F. P. O'Boyle Automatic and Portable Mapping of Data
Parallel Programs to OpenCL for
GPU-Based Heterogeneous Systems . . . . 42:1--42:??
Dan He and
Fang Wang and
Hong Jiang and
Dan Feng and
Jing Ning Liu and
Wei Tong and
Zheng Zhang Improving Hybrid FTL by Fully Exploiting
Internal SSD Parallelism with Virtual
Blocks . . . . . . . . . . . . . . . . . 43:1--43:??
Eri Rubin and
Ely Levy and
Amnon Barak and
Tal Ben-Nun MAPS: Optimizing Massively Parallel
Applications Using Device-Level Memory
Abstraction . . . . . . . . . . . . . . 44:1--44:??
Alessandro Cilardo and
Luca Gallo Improving Multibank Memory Access
Parallelism with Lattice-Based
Partitioning . . . . . . . . . . . . . . 45:1--45:??
Jan Kasper Martinsen and
Håkan Grahn and
Anders Isberg The Effects of Parameter Tuning in
Software Thread-Level Speculation in
JavaScript Engines . . . . . . . . . . . 46:1--46:??
Quentin Colombet and
Florian Brandner and
Alain Darte Studying Optimal Spilling in the Light
of SSA . . . . . . . . . . . . . . . . . 47:1--47:??
Jawad Haj-Yihia and
Yosi Ben Asher and
Efraim Rotem and
Ahmad Yasin and
Ran Ginosar Compiler-Directed Power Management for
Superscalars . . . . . . . . . . . . . . 48:1--48:??
Hong-Phuc Trinh and
Marc Duranton and
Michel Paindavoine Efficient Data Encoding for
Convolutional Neural Network application 49:1--49:??
Maximilien B. Breugh and
Stijn Eyerman and
Lieven Eeckhout Mechanistic Analytical Modeling of
Superscalar In-Order Processor
Performance . . . . . . . . . . . . . . 50:1--50:??
Vivek Seshadri and
Samihan Yedkar and
Hongyi Xin and
Onur Mutlu and
Phillip B. Gibbons and
Michael A. Kozuch and
Todd C. Mowry Mitigating Prefetcher-Caused Pollution
Using Informed Caching Policies for
Prefetched Blocks . . . . . . . . . . . 51:1--51:??
George Matheou and
Paraskevas Evripidou Architectural Support for Data-Driven
Execution . . . . . . . . . . . . . . . 52:1--52:??
Amir Morad and
Leonid Yavits and
Ran Ginosar GP--SIMD Processing-in-Memory . . . . . 53:1--53:??
Thomas Schaub and
Simon Moll and
Ralf Karrenberg and
Sebastian Hack The Impact of the SIMD Width on
Control-Flow and Memory Divergence . . . 54:1--54:??
Zhenman Fang and
Sanyam Mehta and
Pen-Chung Yew and
Antonia Zhai and
James Greensky and
Gautham Beeraka and
Binyu Zang Measuring Microarchitectural Details of
Multi- and Many-Core Memory Systems
through Microbenchmarking . . . . . . . 55:1--55:??
Chi Ching Chi and
Mauricio Alvarez-Mesa and
Ben Juurlink Low-Power High-Efficiency Video Decoding
using General-Purpose Processors . . . . 56:1--56:??
Fabio Luporini and
Ana Lucia Varbanescu and
Florian Rathgeber and
Gheorghe-Teodor Bercea and
J. Ramanujam and
David A. Ham and
Paul H. J. Kelly Cross-Loop Optimization of Arithmetic
Intensity for Finite Element Local
Assembly . . . . . . . . . . . . . . . . 57:1--57:??
Xing Zhou and
María J. Garzarán and
David A. Padua Optimal Parallelogram Selection for
Hierarchical Tiling . . . . . . . . . . 58:1--58:??
Leo Porter and
Michael A. Laurenzano and
Ananta Tiwari and
Adam Jundt and
William A. Ward, Jr. and
Roy Campbell and
Laura Carrington Making the Most of SMT in HPC: System-
and Application-Level Perspectives . . . 59:1--59:??
Xin Tong and
Toshihiko Koju and
Motohiro Kawahito and
Andreas Moshovos Optimizing Memory Translation Emulation
in Full System Emulators . . . . . . . . 60:1--60:??
Martin Kong and
Antoniu Pop and
Louis-Noël Pouchet and
R. Govindarajan and
Albert Cohen and
P. Sadayappan Compiler/Runtime Framework for Dynamic
Dataflow Parallelization of Tiled
Programs . . . . . . . . . . . . . . . . 61:1--61:??
Nicolas Melot and
Christoph Kessler and
Jörg Keller and
Patrick Eitschberger Fast Crown Scheduling Heuristics for
Energy-Efficient Mapping and Scaling of
Moldable Streaming Tasks on Manycore
Systems . . . . . . . . . . . . . . . . 62:1--62:??
Wenjia Ruan and
Yujie Liu and
Michael Spear Transactional Read-Modify-Write Without
Aborts . . . . . . . . . . . . . . . . . 63:1--63:??
Zia Ul Huda and
Ali Jannesari and
Felix Wolf Using Template Matching to Infer
Parallel Design Patterns . . . . . . . . 64:1--64:??
Heiner Litz and
Ricardo J. Dias and
David R. Cheriton Efficient Correction of Anomalies in
Snapshot Isolation Transactions . . . . 65:1--65:??
Helge Bahmann and
Nico Reissmann and
Magnus Jahre and
Jan Christian Meyer Perfect Reconstructability of Control
Flow from Demand Dependence Graphs . . . 66:1--66:??
Venmugil Elango and
Naser Sedaghati and
Fabrice Rastello and
Louis-Noël Pouchet and
J. Ramanujam and
Radu Teodorescu and
P. Sadayappan On Using the Roofline Model with Lower
Bounds on Data Movement . . . . . . . . 67:1--67:??
Anonymous List of Distinguished Reviewers ACM TACO
2014 . . . . . . . . . . . . . . . . . . 68:1--68:??
Christopher Zimmer and
Frank Mueller NoCMsg: a Scalable Message-Passing
Abstraction for Network-on-Chips . . . . 1:1--1:??
Beayna Grigorian and
Glenn Reinman Accelerating Divergent Applications on
SIMD Architectures Using Neural Networks 2:1--2:??
Anup Holey and
Vineeth Mekkat and
Pen-Chung Yew and
Antonia Zhai Performance-Energy Considerations for
Shared Cache Management in a
Heterogeneous Multicore Processor . . . 3:1--3:??
Jinho Suh and
Chieh-Ting Huang and
Michel Dubois Dynamic MIPS Rate Stabilization for
Complex Processors . . . . . . . . . . . 4:1--4:??
Naghmeh Karimi and
Arun Karthik Kanuparthi and
Xueyang Wang and
Ozgur Sinanoglu and
Ramesh Karri MAGIC: Malicious Aging in Circuits/Cores 5:1--5:??
Pablo De Oliveira Castro and
Chadi Akel and
Eric Petit and
Mihail Popov and
William Jalby CERE: LLVM-Based Codelet Extractor and
REplayer for Piecewise Benchmarking and
Optimization . . . . . . . . . . . . . . 6:1--6:??
Benedict R. Gaster and
Derek Hower and
Lee Howes HRF-Relaxed: Adapting HRF to the
Complexities of Industrial Heterogeneous
Memory Models . . . . . . . . . . . . . 7:1--7:??
Kevin Streit and
Johannes Doerfert and
Clemens Hammacher and
Andreas Zeller and
Sebastian Hack Generalized Task Parallelism . . . . . . 8:1--8:??
Hamed Tabkhi and
Gunar Schirner A Joint SW/HW Approach for Reducing
Register File Vulnerability . . . . . . 9:1--9:??
Arun Kanuparthi and
Ramesh Karri Reliable Integrity Checking in Multicore
Processors . . . . . . . . . . . . . . . 10:1--10:??
Do-Heon Lee and
Su-Kyung Yoon and
Jung-Geun Kim and
Charles C. Weems and
Shin-Dug Kim A New Memory-Disk Integrated System with
HW Optimizer . . . . . . . . . . . . . . 11:1--11:??
Morteza Mohajjel Kafshdooz and
Alireza Ejlali Dynamic Shared SPM Reuse for Real-Time
Multicore Embedded Systems . . . . . . . 12:1--12:??
Wenhao Jia and
Elba Garza and
Kelly A. Shaw and
Margaret Martonosi GPU Performance and Power Tuning Using
Regression Trees . . . . . . . . . . . . 13:1--13:??
Irshad Pananilath and
Aravind Acharya and
Vinay Vasista and
Uday Bondhugula An Optimizing Code Generator for a Class
of Lattice-Boltzmann Computations . . . 14:1--14:??
Shuangde Fang and
Wenwen Xu and
Yang Chen and
Lieven Eeckhout and
Olivier Temam and
Yunji Chen and
Chengyong Wu and
Xiaobing Feng Practical Iterative Optimization for the
Data Center . . . . . . . . . . . . . . 15:1--15:??
Tao Zhang and
Naifeng Jing and
Kaiming Jiang and
Wei Shu and
Min-You Wu and
Xiaoyao Liang Buddy SM: Sharing Pipeline Front-End for
Improved Energy Efficiency in GPGPUs . . 16:1--16:??
Hsiang-Yun Cheng and
Matt Poremba and
Narges Shahidi and
Ivan Stalev and
Mary Jane Irwin and
Mahmut Kandemir and
Jack Sampson and
Yuan Xie EECache: a Comprehensive Study on the
Architectural Design for
Energy-Efficient Last-Level Caches in
Chip Multiprocessors . . . . . . . . . . 17:1--17:??
Arjun Suresh and
Bharath Narasimha Swamy and
Erven Rohou and
André Seznec Intercepting Functions for Memoization:
a Case Study Using Transcendental
Functions . . . . . . . . . . . . . . . 18:1--18:??
Chung-Hsiang Lin and
De-Yu Shen and
Yi-Jung Chen and
Chia-Lin Yang and
Cheng-Yuan Michael Wang SECRET: a Selective Error Correction
Framework for Refresh Energy Reduction
in DRAMs . . . . . . . . . . . . . . . . 19:1--19:??
Doug Simon and
Christian Wimmer and
Bernhard Urban and
Gilles Duboscq and
Lukas Stadler and
Thomas Würthinger Snippets: Taking the High Road to a Low
Level . . . . . . . . . . . . . . . . . 20:1--20:??
Raghuraman Balasubramanian and
Vinay Gangadhar and
Ziliang Guo and
Chen-Han Ho and
Cherin Joseph and
Jaikrishnan Menon and
Mario Paulo Drumond and
Robin Paul and
Sharath Prasad and
Pradip Valathol and
Karthikeyan Sankaralingam Enabling GPGPU Low-Level Hardware
Explorations with MIAOW: an Open-Source
RTL Implementation of a GPGPU . . . . . 21:1--21:??
Quan Chen and
Minyi Guo Locality-Aware Work Stealing Based on
Online Profiling and Auto-Tuning for
Multisocket Multicore Architectures . . 22:1--22:??
Madan Das and
Gabriel Southern and
Jose Renau Section-Based Program Analysis to Reduce
Overhead of Detecting Unsynchronized
Thread Communication . . . . . . . . . . 23:1--23:??
Atieh Lotfi and
Abbas Rahimi and
Luca Benini and
Rajesh K. Gupta Aging-Aware Compilation for GP-GPUs . . 24:1--24:??
Brian P. Railing and
Eric R. Hein and
Thomas M. Conte Contech: Efficiently Generating Dynamic
Task Graphs for Arbitrary Parallel
Programs . . . . . . . . . . . . . . . . 25:1--25:??
Mahdad Davari and
Alberto Ros and
Erik Hagersten and
Stefanos Kaxiras The Effects of Granularity and
Adaptivity on Private/Shared
Classification for Coherence . . . . . . 26:1--26:??
Mark Gottscho and
Abbas BanaiyanMofrad and
Nikil Dutt and
Alex Nicolau and
Puneet Gupta DPCS: Dynamic Power/Capacity Scaling for
SRAM Caches in the Nanoscale Era . . . . 27:1--27:??
Pierre Michaud and
Andrea Mondelli and
André Seznec Revisiting Clustered Microarchitecture
for Future Superscalar Cores: a Case for
Wide Issue Clusters . . . . . . . . . . 28:1--28:??
Ragavendra Natarajan and
Antonia Zhai Leveraging Transactional Execution for
Memory Consistency Model Emulation . . . 29:1--29:??
Biswabandan Panda and
Shankar Balachandran CAFFEINE: a Utility-Driven Prefetcher
Aggressiveness Engine for Multicores . . 30:1--30:??
Jishen Zhao and
Sheng Li and
Jichuan Chang and
John L. Byrne and
Laura L. Ramirez and
Kevin Lim and
Yuan Xie and
Paolo Faraboschi Buri: Scaling Big-Memory Computing with
Hardware-Based Memory Expansion . . . . 31:1--31:??
Jan Lucas and
Michael Andersch and
Mauricio Alvarez-Mesa and
Ben Juurlink Spatiotemporal SIMT and Scalarization
for Improving GPU Efficiency . . . . . . 32:1--32:??
Subhasis Das and
Tor M. Aamodt and
William J. Dally Reuse Distance-Based Probabilistic Cache
Replacement . . . . . . . . . . . . . . 33:1--33:??
Etem Deniz and
Alper Sen MINIME-GPU: Multicore Benchmark
Synthesizer for GPUs . . . . . . . . . . 34:1--34:??
Li Tan and
Zizhong Chen and
Shuaiwen Leon Song Scalable Energy Efficiency with
Resilience for High Performance
Computing Systems: a Quantitative
Methodology . . . . . . . . . . . . . . 35:1--35:??
Kishore Kumar Pusukuri and
Rajiv Gupta and
Laxmi N. Bhuyan Tumbler: an Effective Load-Balancing
Technique for Multi-CPU Multicore
Systems . . . . . . . . . . . . . . . . 36:1--36:??
Erik Tomusk and
Christophe Dubach and
Michael O'Boyle Four Metrics to Evaluate Heterogeneous
Multicores . . . . . . . . . . . . . . . 37:1--37:??
Morteza Hoseinzadeh and
Mohammad Arjomand and
Hamid Sarbazi-Azad SPCM: The Striped Phase Change Memory 38:1--38:??
Chuntao Jiang and
Zhibin Yu and
Lieven Eeckhout and
Hai Jin and
Xiaofei Liao and
Chengzhong Xu Two-Level Hybrid Sampled Simulation of
Multithreaded Applications . . . . . . . 39:1--39:??
Sandeep D'souza and
Soumya J. and
Santanu Chattopadhyay Integrated Mapping and Synthesis
Techniques for Network-on-Chip
Topologies with Express Channels . . . . 40:1--40:??
Dimitrios Chasapis and
Marc Casas and
Miquel Moretó and
Raul Vidal and
Eduard Ayguadé and
Jesús Labarta and
Mateo Valero PARSECSs: Evaluating the Impact of Task
Parallelism in the PARSEC Benchmark
Suite . . . . . . . . . . . . . . . . . 41:1--41:??
Francisco Gaspar and
Luis Taniça and
Pedro Tomás and
Aleksandar Ilic and
Leonel Sousa A Framework for Application-Guided Task
Management on Heterogeneous Embedded
Systems . . . . . . . . . . . . . . . . 42:1--42:??
Ehsan K. Ardestani and
Rafael Trapani Possignolo and
Jose Luis Briz and
Jose Renau Managing Mismatches in Voltage Stacking
with CoreUnfolding . . . . . . . . . . . 43:1--43:??
Prashant J. Nair and
David A. Roberts and
Moinuddin K. Qureshi FaultSim: a Fast, Configurable
Memory-Reliability Simulator for
Conventional and $3$D-Stacked Systems 44:1--44:??
Byeongcheol Lee Adaptive Correction of Sampling Bias in
Dynamic Call Graphs . . . . . . . . . . 45:1--45:??
Andrew J. Mcpherson and
Vijay Nagarajan and
Susmit Sarkar and
Marcelo Cintra Fence Placement for Legacy
Data-Race-Free Programs via
Synchronization Read Detection . . . . . 46:1--46:??
Ding-Yong Hong and
Chun-Chen Hsu and
Cheng-Yi Chou and
Wei-Chung Hsu and
Pangfeng Liu and
Jan-Jan Wu Optimizing Control Transfer and Memory
Virtualization in Full System Emulators 47:1--47:??
Aravind Sukumaran-Rajam and
Philippe Clauss The Polyhedral Model of Nonlinear Loops 48:1--48:??
Prashant J. Nair and
David A. Roberts and
Moinuddin K. Qureshi Citadel: Efficiently Protecting Stacked
Memory from TSV and Large Granularity
Failures . . . . . . . . . . . . . . . . 49:1--49:??
Andrew Anderson and
Avinash Malik and
David Gregg Automatic Vectorization of Interleaved
Data Revisited . . . . . . . . . . . . . 50:1--50:??
Lihang Zhao and
Lizhong Chen and
Woojin Choi and
Jeffrey Draper A Filtering Mechanism to Reduce Network
Bandwidth Utilization of Transaction
Execution . . . . . . . . . . . . . . . 51:1--51:??
Olivier Serres and
Abdullah Kayi and
Ahmad Anbar and
Tarek El-Ghazawi Enabling PGAS Productivity with Hardware
Support for Shared Address Mapping: a
UPC Case Study . . . . . . . . . . . . . 52:1--52:??
Riccardo Cattaneo and
Giuseppe Natale and
Carlo Sicignano and
Donatella Sciuto and
Marco Domenico Santambrogio On How to Accelerate Iterative Stencil
Loops: a Scalable Streaming-Based
Approach . . . . . . . . . . . . . . . . 53:1--53:??
Unnikrishnan C and
Rupesh Nasre and
Y. N. Srikant Falcon: a Graph Manipulation Language
for Heterogeneous Systems . . . . . . . 54:1--54:??
Rajshekar Kalayappan and
Smruti R. Sarangi FluidCheck: a Redundant Threading-Based
Approach for Reliable Execution in
Manycore Processors . . . . . . . . . . 55:1--55:??
Jesse Elwell and
Ryan Riley and
Nael Abu-Ghazaleh and
Dmitry Ponomarev and
Iliano Cervesato Rethinking Memory Permissions for
Protection Against Cross-Layer Attacks 56:1--56:??
Amir Morad and
Leonid Yavits and
Shahar Kvatinsky and
Ran Ginosar Resistive GP-SIMD Processing-In-Memory 57:1--57:??
Yaohua Wang and
Dong Wang and
Shuming Chen and
Zonglin Liu and
Shenggang Chen and
Xiaowen Chen and
Xu Zhou Iteration Interleaving--Based SIMD Lane
Partition . . . . . . . . . . . . . . . 58:1--58:??
Tomi Äijö and
Pekka Jääskeläinen and
Tapio Elomaa and
Heikki Kultala and
Jarmo Takala Integer Linear Programming-Based
Scheduling for Transport Triggered
Architectures . . . . . . . . . . . . . 59:1--59:??
Qixiao Liu and
Miquel Moreto and
Jaume Abella and
Francisco J. Cazorla and
Daniel A. Jimenez and
Mateo Valero Sensible Energy Accounting with Abstract
Metering for Multicore Systems . . . . . 60:1--60:??
Miao Zhou and
Yu Du and
Bruce Childers and
Daniel Mosse and
Rami Melhem Symmetry-Agnostic Coordinated Management
of the Memory Hierarchy in Multicore
Systems . . . . . . . . . . . . . . . . 61:1--61:??
Amir Yazdanbakhsh and
Gennady Pekhimenko and
Bradley Thwaites and
Hadi Esmaeilzadeh and
Onur Mutlu and
Todd C. Mowry RFVP: Rollback-Free Value Prediction
with Safe-to-Approximate Loads . . . . . 62:1--62:??
Donghyuk Lee and
Saugata Ghose and
Gennady Pekhimenko and
Samira Khan and
Onur Mutlu Simultaneous Multi-Layer Access:
Improving $3$D-Stacked Memory Bandwidth
at Low Cost . . . . . . . . . . . . . . 63:1--63:??
Yeoul Na and
Seon Wook Kim and
Youngsun Han JavaScript Parallelizing Compiler for
Exploiting Parallelism from
Data-Parallel HTML5 Applications . . . . 64:1--64:??
Hiroyuki Usui and
Lavanya Subramanian and
Kevin Kai-Wei Chang and
Onur Mutlu DASH: Deadline-Aware High-Performance
Memory Scheduler for Heterogeneous
Systems with Hardware Accelerators . . . 65:1--65:??
Morteza Mohajjel Kafshdooz and
Mohammadkazem Taram and
Sepehr Assadi and
Alireza Ejlali A Compile-Time Optimization Method for
WCET Reduction in Real-Time Embedded
Systems through Block Formation . . . . 66:1--66:25
Konstantinos Koukos and
Alberto Ros and
Erik Hagersten and
Stefanos Kaxiras Building Heterogeneous Unified Virtual
Memories (UVMs) without the Overhead . . 1:1--1:22
Zhigang Wang and
Xiaolin Wang and
Fang Hou and
Yingwei Luo and
Zhenlin Wang Dynamic Memory Balancing for
Virtualization . . . . . . . . . . . . . 2:1--2:??
Xueyang Wang and
Sek Chai and
Michael Isnardi and
Sehoon Lim and
Ramesh Karri Hardware Performance Counter-Based
Malware Identification and Detection
with Adaptive Compressive Sensing . . . 3:1--3:??
Shoaib Akram and
Jennifer B. Sartor and
Kenzo Van Craeynest and
Wim Heirman and
Lieven Eeckhout Boosting the Priority of Garbage:
Scheduling Collection on Heterogeneous
Multicore Processors . . . . . . . . . . 4:1--4:??
Buse Yilmaz and
Baris Aktemur and
MaríA J. Garzarán and
Sam Kamin and
Furkan Kiraç Autotuning Runtime Specialization for
Sparse Matrix-Vector Multiplication . . 5:1--5:??
Mingzhou Zhou and
Bo Wu and
Xipeng Shen and
Yaoqing Gao and
Graham Yiu Examining and Reducing the Influence of
Sampling Errors on Feedback-Driven
Optimizations . . . . . . . . . . . . . 6:1--6:??
Amanieu D'antras and
Cosmin Gorgovan and
Jim Garside and
Mikel Luján Optimizing Indirect Branches in Dynamic
Binary Translators . . . . . . . . . . . 7:1--7:??
Luiz G. A. Martins and
Ricardo Nobre and
João M. P. Cardoso and
Alexandre C. B. Delbem and
Eduardo Marques Clustering-Based Selection for the
Exploration of Compiler Optimization
Sequences . . . . . . . . . . . . . . . 8:1--8:??
Sang Wook Stephen Do and
Michel Dubois Power Efficient Hardware Transactional
Memory: Dynamic Issue of Transactions 9:1--9:??
Dmitry Evtyushkin and
Dmitry Ponomarev and
Nael Abu-Ghazaleh Understanding and Mitigating Covert
Channels Through Branch Predictors . . . 10:1--10:??
Hao Zhou and
Jingling Xue A Compiler Approach for Exploiting
Partial SIMD Parallelism . . . . . . . . 11:1--11:??
Gert-Jan Van Den Braak and
Henk Corporaal R-GPU: a Reconfigurable GPU Architecture 12:1--12:??
Peng Liu and
Jiyang Yu and
Michael C. Huang Thread-Aware Adaptive Prefetcher on
Multicore Systems: Improving the
Performance for Multithreaded Workloads 13:1--13:??
Cosmin Gorgovan and
Amanieu D'antras and
Mikel Luján MAMBO: a Low-Overhead Dynamic Binary
Modification Tool for ARM . . . . . . . 14:1--14:??
Panagiotis Theocharis and
Bjorn De Sutter A Bimodal Scheduler for Coarse-Grained
Reconfigurable Arrays . . . . . . . . . 15:1--15:??
Ahmad Anbar and
Olivier Serres and
Engin Kayraklioglu and
Abdel-Hameed A. Badawy and
Tarek El-Ghazawi Exploiting Hierarchical Locality in Deep
Parallel Architectures . . . . . . . . . 16:1--16:??
Cecilia González-álvarez and
Jennifer B. Sartor and
Carlos Álvarez and
Daniel Jiménez-González and
Lieven Eeckhout MInGLE: an Efficient Framework for
Domain Acceleration Using Low-Power
Specialized Functional Units . . . . . . 17:1--17:??
Christian Andreetta and
Vivien Bégot and
Jost Berthold and
Martin Elsman and
Fritz Henglein and
Troels Henriksen and
Maj-Britt Nordfang and
Cosmin E. Oancea FinPar: a Parallel Financial Benchmark 18:1--18:??
Mickaël Dardaillon and
Kevin Marquet and
Tanguy Risset and
Jérôme Martin and
Henri-Pierre Charles A New Compilation Flow for
Software-Defined Radio Applications on
Heterogeneous MPSoCs . . . . . . . . . . 19:1--19:??
Jianwei Liao and
François Trahay and
Guoqiang Xiao Dynamic Process Migration Based on Block
Access Patterns Occurring in Storage
Servers . . . . . . . . . . . . . . . . 20:1--20:??
Amir Hossein Ashouri and
Giovanni Mariani and
Gianluca Palermo and
Eunjung Park and
John Cavazos and
Cristina Silvano COBAYN: Compiler Autotuning Framework
Using Bayesian Networks . . . . . . . . 21:1--21:??
Kypros Chrysanthou and
Panayiotis Englezakis and
Andreas Prodromou and
Andreas Panteli and
Chrysostomos Nicopoulos and
Yiannakis Sazeides and
Giorgos Dimitrakopoulos An Online and Real-Time Fault Detection
and Localization Mechanism for
Network-on-Chip Architectures . . . . . 22:1--22:??
Sanyam Mehta and
Pen-Chung Yew Variable Liberalization . . . . . . . . 23:1--23:??
Hsing-Min Chen and
Carole-Jean Wu and
Trevor Mudge and
Chaitali Chakrabarti RATT-ECC: Rate Adaptive Two-Tiered Error
Correction Codes for Reliable $3$D
Die-Stacked Memory . . . . . . . . . . . 24:1--24:??
Wenjie Chen and
Zhibin Wang and
Qin Wu and
Jiuzhen Liang and
Zhilei Chai Implementing Dense Optical Flow
Computation on a Heterogeneous FPGA SoC
in C . . . . . . . . . . . . . . . . . . 25:1--25:??
Nilay Vaish and
Michael C. Ferris and
David A. Wood Optimization Models for Three On-Chip
Network Problems . . . . . . . . . . . . 26:1--26:??
Somayeh Sardashti and
Andre Seznec and
David A. Wood Yet Another Compressed Cache: a Low-Cost
Yet Effective Compressed Cache . . . . . 27:1--27:??
Eduardo H. M. Cruz and
Matthias Diener and
Laércio L. Pilla and
Philippe O. A. Navaux Hardware-Assisted Thread and Data
Mapping in Hierarchical Multicore
Architectures . . . . . . . . . . . . . 28:1--28:??
Almutaz Adileh and
Stijn Eyerman and
Aamer Jaleel and
Lieven Eeckhout Maximizing Heterogeneous Processor
Performance Under Power Constraints . . 29:1--29:??
Bagus Wibowo and
Abhinav Agrawal and
Thomas Stanton and
James Tuck An Accurate Cross-Layer Approach for
Online Architectural Vulnerability
Estimation . . . . . . . . . . . . . . . 30:1--30:??
Manuel Acacio List of Distinguished Reviewers ACM TACO
2014 . . . . . . . . . . . . . . . . . . 31:1--31:??
Keval Vora and
Rajiv Gupta and
Guoqing Xu Synergistic Analysis of Evolving Graphs 32:1--32:??
Yunquan Zhang and
Shigang Li and
Shengen Yan and
Huiyang Zhou A Cross-Platform SpMV Framework on
Many-Core Architectures . . . . . . . . 33:1--33:??
Junwhan Ahn and
Sungjoo Yoo and
Kiyoung Choi AIM: Energy-Efficient Aggregation Inside
the Memory Hierarchy . . . . . . . . . . 34:1--34:??
Amir Kavyan Ziabari and
Yifan Sun and
Yenai Ma and
Dana Schaa and
José L. Abellán and
Rafael Ubal and
John Kim and
Ajay Joshi and
David Kaeli UMH: a Hardware-Based Unified Memory
Hierarchy for Systems with Multiple
Discrete GPUs . . . . . . . . . . . . . 35:1--35:??
Tom Spink and
Harry Wagstaff and
Björn Franke Hardware-Accelerated Cross-Architecture
Full-System Virtualization . . . . . . . 36:1--36:??
Qingchuan Shi and
George Kurian and
Farrukh Hijaz and
Srinivas Devadas and
Omer Khan LDAC: Locality-Aware Data Access Control
for Large-Scale Multicore Cache
Hierarchies . . . . . . . . . . . . . . 37:1--37:??
Fernando Fernandes and
Lucas Weigel and
Claudio Jung and
Philippe Navaux and
Luigi Carro and
Paolo Rech Evaluation of Histogram of Oriented
Gradients Soft Errors Criticality for
Automotive Applications . . . . . . . . 38:1--38:??
Saumay Dublish and
Vijay Nagarajan and
Nigel Topham Cooperative Caching for GPUs . . . . . . 39:1--39:??
Nikolaos Tampouratzis and
Pavlos M. Mattheakis and
Ioannis Papaefstathiou Accelerating Intercommunication in
Highly Parallel Systems . . . . . . . . 40:1--40:??
Hyukwoo Park and
Myungsu Cha and
Soo-Mook Moon Concurrent JavaScript Parsing for Faster
Loading of Web Apps . . . . . . . . . . 41:1--41:??
Dongliang Xiong and
Kai Huang and
Xiaowen Jiang and
Xiaolang Yan Memory Access Scheduling Based on
Dynamic Multilevel Priority in Shared
DRAM Systems . . . . . . . . . . . . . . 42:1--42:??
Daniele De Sensi and
Massimo Torquati and
Marco Danelutto A Reconfiguration Algorithm for
Power-Aware Parallel Applications . . . 43:1--43:??
Michael R. Jantz and
Forrest J. Robinson and
Prasad A. Kulkarni Impact of Intrinsic Profiling
Limitations on Effectiveness of Adaptive
Optimizations . . . . . . . . . . . . . 44:1--44:??
Marvin Damschen and
Lars Bauer and
Jörg Henkel Extending the WCET Problem to Optimize
for Runtime-Reconfigurable Processors 45:1--45:??
Zheng Li and
Fang Wang and
Dan Feng and
Yu Hua and
Jingning Liu and
Wei Tong MaxPB: Accelerating PCM Write by
Maximizing the Power Budget Utilization 46:1--46:??
Saurav Muralidharan and
Michael Garland and
Albert Sidelnik and
Mary Hall Designing a Tunable Nested Data-Parallel
Programming System . . . . . . . . . . . 47:1--47:??
Ismail Akturk and
Riad Akram and
Mohammad Majharul Islam and
Abdullah Muzahid and
Ulya R. Karpuzcu Accuracy Bugs: a New Class of
Concurrency Bugs to Exploit Algorithmic
Noise Tolerance . . . . . . . . . . . . 48:1--48:??
Erik Tomusk and
Christophe Dubach and
Michael O'Boyle Selecting Heterogeneous Cores for
Diversity . . . . . . . . . . . . . . . 49:1--49:??
Pierre Michaud Some Mathematical Facts About Optimal
Cache Replacement . . . . . . . . . . . 50:1--50:??
Wenlei Bao and
Changwan Hong and
Sudheer Chunduri and
Sriram Krishnamoorthy and
Louis-Noël Pouchet and
Fabrice Rastello and
P. Sadayappan Static and Dynamic Frequency Scaling on
Multicore CPUs . . . . . . . . . . . . . 51:1--51:??
Tiago M. Vale and
João A. Silva and
Ricardo J. Dias and
João M. Lourenço Pot: Deterministic Transactional
Execution . . . . . . . . . . . . . . . 52:1--52:??
Zhonghai Lu and
Yuan Yao Aggregate Flow-Based Performance
Fairness in CMPs . . . . . . . . . . . . 53:1--53:??
Yigit Demir and
Nikos Hardavellas Energy-Proportional Photonic
Interconnects . . . . . . . . . . . . . 54:1--54:??
Mehmet Can Kurt and
Sriram Krishnamoorthy and
Gagan Agrawal and
Bin Ren User-Assisted Store Recycling for
Dynamic Task Graph Schedulers . . . . . 55:1--55:??
Jawad Haj-Yihia and
Ahmad Yasin and
Yosi Ben Asher and
Avi Mendelson Fine-Grain Power Breakdown of Modern
Out-of-Order Cores and Its Implications
on Skylake-Based Systems . . . . . . . . 56:1--56:??
Alberto Scolari and
Davide Basilio Bartolini and
Marco Domenico Santambrogio A Software Cache Partitioning System for
Hash-Based Caches . . . . . . . . . . . 57:1--57:??
Lev Mukhanov and
Pavlos Petoumenos and
Zheng Wang and
Nikos Parasyris and
Dimitrios S. Nikolopoulos and
Bronis R. De Supinski and
Hugh Leather ALEA: a Fine-Grained Energy Profiling
Tool . . . . . . . . . . . . . . . . . . 1:1--1:??
Anuj Pathania and
Vanchinathan Venkataramani and
Muhammad Shafique and
Tulika Mitra and
Jörg Henkel Defragmentation of Tasks in Many-Core
Architecture . . . . . . . . . . . . . . 2:1--2:??
Darko Zivanovic and
Milan Pavlovic and
Milan Radulovic and
Hyunsung Shin and
Jongpil Son and
Sally A. Mckee and
Paul M. Carpenter and
Petar Radojkovi\'c and
Eduard Ayguadé Main Memory in HPC: Do We Need More or
Could We Live with Less? . . . . . . . . 3:1--3:??
Wenguang Zheng and
Hui Wu and
Qing Yang WCET-Aware Dynamic I-Cache Locking for a
Single Task . . . . . . . . . . . . . . 4:1--4:??
Byung-Sun Yang and
Jae-Yun Kim and
Soo-Mook Moon Exceptionization: a Java VM Optimization
for Non-Java Languages . . . . . . . . . 5:1--5:??
Rathijit Sen and
David A. Wood Pareto Governors for Energy-Optimal
Computing . . . . . . . . . . . . . . . 6:1--6:??
Mainak Chaudhuri and
Mukesh Agrawal and
Jayesh Gaur and
Sreenivas Subramoney Micro-Sector Cache: Improving Space
Utilization in Sectored DRAM Caches . . 7:1--7:??
Kyriakos Georgiou and
Steve Kerrison and
Zbigniew Chamski and
Kerstin Eder Energy Transparency for Deeply Embedded
Programs . . . . . . . . . . . . . . . . 8:1--8:??
Pengcheng Li and
Xiaoyu Hu and
Dong Chen and
Jacob Brock and
Hao Luo and
Eddy Z. Zhang and
Chen Ding LD: Low-Overhead GPU Race Detection
Without Access Monitoring . . . . . . . 9:1--9:??
Poovaiah M. Palangappa and
Kartik Mohanram CompEx++: Compression-Expansion Coding
for Energy, Latency, and Lifetime
Improvements in MLC/TLC NVMs . . . . . . 10:1--10:??
Dongwoo Lee and
Sangheon Lee and
Soojung Ryu and
Kiyoung Choi Dirty-Block Tracking in a Direct-Mapped
DRAM Cache with Self-Balancing Dispatch 11:1--11:??
Konstantinos Parasyris and
Vassilis Vassiliadis and
Christos D. Antonopoulos and
Spyros Lalis and
Nikolaos Bellas Significance-Aware Program Execution on
Unreliable Hardware . . . . . . . . . . 12:1--12:??
Gleison Mendonça and
Breno Guimarães and
Péricles Alves and
Márcio Pereira and
Guido Araújo and
Fernando Magno Quintão Pereira DawnCC: Automatic Annotation for Data
Parallelism and Offloading . . . . . . . 13:1--13:??
Rajeev Balasubramonian and
Andrew B. Kahng and
Naveen Muralimanohar and
Ali Shafiee and
Vaishnav Srinivas CACTI 7: New Tools for Interconnect
Exploration in Innovative Off-Chip
Memories . . . . . . . . . . . . . . . . 14:1--14:??
Vishwesh Jatala and
Jayvant Anantpur and
Amey Karkare Scratchpad Sharing in GPUs . . . . . . . 15:1--15:??
Tae Jun Ham and
Juan L. Aragón and
Margaret Martonosi Decoupling Data Supply from Computation
for Latency-Tolerant Communication in
Heterogeneous Architectures . . . . . . 16:1--16:??
Milan Stanic and
Oscar Palomar and
Timothy Hayes and
Ivan Ratkovic and
Adrian Cristal and
Osman Unsal and
Mateo Valero An Integrated Vector-Scalar Design on an
In-Order ARM Core . . . . . . . . . . . 17:1--17:??
Fernando A. Endo and
Arthur Perais and
André Seznec On the Interactions Between Value
Prediction and Compiler Optimizations in
the Context of EOLE . . . . . . . . . . 18:1--18:??
Aswinkumar Sridharan and
Biswabandan Panda and
Andre Seznec Band-Pass Prefetching: an Effective
Prefetch Management Mechanism Using
Prefetch-Fraction Metric in Multi-Core
Systems . . . . . . . . . . . . . . . . 19:1--19:??
Andrés Goens and
Sergio Siccha and
Jeronimo Castrillon Symmetry in Software Synthesis . . . . . 20:1--20:??
Sander Vocke and
Henk Corporaal and
Roel Jordans and
Rosilde Corvino and
Rick Nas Extending Halide to Improve Software
Development for Imaging DSPs . . . . . . 21:1--21:??
Nicklas Bo Jensen and
Sven Karlsson Improving Loop Dependence Analysis . . . 22:1--22:??
Stefan Ganser and
Armin Grösslinger and
Norbert Siegmund and
Sven Apel and
Christian Lengauer Iterative Schedule Optimization for
Parallelization in the Polyhedron Model 23:1--23:??
Wei Wei and
Dejun Jiang and
Jin Xiong and
Mingyu Chen HAP: Hybrid-Memory-Aware Partition in
Shared Last-Level Cache . . . . . . . . 24:1--24:??
Dongliang Xiong and
Kai Huang and
Xiaowen Jiang and
Xiaolang Yan Providing Predictable Performance via a
Slowdown Estimation Model . . . . . . . 25:1--25:??
Jing Pu and
Steven Bell and
Xuan Yang and
Jeff Setter and
Stephen Richardson and
Jonathan Ragan-Kelley and
Mark Horowitz Programming Heterogeneous Systems from
an Image Processing DSL . . . . . . . . 26:1--26:??
Ayman Hroub and
M. E. S. Elrabaa and
M. F. Mudawar and
A. Khayyat Efficient Generation of Compact
Execution Traces for Multicore
Architectural Simulations . . . . . . . 27:1--27:??
Nicolas Weber and
Michael Goesele MATOG: Array Layout Auto-Tuning for CUDA 28:1--28:??
Amir H. Ashouri and
Andrea Bignoli and
Gianluca Palermo and
Cristina Silvano and
Sameer Kulkarni and
John Cavazos MiCOMP: Mitigating the Compiler
Phase-Ordering Problem Using
Optimization Sub-Sequences and Machine
Learning . . . . . . . . . . . . . . . . 29:1--29:??
Erik Vermij and
Leandro Fiorin and
Rik Jongerius and
Christoph Hagleitner and
Jan Van Lunteren and
Koen Bertels An Architecture for Integrated Near-Data
Processors . . . . . . . . . . . . . . . 30:1--30:??
Andreas Diavastos and
Pedro Trancoso SWITCHES: a Lightweight Runtime for
Dataflow Execution of Tasks on
Many-Cores . . . . . . . . . . . . . . . 31:1--31:??
Rahul Jain and
Preeti Ranjan Panda and
Sreenivas Subramoney Cooperative Multi-Agent Reinforcement
Learning-Based Co-optimization of Cores,
Caches, and On-chip Network . . . . . . 32:1--32:??
Daniele De Sensi and
Tiziano De Matteis and
Massimo Torquati and
Gabriele Mencagli and
Marco Danelutto Bringing Parallel Patterns Out of the
Corner: The P$^3$ARSEC Benchmark Suite 33:1--33:??
Chencheng Ye and
Chen Ding and
Hao Luo and
Jacob Brock and
Dong Chen and
Hai Jin Cache Exclusivity and Sharing: Theory
and Optimization . . . . . . . . . . . . 34:1--34:??
Rahul Shrivastava and
V. Krishna Nandivada Energy-Efficient Compilation of
Irregular Task-Parallel Loops . . . . . 35:1--35:??
Julien Proy and
Karine Heydemann and
Alexandre Berzati and
Albert Cohen Compiler-Assisted Loop Hardening Against
Fault Attacks . . . . . . . . . . . . . 36:1--36:??
Christina Peterson and
Damian Dechev A Transactional Correctness Tool for
Abstract Data Types . . . . . . . . . . 37:1--37:??
Matteo Ferroni and
Andrea Corna and
Andrea Damiani and
Rolando Brondolin and
Juan A. Colmenares and
Steven Hofmeyr and
John D. Kubiatowicz and
Marco D. Santambrogio Power Consumption Models for
Multi-Tenant Server Infrastructures . . 38:1--38:??
Milad Mohammadi and
Tor M. Aamodt and
William J. Dally CG-OoO: Energy-Efficient Coarse-Grain
Out-of-Order Execution Near In-Order
Energy with Near Out-of-Order
Performance . . . . . . . . . . . . . . 39:1--39:??
Shivam Swami and
Poovaiah M. Palangappa and
Kartik Mohanram ECS: Error-Correcting Strings for
Lifetime Improvements in Nonvolatile
Memories . . . . . . . . . . . . . . . . 40:1--40:??
M. Waqar Azhar and
Per Stenström and
Vassilis Papaefstathiou SLOOP: QoS-Supervised Loop Execution to
Reduce Energy on Heterogeneous
Architectures . . . . . . . . . . . . . 41:1--41:??
Raghavendra Kanakagiri and
Biswabandan Panda and
Madhu Mutyam MBZip: Multiblock Data Compression . . . 42:1--42:??
Richard Neill and
Andi Drebes and
Antoniu Pop Fuse: Accurate Multiplexing of Hardware
Performance Counters Across Executions 43:1--43:??
Somayeh Sardashti and
David A. Wood Could Compression Be of General Use?
Evaluating Memory Compression across
Domains . . . . . . . . . . . . . . . . 44:1--44:??
Libo Huang and
Yashuai Lü and
Li Shen and
Zhiying Wang Improving the Efficiency of GPGPU
Work-Queue Through Data Awareness . . . 45:1--45:??
Alexandra Angerd and
Erik Sintorn and
Per Stenström A Framework for Automated and Controlled
Floating-Point Accuracy Reduction in
Graphics Applications on GPUs . . . . . 46:1--46:??
Jaime Arteaga and
Stéphane Zuckerman and
Guang R. Gao Generating Fine-Grain Multithreaded
Applications Using a Multigrain Approach 47:1--47:??
Ramyad Hadidi and
Lifeng Nai and
Hyojong Kim and
Hyesoon Kim CAIRO: a Compiler-Assisted Technique for
Enabling Instruction-Level Offloading of
Processing-In-Memory . . . . . . . . . . 48:1--48:??
Hongyeol Lim and
Giho Park Triple Engine Processor (TEP): a
Heterogeneous Near-Memory Processor for
Diverse Kernel Operations . . . . . . . 49:1--49:??
George Patsilaras and
James Tuck ReDirect: Reconfigurable Directories for
Multicore Architectures . . . . . . . . 50:1--50:??
Adarsh Patil and
Ramaswamy Govindarajan HAShCache: Heterogeneity-Aware Shared
DRAMCache for Integrated Heterogeneous
Systems . . . . . . . . . . . . . . . . 51:1--51:??
Christophe Alias and
Alexandru Plesco Optimizing Affine Control With Semantic
Factorizations . . . . . . . . . . . . . 52:1--52:??
George Matheou and
Paraskevas Evripidou Data-Driven Concurrency for High
Performance Computing . . . . . . . . . 53:1--53:??
Giorgis Georgakoudis and
Hans Vandierendonck and
Peter Thoman and
Bronis R. De Supinski and
Thomas Fahringer and
Dimitrios S. Nikolopoulos SCALO: Scalability-Aware Parallelism
Orchestration for Multi-Threaded
Workloads . . . . . . . . . . . . . . . 54:1--54:??
Toufik Baroudi and
Rachid Seghir and
Vincent Loechner Optimization of Triangular and Banded
Matrix Operations Using $2$ d-Packed
Layouts . . . . . . . . . . . . . . . . 55:1--55:??
Hochan Lee and
Mansureh S. Moghaddam and
Dongkwan Suh and
Bernhard Egger Improving Energy Efficiency of
Coarse-Grain Reconfigurable Arrays
Through Modulo Schedule
Compression/Decompression . . . . . . . 1:1--1:??
Karthik Sangaiah and
Michael Lui and
Radhika Jagtap and
Stephan Diestelhorst and
Siddharth Nilakantan and
Ankit More and
Baris Taskin and
Mark Hempstead SynchroTrace: Synchronization-Aware
Architecture-Agnostic Traces for
Lightweight Multicore Simulation of CMP
and HPC Workloads . . . . . . . . . . . 2:1--2:??
Long Zheng and
Xiaofei Liao and
Hai Jin Efficient and Scalable Graph Parallel
Processing With Symbolic Execution . . . 3:1--3:??
Jae-Eon Jo and
Gyu-Hyeon Lee and
Hanhwi Jang and
Jaewon Lee and
Mohammadamin Ajdari and
Jangwoo Kim DiagSim: Systematically Diagnosing
Simulators for Healthy Simulations . . . 4:1--4:??
Sushant Kondguli and
Michael Huang A Case for a More Effective,
Power-Efficient Turbo Boosting . . . . . 5:1--5:??
Kuan-Chung Chen and
Chung-Ho Chen Enabling SIMT Execution Model on
Homogeneous Multi-Core System . . . . . 6:1--6:??
Mingzhe Zhang and
King Tin Lam and
Xin Yao and
Cho-Li Wang SIMPO: a Scalable In-Memory Persistent
Object Framework Using NVRAM for
Reliable Big Data Computing . . . . . . 7:1--7:??
Bobin Deng and
Sriseshan Srikanth and
Eric R. Hein and
Thomas M. Conte and
Erik Debenedictis and
Jeanine Cook and
Michael P. Frank Extending Moore's Law via
Computationally Error-Tolerant Computing 8:1--8:??
Dave Dice and
Maurice Herlihy and
Alex Kogan Improving Parallelism in Hardware
Transactional Memory . . . . . . . . . . 9:1--9:??
Namhyung Kim and
Junwhan Ahn and
Kiyoung Choi and
Daniel Sanchez and
Donghoon Yoo and
Soojung Ryu Benzene: an Energy-Efficient Distributed
Hybrid Cache Architecture for Manycore
Systems . . . . . . . . . . . . . . . . 10:1--10:??
Yulong Ao and
Chao Yang and
Fangfang Liu and
Wanwang Yin and
Lijuan Jiang and
Qiao Sun Performance Optimization of the HPCG
Benchmark on the Sunway TaihuLight
Supercomputer . . . . . . . . . . . . . 11:1--11:??
Saeed Rashidi and
Majid Jalili and
Hamid Sarbazi-Azad Improving MLC PCM Performance through
Relaxed Write and Read for Intermediate
Resistance Levels . . . . . . . . . . . 12:1--12:??
Wenlai Zhao and
Haohuan Fu and
Jiarui Fang and
Weijie Zheng and
Lin Gan and
Guangwen Yang Optimizing Convolutional Neural Networks
on the Sunway TaihuLight Supercomputer 13:1--13:??
Dimitrios Mbakoyiannis and
Othon Tomoutzoglou and
George Kornaros Energy-Performance Considerations for
Data Offloading to FPGA-Based
Accelerators Over PCIe . . . . . . . . . 14:1--14:??
Zhen Lin and
Michael Mantor and
Huiyang Zhou GPU Performance vs. Thread-Level
Parallelism: Scalability Analysis and a
Novel Way to Improve TLP . . . . . . . . 15:1--15:??
Oleksandr Zinenko and
Stéphane Huot and
Cédric Bastoul Visual Program Manipulation in the
Polyhedral Model . . . . . . . . . . . . 16:1--16:??
Mustafa M. Shihab and
Jie Zhang and
Myoungsoo Jung and
Mahmut Kandemir ReveNAND: a Fast-Drift-Aware Resilient
$3$D NAND Flash Design . . . . . . . . . 17:1--17:??
Seyed Majid Zahedi and
Songchun Fan and
Benjamin C. Lee Managing Heterogeneous Datacenters with
Tokens . . . . . . . . . . . . . . . . . 18:1--18:??
Miquel Peric\`as Elastic Places: an Adaptive Resource
Manager for Scalable and Portable
Performance . . . . . . . . . . . . . . 19:1--19:??
Matthew Benjamin Olson and
Joseph T. Teague and
Divyani Rao and
Michael R. JANTZ and
Kshitij A. Doshi and
Prasad A. Kulkarni Cross-Layer Memory Management to Improve
DRAM Energy Efficiency . . . . . . . . . 20:1--20:??
Davide Zoni and
Luca Colombo and
William Fornaciari DarkCache: Energy-Performance
Optimization of Tiled Multi-Cores by
Adaptively Power-Gating LLC Banks . . . 21:1--21:??
Yang Zhang and
Dan Feng and
Wei Tong and
Yu Hua and
Jingning Liu and
Zhipeng Tan and
Chengning Wang and
Bing Wu and
Zheng Li and
Gaoxiang Xu CACF: a Novel Circuit Architecture
Co-optimization Framework for Improving
Performance, Reliability and Energy of
ReRAM-based Main Memory System . . . . . 22:1--22:??
Nicolai Stawinoga and
Tony Field Predictable Thread Coarsening . . . . . 23:1--23:??
Probir Roy and
Shuaiwen Leon Song and
Sriram Krishnamoorthy and
Abhinav Vishnu and
Dipanjan Sengupta and
Xu Liu NUMA-Caffe: NUMA-Aware Deep Learning
Neural Networks . . . . . . . . . . . . 24:1--24:??
Ahsen Ejaz and
Vassilios Papaefstathiou and
Ioannis Sourdis DDRNoC: Dual Data-Rate Network-on-Chip 25:1--25:??
Ying Cai and
Yulong Ao and
Chao Yang and
Wenjing Ma and
Haitao Zhao Extreme-Scale High-Order WENO
Simulations of $3$-D Detonation Wave
with 10 Million Cores . . . . . . . . . 26:1--26:??
Yannis Sfakianakis and
Christos Kozanitis and
Christos Kozyrakis and
Angelos Bilas QuMan: Profile-based Improvement of
Cluster Utilization . . . . . . . . . . 27:1--27:??
Engin Kayraklioglu and
Michael P. Ferguson and
Tarek El-Ghazawi LAPPS: Locality-Aware Productive
Prefetching Support for PGAS . . . . . . 28:1--28:??
Akrem Benatia and
Weixing Ji and
Yizhuo Wang and
Feng Shi BestSF: a Sparse Meta-Format for
Optimizing SpMV on GPU . . . . . . . . . 29:1--29:??
Pierre Michaud An Alternative TAGE-like Conditional
Branch Predictor . . . . . . . . . . . . 30:1--30:??
James Garland and
David Gregg Low Complexity Multiply-Accumulate Units
for Convolutional Neural Networks with
Weight-Sharing . . . . . . . . . . . . . 31:1--31:??
Hyojong Kim and
Ramyad Hadidi and
Lifeng Nai and
Hyesoon Kim and
Nuwan Jayasena and
Yasuko Eckert and
Onur Kayiran and
Gabriel Loh CODA: Enabling Co-location of
Computation and Data for Multiple GPU
Systems . . . . . . . . . . . . . . . . 32:1--32:??
Madhavan Manivannan and
Miquel Pericás and
Vassilis Papaefstathiou and
Per Stenström Global Dead-Block Management for
Task-Parallel Programs . . . . . . . . . 33:1--33:??
Roman Gareev and
Tobias Grosser and
Michael Kruse High-Performance Generalized Tensor
Operations: a Compiler-Oriented Approach 34:1--34:??
Hervé Yviquel and
Lauro Cruz and
Guido Araujo Cluster Programming using the OpenMP
Accelerator Model . . . . . . . . . . . 35:1--35:??
Mohammad Khavari Tavana and
Amir Kavyan Ziabari and
David Kaeli Block Cooperation: Advancing Lifetime of
Resistive Memories by Increasing
Utilization of Error Correcting Codes 36:1--36:??
Hai Jin and
Bo Liu and
Wenbin Jiang and
Yang Ma and
Xuanhua Shi and
Bingsheng He and
Shaofeng Zhao Layer-Centric Memory Reuse and Data
Migration for Extreme-Scale Deep
Learning on Many-Core Architectures . . 37:1--37:??
Dani Voitsechov and
Arslan Zulfiqar and
Mark Stephenson and
Mark Gebhart and
Stephen W. Keckler Software-Directed Techniques for
Improved GPU Register File Utilization 38:1--38:??
Huanxin Lin and
Cho-Li Wang and
Hongyuan Liu On-GPU Thread-Data Remapping for Branch
Divergence Reduction . . . . . . . . . . 39:1--39:??
Stefan Kronawitter and
Christian Lengauer Polyhedral Search Space Exploration in
the ExaStencils Code Generator . . . . . 40:1--40:??
Jingheng Xu and
Haohuan Fu and
Wen Shi and
Lin Gan and
Yuxuan Li and
Wayne Luk and
Guangwen Yang Performance Tuning and Analysis for
Stencil-Based Applications on POWER8
Processor . . . . . . . . . . . . . . . 41:1--41:??
Jiajun Wang and
Reena Panda and
Lizy K. John SelSMaP: a Selective Stride Masking
Prefetching Scheme . . . . . . . . . . . 42:1--42:??
Xing Su and
Xiangke Liao and
Hao Jiang and
Canqun Yang and
Jingling Xue SCP: Shared Cache Partitioning for
High-Performance GEMM . . . . . . . . . 43:1--43:??
Fernando Magno Quintão Pereira and
Guilherme Vieira Leobas and
Abdoulaye Gamatié Static Prediction of Silent Stores . . . 44:1--44:??
Neal C. Crago and
Mark Stephenson and
Stephen W. Keckler Exposing Memory Access Patterns to
Improve Instruction and Memory
Efficiency in GPUs . . . . . . . . . . . 45:1--45:??
Feng Zhang and
Jingling Xue Poker: Permutation-Based SIMD Execution
of Intensive Tree Search by Path
Encoding . . . . . . . . . . . . . . . . 46:1--46:??
Nicolas Belleville and
Damien Couroussé and
Karine Heydemann and
Henri-Pierre Charles Automated Software Protection for the
Masses Against Side-Channel Attacks . . 47:1--47:??
Chao Yu and
Yuebin Bai and
Qingxiao Sun and
Hailong Yang Improving Thread-level Parallelism in
GPUs Through Expanding Register File to
Scratchpad Memory . . . . . . . . . . . 48:1--48:??
Lois Orosa and
Rodolfo Azevedo and
Onur Mutlu AVPP: Address-first Value-next Predictor
with Value Prefetching for Improving the
Efficiency of Load Value Prediction . . 49:1--49:??
Jun Zhang and
Rui Hou and
Wei Song and
Sally A. Mckee and
Zhen Jia and
Chen Zheng and
Mingyu Chen and
Lixin Zhang and
Dan Meng RAGuard: an Efficient and
User-Transparent Hardware Mechanism
against ROP Attacks . . . . . . . . . . 50:1--50:??
Ping Wang and
Luke Mchale and
Paul V. Gratz and
Alex Sprintson GenMatcher: a Generic Clustering-Based
Arbitrary Matching Framework . . . . . . 51:1--51:??
Ding-Yong Hong and
Jan-Jan Wu and
Yu-Ping Liu and
Sheng-Yu Fu and
Wei-Chung Hsu Processor-Tracing Guided Region
Formation in Dynamic Binary Translation 52:1--52:??
Yu Wang and
Victor Lee and
Gu-Yeon Wei and
David Brooks Predicting New Workload or CPU
Performance by Analyzing Public Datasets 53:1--53:??
Hyukwoo Park and
Sungkook Kim and
Jung-Geun Park and
Soo-Mook Moon Reusing the Optimized Code for
JavaScript Ahead-of-Time Compilation . . 54:1--54:??
Han Zhao and
Quan Chen and
Yuxian Qiu and
Ming Wu and
Yao Shen and
Jingwen Leng and
Chao Li and
Minyi Guo Bandwidth and Locality Aware
Task-stealing for Manycore Architectures
with Bandwidth-Asymmetric Memory . . . . 55:1--55:??
Stefan Ganser and
Armin Größlinger and
Norbert Siegmund and
Sven Apel and
Christian Lengauer Speeding up Iterative Polyhedral
Schedule Optimization with Surrogate
Performance Models . . . . . . . . . . . 56:1--56:??
Song Wu and
Fang Zhou and
Xiang Gao and
Hai Jin and
Jinglei Ren Dual-Page Checkpointing: an
Architectural Approach to Efficient Data
Persistence for In-Memory Applications 57:1--57:??
Mohsen Kiani and
Amir Rajabzadeh Efficient Cache Performance Modeling in
GPUs Using Reuse Distance Analysis . . . 58:1--58:??
Thomas Debrunner and
Sajad Saeedi and
Paul H. J. Kelly AUKE: Automatic Kernel Code Generation
for an Analogue SIMD Focal-Plane
Sensor-Processor Array . . . . . . . . . 59:1--59:??
You Zhou and
Fei Wu and
Zhonghai Lu and
Xubin He and
Ping Huang and
Changsheng Xie SCORE: a Novel Scheme to Efficiently
Cache Overlong ECCs in NAND Flash Memory 60:1--60:??
Franciso J. Andújar and
Salvador Coll and
Marina Alonso and
Pedro López and
Juan-Miguel Martínez POWAR: Power-Aware Routing in HPC
Networks with On/Off Links . . . . . . . 61:1--61:??
Rahim Mammadli and
Felix Wolf and
Ali Jannesari The Art of Getting Deep Neural Networks
in Shape . . . . . . . . . . . . . . . . 62:1--62:??
Stavros Tzilis and
Pedro Trancoso and
Ioannis Sourdis Energy-Efficient Runtime Management of
Heterogeneous Multicores using Online
Projection . . . . . . . . . . . . . . . 63:1--63:??
Matthew Kay Fei Lee and
Yingnan Cui and
Thannirmalai Somu and
Tao Luo and
Jun Zhou and
Wai Teng Tang and
Weng-Fai Wong and
Rick Siow Mong Goh A System-Level Simulator for RRAM-Based
Neuromorphic Computing Chips . . . . . . 64:1--64:??
Evangelos Vasilakis and
Vassilis Papaefstathiou and
Pedro Trancoso and
Ioannis Sourdis Decoupled Fused Cache: Fusing a
Decoupled LLC with a DRAM Cache . . . . 65:1--65:??
Peter Pirkelbauer and
Amalee Wilson and
Christina Peterson and
Damian Dechev Blaze-Tasks: a Framework for Computing
Parallel Reductions over Tasks . . . . . 66:1--66:??
Yukinori Sato and
Tomoya Yuki and
Toshio Endo An Autotuning Framework for Scalable
Execution of Tiled Code via Iterative
Polyhedral Compilation . . . . . . . . . 67:1--67:??
S.-Kazem Shekofteh and
Hamid Noori and
Mahmoud Naghibzadeh and
Hadi Sadoghi Yazdi and
Holger Fröning Metric Selection for GPU Kernel
Classification . . . . . . . . . . . . . 68:1--68:??
Angelos Bilas List of 2018 Distinguished Reviewers ACM
TACO . . . . . . . . . . . . . . . . . . 69:1--69:??
Ghassan Shobaki and
Austin Kerbow and
Christopher Pulido and
William Dobson Exploring an Alternative Cost Function
for Combinatorial
Register-Pressure-Aware Instruction
Scheduling . . . . . . . . . . . . . . . 1:1--1:??
Yu-Ping Liu and
Ding-Yong Hong and
Jan-Jan Wu and
Sheng-Yu Fu and
Wei-Chung Hsu Exploiting SIMD Asymmetry in ARM-to-x86
Dynamic Binary Translation . . . . . . . 2:1--2:??
Mohammad Sadrosadati and
Seyed Borna Ehsani and
Hajar Falahati and
Rachata Ausavarungnirun and
Arash Tavakkol and
Mojtaba Abaee and
Lois Orosa and
Yaohua Wang and
Hamid Sarbazi-Azad and
Onur Mutlu ITAP: Idle-Time-Aware Power Management
for GPU Execution Units . . . . . . . . 3:1--3:??
Halit Dogan and
Masab Ahmad and
Brian Kahne and
Omer Khan Accelerating Synchronization Using
Moving Compute to Data Model at
1,000-core Multicore Scale . . . . . . . 4:1--4:??
Leonid Azriel and
Lukas Humbel and
Reto Achermann and
Alex Richardson and
Moritz Hoffmann and
Avi Mendelson and
Timothy Roscoe and
Robert N. M. Watson and
Paolo Faraboschi and
Dejan Milojicic Memory-Side Protection With a Capability
Enforcement Co-Processor . . . . . . . . 5:1--5:??
Aamer Jaleel and
Eiman Ebrahimi and
Sam Duncan DUCATI: High-performance Address
Translation by Extending TLB Reach of
GPU-accelerated Systems . . . . . . . . 6:1--6:??
Yemao Xu and
Dezun Dong and
Weixia Xu and
Xiangke Liao SketchDLC: a Sketch on Distributed Deep
Learning Communication via Trace
Capturing . . . . . . . . . . . . . . . 7:1--7:??
Aristeidis Mastoras and
Thomas R. Gross Efficient and Scalable Execution of
Fine-Grained Dynamic Linear Pipelines 8:1--8:??
Tae Jun Ham and
Juan L. Aragón and
Margaret Martonosi Efficient Data Supply for Parallel
Heterogeneous Architectures . . . . . . 9:1--9:??
Savvas Sioutas and
Sander Stuijk and
Luc Waeijen and
Twan Basten and
Henk Corporaal and
Lou Somers Schedule Synthesis for Halide Pipelines
through Reuse Analysis . . . . . . . . . 10:1--10:??
Xiaoyuan Wang and
Haikun Liu and
Xiaofei Liao and
Ji Chen and
Hai Jin and
Yu Zhang and
Long Zheng and
Bingsheng He and
Song Jiang Supporting Superpages and Lightweight
Page Migration in Hybrid Memory Systems 11:1--11:??
Sahar Sargaran and
Naser Mohammadzadeh SAQIP: a Scalable Architecture for
Quantum Information Processors . . . . . 12:1--12:??
Prerna Budhkar and
Ildar Absalyamov and
Vasileios Zois and
Skyler Windh and
Walid A. Najjar and
Vassilis J. Tsotras Accelerating In-Memory Database
Selections Using Latency Masking
Hardware Threads . . . . . . . . . . . . 13:1--13:??
Heinrich Riebler and
Gavin Vaz and
Tobias Kenter and
Christian Plessl Transparent Acceleration for
Heterogeneous Platforms With Compilation
to OpenCL . . . . . . . . . . . . . . . 14:1--14:??
Xun Gong and
Xiang Gong and
Leiming Yu and
David Kaeli HAWS: Accelerating GPU Wavefront
Execution through Selective Out-of-order
Execution . . . . . . . . . . . . . . . 15:1--15:??
Yang Song and
Olivier Alavoine and
Bill Lin A Self-aware Resource Management
Framework for Heterogeneous Multicore
SoCs with Diverse QoS Targets . . . . . 16:1--16:??
Pedro Yebenes and
Jose Rocher-Gonzalez and
Jesus Escudero-Sahuquillo and
Pedro Javier Garcia and
Francisco J. Alfaro and
Francisco J. Quiles and
Crispín Gómez and
Jose Duato Combining Source-adaptive and Oblivious
Routing with Congestion Control in
High-performance Interconnects using
Hybrid and Direct Topologies . . . . . . 17:1--17:??
Mohammad Alshboul and
Hussein Elnawawy and
Reem Elkhouly and
Keiji Kimura and
James Tuck and
Yan Solihin Efficient Checkpointing with Recompute
Scheme for Non-volatile Main Memory . . 18:1--18:??
Zacharias Hadjilambrou and
Marios Kleanthous and
Georgia Antoniou and
Antoni Portero and
Yiannakis Sazeides Comprehensive Characterization of an
Open Source Document Search Engine . . . 19:1--19:??
Bingchao Li and
Jizeng Wei and
Jizhou Sun and
Murali Annavaram and
Nam Sung Kim An Efficient GPU Cache Architecture for
Applications with Irregular Memory
Access Patterns . . . . . . . . . . . . 20:1--20:??
Stephen I. Roberts and
Steven A. Wright and
Suhaib A. Fahmy and
Stephen A. Jarvis The Power-optimised Software Envelope 21:1--21:??
Ram Srivatsa Kannan and
Michael Laurenzano and
Jeongseob Ahn and
Jason Mars and
Lingjia Tang Caliper: Interference Estimator for
Multi-tenant Environments Sharing
Architectural Resources . . . . . . . . 22:1--22:??
Zhen Lin and
Hongwen Dai and
Michael Mantor and
Huiyang Zhou Coordinated CTA Combination and
Bandwidth Partitioning for GPU
Concurrent Kernel Execution . . . . . . 23:1--23:??
Keryan Didier and
Dumitru Potop-Butucaru and
Guillaume Iooss and
Albert Cohen and
Jean Souyris and
Philippe Baufreton and
Amaury Graillat Correct-by-Construction Parallelization
of Hard Real-Time Avionics Applications
on Off-the-Shelf Predictable Hardware 24:1--24:??
Pantea Zardoshti and
Tingzhe Zhou and
Pavithra Balaji and
Michael L. Scott and
Michael Spear Simplifying Transactional Memory Support
in C++ . . . . . . . . . . . . . . . . . 25:1--25:??
Jungwoo Park and
Myoungjun Lee and
Soontae Kim and
Minho Ju and
Jeongkyu Hong MH Cache: a Multi-retention
STT-RAM-based Low-power Last-level Cache
for Mobile Hardware Rendering Systems 26:1--26:??
Jakob Leben and
George Tzanetakis Polyhedral Compilation for
Multi-dimensional Stream Processing . . 27:1--27:??
Mohammad Sadegh Sadeghi and
Siavash Bayat Sarmadi and
Shaahin Hessabi Toward On-chip Network Security Using
Runtime Isolation Mapping . . . . . . . 28:1--28:??
Stephane Louise A First Step Toward Using Quantum
Computing for Low-level WCETs
Estimations . . . . . . . . . . . . . . 29:1--29:??
Artem Chikin and
Taylor Lloyd and
José Nelson Amaral and
Ettore Tiotto and
Muhammad Usman Memory-access-aware Safety and
Profitability Analysis for
Transformation of Accelerator-bound
OpenMP Loops . . . . . . . . . . . . . . 30:1--30:??
Sanghoon Cha and
Bokyeong Kim and
Chang Hyun Park and
Jaehyuk Huh Morphable DRAM Cache Design for Hybrid
Memory Systems . . . . . . . . . . . . . 31:1--31:??
Chao Luo and
Yunsi Fei and
David Kaeli Side-channel Timing Attack of RSA on a
GPU . . . . . . . . . . . . . . . . . . 32:1--32:??
Liang Yuan and
Chen Ding and
Wesley Smith and
Peter Denning and
Yunquan Zhang A Relational Theory of Locality . . . . 33:1--33:??
Arun Thangamani and
V. Krishna Nandivada Optimizing Remote Communication in X10 34:1--34:26
Sriseshan Srikanth and
Anirudh Jain and
Joseph M. Lennon and
Thomas M. Conte and
Erik Debenedictis and
Jeanine Cook MetaStrider: Architectures for Scalable
Memory-centric Reduction of Sparse Data
Streams . . . . . . . . . . . . . . . . 35:1--35:26
Mostafa Koraei and
Omid Fatemi and
Magnus Jahre DCMI: a Scalable Strategy for
Accelerating Iterative Stencil Loops on
FPGAs . . . . . . . . . . . . . . . . . 36:1--36:24
Leeor Peled and
Uri Weiser and
Yoav Etsion A Neural Network Prefetcher for
Arbitrary Memory Access Patterns . . . . 37:1--37:27
Nicolas Vasilache and
Oleksandr Zinenko and
Theodoros Theodoridis and
Priya Goyal and
Zachary Devito and
William S. Moses and
Sven Verdoolaege and
Andrew Adams and
Albert Cohen The Next 700 Accelerated Layers: From
Mathematical Expressions of Network
Computation Graphs to Accelerated GPU
Kernels, Automatically . . . . . . . . . 38:1--38:26
Wenbin Jiang and
Yang Ma and
Bo Liu and
Haikun Liu and
Bing Bing Zhou and
Jian Zhu and
Song Wu and
Hai Jin Layup: Layer-adaptive and Multi-type
Intermediate-oriented Memory
Optimization for GPU-based CNNs . . . . 39:1--39:23
Sergi Siso and
Wes Armour and
Jeyarajan Thiyagalingam Evaluating Auto-Vectorizing Compilers
through Objective Withdrawal of Useful
Information . . . . . . . . . . . . . . 40:1--40:23
Salonik Resch and
S. Karen Khatamifard and
Zamshed Iqbal Chowdhury and
Masoud Zabihi and
Zhengyang Zhao and
Jian-Ping Wang and
Sachin S. Sapatnekar and
Ulya R. Karpuzcu PIMBALL: Binary Neural Networks in
Spintronic Memory . . . . . . . . . . . 41:1--41:26
Zhen Hang Jiang and
Yunsi Fei and
David Kaeli Exploiting Bank Conflict-based
Side-channel Timing Leakage of GPUs . . 42:1--42:24
Kyle Daruwalla and
Heng Zhuo and
Rohit Shukla and
Mikko Lipasti BitSAD v2: Compiler Optimization and
Analysis for Bitstream Computing . . . . 43:1--43:25
Aristeidis Mastoras and
Thomas R. Gross Chunking for Dynamic Linear Pipelines 44:1--44:25
Manuel Selva and
Fabian Gruber and
Diogo Sampaio and
Christophe Guillon and
Louis-Noël Pouchet and
Fabrice Rastello Building a Polyhedral Representation
from an Instrumented Execution: Making
Dynamic Analyses of Nonaffine Programs
Scalable . . . . . . . . . . . . . . . . 45:1--45:26
Ahmad Yasin and
Jawad Haj-Yahya and
Yosi Ben-Asher and
Avi Mendelson A Metric-Guided Method for Discovering
Impactful Features and Architectural
Insights for Skylake-Based Processors 46:1--46:25
Jie Zhao and
Albert Cohen Flextended Tiles: a Flexible Extension
of Overlapped Tiles for Polyhedral
Compilation . . . . . . . . . . . . . . 47:1--47:25
Daniel Gerzhoy and
Xiaowu Sun and
Michael Zuzak and
Donald Yeung Nested MIMD--SIMD Parallelization for
Heterogeneous Microprocessors . . . . . 48:1--48:27
Chunwei Xia and
Jiacheng Zhao and
Huimin Cui and
Xiaobing Feng and
Jingling Xue DNNTune: Automatic Benchmarking DNN
Models for Mobile-cloud Computing . . . 49:1--49:26
Ian Briggs and
Arnab Das and
Mark Baranowski and
Vishal Sharma and
Sriram Krishnamoorthy and
Zvonimir Rakamari\'c and
Ganesh Gopalakrishnan FailAmp: Relativization Transformation
for Soft Error Detection in Structured
Address Generation . . . . . . . . . . . 50:1--50:21
Khalid Ahmad and
Hari Sundar and
Mary Hall Data-driven Mixed Precision Sparse
Matrix Vector Multiplication for GPUs 51:1--51:24
Larisa Stoltzfus and
Bastian Hagedorn and
Michel Steuwer and
Sergei Gorlatch and
Christophe Dubach Tiling Optimizations for Stencil
Computations Using Rewrite Rules in Lift 52:1--52:25
Michiel A. van der Vlag and
Georgios Smaragdos and
Zaid Al-Ars and
Christos Strydis Exploring Complex Brain-Simulation
Workloads on Multi-GPU Deployments . . . 53:1--53:25
Reem Elkhouly and
Mohammad Alshboul and
Akihiro Hayashi and
Yan Solihin and
Keiji Kimura Compiler-support for Critical Data
Persistence in NVM . . . . . . . . . . . 54:1--54:25
Lorenzo Chelini and
Oleksandr Zinenko and
Tobias Grosser and
Henk Corporaal Declarative Loop Tactics for
Domain-specific Optimization . . . . . . 55:1--55:25
Asif Ali Khan and
Fazal Hameed and
Robin Bläsing and
Stuart S. P. Parkin and
Jeronimo Castrillon ShiftsReduce: Minimizing Shifts in
Racetrack Memory 4.0 . . . . . . . . . . 56:1--56:23
Yuhao Li and
Dan Sun and
Benjamin C. Lee Dynamic Colocation Policies with
Reinforcement Learning . . . . . . . . . 1:1--1:25
Nikolaos Tampouratzis and
Ioannis Papaefstathiou and
Antonios Nikitakis and
Andreas Brokalakis and
Stamatis Andrianakis and
Apostolos Dollas and
Marco Marcon and
Emanuele Plebani A Novel, Highly Integrated Simulator for
Parallel and Distributed Systems . . . . 2:1--2:28
Lijuan Jiang and
Chao Yang and
Wenjing Ma Enabling Highly Efficient Batched Matrix
Multiplications on SW26010 Many-core
Processor . . . . . . . . . . . . . . . 3:1--3:23
Mustafa Cavus and
Resit Sendag and
Joshua J. Yi Informed Prefetching for Indirect Memory
Accesses . . . . . . . . . . . . . . . . 4:1--4:29
Yohann Uguen and
Florent De Dinechin and
Victor Lezaud and
Steven Derrien Application-Specific Arithmetic in
High-Level Synthesis Tools . . . . . . . 5:1--5:23
Yang Song and
Bill Lin Improving Memory Efficiency in
Heterogeneous MPSoCs through Row-Buffer
Locality-aware Forwarding . . . . . . . 6:1--6:26
Hao Wu and
Weizhi Liu and
Huanxin Lin and
Cho-Li Wang A Model-Based Software Solution for
Simultaneous Multiple Kernels on GPUs 7:1--7:26
Xuanhua Shi and
Wei Liu and
Ligang He and
Hai Jin and
Ming Li and
Yong Chen Optimizing the SSD Burst Buffer by
Traffic Detection . . . . . . . . . . . 8:1--8:26
Charu Kalra and
Fritz Previlon and
Norm Rubin and
David Kaeli ArmorAll: Compiler-based Resilience
Targeting GPU Applications . . . . . . . 9:1--9:24
Stefano Cherubin and
Daniele Cattaneo and
Michele Chiari and
Giovanni Agosta Dynamic Precision Autotuning with TAFFO 10:1--10:26
Ahmet Erdem and
Cristina Silvano and
Thomas Boesch and
Andrea Carlo Ornstein and
Surinder-Pal Singh and
Giuseppe Desoli Runtime Design Space Exploration and
Mapping of DCNNs for the Ultra-Low-Power
Orlando SoC . . . . . . . . . . . . . . 11:1--11:25
Amir Hossein Nodehi Sabet and
Junqiao Qiu and
Zhijia Zhao and
Sriram Krishnamoorthy Reliability Analysis for Unreliable FSM
Computations . . . . . . . . . . . . . . 12:1--12:23
Jiachen Xue and
T. N. Vijaykumar and
Mithuna Thottethodi Network Interface Architecture for
Remote Indirect Memory Access (RIMA) in
Datacenters . . . . . . . . . . . . . . 13:1--13:22
Qinggang Wang and
Long Zheng and
Jieshan Zhao and
Xiaofei Liao and
Hai Jin and
Jingling Xue A Conflict-free Scheduler for
High-performance Graph Processing on
Multi-pipeline FPGAs . . . . . . . . . . 14:1--14:26
Anita Tino and
Caroline Collange and
André Seznec SIMT-X: Extending Single-Instruction
Multi-Threading to Out-of-Order Cores 15:1--15:23
Dave Kaeli Editorial: a Message from the
Editor-in-Chief . . . . . . . . . . . . 16:1--16:2
Ram Rangan and
Mark W. Stephenson and
Aditya Ukarande and
Shyam Murthy and
Virat Agarwal and
Marc Blackstein Zeroploit: Exploiting Zero Valued
Operands in Interactive Gaming
Applications . . . . . . . . . . . . . . 17:1--17:26
Karel Adámek and
Sofia Dimoudi and
Mike Giles and
Wesley Armour GPU Fast Convolution via the
Overlap-and-Save Method in Shared Memory 18:1--18:20
Arnab Das and
Sriram Krishnamoorthy and
Ian Briggs and
Ganesh Gopalakrishnan and
Ramakrishna Tipireddy FPDetect: Efficient Reasoning About
Stencil Programs Using Selective Direct
Evaluation . . . . . . . . . . . . . . . 19:1--19:27
Tarek S. Abdelrahman Cooperative Software-hardware
Acceleration of $K$-means on a Tightly
Coupled CPU--FPGA System . . . . . . . . 20:1--20:24
Jaekyu Lee and
Yasuo Ishii and
Dam Sunwoo Securing Branch Predictors with
Two-Level Encryption . . . . . . . . . . 21:1--21:25
L. Cerina and
M. D. Santambrogio and
G. Franco and
C. Gallicchio and
A. Micheli EchoBay: Design and Optimization of Echo
State Networks under Memory and Time
Constraints . . . . . . . . . . . . . . 22:1--22:24
Savvas Sioutas and
Sander Stuijk and
Twan Basten and
Henk Corporaal and
Lou Somers Schedule Synthesis for Halide Pipelines
on GPUs . . . . . . . . . . . . . . . . 23:1--23:25
Muhammad Huzaifa and
Johnathan Alsop and
Abdulrahman Mahmoud and
Giordano Salvador and
Matthew D. Sinclair and
Sarita V. Adve Inter-kernel Reuse-aware Thread Block
Scheduling . . . . . . . . . . . . . . . 24:1--24:27
Syed M. A. H. Jafri and
Hasan Hassan and
Ahmed Hemani and
Onur Mutlu Refresh Triggered Computation: Improving
the Energy Efficiency of Convolutional
Neural Network Accelerators . . . . . . 2:1--2:29
Solomon Abera and
M. Balakrishnan and
Anshul Kumar Performance-Energy Trade-off in Modern
CMPs . . . . . . . . . . . . . . . . . . 3:1--3:26
Atefeh Mehrabi and
Aninda Manocha and
Benjamin C. Lee and
Daniel J. Sorin Bayesian Optimization for Efficient
Accelerator Synthesis . . . . . . . . . 4:1--4:25
Minsu Kim and
Jeong-Keun Park and
Soo-Mook Moon Irregular Register Allocation for
Translation of Test-pattern Programs . . 5:1--5:23
Negin Nematollahi and
Mohammad Sadrosadati and
Hajar Falahati and
Marzieh Barkhordar and
Mario Paulo Drumond and
Hamid Sarbazi-Azad and
Babak Falsafi Efficient Nearest-Neighbor Data Sharing
in GPUs . . . . . . . . . . . . . . . . 6:1--6:26
Lorenz Braun and
Sotirios Nikas and
Chen Song and
Vincent Heuveline and
Holger Fröning A Simple Model for Portable and Fast
Prediction of Execution Time and Power
Consumption of GPU Kernels . . . . . . . 7:1--7:25
Marcel Mettler and
Daniel Mueller-Gritschneder and
Ulf Schlichtmann A Distributed Hardware Monitoring System
for Runtime Verification on Multi-Tile
MPSoCs . . . . . . . . . . . . . . . . . 8:1--8:25
Yu Emma Wang and
Carole-Jean Wu and
Xiaodong Wang and
Kim Hazelwood and
David Brooks Exploiting Parallelism Opportunities
with Deep Learning Frameworks . . . . . 9:1--9:23
Sanket Tavarageri and
Alexander Heinecke and
Sasikanth Avancha and
Bharat Kaul and
Gagandeep Goyal and
Ramakrishna Upadrasta PolyDL: Polyhedral Optimizations for
Creation of High-performance DL
Primitives . . . . . . . . . . . . . . . 11:1--11:27
Sujay Yadalam and
Vinod Ganapathy and
Arkaprava Basu SG XL: Security and Performance for
Enclaves Using Large Pages . . . . . . . 12:1--12:25
Kleovoulos Kalaitzidis and
André Seznec Leveraging Value Equality Prediction for
Value Speculation . . . . . . . . . . . 13:1--13:20
Abhishek Singh and
Shail Dave and
Pantea Zardoshti and
Robert Brotzman and
Chao Zhang and
Xiaochen Guo and
Aviral Shrivastava and
Gang Tan and
Michael Spear SPX64: a Scratchpad Memory for
General-purpose Microprocessors . . . . 14:1--14:26
Paolo Sylos Labini and
Marco Cianfriglia and
Damiano Perri and
Osvaldo Gervasi and
Grigori Fursin and
Anton Lokhmotov and
Cedric Nugteren and
Bruno Carpentieri and
Fabiana Zollo and
Flavio Vella On the Anatomy of Predictive Models for
Accelerating GPU Convolution Kernels and
Beyond . . . . . . . . . . . . . . . . . 16:1--16:24
Nils Voss and
Bastiaan Kwaadgras and
Oskar Mencer and
Wayne Luk and
Georgi Gaydadjiev On Predictable Reconfigurable System
Design . . . . . . . . . . . . . . . . . 17:1--17:28
Anirudh Mohan Kaushik and
Gennady Pekhimenko and
Hiren Patel Gretch: a Hardware Prefetcher for Graph
Analytics . . . . . . . . . . . . . . . 18:1--18:25
Nhut-Minh Ho and
Himeshi De Silva and
Weng-Fai Wong GRAM: a Framework for Dynamically Mixing
Precisions in GPU Applications . . . . . 19:1--19:24
Arnab Kumar Biswas Cryptographic Software IP Protection
without Compromising Performance or
Timing Side-channel Leakage . . . . . . 20:1--20:20
Maxime France-Pillois and
Jérôme Martin and
Frédéric Rousseau A Non-Intrusive Tool Chain to Optimize
MPSoC End-to-End Systems . . . . . . . . 21:1--21:22
Pengyu Wang and
Jing Wang and
Chao Li and
Jianzong Wang and
Haojin Zhu and
Minyi Guo Grus: Toward Unified-memory-efficient
High-performance Graph Processing on GPU 22:1--22:25
Ramin Izadpanah and
Christina Peterson and
Yan Solihin and
Damian Dechev PETRA: Persistent Transactional
Non-blocking Linked Data Structures . . 23:1--23:26
Muhammad Hassan and
Chang Hyun Park and
David Black-Schaffer A Reusable Characterization of the
Memory System Behavior of SPEC2017 and
SPEC2006 . . . . . . . . . . . . . . . . 24:1--24:20
Sugandha Tiwari and
Neel Gala and
Chester Rebeiro and
V. Kamakoti PERI: a Configurable Posit Enabled
RISC-V Core . . . . . . . . . . . . . . 25:1--25:26
George Charitopoulos and
Dionisios N. Pnevmatikatos and
Georgi Gaydadjiev MC-DeF: Creating Customized CGRAs for
Dataflow Applications . . . . . . . . . 26:1--26:25
Jose M. Rodriguez Borbon and
Junjie Huang and
Bryan M. Wong and
Walid Najjar Acceleration of Parallel-Blocked $ Q R $
Decomposition of Tall-and-Skinny
Matrices on FPGAs . . . . . . . . . . . 27:1--27:25
Michael Stokes and
David Whalley and
Soner Onder Decreasing the Miss Rate and Eliminating
the Performance Penalty of a Data Filter
Cache . . . . . . . . . . . . . . . . . 28:1--28:22
Shoaib Akram Performance Evaluation of Intel Optane
Memory for Managed Workloads . . . . . . 29:1--29:26
Yashuai Lü and
Hui Guo and
Libo Huang and
Qi Yu and
Li Shen and
Nong Xiao and
Zhiying Wang GraphPEG: Accelerating Graph Processing
on GPUs . . . . . . . . . . . . . . . . 30:1--30:24
Hamza Omar and
Omer Khan PRISM: Strong Hardware Isolation-based
Soft-Error Resilient Multicore
Architecture with High Performance and
Availability at Low Hardware Overheads 31:1--31:25
Devashree Tripathy and
Amirali Abdolrashidi and
Laxmi Narayan Bhuyan and
Liang Zhou and
Daniel Wong PAVER: Locality Graph-Based Thread Block
Scheduling for GPUs . . . . . . . . . . 32:1--32:26
Wim Heirman and
Stijn Eyerman and
Kristof Du Bois and
Ibrahim Hur Automatic Sublining for Efficient Sparse
Memory Accesses . . . . . . . . . . . . 33:1--33:23
Mustafa Cavus and
Mohammed Shatnawi and
Resit Sendag and
Augustus K. Uht Fast Key-Value Lookups with Node Tracker 34:1--34:26
Weijia Song and
Christina Delimitrou and
Zhiming Shen and
Robbert Van Renesse and
Hakim Weatherspoon and
Lotfi Benmohamed and
Frederic De Vaulx and
Charif Mahmoudi CacheInspector: Reverse Engineering
Cache Resources in Public Clouds . . . . 35:1--35:25
Daniel Rodrigues Carvalho and
André Seznec Understanding Cache Compression . . . . 36:1--36:27
Daniel Thuerck and
Nicolas Weber and
Roberto Bifulco Flynn's Reconciliation: Automating the
Register Cache Idiom for
Cross-accelerator Programming . . . . . 37:1--37:26
João P. L. De Carvalho and
Braedy Kuzma and
Ivan Korostelev and
José Nelson Amaral and
Christopher Barton and
José Moreira and
Guido Araujo KernelFaRer: Replacing Native-Code
Idioms with High-Performance Library
Calls . . . . . . . . . . . . . . . . . 38:1--38:22
Ricardo Alves and
Stefanos Kaxiras and
David Black-Schaffer Early Address Prediction: Efficient
Pipeline Prefetch and Reuse . . . . . . 39:1--39:22
Kaustav Goswami and
Dip Sankar Banerjee and
Shirshendu Das Towards Enhanced System Efficiency while
Mitigating Row Hammer . . . . . . . . . 40:1--40:26
Jerzy Proficz All-gather Algorithms Resilient to
Imbalanced Process Arrival Patterns . . 41:1--41:22
Rui Xu and
Sheng Ma and
Yaohua Wang and
Xinhai Chen and
Yang Guo Configurable Multi-directional Systolic
Array Architecture for Convolutional
Neural Networks . . . . . . . . . . . . 42:1--42:24
Wonik Seo and
Sanghoon Cha and
Yeonjae Kim and
Jaehyuk Huh and
Jongse Park SLO-Aware Inference Scheduler for
Heterogeneous Processors in Edge
Platforms . . . . . . . . . . . . . . . 43:1--43:26
Yasir Mahmood Qureshi and
William Andrew Simon and
Marina Zapater and
Katzalin Olcoz and
David Atienza Gem5-X: a Many-core Heterogeneous
Simulation Platform for Architectural
Exploration and Optimization . . . . . . 44:1--44:27
Tina Jung and
Fabian Ritter and
Sebastian Hack PICO: a Presburger In-bounds Check
Optimization for Compiler-based Memory
Safety Instrumentations . . . . . . . . 45:1--45:27
Zhibing Sha and
Jun Li and
Lihao Song and
Jiewen Tang and
Min Huang and
Zhigang Cai and
Lianju Qian and
Jianwei Liao and
Zhiming Liu Low I/O Intensity-aware Partial GC
Scheduling to Reduce Long-tail Latency
in SSDs . . . . . . . . . . . . . . . . 46:1--46:25
Syed Asad Alam and
James Garland and
David Gregg Low-precision Logarithmic Number
Systems: Beyond Base-2 . . . . . . . . . 47:1--47:25
Candace Walden and
Devesh Singh and
Meenatchi Jagasivamani and
Shang Li and
Luyi Kang and
Mehdi Asnaashari and
Sylvain Dubois and
Bruce Jacob and
Donald Yeung Monolithically Integrating Non-Volatile
Main Memory over the Last-Level Cache 48:1--48:26
Matthew Tomei and
Shomit Das and
Mohammad Seyedzadeh and
Philip Bedoukian and
Bradford Beckmann and
Rakesh Kumar and
David Wood Byte-Select Compression . . . . . . . . 49:1--49:27
Cunlu Li and
Dezun Dong and
Shazhou Yang and
Xiangke Liao and
Guangyu Sun and
Yongheng Liu CIB-HIER: Centralized Input Buffer
Design in Hierarchical High-radix
Routers . . . . . . . . . . . . . . . . 50:1--50:21
Tobias Gysi and
Christoph Müller and
Oleksandr Zinenko and
Stephan Herhut and
Eddie Davis and
Tobias Wicky and
Oliver Fuhrer and
Torsten Hoefler and
Tobias Grosser Domain-Specific Multi-Level IR Rewriting
for GPU: The Open Earth Compiler for
GPU-accelerated Climate Simulation . . . 51:1--51:23
An Zou and
Huifeng Zhu and
Jingwen Leng and
Xin He and
Vijay Janapa Reddi and
Christopher D. Gill and
Xuan Zhang System-level Early-stage Modeling and
Evaluation of IVR-assisted Processor
Power Delivery System . . . . . . . . . 52:1--52:27
Aninda Manocha and
Tyler Sorensen and
Esin Tureci and
Opeoluwa Matthews and
Juan L. Aragón and
Margaret Martonosi GraphAttack: Optimizing Data Supply for
Graph Applications on In-Order Multicore
Architectures . . . . . . . . . . . . . 53:1--53:26
Joscha Benz and
Oliver Bringmann Scenario-Aware Program Specialization
for Timing Predictability . . . . . . . 54:1--54:26
Shounak Chakraborty and
Magnus Själander WaFFLe: Gated Cache-Ways with Per-Core
Fine-Grained DVFS for Reduced On-Chip
Temperature and Leakage Consumption . . 55:1--55:25
Sriseshan Srikanth and
Anirudh Jain and
Thomas M. Conte and
Erik P. Debenedictis and
Jeanine Cook SortCache: Intelligent Cache Management
for Accelerating Sparse Data Workloads 56:1--56:24
Paul Metzger and
Volker Seeker and
Christian Fensch and
Murray Cole Device Hopping: Transparent Mid-Kernel
Runtime Switching for Heterogeneous
Systems . . . . . . . . . . . . . . . . 57:1--57:25
Yu Zhang and
Da Peng and
Xiaofei Liao and
Hai Jin and
Haikun Liu and
Lin Gu and
Bingsheng He LargeGraph: an Efficient
Dependency-Aware GPU-Accelerated
Large-Scale Graph Processing . . . . . . 58:1--58:24
Hüsrev Cilasun and
Salonik Resch and
Zamshed I. Chowdhury and
Erin Olson and
Masoud Zabihi and
Zhengyang Zhao and
Thomas Peterson and
Keshab K. Parhi and
Jian-Ping Wang and
Sachin S. Sapatnekar and
Ulya R. Karpuzcu Spiking Neural Networks in Spintronic
Computational RAM . . . . . . . . . . . 59:1--59:21
Aditya Ukarande and
Suryakant Patidar and
Ram Rangan Locality-Aware CTA Scheduling for Gaming
Applications . . . . . . . . . . . . . . 1:1--1:26
Hongzhi Liu and
Jie Luo and
Ying Li and
Zhonghai Wu Iterative Compilation Optimization Based
on Metric Learning and Collaborative
Filtering . . . . . . . . . . . . . . . 2:1--2:25
Muhammad Aditya Sasongko and
Milind Chabbi and
Mandana Bagheri Marzijarani and
Didem Unat ReuseTracker: Fast Yet Accurate
Multicore Reuse Distance Analyzer . . . 3:1--3:25
Yaosheng Fu and
Evgeny Bolotin and
Niladrish Chatterjee and
David Nellans and
Stephen W. Keckler GPU Domain Specialization via Composable
On-Package Architecture . . . . . . . . 4:1--4:23
Daeyeal Lee and
Bill Lin and
Chung-Kuan Cheng SMT-Based Contention-Free Task Mapping
and Scheduling on $2$D/$3$D SMART NoC
with Mixed Dimension-Order Routing . . . 5:1--5:21
Prasanth Chatarasi and
Hyoukjun Kwon and
Angshuman Parashar and
Michael Pellauer and
Tushar Krishna and
Vivek Sarkar Marvel: a Data-Centric Approach for
Mapping Deep Learning Operators on
Spatial Accelerators . . . . . . . . . . 6:1--6:26
Dennis Rieber and
Axel Acosta and
Holger Fröning Joint Program and Layout Transformations
to Enable Convolutional Operators on
Specialized Hardware Based on Constraint
Programming . . . . . . . . . . . . . . 7:1--7:26
Mengya Lei and
Fan Li and
Fang Wang and
Dan Feng and
Xiaomin Zou and
Renzhi Xiao SecNVM: an Efficient and Write-Friendly
Metadata Crash Consistency Scheme for
Secure NVM . . . . . . . . . . . . . . . 8:1--8:26
Bang Di and
Daokun Hu and
Zhen Xie and
Jianhua Sun and
Hao Chen and
Jinkui Ren and
Dong Li TLB-pilot: Mitigating TLB Contention
Attack on GPUs with
Microarchitecture-Aware Scheduling . . . 9:1--9:23
Gururaj Saileshwar and
Rick Boivie and
Tong Chen and
Benjamin Segal and
Alper Buyuktosunoglu HeapCheck: Low-cost Hardware Support for
Memory Safety . . . . . . . . . . . . . 10:1--10:24
M. Waqar Azhar and
Miquel Peric\`as and
Per Stenström Task-RM: a Resource Manager for Energy
Reduction in Task-Parallel Applications
under Quality of Service Constraints . . 11:1--11:26
Cesar Gomes and
Maziar Amiraski and
Mark Hempstead CASHT: Contention Analysis in Shared
Hierarchies with Thefts . . . . . . . . 12:1--12:27
Yufei Wang and
Xiaoshe Dong and
Longxiang Wang and
Weiduo Chen and
Xingjun Zhang Optimizing Small-Sample Disk Fault
Detection Based on LSTM-GAN Model . . . 13:1--13:24
Franyell Silfa and
Jose Maria Arnau and
Antonio González E-BATCH: Energy-Efficient and
High-Throughput RNN Batching . . . . . . 14:1--14:23
Chen Ding and
Dong Chen and
Fangzhou Liu and
Benjamin Reber and
Wesley Smith CARL: Compiler Assigned Reference
Leasing . . . . . . . . . . . . . . . . 15:1--15:28
Christof Schlaak and
Tzung-Han Juang and
Christophe Dubach Memory-Aware Functional IR for
Higher-Level Synthesis of Accelerators 16:1--16:26
Kartik Lakshminarasimhan and
Ajeya Naithani and
Josué Feliu and
Lieven Eeckhout The Forward Slice Core: a
High-Performance, Yet Low-Complexity
Microarchitecture . . . . . . . . . . . 17:1--17:25
Sharanyan Srikanthan and
Sayak Chakraborti and
Princeton Ferro and
Sandhya Dwarkadas MAPPER: Managing Application Performance
via Parallel Efficiency Regulation * . . 18:1--18:26
Tziouvaras Athanasios and
Dimitriou Georgios and
Stamoulis Georgios Low-power Near-data Instruction
Execution Leveraging Opcode-based Timing
Analysis . . . . . . . . . . . . . . . . 19:1--19:26
Xingguo Jia and
Jin Zhang and
Boshi Yu and
Xingyue Qian and
Zhengwei Qi and
Haibing Guan GiantVM: a Novel Distributed Hypervisor
for Resource Aggregation with DSM-aware
Optimizations . . . . . . . . . . . . . 20:1--20:27
Mehrzad Nejat and
Madhavan Manivannan and
Miquel Peric\`as and
Per Stenström Cooperative Slack Management: Saving
Energy of Multicore Processors by
Trading Performance Slack Between
QoS-Constrained Applications . . . . . . 21:1--21:27
Hugo Pompougnac and
Ulysse Beaugnon and
Albert Cohen and
Dumitru Potop Butucaru Weaving Synchronous Reactions into the
Fabric of SSA-form Compilers . . . . . . 22:1--22:25
Ghassan Shobaki and
Vahl Scott Gordon and
Paul McHugh and
Theodore Dubois and
Austin Kerbow Register-Pressure-Aware Instruction
Scheduling Using Ant Colony Optimization 23:1--23:23
Qihan Wang and
Zhen Peng and
Bin Ren and
Jie Chen and
Robert G. Edwards MemHC: an Optimized GPU Memory
Management Framework for Accelerating
Many-body Correlation . . . . . . . . . 24:1--24:26
Rakesh Kumar and
Mehdi Alipour and
David Black-Schaffer Dependence-aware Slice Execution to
Boost MLP in Slice-out-of-order Cores 25:1--25:28
Nandita Vijaykumar and
Ataberk Olgun and
Konstantinos Kanellopoulos and
F. Nisa Bostanci and
Hasan Hassan and
Mehrshad Lotfi and
Phillip B. Gibbons and
Onur Mutlu \pkgMetaSys: a Practical Open-source
Metadata Management System to Implement
and Evaluate Cross-layer Optimizations 26:1--26:29
Jing Chen and
Madhavan Manivannan and
Mustafa Abduljabbar and
Miquel Peric\`as \pkgERASE: Energy Efficient Task Mapping
and Resource Management for Work
Stealing Runtimes . . . . . . . . . . . 27:1--27:29
Chencheng Ye and
Yuanchao Xu and
Xipeng Shen and
Hai Jin and
Xiaofei Liao and
Yan Solihin Preserving Addressability Upon
GC-Triggered Data Movements on
Non-Volatile Memory . . . . . . . . . . 28:1--28:26
George Michelogiannakis and
Benjamin Klenk and
Brandon Cook and
Min Yee Teh and
Madeleine Glick and
Larry Dennison and
Keren Bergman and
John Shalf A Case For Intra-rack Resource
Disaggregation in HPC . . . . . . . . . 29:1--29:26
Ping Wang and
Fei Wen and
Paul V. Gratz and
Alex Sprintson SIMD-Matcher: a SIMD-based Arbitrary
Matching Framework . . . . . . . . . . . 30:1--30:20
Marcel Mettler and
Martin Rapp and
Heba Khdr and
Daniel Mueller-Gritschneder and
Jörg Henkel and
Ulf Schlichtmann An FPGA-based Approach to Evaluate
Thermal and Resource Management
Strategies of Many-core Processors . . . 31:1--31:24
Paschalis Mpeis and
Pavlos Petoumenos and
Kim Hazelwood and
Hugh Leather Object Intersection Captures on
Interactive Apps to Drive a
Crowd-sourced Replay-based Compiler
Optimization . . . . . . . . . . . . . . 32:1--32:25
Cunlu Li and
Dezun Dong and
Xiangke Liao MUA-Router: Maximizing the
Utility-of-Allocation for On-chip
Pipelining Routers . . . . . . . . . . . 33:1--33:23
Ziaul Choudhury and
Shashwat Shrivastava and
Lavanya Ramapantulu and
Suresh Purini An FPGA Overlay for CNN Inference with
Fine-grained Flexible Parallelism . . . 34:1--34:26
Diksha Moolchandani and
Anshul Kumar and
Smruti R. Sarangi Performance and Power Prediction for
Concurrent Execution on GPUs . . . . . . 35:1--35:27
Ali Jahanshahi and
Nanpeng Yu and
Daniel Wong PowerMorph: QoS-Aware Server Power
Reshaping for Data Center Regulation
Service . . . . . . . . . . . . . . . . 36:1--36:27
Peng Xu and
Nannan Zhao and
Jiguang Wan and
Wei Liu and
Shuning Chen and
Yuanhui Zhou and
Hadeel Albahar and
Hanyang Liu and
Liu Tang and
Zhihu Tan Building a Fast and Efficient LSM-tree
Store by Integrating Local Storage with
Cloud Storage . . . . . . . . . . . . . 37:1--37:26
Horng-Ruey Huang and
Ding-Yong Hong and
Jan-Jan Wu and
Kung-Fu Chen and
Pangfeng Liu and
Wei-Chung Hsu Accelerating Video Captioning on
Heterogeneous System Architectures . . . 38:1--38:25
David Corbalán-Navarro and
Juan L. Aragón and
Martí Anglada and
Joan-Manuel Parcerisa and
Antonio González Triangle Dropping: an Occluded-geometry
Predictor for Energy-efficient Mobile
GPUs . . . . . . . . . . . . . . . . . . 39:1--39:20
Shivam Kundan and
Theodoros Marinakis and
Iraklis Anagnostopoulos and
Dimitri Kagaris A Pressure-Aware Policy for Contention
Minimization on Multicore Systems . . . 40:1--40:26
Johnathan Alsop and
Weon Taek Na and
Matthew D. Sinclair and
Samuel Grayson and
Sarita Adve A Case for Fine-grain Coherence
Specialization in Heterogeneous Systems 41:1--41:26
Mohammadreza Soltaniyeh and
Richard P. Martin and
Santosh Nagarakatte An Accelerator for Sparse Convolutional
Neural Networks Leveraging Systolic
General Matrix--matrix Multiplication 42:1--42:26
Dharanidhar Dang and
Bill Lin and
Debashis Sahoo LiteCON: an All-photonic Neuromorphic
Accelerator for Energy-efficient Deep
Learning . . . . . . . . . . . . . . . . 43:1--43:22
Lokesh Siddhu and
Rajesh Kedia and
Shailja Pandey and
Martin Rapp and
Anuj Pathania and
Jörg Henkel and
Preeti Ranjan Panda CoMeT: an Integrated Interval Thermal
Simulation Toolchain for $2$D, 2.5D, and
$3$D Processor-Memory Systems . . . . . 44:1--44:25
M. Ben Olson and
Brandon Kammerdiener and
Michael R. Jantz and
Kshitij A. Doshi and
Terry Jones Online Application Guidance for
Heterogeneous Memory Systems . . . . . . 45:1--45:27
Bruno Chinelato Honorio and
João P. L. De Carvalho and
Catalina Munoz Morales and
Alexandro Baldassin and
Guido Araujo Using Barrier Elision to Improve
Transactional Code Generation . . . . . 46:1--46:23
Jiansong Li and
Xueying Wang and
Xiaobing Chen and
Guangli Li and
Xiao Dong and
Peng Zhao and
Xianzhi Yu and
Yongxin Yang and
Wei Cao and
Lei Liu and
Xiaobing Feng An Application-oblivious Memory
Scheduling System for DNN Accelerators 47:1--47:??
Aditya Narayan and
Yvain Thonnart and
Pascal Vivet and
Ayse Coskun and
Ajay Joshi Architecting Optically Controlled Phase
Change Memory . . . . . . . . . . . . . 48:1--48:??
Chao Zhang and
Maximilian Bremer and
Cy Chan and
John Shalf and
Xiaochen Guo ASA: Accelerating Sparse Accumulation in
Column-wise SpGEMM . . . . . . . . . . . 49:1--49:??
Aart Bik and
Penporn Koanantakool and
Tatiana Shpeisman and
Nicolas Vasilache and
Bixia Zheng and
Fredrik Kjolstad Compiler Support for Sparse Tensor
Computations in MLIR . . . . . . . . . . 50:1--50:??
Pierre Michaud and
Anis Peysieux HAIR: Halving the Area of the Integer
Register File with Odd/Even Banking . . 51:1--51:??
Amirreza Yousefzadeh and
Jan Stuijt and
Martijn Hijdra and
Hsiao-Hsuan Liu and
Anteneh Gebregiorgis and
Abhairaj Singh and
Said Hamdioui and
Francky Catthoor Energy-efficient In-Memory Address
Calculation . . . . . . . . . . . . . . 52:1--52:??
Hwisoo So and
Moslem Didehban and
Yohan Ko and
Aviral Shrivastava and
Kyoungwoo Lee EXPERTISE: an Effective Software-level
Redundant Multithreading Scheme against
Hardware Faults . . . . . . . . . . . . 53:1--53:??
Tim Hartley and
Foivos S. Zakkak and
Andy Nisbet and
Christos Kotselidis and
Mikel Luján Just-In-Time Compilation on ARM --- a
Closer Look at Call-Site Code
Consistency . . . . . . . . . . . . . . 54:1--54:??
Erling Jellum and
Milica Orlandi\'c and
Edmund Brekke and
Tor Johansen and
Torleiv Bryne Solving Sparse Assignment Problems on
FPGAs . . . . . . . . . . . . . . . . . 55:1--55:??
Yuhao Li and
Benjamin C. Lee Phronesis: Efficient Performance
Modeling for High-dimensional
Configuration Tuning . . . . . . . . . . 56:1--56:??
Chandrahas Tirumalasetty and
Chih Chieh Chou and
Narasimha Reddy and
Paul Gratz and
Ayman Abouelwafa Reducing Minor Page Fault Overheads
through Enhanced Page Walker . . . . . . 57:1--57:??
Lan Gao and
Jing Wang and
Weigong Zhang Adaptive Contention Management for
Fine-Grained Synchronization on
Commodity GPUs . . . . . . . . . . . . . 58:1--58:??
Ruobing Han and
Jaewon Lee and
Jaewoong Sim and
Hyesoon Kim COX : Exposing CUDA Warp-level Functions
to CPUs . . . . . . . . . . . . . . . . 59:1--59:??
Yiding Liu and
Xingyao Zhang and
Donglin Zhuang and
Xin Fu and
Shuaiwen Song DynamAP: Architectural Support for
Dynamic Graph Traversal on the Automata
Processor . . . . . . . . . . . . . . . 60:1--60:??
Changwei Zou and
Yaoqing Gao and
Jingling Xue Practical Software-Based Shadow Stacks
on x86-64 . . . . . . . . . . . . . . . 61:1--61:??
Thomas Luinaud and
J. M. Pierre Langlois and
Yvon Savaria Symbolic Analysis for Data Plane
Programs Specialization . . . . . . . . 1:1--1:??
Nilesh Rajendra Shah and
Ashitabh Misra and
Antoine Miné and
Rakesh Venkat and
Ramakrishna Upadrasta BullsEye: Scalable and Accurate
Approximation Framework for Cache Miss
Calculation . . . . . . . . . . . . . . 2:1--2:??
Mitali Soni and
Asmita Pal and
Joshua San Miguel As-Is Approximate Computing . . . . . . 3:1--3:??
Parth Shah and
Ranjal Gautham Shenoy and
Vaidyanathan Srinivasan and
Pradip Bose and
Alper Buyuktosunoglu TokenSmart: Distributed, Scalable Power
Management in the Many-core Era . . . . 4:1--4:??
Zhangyu Chen and
Yu Hua and
Luochangqi Ding and
Bo Ding and
Pengfei Zuo and
Xue Liu Lock-Free High-performance Hashing for
Persistent Memory via PM-aware Holistic
Optimization . . . . . . . . . . . . . . 5:1--5:??
Aristeidis Mastoras and
Sotiris Anagnostidis and
Albert-Jan N. Yzelman Design and Implementation for
Nonblocking Execution in GraphBLAS:
Tradeoffs and Performance . . . . . . . 6:1--6:??
Yemao Xu and
Dezun Dong and
Dongsheng Wang and
Shi Xu and
Enda Yu and
Weixia Xu and
Xiangke Liao SSD-SGD: Communication Sparsification
for Distributed Deep Learning Training 7:1--7:??
Ataberk Olgun and
Juan Gómez Luna and
Konstantinos Kanellopoulos and
Behzad Salami and
Hasan Hassan and
Oguz Ergin and
Onur Mutlu PiDRAM: a Holistic End-to-end FPGA-based
Framework for Processing-in-DRAM . . . . 8:1--8:??
Christos Sakalis and
Stefanos Kaxiras and
Magnus Själander Delay-on-Squash: Stopping
Microarchitectural Replay Attacks in
Their Tracks . . . . . . . . . . . . . . 9:1--9:??
Yi Liang and
Shaokang Zeng and
Lei Wang Quantifying Resource Contention of
Co-located Workloads with the
System-level Entropy . . . . . . . . . . 10:1--10:??
Hur Suyeon and
Seongmin Na and
Dongup Kwon and
Kim Joonsung and
Andrew Boutros and
Eriko Nurvitadhi and
Jangwoo Kim A Fast and Flexible FPGA-based
Accelerator for Natural Language
Processing Neural Networks . . . . . . . 11:1--11:??
Ashish Gondimalla and
Jianqiao Liu and
Mithuna Thottethodi and
T. N. Vijaykumar Occam: Optimal Data Reuse for
Convolutional Neural Networks . . . . . 12:1--12:??
Bo Peng and
Yaozu Dong and
Jianguo Yao and
Fengguang Wu and
Haibing Guan FlexHM: a Practical System for
Heterogeneous Memory with Flexible and
Efficient Performance Optimizations . . 13:1--13:??
Qiang Zhang and
Lei Xu and
Baowen Xu RegCPython: a Register-based Python
Interpreter for Better Performance . . . 14:1--14:??
Hai Jin and
Zhuo He and
Weizhong Qiang SpecTerminator: Blocking Speculative
Side Channels Based on Instruction
Classes on RISC-V . . . . . . . . . . . 15:1--15:??
Tuowen Zhao and
Tobi Popoola and
Mary Hall and
Catherine Olschanowsky and
Michelle Strout Polyhedral Specification and Code
Generation of Sparse Tensor Contraction
with Co-iteration . . . . . . . . . . . 16:1--16:??
Manuela Schuler and
Richard Membarth and
Philipp Slusallek XEngine: Optimal Tensor
Rematerialization for Neural Networks in
Heterogeneous Environments . . . . . . . 17:1--17:??
Ivan Korostelev and
João P. L. De Carvalho and
José Moreira and
José Nelson Amaral YaConv: Convolution with Low Cache
Footprint . . . . . . . . . . . . . . . 18:1--18:??
Furkan Eris and
Marcia Louis and
Kubra Eris and
José Abellán and
Ajay Joshi Puppeteer: a Random Forest Based Manager
for Hardware Prefetchers Across the
Memory Hierarchy . . . . . . . . . . . . 19:1--19:??
Nicolas Tollenaere and
Guillaume Iooss and
Stéphane Pouget and
Hugo Brunie and
Christophe Guillon and
Albert Cohen and
P. Sadayappan and
Fabrice Rastello Autotuning Convolutions Is Easier Than
You Think . . . . . . . . . . . . . . . 20:1--20:??
Víctor Pérez and
Lukas Sommer and
Victor Lomüller and
Kumudha Narasimhan and
Mehdi Goli User-driven Online Kernel Fusion for
SYCL . . . . . . . . . . . . . . . . . . 21:1--21:??
Vinicius Espindola and
Luciano Zago and
Hervé Yviquel and
Guido Araujo Source Matching and Rewriting for MLIR
Using String-Based Automata . . . . . . 22:1--22:??
Wenjing Ma and
Fangfang Liu and
Daokun Chen and
Qinglin Lu and
Yi Hu and
Hongsen Wang and
Xinhui Yuan An Optimized Framework for Matrix
Factorization on the New Sunway
Many-core Platform . . . . . . . . . . . 23:1--23:??
Sarabjeet Singh and
Neelam Surana and
Kailash Prasad and
Pranjali Jain and
Joycee Mekie and
Manu Awasthi HyGain: High-performance,
Energy-efficient Hybrid Gain Cell-based
Cache Hierarchy . . . . . . . . . . . . 24:1--24:??
Chandra Sekhar Mummidi and
Sandip Kundu ACTION: Adaptive Cache Block Migration
in Distributed Cache Architectures . . . 25:1--25:??
Qiaoyi Liu and
Jeff Setter and
Dillon Huff and
Maxwell Strange and
Kathleen Feng and
Mark Horowitz and
Priyanka Raina and
Fredrik Kjolstad Unified Buffer: Compiling Image
Processing and Machine Learning
Applications to Push-Memory Accelerators 26:1--26:??
Ahmet Caner Yüzügüler and
Canberk Sönmez and
Mario Drumond and
Yunho Oh and
Babak Falsafi and
Pascal Frossard Scale-out Systolic Arrays . . . . . . . 27:1--27:??
Francesco Minervini and
Oscar Palomar and
Osman Unsal and
Enrico Reggiani and
Josue Quiroga and
Joan Marimon and
Carlos Rojas and
Roger Figueras and
Abraham Ruiz and
Alberto Gonzalez and
Jonnatan Mendoza and
Ivan Vargas and
César Hernandez and
Joan Cabre and
Lina Khoirunisya and
Mustapha Bouhali and
Julian Pavon and
Francesc Moll and
Mauro Olivieri and
Mario Kovac and
Mate Kovac and
Leon Dragic and
Mateo Valero and
Adrian Cristal Vitruvius+: an Area-Efficient RISC-V
Decoupled Vector Coprocessor for High
Performance Computing Applications . . . 28:1--28:??
Hadjer Benmeziane and
Hamza Ouarnoughi and
Kaoutar El Maghraoui and
Smail Niar Multi-objective Hardware-aware Neural
Architecture Search with Pareto
Rank-preserving Surrogate Models . . . . 29:1--29:??
Dongwei Chen and
Dong Tong and
Chun Yang and
Jiangfang Yi and
Xu Cheng FlexPointer: Fast Address Translation
Based on Range TLB and Tagged Pointers 30:1--30:??
Jingwen Du and
Fang Wang and
Dan Feng and
Changchen Gan and
Yuchao Cao and
Xiaomin Zou and
Fan Li Fast One-Sided RDMA-Based State Machine
Replication for Disaggregated Memory . . 31:1--31:??
Abdul Rasheed Sahni and
Hamza Omar and
Usman Ali and
Omer Khan ASM: an Adaptive Secure Multicore for
Co-located Mutually Distrusting
Processes . . . . . . . . . . . . . . . 32:1--32:??
Sooraj Puthoor and
Mikko H. Lipasti Turn-based Spatiotemporal Coherence for
GPUs . . . . . . . . . . . . . . . . . . 33:1--33:??
Ruobing Chen and
Haosen Shi and
Jinping Wu and
Yusen Li and
Xiaoguang Liu and
Gang Wang Jointly Optimizing Job Assignment and
Resource Partitioning for Improving
System Throughput in Cloud Datacenters 34:1--34:??
Gokul Subramanian Ravi and
Tushar Krishna and
Mikko Lipasti TNT: a Modular Approach to Traversing
Physically Heterogeneous NOCs at
Bare-wire Latency . . . . . . . . . . . 35:1--35:??
Weizhi Xu and
Yintai Sun and
Shengyu Fan and
Hui Yu and
Xin Fu Accelerating Convolutional Neural
Network by Exploiting Sparsity on GPUs 36:1--36:??
Jin Zhao and
Yu Zhang and
Ligang He and
Qikun Li and
Xiang Zhang and
Xinyu Jiang and
Hui Yu and
Xiaofei Liao and
Hai Jin and
Lin Gu and
Haikun Liu and
Bingsheng He and
Ji Zhang and
Xianzheng Song and
Lin Wang and
Jun Zhou GraphTune: an Efficient Dependency-Aware
Substrate to Alleviate Irregularity in
Concurrent Graph Processing . . . . . . 37:1--37:??
Yufeng Zhou and
Alan L. Cox and
Sandhya Dwarkadas and
Xiaowan Dong The Impact of Page Size and
Microarchitecture on Instruction Address
Translation Overhead . . . . . . . . . . 38:1--38:??
Benjamin Reber and
Matthew Gould and
Alexander H. Kneipp and
Fangzhou Liu and
Ian Prechtl and
Chen Ding and
Linlin Chen and
Dorin Patru Cache Programming for Scientific Loops
Using Leases . . . . . . . . . . . . . . 39:1--39:??
Xinfeng Xie and
Peng Gu and
Yufei Ding and
Dimin Niu and
Hongzhong Zheng and
Yuan Xie MPU: Memory-centric SIMT Processor via
In-DRAM Near-bank Computing . . . . . . 40:1--40:??
Alexander Krolik and
Clark Verbrugge and
Laurie Hendren rNdN: Fast Query Compilation for NVIDIA
GPUs . . . . . . . . . . . . . . . . . . 41:1--41:??
Jiazhi Jiang and
Zijian Huang and
Dan Huang and
Jiangsu Du and
Lin Chen and
Ziguan Chen and
Yutong Lu Hierarchical Model Parallelism for
Optimizing Inference on Many-core
Processor via Decoupled $3$D-CNN
Structure . . . . . . . . . . . . . . . 42:1--42:??
Yuwen Zhao and
Fangfang Liu and
Wenjing Ma and
Huiyuan Li and
Yuanchi Peng and
Cui Wang MFFT: a GPU Accelerated Highly Efficient
Mixed-Precision Large-Scale FFT
Framework . . . . . . . . . . . . . . . 43:1--43:??
Muhammad Waqar Azhar and
Madhavan Manivannan and
Per Stenström Approx-RM: Reducing Energy on
Heterogeneous Multicore Processors under
Accuracy and Timing Constraints . . . . 44:1--44:??
Dong Huang and
Dan Feng and
Qiankun Liu and
Bo Ding and
Wei Zhao and
Xueliang Wei and
Wei Tong SplitZNS: Towards an Efficient LSM-Tree
on Zoned Namespace SSDs . . . . . . . . 45:1--45:??
Jiangsu Du and
Jiazhi Jiang and
Jiang Zheng and
Hongbin Zhang and
Dan Huang and
Yutong Lu Improving Computation and Memory
Efficiency for Real-world Transformer
Inference on GPUs . . . . . . . . . . . 46:1--46:??
Hai Jin and
Bo Lei and
Haikun Liu and
Xiaofei Liao and
Zhuohui Duan and
Chencheng Ye and
Yu Zhang A Compilation Tool for Computation
Offloading in ReRAM-based CIM
Architectures . . . . . . . . . . . . . 47:1--47:??
Christian Menard and
Marten Lohstroh and
Soroush Bateni and
Matthew Chorlian and
Arthur Deng and
Peter Donovan and
Clément Fournier and
Shaokai Lin and
Felix Suchert and
Tassilo Tanneberger and
Hokeun Kim and
Jeronimo Castrillon and
Edward A. Lee High-performance Deterministic
Concurrency Using Lingua Franca . . . . 48:1--48:??
Donglei Wu and
Weihao Yang and
Xiangyu Zou and
Wen Xia and
Shiyi Li and
Zhenbo Hu and
Weizhe Zhang and
Binxing Fang Smart-DNN+: a Memory-efficient Neural
Networks Compression Framework for the
Model Inference . . . . . . . . . . . . 49:1--49:??
Syed Salauddin Mohammad Tariq and
Lance Menard and
Pengfei Su and
Probir Roy MicroProf: Code-level Attribution of
Unnecessary Data Transfer in
Microservice Applications . . . . . . . 50:1--50:??
Shiyi Li and
Qiang Cao and
Shenggang Wan and
Wen Xia and
Changsheng Xie gPPM: a Generalized Matrix Operation and
Parallel Algorithm to Accelerate the
Encoding/Decoding Process of Erasure
Codes . . . . . . . . . . . . . . . . . 51:1--51:??
Petros Anastasiadis and
Nikela Papadopoulou and
Georgios Goumas and
Nectarios Koziris and
Dennis Hoppe and
Li Zhong PARALiA: a Performance Aware Runtime for
Auto-tuning Linear Algebra on
Heterogeneous Systems . . . . . . . . . 52:1--52:??
Hui Yu and
Yu Zhang and
Jin Zhao and
Yujian Liao and
Zhiying Huang and
Donghao He and
Lin Gu and
Hai Jin and
Xiaofei Liao and
Haikun Liu and
Bingsheng He and
Jianhui Yue RACE: an Efficient Redundancy-aware
Accelerator for Dynamic Graph Neural
Network . . . . . . . . . . . . . . . . 53:1--53:??
Victor Ferrari and
Rafael Sousa and
Marcio Pereira and
João P. L. De Carvalho and
José Nelson Amaral and
José Moreira and
Guido Araujo Advancing Direct Convolution Using
Convolution Slicing Optimization and ISA
Extensions . . . . . . . . . . . . . . . 54:1--54:??
Bowen He and
Xiao Zheng and
Yuan Chen and
Weinan Li and
Yajin Zhou and
Xin Long and
Pengcheng Zhang and
Xiaowei Lu and
Linquan Jiang and
Qiang Liu and
Dennis Cai and
Xiantao Zhang DxPU: Large-scale Disaggregated GPU
Pools in the Datacenter . . . . . . . . 55:1--55:??
Shiqing Zhang and
Mahmood Naderan-Tahan and
Magnus Jahre and
Lieven Eeckhout Characterizing Multi-Chip GPU Data
Sharing . . . . . . . . . . . . . . . . 56:1--56:??
Jens Domke and
Emil Vatai and
Balazs Gerofi and
Yuetsu Kodama and
Mohamed Wahib and
Artur Podobas and
Sparsh Mittal and
Miquel Peric\`as and
Lingqi Zhang and
Peng Chen and
Aleksandr Drozd and
Satoshi Matsuoka At the Locus of Performance: Quantifying
the Effects of Copious $3$D-Stacked
Cache on HPC Workloads . . . . . . . . . 57:1--57:??
Satya Jaswanth Badri and
Mukesh Saini and
Neeraj Goel Mapi-Pro: an Energy Efficient Memory
Mapping Technique for Intermittent
Computing . . . . . . . . . . . . . . . 58:1--58:??
Miao Yu and
Tingting Xiang and
Venkata Pavan Kumar Miriyala and
Trevor E. Carlson Multiply-and-Fire: an Event-Driven
Sparse Neural Network Accelerator . . . 59:1--59:??
Ziaul Choudhury and
Anish Gulati and
Suresh Purini FlowPix: Accelerating Image Processing
Pipelines on an FPGA Overlay using a
Domain Specific Compiler . . . . . . . . 60:1--60:??
Zachary Susskind and
Aman Arora and
Igor D. S. Miranda and
Alan T. L. Bacellar and
Luis A. Q. Villon and
Rafael F. Katopodis and
Leandro S. de Araújo and
Diego L. C. Dutra and
Priscila M. V. Lima and
Felipe M. G. França and
Mauricio Breternitz Jr. and
Lizy K. John ULEEN: a Novel Architecture for
Ultra-low-energy Edge Neural Networks 61:1--61:??
Jia Wei and
Xingjun Zhang and
Longxiang Wang and
Zheng Wei Fastensor: Optimise the Tensor I/O Path
from SSD to GPU for Deep Learning
Training . . . . . . . . . . . . . . . . 62:1--62:??
Longfei Luo and
Dingcui Yu and
Yina Lv and
Liang Shi Critical Data Backup with Hybrid
Flash-Based Consumer Devices . . . . . . 1:1--1:??
Peng Chen and
Hui Chen and
Weichen Liu and
Linbo Long and
Wanli Chang and
Nan Guan DAG-Order: an Order-Based Dynamic DAG
Scheduling for Real-Time
Networks-on-Chip . . . . . . . . . . . . 2:1--2:??
Zhang Jiang and
Ying Chen and
Xiaoli Gong and
Jin Zhang and
Wenwen Wang and
Pen-Chung Yew JiuJITsu: Removing Gadgets with Safe
Register Allocation for JIT Code
Generation . . . . . . . . . . . . . . . 3:1--3:??
Hayfa Tayeb and
Ludovic Paillat and
Bérenger Bramas Autovesk: Automatic Vectorized Code
Generation from Unstructured Static
Kernels Using Graph Transformations . . 4:1--4:??
Xueying Wang and
Guangli Li and
Zhen Jia and
Xiaobing Feng and
Yida Wang Fast Convolution Meets Low Precision:
Exploring Efficient Quantized Winograd
Convolution on Modern CPUs . . . . . . . 5:1--5:??
Hao Fan and
Yiliang Ye and
Shadi Ibrahim and
Zhuo Huang and
Xingru Li and
Weibin Xue and
Song Wu and
Chen Yu and
Xuanhua Shi and
Hai Jin QoS-pro: a QoS-enhanced Transaction
Processing Framework for Shared SSDs . . 6:1--6:??
Yunping Zhao and
Sheng Ma and
Heng Liu and
Libo Huang and
Yi Dai SAC: an Ultra-Efficient Spin-based
Architecture for Compressed DNNs . . . . 7:1--7:??
Tong-Yu Liu and
Jianmei Guo and
Bo Huang Efficient Cross-platform Multiplexing of
Hardware Performance Counters via
Adaptive Grouping . . . . . . . . . . . 8:1--8:??
Lei Liu and
Xinglei Dou QuCloud+: a Holistic Qubit Mapping
Scheme for Single/Multi-programming on
$2$D/$3$D NISQ Quantum Computers . . . . 9:1--9:??
Lingxi Wu and
Minxuan Zhou and
Weihong Xu and
Ashish Venkat and
Tajana Rosing and
Kevin Skadron Abakus: Accelerating $k$-mer Counting
with Storage Technology . . . . . . . . 10:1--10:??
Seokwon Kang and
Jongbin Kim and
Gyeongyong Lee and
Jeongmyung Lee and
Jiwon Seo and
Hyungsoo Jung and
Yong Ho Song and
Yongjun Park ISP Agent: a Generalized
In-storage-processing Workload
Offloading Framework by Providing
Multiple Optimization Opportunities . . 11:1--11:??
Prasoon Mishra and
V. Krishna Nandivada COWS for High Performance: Cost Aware
Work Stealing for Irregular Parallel
Loop . . . . . . . . . . . . . . . . . . 12:1--12:??
Joongun Park and
Seunghyo Kang and
Sanghyeon Lee and
Taehoon Kim and
Jongse Park and
Youngjin Kwon and
Jaehyuk Huh Hardware-hardened Sandbox Enclaves for
Trusted Serverless Computing . . . . . . 13:1--13:??
Tyler Allen and
Bennett Cooper and
Rong Ge Fine-grain Quantitative Analysis of
Demand Paging in Unified Virtual Memory 14:1--14:??
Zhonghua Wang and
Yixing Guo and
Kai Lu and
Jiguang Wan and
Daohui Wang and
Ting Yao and
Huatao Wu Rcmp: Reconstructing RDMA-Based Memory
Disaggregation via CXL . . . . . . . . . 15:1--15:??
Linbo Long and
Shuiyong He and
Jingcheng Shen and
Renping Liu and
Zhenhua Tan and
Congming Gao and
Duo Liu and
Kan Zhong and
Yi Jiang WA-Zone: Wear-Aware Zone Management
Optimization for LSM-Tree on ZNS SSDs 16:1--16:??
Zhihua Fan and
Wenming Li and
Zhen Wang and
Yu Yang and
Xiaochun Ye and
Dongrui Fan and
Ninghui Sun and
Xuejun An Improving Utilization of Dataflow Unit
for Multi-Batch Processing . . . . . . . 17:1--17:??
Dunbo Zhang and
Qingjie Lang and
Ruoxi Wang and
Li Shen Extension VM: Interleaved Data Layout in
Vector Memory . . . . . . . . . . . . . 18:1--18:??
Can Firtina and
Kamlesh Pillai and
Gurpreet S. Kalsi and
Bharathwaj Suresh and
Damla Senol Cali and
Jeremie S. Kim and
Taha Shahroodi and
Meryem Banu Cavlak and
Joël Lindegger and
Mohammed Alser and
Juan Gómez Luna and
Sreenivas Subramoney and
Onur Mutlu ApHMM: Accelerating Profile Hidden
Markov Models for Fast and
Energy-efficient Genome Analysis . . . . 19:1--19:??
Khalid Ahmad and
Cris Cecka and
Michael Garland and
Mary Hall Exploring Data Layout for Sparse Tensor
Times Dense Matrix on GPUs . . . . . . . 20:1--20:??
Chandra Sekhar Mummidi and
Victor C. Ferreira and
Sudarshan Srinivasan and
Sandip Kundu Highly Efficient Self-checking Matrix
Multiplication on Tiled AMX Accelerators 21:1--21:??
Zhonghua Wang and
Chen Ding and
Fengguang Song and
Kai Lu and
Jiguang Wan and
Zhihu Tan and
Changsheng Xie and
Guokuan Li WIPE: a Write-Optimized Learned Index
for Persistent Memory . . . . . . . . . 22:1--22:??
Gino A. Chacon and
Charles Williams and
Johann Knechtel and
Ozgur Sinanoglu and
Paul V. Gratz and
Vassos Soteriou Coherence Attacks and Countermeasures in
Interposer-based Chiplet Systems . . . . 23:1--23:??
Yan Wei and
Zhang Xingjun A Concise Concurrent B+-Tree for
Persistent Memory . . . . . . . . . . . 24:1--24:??
Fareed Qararyah and
Muhammad Waqar Azhar and
Pedro Trancoso An Efficient Hybrid Deep Learning
Accelerator for Compact and
Heterogeneous CNNs . . . . . . . . . . . 25:1--25:??
Fernando Fernandes Dos Santos and
Luigi Carro and
Flavio Vella and
Paolo Rech Assessing the Impact of Compiler
Optimizations on GPUs Reliability . . . 26:1--26:??
Valentin Isaac-Chassande and
Adrian Evans and
Yves Durand and
Frédéric Rousseau Dedicated Hardware Accelerators for
Processing of Sparse Matrices and
Vectors: a Survey . . . . . . . . . . . 27:1--27:??
Benyi Xie and
Yue Yan and
Chenghao Yan and
Sicheng Tao and
Zhuangzhuang Zhang and
Xinyu Li and
Yanzhi Lan and
Xiang Wu and
Tianyi Liu and
Tingting Zhang and
Fuxin Zhang An Instruction Inflation Analyzing
Framework for Dynamic Binary Translators 28:1--28:??
Samuel Rac and
Mats Brorsson Cost-aware Service Placement and
Scheduling in the Edge-Cloud Continuum 29:1--29:??
Feng Xue and
Chenji Han and
Xinyu Li and
Junliang Wu and
Tingting Zhang and
Tianyi Liu and
Yifan Hao and
Zidong Du and
Qi Guo and
Fuxin Zhang Tyche: an Efficient and General
Prefetcher for Indirect Memory Accesses 30:1--30:??
Kunpeng Xie and
Ye Lu and
Xinyu He and
Dezhi Yi and
Huijuan Dong and
Yao Chen Winols: a Large-Tiling Sparse Winograd
CNN Accelerator on FPGAs . . . . . . . . 31:1--31:??
Ke Liu and
Kan Wu and
Hua Wang and
Ke Zhou and
Peng Wang and
Ji Zhang and
Cong Li SLAP: Segmented Reuse-Time-Label Based
Admission Policy for Content Delivery
Network Caching . . . . . . . . . . . . 32:1--32:??
Panagiotis Miliadis and
Dimitris Theodoropoulos and
Dionisios Pnevmatikatos and
Nectarios Koziris Architectural Support for Sharing,
Isolating and Virtualizing FPGA
Resources . . . . . . . . . . . . . . . 33:1--33:??
Haitao Du and
Yuhan Qin and
Song Chen and
Yi Kang FASA-DRAM: Reducing DRAM Latency with
Destructive Activation and Delayed
Restoration . . . . . . . . . . . . . . 34:1--34:??
Michael Canesche and
Vanderson Rosário and
Edson Borin and
Fernando Quintão Pereira The Droplet Search Algorithm for Kernel
Scheduling . . . . . . . . . . . . . . . 35:1--35:??
Asmita Pal and
Keerthana Desai and
Rahul Chatterjee and
Joshua San Miguel Camouflage: Utility-Aware Obfuscation
for Accurate Simulation of Sensitive
Program Traces . . . . . . . . . . . . . 36:1--36:??
Chengying Huan and
Yongchao Liu and
Heng Zhang and
Shuaiwen Song and
Santosh Pandey and
Shiyang Chen and
Xiangfei Fang and
Yue Jin and
Baptiste Lepers and
Yanjun Wu and
Hang Liu TEA+: a Novel Temporal Graph Random Walk
Engine with Hybrid Storage Architecture 37:1--37:??
Soojin Hwang and
Daehyeon Baek and
Jongse Park and
Jaehyuk Huh Cerberus: Triple Mode Acceleration of
Sparse Matrix and Vector Multiplication 38:1--38:??
Siddhartha Raman Sundara Raman and
Lizy John and
Jaydeep P. Kulkarni NEM-GNN: DAC/ADC-less, Scalable,
Reconfigurable, Graph and Sparsity-Aware
Near-Memory Accelerator for Graph Neural
Networks . . . . . . . . . . . . . . . . 39:1--39:??
Yan Chen and
Qiwen Ke and
Huiba Li and
Yongwei Wu and
Yiming Zhang xMeta: SSD-HDD-hybrid Optimization for
Metadata Maintenance of Cloud-scale
Object Storage . . . . . . . . . . . . . 40:1--40:??
Vidush Singhal and
Laith Sakka and
Kirshanthan Sundararajah and
Ryan Newton and
Milind Kulkarni Orchard: Heterogeneous Parallelism and
Fine-grained Fusion for Complex Tree
Traversals . . . . . . . . . . . . . . . 41:1--41:??
Hajar Falahati and
Mohammad Sadrosadati and
Qiumin Xu and
Juan Gómez-Luna and
Banafsheh Saber Latibari and
Hyeran Jeon and
Shaahin Hesaabi and
Hamid Sarbazi-Azad and
Onur Mutlu and
Murali Annavaram and
Masoud Pedram Cross-core Data Sharing for
Energy-efficient GPUs . . . . . . . . . 42:1--42:??
Ching-Jui Lee and
Tsung Tai Yeh ReSA: Reconfigurable Systolic Array for
Multiple Tiny DNN Tensors . . . . . . . 43:1--43:??
Ziheng Wang and
Xiaoshe Dong and
Yan Kang and
Heng Chen and
Qiang Wang An Example of Parallel Merkle Tree
Traversal: Post-Quantum Leighton--Micali
Signature on the GPU . . . . . . . . . . 44:1--44:??
Jiang Wu and
Zhuo Zhang and
Deheng Yang and
Jianjun Xu and
Jiayu He and
Xiaoguang Mao Knowledge-Augmented Mutation-Based Bug
Localization for Hardware Design Code 45:1--45:??
Chen Ding and
Jian Zhou and
Kai Lu and
Sicen Li and
Yiqin Xiong and
Jiguang Wan and
Ling Zhan D$^2$Comp: Efficient Offload of LSM-tree
Compaction with Data Processing Units on
Disaggregated Storage . . . . . . . . . 46:1--46:??
Zhuohao Wang and
Lei Liu and
Limin Xiao iSwap: a New Memory Page Swap Mechanism
for Reducing Ineffective I/O Operations
in Cloud Environments . . . . . . . . . 47:1--47:??
Junkaixuan Li and
Yi Kang GraphSER: Distance-Aware Stream-Based
Edge Repartition for Many-Core Systems 48:1--48:??
Ke Wu and
Dezun Dong and
Weixia Xu COER: a Network Interface Offloading
Architecture for RDMA and Congestion
Control Protocol Codesign . . . . . . . 49:1--49:??
Qunyou Liu and
Darong Huang and
Luis Costero and
Marina Zapater and
David Atienza Intermediate Address Space: virtual
memory optimization of heterogeneous
architectures for cache-resident
workloads . . . . . . . . . . . . . . . 50:1--50:??
Dongmoon Min and
Ilkwon Byun and
Gyu-Hyeon Lee and
Jangwoo Kim CoolDC: a Cost-Effective
Immersion-Cooled Datacenter with
Workload-Aware Temperature Scaling . . . 51:1--51:??
Hai Zhou and
Dan Feng Stripe-schedule Aware Repair in
Erasure-coded Clusters with
Heterogeneous Star Networks . . . . . . 52:1--52:??
Bobin Deng and
Bhargava Nadendla and
Kun Suo and
Yixin Xie and
Dan Chia-Tien Lo Fixed-point Encoding and Architecture
Exploration for Residue Number Systems 53:1--53:??
Yizhuo Wang and
Fangli Chang and
Bingxin Wei and
Jianhua Gao and
Weixing Ji Optimization of Sparse Matrix
Computation for Algebraic Multigrid on
GPUs . . . . . . . . . . . . . . . . . . 54:1--54:??
Luming Wang and
Xu Zhang and
Songyue Wang and
Zhuolun Jiang and
Tianyue Lu and
Mingyu Chen and
Siwei Luo and
Keji Huang Asynchronous Memory Access Unit:
Exploiting Massive Parallelism for Far
Memory Access . . . . . . . . . . . . . 55:1--55:??
Yunping Zhao and
Sheng Ma and
Hengzhu Liu and
Dongsheng Li SAL: Optimizing the Dataflow of
Spin-based Architectures for Lightweight
Neural Networks . . . . . . . . . . . . 56:1--56:??
Kai Lu and
Siqi Zhao and
Haikang Shan and
Qiang Wei and
Guokuan Li and
Jiguang Wan and
Ting Yao and
Huatao Wu and
Daohui Wang Scythe: a Low-latency RDMA-enabled
Distributed Transaction System for
Disaggregated Memory . . . . . . . . . . 57:1--57:??
Wangqi Peng and
Yusen Li and
Xiaoguang Liu and
Gang Wang Lavender: an Efficient Resource
Partitioning Framework for Large-Scale
Job Colocation . . . . . . . . . . . . . 58:1--58:??
Feng Zhang and
Fulin Nan and
Binbin Xu and
Zhirong Shen and
Jiebin Zhai and
Dmitrii Kalplun and
Jiwu Shu Achieving Tunable Erasure Coding with
Cluster-Aware Redundancy Transitioning 59:1--59:??
Ataberk Olgun and
F. Nisa Bostanci and
Geraldo Francisco de Oliveira Junior and
Yahya Can Tugrul and
Rahul Bera and
Abdullah Giray Yaglikci and
Hasan Hassan and
Oguz Ergin and
Onur Mutlu Sectored DRAM: a Practical
Energy-Efficient and High-Performance
Fine-Grained DRAM Architecture . . . . . 60:1--60:??
Xiaohui Wei and
Chenyang Wang and
Hengshan Yue and
Jingweijia Tan and
Zeyu Guan and
Nan Jiang and
Xinyang Zheng and
Jianpeng Zhao and
Meikang Qiu ReIPE: Recycling Idle PEs in CNN
Accelerator for Vulnerable Filters
Soft-Error Detection . . . . . . . . . . 61:1--61:??
Qiao Li and
Yu Chen and
Guanyu Wu and
Yajuan Du and
Min Ye and
Xinbiao Gan and
Jie Zhang and
Zhirong Shen and
Jiwu Shu and
Chun Xue Characterizing and Optimizing LDPC
Performance on $3$D NAND Flash Memories 62:1--62:??
Jiahong Xu and
Haikun Liu and
Zhuohui Duan and
Xiaofei Liao and
Hai Jin and
Xiaokang Yang and
Huize Li and
Cong Liu and
Fubing Mao and
Yu Zhang ReHarvest: an ADC Resource-Harvesting
Crossbar Architecture for ReRAM-Based
DNN Accelerators . . . . . . . . . . . . 63:1--63:??
Jiang Wu and
Zhuo Zhang and
Deheng Yang and
Jianjun Xu and
Jiayu He and
Xiaoguang Mao Time-Aware Spectrum-Based Bug
Localization for Hardware Design Code
with Data Purification . . . . . . . . . 64:1--64:??
Zhuoran Song and
Zhongkai Yu and
Xinkai Song and
Yifan Hao and
Li Jiang and
Naifeng Jing and
Xiaoyao Liang Environmental Condition Aware
Super-Resolution Acceleration Framework
in Server--Client Hierarchies . . . . . 65:1--65:??
Georgia Antoniou and
Davide Bartolini and
Haris Volos and
Marios Kleanthous and
Zhe Wang and
Kleovoulos Kalaitzidis and
Tom Rollet and
Ziwei Li and
Onur Mutlu and
Yiannakis Sazeides and
Jawad Haj Yahya Agile C-states: a Core C-state
Architecture for Latency Critical
Applications Optimizing both Transition
and Cold-Start Latency . . . . . . . . . 66:1--66:??
Xinbiao Gan and
Tiejun Li and
Feng Xiong and
Bo Yang and
Xinhai Chen and
Chunye Gong and
Shijie Li and
Kai Lu and
Qiao Li and
Yiming Zhang MST: Topology-Aware Message Aggregation
for Exascale Graph Processing of
Traversal-Centric Algorithms . . . . . . 67:1--67:??
Yujie Cui and
Wei Chen and
Xu Cheng and
Jiangfang Yi Hyperion: a Highly Effective Page and PC
Based Delta Prefetcher . . . . . . . . . 68:1--68:??
Jianhua Gao and
Weixing Ji and
Yizhuo Wang Optimization of Large-Scale Sparse
Matrix--Vector Multiplication on
Multi-GPU Systems . . . . . . . . . . . 69:1--69:??
Zhengding Hu and
Jingwei Sun and
Zhongyang Li and
Guangzhong Sun AG-SpTRSV: an Automatic Framework to
Optimize Sparse Triangular Solve on GPUs 70:1--70:??
Wenbo Zhang and
Yiqi Liu and
Tianhao Zang and
Zhenshan Bao EA4RCA: Efficient AIE accelerator design
framework for regular
Communication-Avoiding Algorithm . . . . 71:1--71:??
Arun Thangamani and
Vincent Loechner and
Stéphane Genaud A Survey of General-purpose Polyhedral
Compilers . . . . . . . . . . . . . . . 72:1--72:??
Junqing Lin and
Jingwei Sun and
Xiaolong Shi and
Honghe Zhang and
Xianzhi Yu and
Xinzhi Wang and
Jun Yao and
Guangzhong Sun LO-SpMM: Low-cost Search for
High-performance SpMM Kernels on GPUs 73:1--73:??
Chenglong Yi and
Jintong Liu and
Shenggang Wan and
Juntao Fang and
Bin Sun and
Liqiang Zhang Data Deduplication Based on Content
Locality of Transactions to Enhance
Blockchain Scalability . . . . . . . . . 74:1--74:??
Joshua Dennis Booth and
Phillip Lane A NUMA-Aware Version of an Adaptive
Self-Scheduling Loop Scheduler . . . . . 75:1--75:??
Yu Tang and
Qiao Li and
Lujia Yin and
Dongsheng Li and
Yiming Zhang and
Chenyu Wang and
Xingcheng Zhang and
Linbo Qiao and
Zhaoning Zhang and
Kai Lu DELTA: Memory-Efficient Training via
Dynamic Fine-Grained Recomputation and
Swapping . . . . . . . . . . . . . . . . 76:1--76:??
Zhenhua Tan and
Linbo Long and
Jingcheng Shen and
Renping Liu and
Congming Gao and
Kan Zhong and
Yi Jiang Optimizing Garbage Collection for ZNS
SSDs via In-storage Data Migration and
Address Remapping . . . . . . . . . . . 77:1--77:??
Xiang Li and
Qiong Chang and
Aolong Zha and
Shijie Chang and
Yun Li and
Jun Miyazaki An Optimized GPU Implementation for GIST
Descriptor . . . . . . . . . . . . . . . 78:1--78:??
Xiaobo Lu and
Jianbin Fang and
Lin Peng and
Chun Huang and
Zidong Du and
Yongwei Zhao and
Zheng Wang Mentor: a Memory-Efficient Sparse-dense
Matrix Multiplication Accelerator Based
on Column-Wise Product . . . . . . . . . 79:1--79:??
Yu Feng and
Weikai Lin and
Zihan Liu and
Jingwen Leng and
Minyi Guo and
Han Zhao and
Xiaofeng Hou and
Jieru Zhao and
Yuhao Zhu Potamoi: Accelerating Neural Rendering
via a Unified Streaming Architecture . . 80:1--80:??
Changxi Liu and
Alen Sabu and
Akanksha Chaudhari and
Qingxuan Kang and
Trevor E. Carlson Pac-Sim: Simulation of Multi-threaded
Workloads using Intelligent, Live
Sampling . . . . . . . . . . . . . . . . 81:1--81:??
Saurabh Raje and
Yufan Xu and
Atanas Rountev and
Edward F. Valeev and
P. Sadayappan CoNST: Code Generator for Sparse Tensor
Networks . . . . . . . . . . . . . . . . 82:1--82:??
Danlin Jia and
Geng Yuan and
Yiming Xie and
Xue Lin and
Ningfang Mi A Data-Loader Tunable Knob to Shorten
GPU Idleness for Distributed Deep
Learning . . . . . . . . . . . . . . . . 83:1--83:??
Shaobu Wang and
Guangyan Zhang and
Junyu Wei and
Yang Wang and
Jiesheng Wu and
Qingchao Luo Understanding Silent Data Corruption in
Processors for Mitigating its Effects 84:1--84:??
Yen-Yu Lu and
Chin-Hsien Wu and
Shih-Jen Li and
Cheng-Tze Lee and
Cheng-Yen Wu A Stable Idle Time Detection Platform
for Real I/O Workloads . . . . . . . . . 85:1--85:??
Lingyu Sun and
Xiaofeng Hou and
Chao Li and
Jiacheng Liu and
Xinkai Wang and
Quan Chen and
Minyi Guo $ A^2 $: Towards Accelerator Level
Parallelism for Autonomous Micromobility
Systems . . . . . . . . . . . . . . . . 86:1--86:??
Manojna Sistla and
Yiding Liu and
Xin Fu Towards High Performance QNNs via
Distribution-Based CNOT Gate Reduction 87:1--87:??
Fubing Mao and
Xu Liu and
Yu Zhang and
Haikun Liu and
Xiaofei Liao and
Hai Jin and
Wei Zhang and
Jian Zhou and
Yufei Wu and
Longyu Nie and
Yapu Guo and
Zihan Jiang and
Jingkang Liu PMGraph: Accelerating Concurrent Graph
Queries over Streaming Graphs . . . . . 88:1--88:??
Wentong Li and
Yina Lv and
Longfei Luo and
Yunpeng Song and
Liang Shi Access Characteristic-Guided Remote
Swapping Across Mobile Devices . . . . . 89:1--89:??
Yinan Zhang and
Shun Yang and
Huiqi Hu and
Chengcheng Yang and
Peng Cai and
Xuan Zhou SuccinctKV: a CPU-efficient LSM-tree
Based KV Store with Scan-based
Compaction . . . . . . . . . . . . . . . 90:1--90:??
Siyuan Ma and
Kaustubh Mhatre and
Jian Weng and
Bagus Hanindhito and
Zhengrong Wang and
Tony Nowatzki and
Lizy John and
Aman Arora PIMSAB: a Processing-In-Memory System
with Spatially-Aware Communication and
Bit-Serial-Aware Computation . . . . . . 91:1--91:??
Perry Gibson and
Jose Cano and
Elliot Crowley and
Amos Storkey and
Michael O'boyle DLAS: a Conceptual Model for
Across-Stack Deep Learning Acceleration 1:1--1:??
Xinbiao Gan GraphService: Topology-aware Constructor
for Large-scale Graph Applications . . . 2:1--2:??
Renjun Zhang and
Tianming Zhang and
Zinuo Cai and
Dongmei Li and
Ruhui Ma and
Buyya Rajkumar MemoriaNova: Optimizing Memory-Aware
Model Inference for Edge Computing . . . 3:1--3:??
Andrea Lepori and
Alexandru Calotoiu and
Torsten Hoefler Iterating Pointers: Enabling Static
Analysis for Loop-based Pointers . . . . 4:1--4:??
Viktor Razilov and
Ipek Gecin and
Emil Matús and
Gerhard Fettweis Conflict Management in Vector Register
Files . . . . . . . . . . . . . . . . . 5:1--5:??
Jingle Xu and
Jiayu Fu and
Lin Gan and
Yaojian Chen and
Zhaoqi Sun and
Zhenchun Huang and
Guangwen Yang Leveraging the Hardware Resources to
Accelerate cryo-EM Reconstruction of
RELION on the New Sunway Supercomputer 6:1--6:??
Yuta Saito and
Kazunori Sakamoto and
Hironori Washizaki and
Yoshiaki Fukazawa Multiple Function Merging for Code Size
Reduction . . . . . . . . . . . . . . . 7:1--7:??
Peihua Zhang and
Chenggang Wu and
Hanzhi Hu and
Lichen Jia and
Mingfan Peng and
Jiali Xu and
Mengyao Xie and
Yuanming Lai and
Yan Kang and
Zhe Wang Shining Light on the Inter-procedural
Code Obfuscation: Keep Pace with
Progress in Binary Diffing . . . . . . . 8:1--8:??
Dengke Han and
Mingyu Yan and
Xiaochun Ye and
Dongrui Fan Characterizing and Understanding HGNN
Training on GPUs . . . . . . . . . . . . 9:1--9:??
Jingyu Wang and
Ruilong Ma and
Xiang Yang and
Qi Qi and
Zirui Zhuang and
Jing Wang and
Jianxin Liao and
Song Guo DeepZoning: Re-accelerate CNN Inference
with Zoning Graph for Heterogeneous Edge
Cluster . . . . . . . . . . . . . . . . 10:1--10:??
Chenghao Ouyang and
Jinhan Xin and
Siqi Zeng and
Guohui Li and
Jianjun Li and
Zhibin Yu Constructing a Supplementary Benchmark
Suite to Represent Android Applications
with User Interactions by using
Performance Counters . . . . . . . . . . 11:1--11:??
Xinglei Dou and
Lei Liu and
Limin Xiao An Intelligent Scheduling Approach on
Mobile OS for Optimizing UI Smoothness
and Power . . . . . . . . . . . . . . . 12:1--12:??
Kwanghoon Choi and
Igjae Kim and
Sunho Lee and
Jaehyuk Huh ShieldCXL: a Practical Obliviousness
Support with Sealed CXL Memory . . . . . 13:1--13:??
Yun Chen and
Ali Hajiabadi and
Romain Poussier and
Yaswanth Tavva and
Andreas Diavastos and
Shivam Bhasin and
Trevor E. Carlson PARADISE: Criticality-Aware Instruction
Reordering for Power Attack Resistance 14:1--14:??
Chunfeng Li and
Feng Shi and
Fei Yin and
Karim Soliman and
Jin Wei A High Scalability Memory NoC with
Shared-Inside Hierarchical-Groupings for
Triplet-Based Many-Core Architecture . . 15:1--15:??
Jin Zhao and
Yu Zhang and
Donghao He and
Qikun Li and
Weihang Yin and
Hui Yu and
Hao Qi and
Xiaofei Liao and
Hai Jin and
Haikun Liu and
Linchen Yu and
Zhang Zhan An Efficient ReRAM-based Accelerator for
Asynchronous Iterative Graph Processing 16:1--16:??
Xinyu Li and
Guangyao Guo and
Yanzhi Lan and
Feng Xue and
Chenji Han and
Gen Niu and
Fuxin Zhang Tiaozhuan: a General and Efficient
Indirect Branch Optimization for Binary
Translation . . . . . . . . . . . . . . 17:1--17:??
Jianhua Gao and
Zeming Liu and
Yizhuo Wang and
Weixing Ji RaNAS: Resource-Aware Neural
Architecture Search for Edge Computing 18:1--18:??
Adnan Hasnat and
Shoaib Akram SPIRIT: Scalable and Persistent
In-Memory Indices for Real-Time Search 19:1--19:??
Dezhong Yao and
Sifan Zhao and
Tongtong Liu and
Gang Wu and
Hai Jin ApSpGEMM: Accelerating Large-scale
SpGEMM with Heterogeneous Collaboration
and Adaptive Panel . . . . . . . . . . . 20:1--20:??
Weiduo Chen and
Xiaoshe Dong and
Fan Zhang and
Bowen Li and
Yufei Wang and
Qiang Wang ATP: Achieving Throughput Peak for DNN
Training via Smart GPU Memory Management 21:1--21:??
Zhuoran Song and
Jiabei Long and
Li Jiang and
Naifeng Jing and
Xiaoyao Liang GCNTrain+: a Versatile and Efficient
Accelerator for Graph Convolutional
Neural Network Training . . . . . . . . 22:1--22:??
Wenjie Qi and
Zhipeng Tan and
Ziyue Zhang and
Ying Yuan and
Dan Feng exZNS: Extending Zoned Namespace to
Support Byte-loggable Zones . . . . . . 23:1--23:??
Long Zheng and
Bing Zhu and
Pengcheng Yao and
Yuhang Zhou and
Chengao Pan and
Wenju Zhao and
Xiaofei Liao and
Hai Jin and
Jingling Xue PRAGA: a Priority-Aware
Hardware/Software Co-design for
High-Throughput Graph Processing
Acceleration . . . . . . . . . . . . . . 24:1--24:??
Yingshuai Dong and
Chencheng Ye and
Haikun Liu and
Liting Tang and
Xiaofei Liao and
Hai Jin and
Cheng Chen and
Yanjiang Li and
Yi Wang DTAP: Accelerating Strongly-Typed
Programs with Data Type-Aware Hardware
Prefetching . . . . . . . . . . . . . . 25:1--25:??
Xueliang Wei and
Dan Feng and
Wei Tong and
Bing Wu and
Xu Jiang COVER: Alleviating Crash-Consistency
Error Amplification in Secure Persistent
Memory Systems . . . . . . . . . . . . . 26:1--26:??
Xinqi Chen and
Erci Xu and
Dengyao Mo and
Ruiming Lu and
Haonan Wu and
Dian Ding and
Guangtao Xue MasterPlan: a Reinforcement Learning
Based Scheduler for Archive Storage . . 27:1--27:??
Brandon Kammerdiener and
J. Zach Mcmichael and
Michael Jantz and
Kshitij Doshi and
Terry Jones Flexible and Effective Object Tiering
for Heterogeneous Memory Systems . . . . 28:1--28:??
Zhiqiang Chen and
Yongwen Wang and
Hongwei Zhou and
Jian Zhang Steered Bubble: an Interposer-based
Deadlock Recovery Algorithm for
Multi-chiplet Systems . . . . . . . . . 29:1--29:??
Shruthi Karunakar and
Rajshekar Kalayappan and
Sandeep Chandran Consequence-based Clustered Architecture 30:1--30:??
Jiahui Yang and
Fulin Nan and
Zhirong Shen and
Zhisheng Chen and
Yuhui Cai and
Dmitrii Kaplun and
Xiaoli Wang and
Quanqing Xu and
Chuanhui Yang and
Jiwu Shu TPRepair: Tree-based Pipelined Repair in
Clustered Storage Systems . . . . . . . 31:1--31:25
Jianrong Yan and
Wenbin Jiang and
Dongao He and
Suyang Wen and
Yang Li and
Hai Jin and
Zhiyuan Shao RT-GNN: Accelerating Sparse Graph Neural
Networks by Tensor-CUDA Kernel Fusion 32:1--32:27
Yi Dai and
Kai Lu and
Sheng Ma and
Jinshu Su and
Dongsheng Li Bubble-Swap Flow Control . . . . . . . . 33:1--33:26
Dongjie Tang and
Zijun Wu and
Yun Wang and
Yicheng Gu and
Fangxin Liu and
Zhengwei Qi gCom: Fine-grained Compressors in
Graphics Memory of Mobile GPU . . . . . 34:1--34:25
Ruixing Zong and
Jiapeng Zhang and
Zhuo Tang and
Kenli Li IBing: an Efficient Interleaved
Bidirectional Ring All-Reduce Algorithm
for Gradient Synchronization . . . . . . 35:1--35:23
Quancheng Wang and
Ming Tang and
Ke Xu and
Han Wang Unveiling and Evaluating Vulnerabilities
in Branch Predictors via a Three-Step
Modeling Methodology . . . . . . . . . . 36:1--36:26
Pengyu Yang and
Weihao Cui and
Chunyu Xue and
Han Zhao and
Chen Chen and
Quan Chen and
Jing Yang and
Minyi Guo Taming Flexible Job Packing in Deep
Learning Training Clusters . . . . . . . 37:1--37:24
Zhenlin Wu and
Haosong Zhao and
Hongyuan Liu and
Wujie Wen and
Jiajia Li gHyPart: GPU-friendly End-to-End
Hypergraph Partitioner . . . . . . . . . 38:1--38:25
Mariano Benito and
Enrique Vallejo and
Ramón Beivide LIA: Latency-Improved Adaptive routing
for Dragonfly networks . . . . . . . . . 39:1--39:26
Yiming Gan and
Jingwen Leng and
Bo Yu and
Yuhao Zhu KINDRED: Heterogeneous Split-Lock
Architecture for Safe Autonomous
Machines . . . . . . . . . . . . . . . . 40:1--40:25
Tzung-Han Juang and
Christophe Dubach Maximizing Data and Hardware Reuse for
HLS with Early-Stage Symbolic
Partitioning . . . . . . . . . . . . . . 41:1--41:26
Cheng Xu and
Chao Li and
Xiaofeng Hou and
Junyi Mei and
Jing Wang and
Pengyu Wang and
Shixuan Sun and
Minyi Guo and
Baoping Hao Enhancing High-Throughput GPU Random
Walks Through Multi-Task Concurrency
Orchestration . . . . . . . . . . . . . 42:1--42:26
Qiong Chang and
Weimin Wang and
Jun Miyazaki Accelerating Nearest Neighbor Search in
3D Point Cloud Registration on GPUs . . 43:1--43:24
Yekang Zhan and
Xiangrui Yang and
Haichuan Hu and
Qiang Cao and
Yifan Zhang and
Jie Yao AIS: an Active Idleness I/O Scheduler to
Reduce Buffer-Exhausted Degradation of
Solid-State Drives . . . . . . . . . . . 44:1--44:26
Coby Soss and
Aravind Sukumaran Rajam and
Janet Layne and
Edoardo Serra and
Mahantesh Halappanavar and
Assefaw H. Gebremedhin ScaWL: Scaling $k$-WL
(Weisfeiler--Lehman) Algorithms in
Memory and Performance on Shared and
Distributed-Memory Systems . . . . . . . 45:1--45:25
Yiming Wang and
Weizhe Zhang and
Meng Hao and
Weizhi Kong and
Yuan Wen Dynamic Power Management Through
Multi-agent Deep Reinforcement Learning
for Heterogeneous Systems . . . . . . . 46:1--46:??
Xinyuan Wang and
Xingchen Li and
Yun Peng and
Hejiao Huang Comprehensive Evaluation and Opportunity
Discovery for Deterministic Concurrency
Control . . . . . . . . . . . . . . . . 47:1--47:??
Théophile Bastian and
Hugo Pompougnac and
Alban Dutilleul and
Fabrice Rastello CesASMe and Staticdeps: static detection
of memory-carried dependencies for code
analyzers . . . . . . . . . . . . . . . 48:1--48:??
Fuyu Wang and
Minghua Shen and
Yutong Lu and
Nong Xiao Ceiba: an Efficient and Scalable DNN
Scheduler for Spatial Accelerators . . . 49:1--49:??
Kelun Lei and
Shaokang Du and
Xin You and
Hailong Yang and
Zhongzhi Luan and
Yi Liu and
Depei Qian Exploiting Dynamic Regular Patterns in
Irregular Programs for Efficient
Vectorization . . . . . . . . . . . . . 50:1--50:??
Xueying Wang and
Shigang Li and
Hao Qian and
Fan Luo and
Zhaoyang Hao and
Tong Wu and
Ruiyuan Xu and
Huimin Cui and
Xiaobing Feng and
Guangli Li and
Jingling Xue OptiFX: Automatic Optimization for
Convolutional Neural Networks with
Aggressive Operator Fusion on GPUs . . . 51:1--51:??
Yifu He and
Han Zhao and
Weihao Cui and
Shulai Zhang and
Quan Chen and
Minyi Guo ARACHNE: Optimizing Distributed Parallel
Applications with Reduced Inter-Process
Communication . . . . . . . . . . . . . 52:1--52:??
Kailin Yang and
José F. Martínez VersaTile: Flexible Tiled Architectures
via Associative Processors . . . . . . . 53:1--53:??
Changqing Shi and
Yufei Sun and
Rui Chen and
Jiahao Wang and
Qiang Guo and
Chunye Gong and
Yicheng Sui and
Yutong Jin and
Yuzhi Zhang TransCL: an Automatic CUDA-to-OpenCL
Programs Transformation Framework . . . 54:1--54:??
Zhibo Xuan and
Xin You and
Tianyu Feng and
Hailong Yang and
Zhongzhi Luan and
Yi Liu and
Depei Qian SimTrace: Exploiting Spatial and
Temporal Sampling for Large-Scale
Performance Analysis . . . . . . . . . . 55:1--55:??
Congyong Chen and
Shengan Zheng and
Yuhang Zhang and
Linpeng Huang FusionFS: a Contention-Resilient File
System for Persistent CPU Caches . . . . 56:1--56:??
Jingcheng Shen and
Lang Yang and
Linbo Long and
Zhenhua Tan and
Congming Gao and
Kan Zhong and
Masao Okita and
Fumihiko Ino Overlapping Aware Data Placement
Optimizations for LSM Tree-Based Store
on ZNS SSDs . . . . . . . . . . . . . . 57:1--57:??
Minghua Shen and
Aoxiang Qin and
Nong Xiao ODGS: Dependency-Aware Scheduling for
High-Level Synthesis with Graph Neural
Network and Reinforcement Learning . . . 58:1--58:??
Gaoyang Zhao and
Qiuran Li and
Rongzhen Lin and
Yaohua Wang Shift-CIM: In-SRAM Alignment To Support
General-Purpose Bit-level Sparsity
Exploration in SRAM Multiplication . . . 59:1--59:??
Xin Cheng and
Jinpeng Ye and
Haoyu Deng and
Tingting Zhang and
Tianyi Liu and
Jian Wang LitTLS: Lightweight Thread-Level
Speculation on Little Cores . . . . . . 60:1--60:??
Chaoyang Jia and
Jingyu Liu and
Shi Chen and
Kai Lu and
Li Shen TSN Cache: Exploiting Data Localities in
Graph Computing Applications . . . . . . 61:1--61:??
Shantian Qin and
Zhihua Fan and
Wenming Li and
Zhen Wang and
Xuejun An and
Xiaochun Ye and
Dongrui Fan PANDA: Adaptive Prefetching and
Decentralized Scheduling for Dataflow
Architectures . . . . . . . . . . . . . 62:1--62:??
Yu Tang and
Lujia Yin and
Qiao Li and
Hongyu Zhu and
Hengjie Li and
Xingcheng Zhang and
Linbo Qiao and
Dongsheng Li and
Jiaxin Li Koala: Efficient Pipeline Training
through Automated Schedule Searching on
Domain-Specific Language . . . . . . . . 63:1--63:??
Yuting Li and
Yun Xu and
Pengcheng Wang and
Yonghui Xu and
Weiguang Wang A Lock-free RDMA-friendly Index in
CPU-parsimonious Environments . . . . . 64:1--64:??
Xueliang Wei and
Dan Feng and
Wei Tong and
Bing Wu and
Xu Jiang SEED: Speculative Security Metadata
Updates for Low-Latency Secure Memory 65:1--65:??
Xiaobo Lu and
Jianbin Fang and
Lin Peng and
Chun Huang and
Zixiao Yu and
Tiejun Li Gator: Accelerating Graph Attention
Networks by Jointly Optimizing Attention
and Graph Processing . . . . . . . . . . 66:1--66:??
Yacine Hakimi and
Riyadh Baghdadi and
Yacine Challal Supporting Dynamic Program Sizes in Deep
Learning-Based Cost Models for Code
Optimization . . . . . . . . . . . . . . 67:1--67:??
Yicheng Wang and
Lijie Xu and
Tian Guo and
Wensheng Dou and
Hongbin Zeng and
Wei Wang and
Jun Wei and
Tao Huang BridgeGC: an Efficient Cross-Level
Garbage Collector for Big Data
Frameworks . . . . . . . . . . . . . . . 68:1--68:??
Zhen Du and
Ying Liu and
Ninghui Sun and
Huimin Cui and
Xiaobing Feng and
Jiajia Li SRSparse: Generating Codes for
High-Performance Sparse Matrix-Vector
Semiring Computations . . . . . . . . . 69:1--69:??
Chenji Han and
Zifei Zhang and
Feng Xue and
Xinyu Li and
Yuxuan Wu and
Tingting Zhang and
Tianyi Liu and
Qi Guo and
Fuxin Zhang SnsBooster: Enhancing Sampling-based $
\mu $ Arch Evaluation Efficiency through
Online Performance Sensitivity Analysis 70:1--70:??
Amit Tiwari and
V. Krishna Nandivada Unleashing Parallelism with
Elastic-Barriers . . . . . . . . . . . . 71:1--71:??
Gia Bao Thieu and
Sven Gesper and
Guillermo Payá-Vayá DCMA: Accelerating Parallel DMA
Transfers with a Multi-Port Direct
Cached Memory Access in a
Massive-Parallel Vector Processor . . . 72:1--72:??
Aurélie Saulquin and
Mazdak Fatahi and
Pierre Boulet and
Samy Meftali ModNEF : an Open Source Modular
Neuromorphic Emulator for FPGA for
Low-Power In-Edge Artificial
Intelligence . . . . . . . . . . . . . . 73:1--73:??
Zhengding Hu and
Jingwei Sun and
Guangzhong Sun GNNPilot: a Holistic Framework for
High-Performance Graph Neural Network
Computations on GPUs . . . . . . . . . . 74:1--74:??
Jinghao Zhao and
Hongwei Yang and
Meng Hao and
Weizhe Zhang and
Hui He and
Desheng Wang HEngine: a High Performance Optimization
Framework on a GPU for Homomorphic
Encryption . . . . . . . . . . . . . . . 75:1--75:??
Wen Cheng and
Qianya Cheng and
Yi Liu and
Lingfang Zeng and
Andre Brinkmann and
Yang Wang 9Ring: a $3$D-Stacked Memory-Based
Accelerator for Flexible and Efficient
Deep CNN Applications . . . . . . . . . 76:1--76:26
Cunchen Hu and
Heyang Huang and
Liangliang Xu and
Xusheng Chen and
Chenxi Wang and
Jiang Xu and
Shuang Chen and
Hao Feng and
Sa Wang and
Yungang Bao and
Ninghui Sun and
Yizhou Shan ShuffleInfer: Disaggregate LLM Inference
for Mixed Downstream Workloads . . . . . 77:1--77:24
Suchita Pati and
Shaizeen Aga and
Nuwan Jayasena and
Matthew Sinclair GOLDYLOC: Global Optimizations &
Lightweight Dynamic Logic for
Concurrency . . . . . . . . . . . . . . 78:1--78:28
Yi Zhang and
Xiaomeng Yi and
Yu Huang and
Jingrui Yuan and
Chuangyi Gui and
Dan Chen and
Long Zheng and
Jianhui Yue and
Xiaofei Liao and
Hai Jin and
Jingling Xue Cheetah: Accelerating Dynamic Graph
Mining with Grouping Updates . . . . . . 79:1--79:26
Manolis Katsaragakis and
Christos Baloukas and
Lazaros Papadopoulos and
Francky Catthoor and
Dimitrios Soudris Performance, Energy and NVM
Lifetime-Aware Data Structure Refinement
and Placement for Heterogeneous Memory
Systems . . . . . . . . . . . . . . . . 80:1--80:27
Farui Wang and
Meng Hao and
Siyu Yang and
Weizhe Zhang Deep Learning Workload Mapping
Optimization on Jetson Platforms . . . . 81:1--81:23
Wenlong Mu and
Yue Tang and
Bo Huang and
Jianmei Guo AOBO: a Fast-Switching Online Binary
Optimizer on AArch64 . . . . . . . . . . 82:1--82:27
Konrad Moron and
Stefan Wallentowitz Benchmarking WebAssembly for Embedded
Systems . . . . . . . . . . . . . . . . 83:1--83:21
Qian Xiong and
Weiliang Ma and
Xuanhua Shi and
Yongluan Zhou and
Hai Jin and
Kaiyi Huang and
Haozhou Wang and
Zhengru Wang gECC: a GPU-based high-throughput
framework for Elliptic Curve
Cryptography . . . . . . . . . . . . . . 84:1--84:27
Haomin Li and
Fangxin Liu and
Zongwu Wang and
Ning Yang and
Shiyuan Huang and
Xiaoyao Liang and
Haibing Guan and
Li Jiang Attack and Defense: Enhancing Robustness
of Binary Hyper-Dimensional Computing 85:1--85:25
Chris Kjellqvist and
Lisa Wills and
Alvin Lebeck BigLittleMCA: a Spatially-Optimal Tiled
Hardware Accelerator for MCMC Image
Processing . . . . . . . . . . . . . . . 86:1--86:26
Chaoyang Jia and
Zhang Dunbo and
Qingjie Lang and
Ruoxi Wang and
Li Shen In-SRAM Parallel Data Shuffle . . . . . 87:1--87:24
Xinglei Dou and
Lei Liu and
Zhuohao Wang and
Pengyu Li LarQucut: a New Cutting and Mapping
Approach for Large-sized Quantum
Circuits in Distributed Quantum
Computing (DQC) Environments . . . . . . 88:1--88:24
Hao Ding and
Peiling Song and
Yelin Li and
Junyan Qian A Two-Stage Degradation-Based Topology
Reconfiguration Algorithm for
Fault-Tolerant Multiprocessor Arrays . . 89:1--89:26
Xiang Li and
Qiong Chang and
Yun Li and
Jun Miyazaki $3$D GNLM: Efficient $3$D Non-Local
Means Kernel with Nested Reuse
Strategies for Embedded GPUs . . . . . . 90:1--90:22
Yiming Sun and
Jie Zhang and
Huawei Cao and
Yuan Zhang and
Xuejun An and
Junying Huang and
Xiaochun Ye CGCGraph: Efficient CPU-GPU Co-execution
for Concurrent Dynamic Graph Processing 91:1--91:26
Zhanyuan Di and
Leping Wang and
Zhaojia Ma and
En Shao and
Jie Zhao and
Ziyi Ren and
Siyuan Feng and
Dingwen Tao and
Guangming Tan and
Ninghui Sun Accelerating Parallel Structures in DNNs
via Parallel Fusion and Operator
Co-Optimization . . . . . . . . . . . . 92:1--92:26
Ruihao Li and
Bagus Hanindhito and
Sanjana Yadav and
Qinzhe Wu and
Krishna Kavi and
Gayatri Mehta and
Neeraja J. Yadwadkar and
Lizy K. John Performance Implications of Pipelining
the Data Transfer in CPU-GPU
Heterogeneous Systems . . . . . . . . . 93:1--93:26
Haozhong Qiu and
Chuanfu Xu and
Jianbin Fang and
Jian Zhang and
Liang Deng and
Zhe Dai and
Yue Ding and
Yue Wang and
Zhimeng Han and
Yonggang Che and
Jie Liu DCSolver: Accelerating Sparse Iterative
Solvers via Divide-and-Conquer on GPUs 94:1--94:25
Yachun Liu and
Dan Feng and
Jianxi Chen and
Jing Hu and
Zhouxuan Peng and
Jinlei Hu ZNSFQ: an Efficient and High-Performance
Fair Queue Scheduling Scheme for ZNS
SSDs . . . . . . . . . . . . . . . . . . 95:1--95:27
Omar Shaaban Ibrahim ali and
Juliette Fournis d'Albiat and
Isabel Piedrahita and
Vicenç Beltran and
Xavier Martorell and
Paul Carpenter and
Eduard Ayguadé and
Jesus Labarta Leveraging iterative applications to
improve the scalability of task-based
programming models on distributed
systems . . . . . . . . . . . . . . . . 96:1--96:27
Suhong Lee and
Boyeal Kim and
Yongseok Choi and
Hyuk-Jae Lee HopScotch: a Holistic Approach to Data
Layout-Aware Mapping on NPUs for
High-Performance DNN Inference . . . . . 97:1--97:26
Qiliang Li and
Min Lyu and
Tian Liu and
Liangliang Xu and
Wei Wang and
Yinlong Xu MetaEC: an Efficient and Resilient
Erasure-Coded KV Store on Disaggregated
Memory . . . . . . . . . . . . . . . . . 98:1--98:26
Han Zhao and
Weihao Cui and
Quan Chen and
Zijun Li and
Zhenhua Han and
Nan Wang and
Yu Feng and
Jieru Zhao and
Chen Chen and
Jingwen Leng and
Minyi Guo EDAS: Enabling Fast Data Loading for GPU
Serverless Computing . . . . . . . . . . 99:1--99:23
Mary Hall and
Cosmin E. Oancea and
Anne C. Elster and
Ari Rasch and
Sameeran Joshi and
Amir Mohammad Tavakkoli and
Richard Schulze Scheduling Language Chronology: Past,
Present, and Future . . . . . . . . . . 100:1--100:31
Zhibing Sha and
Shuaiwen Yu and
Chengyong Tang and
Zhigang Cai and
Peng Tang and
Ming Huang and
Jun Li and
Jianwei Liao Supports of Data Cache Division for
Computational Solid-state Drives . . . . 101:1--101:20
Lingxiao Jin and
Zinuo Cai and
Haoxin Wang and
Zongpu Zhang and
Ruhui Ma and
Haibing Guan and
Yuan Liu and
Buyya Rajkumar Ephemera: Accelerating I/O-Intensive
Serverless Workloads with a Harvested
In-memory File System . . . . . . . . . 102:1--102:24
Yulong Wu and
Yehan Ma and
Mingdong Xie and
Weizhe Zhang Partitioned Scheduling and Analysis for
a Typed DAG Task on Heterogeneous
Multi-Cores . . . . . . . . . . . . . . 103:1--103:24
Wei Niu and
Mengshu Sun and
Zhengang Li and
Jou-An Chen and
Jiexiong Guan and
Xipeng Shen and
Jun Liu and
Mei Zhang and
Yanzhi Wang and
Xue Lin and
Bin Ren Mobile-$3$DCNN: an Acceleration
Framework for Ultra-Real-Time Execution
of Large $3$D CNNs on Mobile Devices . . 104:1--104:22
Yudong Mu and
Zhihua Fan and
Wenming Li and
Zhiyuan Zhang and
Xuejun An and
Dongrui Fan and
Xiaochun Ye GenCNN: a Partition-Aware
Multi-Objective Mapping Framework for
CNN Accelerators Based on Genetic
Algorithm . . . . . . . . . . . . . . . 105:1--105:26
Neel Patel and
Ren Wang and
Mohammad Alian RACER: Avoiding End-to-End Slowdowns in
Accelerated Chip Multi-Processors . . . 106:1--106:22
Ziyue Xu and
Yichen Li and
Ranzhe Deng and
Liping Yi and
Yusen Li and
Gang Wang and
Xiaoguang Liu SampDedup: Sampling Prediction for
Efficient Inline Data Deduplication on
Non-volatile Memory . . . . . . . . . . 107:1--107:25
Hui Sun and
Qianli Yue and
Guanzhong Chen and
Yi Zou and
Yinliang Yue and
Xiao Qin HAKV: a Hotness-Aware Zone Management
Approach to Optimizing Performance of
LSM-tree-based Key-Value Stores . . . . 108:1--108:26
Lixiao Cui and
Kedi Yang and
Yusen Li and
Gang Wang and
Xiaoguang Liu Towards Optimizing Learned Index for
High Performance, Memory Efficiency and
NUMA Awareness . . . . . . . . . . . . . 109:1--109:26
Marcin Copik and
Lukas Möller and
Alexandru Calotoiu and
Torsten Hoefler Cppless: Single-Source and
High-Performance Serverless Programming
in C++ . . . . . . . . . . . . . . . . . 110:1--110:27
Yifan Zhang and
Xiaoyu Niu and
Hongzheng Tian and
Yanjun Zhang and
Bo Yu and
Shaoshan Liu and
Sitao Huang A Sparsity-Aware Autonomous Path
Planning Accelerator with HW\slash SW
Co-Design and Multi-Level Dataflow
Optimization . . . . . . . . . . . . . . 111:1--111:25
Xinbiao Gan TianheGraph: Topology-aware Graph
Processing . . . . . . . . . . . . . . . 112:1--112:24