Last update:
Fri Nov 22 08:32:24 MST 2024
Brad Calder and Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2 W. Zhang and J. S. Hu and V. Degalahal and M. Kandemir and N. Vijaykrishnan and M. J. Irwin Reducing instruction cache energy consumption using a compiler-based strategy . . . . . . . . . . . . . . . . 3--33 Nemanja Isailovic and Mark Whitney and Yatish Patel and John Kubiatowicz and Dean Copsey and Frederic T. Chong and Isaac L. Chuang and Mark Oskin Datapath and control for quantum wires 34--61 Karthikeyan Sankaralingam and Ramadass Nagarajan and Haiming Liu and Changkyu Kim and Jaehyuk Huh and Nitya Ranganathan and Doug Burger and Stephen W. Keckler and Robert G. McDonald and Charles R. Moore TRIPS: a polymorphous architecture for exploiting ILP, TLP, and DLP . . . . . . 62--93 Kevin Skadron and Mircea R. Stan and Karthik Sankaranarayanan and Wei Huang and Sivakumar Velusamy and David Tarjan Temperature-aware microarchitecture: Modeling and implementation . . . . . . 94--125
Alex Alet\`a and Josep M. Codina and Antonio González and David Kaeli Removing communications in clustered microarchitectures through instruction replication . . . . . . . . . . . . . . 127--151 Yu Bai and R. Iris Bahar A low-power in-order/out-of-order issue queue . . . . . . . . . . . . . . . . . 152--179 Philo Juang and Kevin Skadron and Margaret Martonosi and Zhigang Hu and Douglas W. Clark and Philip W. Diodato and Stefanos Kaxiras Implementing branch-predictor decay using quasi-static memory cells . . . . 180--219 Oliverio J. Santana and Alex Ramirez and Josep L. Larriba-Pey and Mateo Valero A low-complexity fetch architecture for high-performance superscalar processors 220--245
Jin Lin and Tong Chen and Wei-Chung Hsu and Pen-Chung Yew and Roy Dz-Ching Ju and Tin-Fook Ngai and Sun Chan A compiler framework for speculative optimizations . . . . . . . . . . . . . 247--271 Brian A. Fields and Rastislav Bodik and Mark D. Hill and Chris J. Newburn Interaction cost and shotgun profiling 272--304 Karthik Sankaranarayanan and Kevin Skadron Profile-based adaptation for cache decay 305--322 Fen Xie and Margaret Martonosi and Sharad Malik Intraprogram dynamic voltage scaling: Bounding opportunities with analytic modeling . . . . . . . . . . . . . . . . 323--367
A. Hartstein and Thomas R. Puzak The optimum pipeline depth considering both power and performance . . . . . . . 369--388 Adrián Cristal and Oliverio J. Santana and Mateo Valero and José F. Martínez Toward kilo-instruction processors . . . 389--417 Haitham Akkary and Ravi Rajwar and Srikanth T. Srinivasan An analysis of a resource efficient checkpoint architecture . . . . . . . . 418--444 Chia-Lin Yang and Alvin R. Lebeck and Hung-Wei Tseng and Chien-Hao Lee Tolerating memory latency through push prefetching for pointer-intensive applications . . . . . . . . . . . . . . 445--475
Brad Calder and Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2 Yuanyuan Zhou and Pin Zhou and Feng Qin and Wei Liu and Josep Torrellas Efficient and flexible architectural support for dynamic monitoring . . . . . 3--33 Chuanjun Zhang and Frank Vahid and Jun Yang and Walid Najjar A way-halting cache for low-energy high-performance systems . . . . . . . . 34--54 Jaume Abella and Antonio González and Xavier Vera and Michael F. P. O'Boyle IATAC: a smart predictor to turn-off L2 cache lines . . . . . . . . . . . . . . 55--77 John W. Haskins, Jr. and Kevin Skadron Accelerated warmup for sampled microarchitecture simulation . . . . . . 78--108
Tao Li and Ravi Bhargava and Lizy Kurian John Adapting branch-target buffer to improve the target predictability of Java code 109--130 Lingli Zhang and Chandra Krintz The design, implementation, and evaluation of adaptive code unloading for resource-constrained devices . . . . 131--164 Prasad A. Kulkarni and Stephen R. Hines and David B. Whalley and Jason D. Hiser and Jack W. Davidson and Douglas L. Jones Fast and efficient searches for effective optimization-phase sequences 165--198 Esther Salamí and Mateo Valero Dynamic memory interval test vs. interprocedural pointer analysis in multimedia applications . . . . . . . . 199--219
Yan Meng and Timothy Sherwood and Ryan Kastner Exploring the limits of leakage power reduction in caches . . . . . . . . . . 221--246 María Jesús Garzarán and Milos Prvulovic and José María Llabería and Víctor Viñals and Lawrence Rauchwerger and Josep Torrellas Tradeoffs in buffering speculative memory state for thread-level speculation in multiprocessors . . . . . 247--279 David Tarjan and Kevin Skadron Merging path and gshare indexing in perceptron branch prediction . . . . . . 280--300 Xiangyu Zhang and Rajiv Gupta Whole execution traces and their applications . . . . . . . . . . . . . . 301--334
Wankang Zhao and David Whalley and Christopher Healy and Frank Mueller Improving WCET by applying a WC code-positioning optimization . . . . . 335--365 George A. Reis and Jonathan Chang and Neil Vachharajani and Ram Rangan and David I. August and Shubhendu S. Mukherjee Software-controlled fault tolerance . . 366--396 Jian Li and José F. Martínez Power-performance considerations of parallel computing on chip multiprocessors . . . . . . . . . . . . 397--422 Saurabh Sharma and Jesse G. Beu and Thomas M. Conte Spectral prefetcher: An effective mechanism for L2 cache prefetching . . . 423--450
Brad Calder and Dean Tullsen Introduction . . . . . . . . . . . . . . 1--2 Lin Tan and Brett Brotherton and Timothy Sherwood Bit-split string-matching engines for intrusion detection and prevention . . . 3--34 Priya Nagpurkar and Hussam Mousa and Chandra Krintz and Timothy Sherwood Efficient remote profiling for resource-constrained devices . . . . . . 35--66 Jin Lin and Wei-Chung Hsu and Pen-Chung Yew and Roy Dz-Ching Ju and Tin-Fook Ngai Recovery code generation for general speculative optimizations . . . . . . . 67--89 Yoonseo Choi and Hwansoo Han Optimal register reassignment for register stack overflow minimization . . 90--114
Jingling Xue and Qiong Cai A lifetime optimal algorithm for speculative PRE . . . . . . . . . . . . 115--155 Joseph J. Sharkey and Dmitry V. Ponomarev and Kanad Ghose and Oguz Ergin Instruction packing: Toward fast and energy-efficient instruction scheduling 156--181 Luis Ceze and Karin Strauss and James Tuck and Josep Torrellas and Jose Renau CAVA: Using checkpoint-assisted value prediction to hide L2 misses . . . . . . 182--208 Lixin Zhang and Mike Parker and John Carter Efficient address remapping in distributed shared-memory systems . . . 209--229
Min Zhao and Bruce R. Childers and Mary Lou Soffa An approach toward profit-driven optimization . . . . . . . . . . . . . . 231--262 Kim Hazelwood and Michael D. Smith Managing bounded code caches in dynamic binary optimization systems . . . . . . 263--294 Olivier Rochecouste and Gilles Pokam and André Seznec A case for a complexity-effective, width-partitioned microarchitecture . . 295--326 Ahmad Zmily and Christos Kozyrakis Block-aware instruction set architecture 327--357
Jedidiah R. Crandall and S. Felix Wu and Frederic T. Chong Minos: Architectural support for protecting control data . . . . . . . . 359--389 Jaydeep Marathe and Frank Mueller and Bronis R. de Supinski Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques 390--423 Ilya Ganusov and Martin Burtscher Future execution: a prefetching mechanism that uses multiple cores to speed up single threads . . . . . . . . 424--449 Michele Co and Dee A. B. Weikle and Kevin Skadron Evaluating trace cache energy efficiency 450--476 Shiwen Hu and Madhavi Valluri and Lizy Kurian John Effective management of multiple configurable units using dynamic optimization . . . . . . . . . . . . . . 477--501 Chris Bentley and Scott A. Watterson and David K. Lowenthal and Barry Rountree Implicit array bounds checking on 64-bit architectures . . . . . . . . . . . . . 502--527
Brad Calder and Dean Tullsen Introduction . . . . . . . . . . . . . . 1:1--1:1 Kypros Constantinides and Stephen Plaza and Jason Blome and Valeria Bertacco and Scott Mahlke and Todd Austin and Bin Zhang and Michael Orshansky Architecting a reliable CMP switch architecture . . . . . . . . . . . . . . 2:1--2:37 Ruchira Sasanka and Man-Lap Li and Sarita V. Adve and Yen-Kuang Chen and Eric Debes ALP: Efficient support for all levels of parallelism for complex media applications . . . . . . . . . . . . . . 3:1--3:30 Yan Luo and Jia Yu and Jun Yang and Laxmi N. Bhuyan Conserving network processor power consumption by exploiting traffic variability . . . . . . . . . . . . . . 4:1--4:26 Vassos Soteriou and Noel Eisley and Li-Shiuan Peh Software-directed power-aware interconnection networks . . . . . . . . 5:1--5:40 Yuan-Shin Hwang and Jia-Jhe Li Snug set-associative caches: Reducing leakage power of instruction and data caches with no performance penalties . . 6:1--6:28 Hongbo Rong and Zhizhong Tang and R. Govindarajan and Alban Douillet and Guang R. Gao Single-dimension software pipelining for multidimensional loops . . . . . . . . . 7:1--7:44
Fred A. Bower and Daniel J. Sorin and Sule Ozev Online diagnosis of hard faults in microprocessors . . . . . . . . . . . . 8:1--8:?? Pierre Michaud and André Seznec and Damien Fetis and Yiannakis Sazeides and Theofanis Constantinou A study of thread migration in temperature-constrained multicores . . . 9:1--9:?? Yu Chen and Fuxin Zhang Code reordering on limited branch offset 10:1--10:?? A. S. Terechko and H. Corporaal Inter-cluster communication in VLIW architectures . . . . . . . . . . . . . 11:1--11:?? Jialin Dou and Marcelo Cintra A compiler cost model for speculative parallelization . . . . . . . . . . . . 12:1--12:?? Wolfram Amme and Jeffery von Ronne and Michael Franz SSA-based mobile code: Implementation and empirical evaluation . . . . . . . . 13:1--13:??
Xiaodong Li and Ritu Gupta and Sarita V. Adve and Yuanyuan Zhou Cross-component energy management: Joint adaptation of processor and memory . . . 14:1--14:?? Ron Gabor and Shlomo Weiss and Avi Mendelson Fairness enforcement in switch on event multithreading . . . . . . . . . . . . . 15:1--15:?? Diego Andrade and Basilio B. Fraguela and Ramón Doallo Precise automatable analytical modeling of the cache behavior of codes with indirections . . . . . . . . . . . . . . 16:1--16:?? Kris Venstermans and Lieven Eeckhout and Koen De Bosschere Java object header elimination for reduced memory consumption in 64-bit virtual machines . . . . . . . . . . . . 17:1--17:?? Shu Xiao and Edmund M.-K. Lai VLIW instruction scheduling for minimal power variation . . . . . . . . . . . . 18:1--18:?? Sriraman Tallam and Rajiv Gupta Unified control flow and data dependence traces . . . . . . . . . . . . . . . . . 19:1--19:??
Engin Ipek and Sally A. McKee and Karan Singh and Rich Caruana and Bronis R. de Supinski and Martin Schulz Efficient architectural design space exploration via predictive modeling . . 1:1--1:?? Yunhe Shi and Kevin Casey and M. Anton Ertl and David Gregg Virtual machine showdown: Stack versus registers . . . . . . . . . . . . . . . 2:1--2:?? Jun Yan and Wei Zhang Exploiting virtual registers to reduce pressure on real registers . . . . . . . 3:1--3:?? Zoe C. H. Yu and Francis C. M. Lau and Cho-Li Wang Object co-location and memory reuse for Java programs . . . . . . . . . . . . . 4:1--4:?? Chuanjun Zhang Reducing cache misses through programmable decoders . . . . . . . . . 5:1--5:?? Amit Golander and Shlomo Weiss Hiding the misprediction penalty of a resource-efficient high-performance processor . . . . . . . . . . . . . . . 6:1--6:??
Brad Calder and Dean Tullsen Editorial . . . . . . . . . . . . . . . 1:1--1:?? Shashidhar Mysore and Banit Agrawal and Rodolfo Neuber and Timothy Sherwood and Nisheeth Shrivastava and Subhash Suri Formulating and implementing profiling over adaptive ranges . . . . . . . . . . 2:1--2:?? Antonia Zhai and J. Gregory Steffan and Christopher B. Colohan and Todd C. Mowry Compiler and hardware support for reducing the synchronization of speculative threads . . . . . . . . . . 3:1--3:?? Jonathan A. Winter and David H. Albonesi Addressing thermal nonuniformity in SMT workloads . . . . . . . . . . . . . . . 4:1--4:?? Asadollah Shahbahrami and Ben Juurlink and Stamatis Vassiliadis Versatility of extended subwords and the matrix register file . . . . . . . . . . 5:1--5:?? Zhi Guo and Walid Najjar and Betul Buyukkurt Efficient hardware code generation for FPGAs . . . . . . . . . . . . . . . . . 6:1--6:?? Thomas Kotzmann and Christian Wimmer and Hanspeter Mössenböck and Thomas Rodriguez and Kenneth Russell and David Cox Design of the Java HotSpot\TM client compiler for Java 6 . . . . . . . . . . 7:1--7:??
Ram Rangan and Neil Vachharajani and Guilherme Ottoni and David I. August Performance scalability of decoupled software pipelining . . . . . . . . . . 8:1--8:?? Jieyi Long and Seda Ogrenci Memik and Gokhan Memik and Rajarshi Mukherjee Thermal monitoring mechanisms for chip multiprocessors . . . . . . . . . . . . 9:1--9:?? Ajay Joshi and Lieven Eeckhout and Robert H. Bell, Jr. and Lizy K. John Distilling the essence of proprietary workloads into miniature benchmarks . . 10:1--10:?? Vincenzo Catania and Maurizio Palesi and Davide Patti Reducing complexity of multiobjective design space exploration in VLIW-based embedded systems . . . . . . . . . . . . 11:1--11:??
Jacob Leverich and Hideho Arakida and Alex Solomatnikov and Amin Firoozshahian and Mark Horowitz and Christos Kozyrakis Comparative evaluation of memory models for chip multiprocessors . . . . . . . . 12:1--12:?? Joseph J. Sharkey and Jason Loew and Dmitry V. Ponomarev Reducing register pressure in SMT processors through L2-miss-driven early register release . . . . . . . . . . . . 13:1--13:?? Mojtaba Mehrara and Todd Austin Exploiting selective placement for low-cost memory protection . . . . . . . 14:1--14:?? Hans Vandierendonck and André Seznec Speculative return address stack management revisited . . . . . . . . . . 15:1--15:??
Siddhartha Chhabra and Brian Rogers and Yan Solihin and Milos Prvulovic Making secure processors OS- and performance-friendly . . . . . . . . . . 16:1--16:?? Daniel A. Jiménez Generalizing neural branch prediction 17:1--17:?? Jinseong Jeon and Keoncheol Shin and Hwansoo Han Abstracting access patterns of dynamic memory using regular expressions . . . . 18:1--18:?? Ghassan Shobaki and Kent Wilken and Mark Heffernan Optimal trace scheduling using enumeration . . . . . . . . . . . . . . 19:1--19:??
Prasad A. Kulkarni and David B. Whalley and Gary S. Tyson and Jack W. Davidson Practical exhaustive optimization phase order exploration and evaluation . . . . 1:1--1:?? Manuel Hohenauer and Felix Engel and Rainer Leupers and Gerd Ascheid and Heinrich Meyr A SIMD optimization framework for retargetable compilers . . . . . . . . . 2:1--2:?? Stijn Eyerman and Lieven Eeckhout Memory-level parallelism aware fetch policies for simultaneous multithreading processors . . . . . . . . . . . . . . . 3:1--3:?? Lukasz Strozek and David Brooks Energy- and area-efficient architectures through application clustering and architectural heterogeneity . . . . . . 4:1--4:??
Guru Venkataramani and Ioannis Doudalis and Yan Solihin and Milos Prvulovic MemTracker: An accelerator for memory debugging and monitoring . . . . . . . . 5:1--5:?? Ron Gabor and Avi Mendelson and Shlomo Weiss Service level agreement for multithreaded processors . . . . . . . . 6:1--6:?? Wilson W. L. Fung and Ivan Sham and George Yuan and Tor M. Aamodt Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware 7:1--7:?? Cheng-Kok Koh and Weng-Fai Wong and Yiran Chen and Hai Li Tolerating process variations in large, set-associative caches: The buddy cache 8:1--8:??
Lian Li and Hui Feng and Jingling Xue Compiler-directed scratchpad memory management via graph coloring . . . . . 9:1--9:?? Amit Golander and Shlomo Weiss Checkpoint allocation and release . . . 10:1--10:?? Weifeng Xu and Russell Tessier Tetris-XL: a performance-driven spill reduction technique for embedded VLIW processors . . . . . . . . . . . . . . . 11:1--11:?? Timothy M. Jones and Michael F. P. O'Boyle and Jaume Abella and Antonio González and O\uguz Ergin Exploring the limits of early register release: Exploiting compiler analysis 12:1--12:??
Timothy M. Jones and Michael F. P. O'Boyle and Jaume Abella and Antonio González and O\uguz Ergin Energy-efficient register caching with compiler assistance . . . . . . . . . . 13:1--13:?? Weijia Li and Youtao Zhang and Jun Yang and Jiang Zheng Towards update-conscious compilation for energy-efficient code dissemination in WSNs . . . . . . . . . . . . . . . . . . 14:1--14:?? Michal Wegiel and Chandra Krintz The single-referent collector: Optimizing compaction for the common case . . . . . . . . . . . . . . . . . . 15:1--15:?? Samantika Subramaniam and Gabriel H. Loh Design and optimization of the store vectors memory dependence predictor . . 16:1--16:??
Xiaohang Wang and Mei Yang and Yingtao Jiang and Peng Liu A power-aware mapping approach to map IP cores onto NoCs under bandwidth and latency constraints . . . . . . . . . . 1:1--1:?? Zhong-Ho Chen and Alvin W. Y. Su A hardware/software framework for instruction and data scratchpad memory allocation . . . . . . . . . . . . . . . 2:1--2:?? Dong Hyuk Woo and Joshua B. Fryman and Allan D. Knies and Hsien-Hsin S. Lee Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching . . . . . . . . . . . . . . 3:1--3:?? Daniel Sanchez and George Michelogiannakis and Christos Kozyrakis An analysis of on-chip interconnection networks for large-scale chip multiprocessors . . . . . . . . . . . . 4:1--4:?? Xiuyi Zhou and Jun Yang and Marek Chrobak and Youtao Zhang Performance-aware thermal management via task scheduling . . . . . . . . . . . . 5:1--5:??
Arun Raghavan and Colin Blundell and Milo M. K. Martin Token tenure and PATCH: a predictive/adaptive token-counting hybrid . . . . . . . . . . . . . . . . . 6:1--6:?? Christian Wimmer and Hanspeter Mössenbösck Automatic feedback-directed object fusing . . . . . . . . . . . . . . . . . 7:1--7:?? Benjamin C. Lee and David Brooks Applied inference: Case studies in microarchitectural design . . . . . . . 8:1--8:?? R. Rakvic and Q. Cai and J. González and G. Magklis and P. Chaparro and A. González Thread-management techniques to maximize efficiency in multicore and simultaneous multithreaded microprocessors . . . . . 9:1--9:?? Derek Pao and Wei Lin and Bin Liu A memory-efficient pipelined implementation of the Aho--Corasick string-matching algorithm . . . . . . . 10:1--10:?? Xuejun Yang and Ying Zhang and Xicheng Lu and Jingling Xue and Ian Rogers and Gen Li and Guibin Wang and Xudong Fang Exploiting the reuse supplied by loop-dependent stream references for stream processors . . . . . . . . . . . 11:1--11:?? Vijay Janapa Reddi and Simone Campanoni and Meeta S. Gupta and Michael D. Smith and Gu-Yeon Wei and David Brooks and Kim Hazelwood Eliminating voltage emergencies via software-guided code transformations . . 12:1--12:??
Qin Zhao and Ioana Cutcutache and Weng-Fai Wong PiPA: Pipelined profiling and analysis on multicore systems . . . . . . . . . . 13:1--13:?? Fei Guo and Yan Solihin and Li Zhao and Ravishankar Iyer Quality of service shared cache management in chip multiprocessor architecture . . . . . . . . . . . . . . 14:1--14:?? Xiaoxia Wu and Jian Li and Lixin Zhang and Evan Speight and Ram Rajamony and Yuan Xie Design exploration of hybrid caches with disparate memory technologies . . . . . 15:1--15:?? Kornilios Kourtis and Georgios Goumas and Nectarios Koziris Exploiting compression opportunities to improve SpMxV performance on shared memory systems . . . . . . . . . . . . . 16:1--16:??
Betul Buyukkurt and John Cortes and Jason Villarreal and Walid A. Najjar Impact of high-level transformations within the ROCCC framework . . . . . . . 17:1--17:?? Yuan-Shin Hwang and Tzong-Yen Lin and Rong-Guey Chang DisIRer: Converting a retargetable compiler into a multiplatform binary translator . . . . . . . . . . . . . . . 18:1--18:?? Michael Boyer and David Tarjan and Kevin Skadron Federation: Boosting per-thread performance of throughput-oriented manycore architectures . . . . . . . . . 19:1--19:?? Grigori Fursin and Olivier Temam Collective optimization: a practical collaborative approach . . . . . . . . . 20:1--20:?? Fang Liu and Yan Solihin Understanding the behavior and implications of context switch misses 21:1--21:??
Stijn Eyerman and Lieven Eeckhout Fine-grained DVFS using on-chip regulators . . . . . . . . . . . . . . . 1:1--1:?? Chen-Yong Cher and Eren Kursun Exploring the effects of on-chip thermal variation on high-performance multicore architectures . . . . . . . . . . . . . 2:1--2:?? Carole-Jean Wu and Margaret Martonosi Adaptive timekeeping replacement: Fine-grained capacity management for shared CMP caches . . . . . . . . . . . 3:1--3:?? Lucas Vespa and Ning Weng Deterministic finite automata characterization and optimization for scalable pattern matching . . . . . . . 4:1--4:?? Abhishek Bhattacharjee and Gilberto Contreras and Margaret Martonosi Parallelization libraries: Characterizing and reducing overheads 5:1--5:??
Xiangyu Dong and Yuan Xie and Naveen Muralimanohar and Norman P. Jouppi Hybrid checkpointing using emerging nonvolatile memories for future exascale systems . . . . . . . . . . . . . . . . 6:1--6:?? Jianjun Li and Chenggang Wu and Wei-Chung Hsu Efficient and effective misaligned data access handling in a dynamic binary translation system . . . . . . . . . . . 7:1--7:?? Guru Venkataramani and Christopher J. Hughes and Sanjeev Kumar and Milos Prvulovic DeFT: Design space exploration for on-the-fly detection of coherence misses 8:1--8:?? Jason D. Hiser and Daniel W. Williams and Wei Hu and Jack W. Davidson and Jason Mars and Bruce R. Childers Evaluating indirect branch handling mechanisms in software dynamic translation systems . . . . . . . . . . 9:1--9:??
Xi E. Chen and Tor M. Aamodt Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs 10:1--10:?? Marios Kleanthous and Yiannakis Sazeides CATCH: a mechanism for dynamically detecting cache-content-duplication in instruction caches . . . . . . . . . . . 11:1--11:?? Hans Vandierendonck and André Seznec Managing SMT resource usage through speculative instruction window weighting 12:1--12:?? Po-Han Wang and Chia-Lin Yang and Yen-Ming Chen and Yu-Jung Cheng Power gating strategies on GPUs . . . . 13:1--13:?? Min Feng and Chen Tian and Changhui Lin and Rajiv Gupta Dynamic access distance driven cache replacement . . . . . . . . . . . . . . 14:1--14:?? Ahmad Samih and Yan Solihin and Anil Krishna Evaluating placement policies for managing capacity sharing in CMP architectures with private caches . . . 15:1--15:?? Chang-Ching Yeh and Kuei-Chung Chang and Tien-Fu Chen and Chingwei Yeh Maintaining performance on power gating of microprocessor functional units by using a predictive pre-wakeup strategy 16:1--16:?? Hyunjin Lee and Sangyeun Cho and Bruce R. Childers DEFCAM: a design and evaluation framework for defect-tolerant cache memories . . . . . . . . . . . . . . . . 17:1--17:??
Per Stenström and Koen De Bosschere Introduction to the special issue on high-performance and embedded architectures and compilers . . . . . . 18:1--18:?? Jorge Albericio and Rubén Gran and Pablo Ibáñez and Víctor Viñals and Jose María Llabería ABS: a low-cost adaptive controller for prefetching in a banked shared last-level cache . . . . . . . . . . . . 19:1--19:?? Ali Galip Bayrak and Nikola Velickovic and Paolo Ienne and Wayne Burleson An architecture-independent instruction shuffler to protect against side-channel attacks . . . . . . . . . . . . . . . . 20:1--20:?? John Demme and Simha Sethumadhavan Approximate graph clustering for program characterization . . . . . . . . . . . . 21:1--21:?? Mihai Pricopi and Tulika Mitra Bahurupi: a polymorphic heterogeneous multi-core architecture . . . . . . . . 22:1--22:?? Jeroen V. Cleemput and Bart Coppens and Bjorn De Sutter Compiler mitigations for time attacks on modern x86 processors . . . . . . . . . 23:1--23:?? Jason Mccandless and David Gregg Compiler techniques to improve dynamic branch prediction for indirect jump and call instructions . . . . . . . . . . . 24:1--24:?? Antonio García-Guirado and Ricardo Fernández-Pascual and Alberto Ros and José M. García DAPSCO: Distance-aware partially shared cache organization . . . . . . . . . . . 25:1--25:?? Zhenjiang Wang and Chenggang Wu and Pen-Chung Yew and Jianjun Li and Di Xu On-the-fly structure splitting for heap objects . . . . . . . . . . . . . . . . 26:1--26:?? Dibyendu Das and B. Dupont De Dinechin and Ramakrishna Upadrasta Efficient liveness computation using merge sets and DJ-graphs . . . . . . . . 27:1--27:?? George Patsilaras and Niket K. Choudhary and James Tuck Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era . . . . . . . . 28:1--28:?? Roman Malits and Evgeny Bolotin and Avinoam Kolodny and Avi Mendelson Exploring the limits of GPGPU scheduling in control flow bound applications . . . 29:1--29:?? Lois Orosa and Elisardo Antelo and Javier D. Bruguera FlexSig: Implementing flexible hardware signatures . . . . . . . . . . . . . . . 30:1--30:?? Ruben Titos-Gil and Manuel E. Acacio and Jose M. Garcia and Tim Harris and Adrian Cristal and Osman Unsal and Ibrahim Hur and Mateo Valero Hardware transactional memory with software-defined conflicts . . . . . . . 31:1--31:?? Yongjoo Kim and Jongeun Lee and Toan X. Mai and Yunheung Paek Improving performance of nested loops on reconfigurable array processors . . . . 32:1--32:?? Madhura Purnaprajna and Paolo Ienne Making wide-issue VLIW processors viable on FPGAs . . . . . . . . . . . . . . . . 33:1--33:?? Petar Radojkovi\'c and Sylvain Girbal and Arnaud Grasset and Eduardo Quiñones and Sami Yehia and Francisco J. Cazorla On the evaluation of the impact of shared resources in multithreaded COTS processors in time-critical environments 34:1--34:?? Leonid Domnitser and Aamer Jaleel and Jason Loew and Nael Abu-Ghazaleh and Dmitry Ponomarev Non-monopolizable caches: Low-complexity mitigation of cache side channel attacks 35:1--35:?? Alejandro Rico and Felipe Cabarcas and Carlos Villavieja and Milan Pavlovic and Augusto Vega and Yoav Etsion and Alex Ramirez and Mateo Valero On the simulation of large-scale architectures using multiple application abstraction levels . . . . . . . . . . . 36:1--36:?? Selma Saidi and Pranav Tendulkar and Thierry Lepley and Oded Maler Optimizing explicit data transfers for data parallel applications on the Cell architecture . . . . . . . . . . . . . . 37:1--37:?? Min Feng and Changhui Lin and Rajiv Gupta PLDS: Partitioning linked data structures for parallelism . . . . . . . 38:1--38:?? Benoit Pradelle and Alain Ketterlin and Philippe Clauss Polyhedral parallelization of binary code . . . . . . . . . . . . . . . . . . 39:1--39:?? Yaozu Dong and Yu Chen and Zhenhao Pan and Jinquan Dai and Yunhong Jiang ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication . . . . . . . . . . . . . . 40:1--40:?? Tom M. Bruintjes and Karel H. G. Walters and Sabih H. Gerez and Bert Molenkamp and Gerard J. M. Smit Sabrewing: a lightweight architecture for combined floating-point and integer arithmetic . . . . . . . . . . . . . . . 41:1--41:?? Mario Kicherer and Fabian Nowak and Rainer Buchty and Wolfgang Karl Seamlessly portable applications: Managing the diversity of modern heterogeneous systems . . . . . . . . . 42:1--42:?? Nathanael Premillieu and Andre Seznec SYRANT: SYmmetric Resource Allocation on Not-taken and Taken paths . . . . . . . 43:1--43:?? William Hasenplaugh and Pritpal S. Ahuja and Aamer Jaleel and Simon Steely, Jr. and Joel Emer The gradient-based cache partitioning algorithm . . . . . . . . . . . . . . . 44:1--44:?? Javier Lira and Timothy M. Jones and Carlos Molina and Antonio González The migration prefetcher: Anticipating data promotion in dynamic NUCA caches 45:1--45:?? Kishore Kumar Pusukuri and Rajiv Gupta and Laxmi N. Bhuyan Thread Tranquilizer: Dynamically reducing performance variation . . . . . 46:1--46:?? Dongsong Zhang and Deke Guo and Fangyuan Chen and Fei Wu and Tong Wu and Ting Cao and Shiyao Jin TL-plane-based multi-core energy-efficient real-time scheduling algorithm for sporadic tasks . . . . . . 47:1--47:?? Michael J. Lyons and Mark Hempstead and Gu-Yeon Wei and David Brooks The accelerator store: a shared memory framework for accelerator-based systems 48:1--48:?? Daniel Orozco and Elkin Garcia and Rishi Khan and Kelly Livingston and Guang R. Gao Toward high-throughput algorithms on many-core architectures . . . . . . . . 49:1--49:?? Kevin Stock and Louis-Noël Pouchet and P. Sadayappan Using machine learning to improve automatic vectorization . . . . . . . . 50:1--50:?? Kanit Therdsteerasukdi and Gyungsu Byun and Jason Cong and M. Frank Chang and Glenn Reinman Utilizing RF-I and intelligent scheduling for better throughput/watt in a mobile GPU memory system . . . . . . . 51:1--51:?? Frederick Ryckbosch and Stijn Polfliet and Lieven Eeckhout VSim: Simulating multi-server setups at near native hardware speed . . . . . . . 52:1--52:?? Miao Zhou and Yu Du and Bruce Childers and Rami Melhem and Daniel Mossé Writeback-aware partitioning and replacement for last-level caches in phase change main memory systems . . . . 53:1--53:?? Qingping Wang and Sameer Kulkarni and John Cavazos and Michael Spear A transactional memory with automatic performance tuning . . . . . . . . . . . 54:1--54:?? Bartosz Bogdanski and Sven-Arne Reinemo and Frank Olaf Sem-Jacobsen and Ernst Gunnar Gran sFtree: a fully connected and deadlock-free switch-to-switch routing algorithm for fat-trees . . . . . . . . 55:1--55:??
Walid J. Ghandour and Haitham Akkary and Wes Masri Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction . . . . . . . . . 1:1--1:?? Jaekyu Lee and Hyesoon Kim and Richard Vuduc When Prefetching Works, When It Doesn't, and Why . . . . . . . . . . . . . . . . 2:1--2:?? Bita Mazloom and Shashidhar Mysore and Mohit Tiwari and Banit Agrawal and Tim Sherwood Dataflow Tomography: Information Flow Tracking For Understanding and Visualizing Full Systems . . . . . . . . 3:1--3:?? Jung Ho Ahn and Norman P. Jouppi and Christos Kozyrakis and Jacob Leverich and Robert S. Schreiber Improving System Energy Efficiency with Memory Rank Subsetting . . . . . . . . . 4:1--4:?? Xuejun Yang and Li Wang and Jingling Xue and Qingbo Wu Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors . . . . . . . . . 5:1--5:?? Abhinandan Majumdar and Srihari Cadambi and Michela Becchi and Srimat T. Chakradhar and Hans Peter Graf A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification . . . . . . . . . . . 6:1--6:??
Stijn Eyerman and Lieven Eeckhout Probabilistic modeling for job symbiosis scheduling on SMT processors . . . . . . 7:1--7:?? Rachid Seghir and Vincent Loechner and Beno\^\it Meister Integer affine transformations of parametric $Z$-polytopes and applications to loop nest optimization 8:1--8:?? Yi Yang and Ping Xiang and Jingfei Kong and Mike Mantor and Huiyang Zhou A unified optimizing compiler framework for different GPGPU architectures . . . 9:1--9:?? Choonki Jang and Jaejin Lee and Bernhard Egger and Soojung Ryu Automatic code overlay generation and partially redundant code fetch elimination . . . . . . . . . . . . . . 10:1--10:?? Zahra Abbasi and Georgios Varsamopoulos and Sandeep K. S. Gupta TACOMA: Server and workload management in Internet data centers considering cooling-computing power trade-off and energy proportionality . . . . . . . . . 11:1--11:?? Andreas Lankes and Thomas Wild and Stefan Wallentowitz and Andreas Herkersdorf Benefits of selective packet discard in networks-on-chip . . . . . . . . . . . . 12:1--12:??
Yangchun Luo and Antonia Zhai Dynamically dispatching speculative threads to improve sequential execution 13:1--13:?? Huimin Cui and Jingling Xue and Lei Wang and Yang Yang and Xiaobing Feng and Dongrui Fan Extendable pattern-oriented optimization directives . . . . . . . . . . . . . . . 14:1--14:?? Adam Wade Lewis and Nian-Feng Tzeng and Soumik Ghosh Runtime energy consumption estimation for server workloads based on chaotic time-series approximation . . . . . . . 15:1--15:?? Alejandro Valero and Julio Sahuquillo and Salvador Petit and Pedro López and José Duato Combining recency of information with selective random and a victim cache in last-level caches . . . . . . . . . . . 16:1--16:?? Bin Li and Li-Shiuan Peh and Li Zhao and Ravi Iyer Dynamic QoS management for chip multiprocessors . . . . . . . . . . . . 17:1--17:?? Polychronis Xekalakis and Nikolas Ioannou and Marcelo Cintra Mixed speculative multithreaded execution models . . . . . . . . . . . . 18:1--18:?? Mageda Sharafeddine and Komal Jothi and Haitham Akkary Disjoint out-of-order execution processor . . . . . . . . . . . . . . . 19:1--19:?? Diego Andrade and Basilio B. Fraguela and Ramón Doallo Static analysis of the worst-case memory performance for irregular codes with indirections . . . . . . . . . . . . . . 20:1--20:?? Yang Chen and Shuangde Fang and Yuanjie Huang and Lieven Eeckhout and Grigori Fursin and Olivier Temam and Chengyong Wu Deconstructing iterative optimization 21:1--21:?? Apala Guha and Kim Hazelwood and Mary Lou Soffa Memory optimization of dynamic binary translators for embedded systems . . . . 22:1--22:?? James R. Geraci and Sharon M. Sacco A transpose-free in-place SIMD optimized FFT . . . . . . . . . . . . . . . . . . 23:1--23:??
Bart Coppens and Bjorn De Sutter and Jonas Maebe Feedback-driven binary code diversification to the special issue on high-performance embedded architectures and compilers . . . . . . . . . . . . . 24:1--24:?? Jeremy Fowers and Greg Brown and John Wernsing and Greg Stitt A performance and energy comparison of convolution on GPUs, FPGAs, and multicore processors . . . . . . . . . . 25:1--25:?? Erven Rohou and Kevin Williams and David Yuste Vectorization technology to improve interpreter performance . . . . . . . . 26:1--26:?? Jimmy Cleary and Owen Callanan and Mark Purcell and David Gregg Fast asymmetric thread synchronization 27:1--27:?? Yong Li and Rami Melhem and Alex K. Jones PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs 28:1--28:?? Kristof Du Bois and Stijn Eyerman and Lieven Eeckhout Per-thread cycle accounting in multicore processors . . . . . . . . . . . . . . . 29:1--29:?? Christian Wimmer and Michael Haupt and Michael L. Van De Vanter and Mick Jordan and Laurent Dayn\`es and Douglas Simon Maxine: an approachable virtual machine for, and in, Java . . . . . . . . . . . 30:1--30:?? Malik Khan and Protonu Basu and Gabe Rudy and Mary Hall and Chun Chen and Jacqueline Chame A script-based autotuning compiler system to generate high-performance CUDA code . . . . . . . . . . . . . . . . . . 31:1--31:?? Kenzo Van Craeynest and Lieven Eeckhout Understanding fundamental design choices in single-ISA heterogeneous multicore architectures . . . . . . . . . . . . . 32:1--32:?? Samuel Antão and Leonel Sousa The CRNS framework and its application to programmable and reconfigurable cryptography . . . . . . . . . . . . . . 33:1--33:?? Boubacar Diouf and Can Hantas and Albert Cohen and Özcan Özturk and Jens Palsberg A decoupled local memory allocator . . . 34:1--34:?? Huimin Cui and Qing Yi and Jingling Xue and Xiaobing Feng Layout-oblivious compiler optimization for matrix computations . . . . . . . . 35:1--35:?? Stephen Dolan and Servesh Muralidharan and David Gregg Compiler support for lightweight context switching . . . . . . . . . . . . . . . 36:1--36:?? Pablo Abad and Valentin Puente and Jose-Angel Gregorio LIGERO: a light but efficient router conceived for cache-coherent chip multiprocessors . . . . . . . . . . . . 37:1--37:?? Jorge Albericio and Pablo Ibáñez and Víctor Viñals and Jose María Llabería Exploiting reuse locality on inclusive shared last-level caches . . . . . . . . 38:1--38:?? Paraskevas Yiapanis and Demian Rosas-Ham and Gavin Brown and Mikel Luján Optimizing software runtime systems for speculative parallelization . . . . . . 39:1--39:?? Cedric Nugteren and Pieter Custers and Henk Corporaal Algorithmic species: a classification of affine loop nests for parallel programming . . . . . . . . . . . . . . 40:1--40:?? Marco E. T. Gerards and Jan Kuper Optimal DPM and DVFS for frame-based real-time systems . . . . . . . . . . . 41:1--41:?? Zhichao Yan and Hong Jiang and Yujuan Tan and Dan Feng An integrated pseudo-associativity and relaxed-order approach to hardware transactional memory . . . . . . . . . . 42:1--42:?? Doris Chen and Deshanand Singh Profile-guided floating- to fixed-point conversion for hybrid FPGA-processor applications . . . . . . . . . . . . . . 43:1--43:?? Yan Cui and Yingxin Wang and Yu Chen and Yuanchun Shi Lock-contention-aware scheduler: a scalable and energy-efficient method for addressing scalability collapse on multicore systems . . . . . . . . . . . 44:1--44:?? Kishore Kumar Pusukuri and Rajiv Gupta and Laxmi N. Bhuyan ADAPT: a framework for coscheduling multithreaded programs . . . . . . . . . 45:1--45:?? Michele Tartara and Stefano Crespi Reghizzi Continuous learning of compiler heuristics . . . . . . . . . . . . . . . 46:1--46:?? Grigorios Chrysos and Panagiotis Dagritzikos and Ioannis Papaefstathiou and Apostolos Dollas HC-CART: a parallel system implementation of data mining classification and regression tree (CART) algorithm on a multi-FPGA system 47:1--47:?? Jongwon Lee and Yohan Ko and Kyoungwoo Lee and Jonghee M. Youn and Yunheung Paek Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures . . . . 48:1--48:?? Fabien Coelho and François Irigoin API compilation for image hardware accelerators . . . . . . . . . . . . . . 49:1--49:?? Carlos Luque and Miquel Moreto and Francisco J. Cazorla and Mateo Valero Fair CPU time accounting in CMP+SMT processors . . . . . . . . . . . . . . . 50:1--50:?? Pavlos M. Mattheakis and Ioannis Papaefstathiou Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems . . . . . . . . . . . . . . . . 51:1--51:?? Riyadh Baghdadi and Albert Cohen and Sven Verdoolaege and Konrad Trifunovi\'c Improved loop tiling based on the removal of spurious false dependences 52:1--52:?? Antoniu Pop and Albert Cohen OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs 53:1--53:?? Sven Verdoolaege and Juan Carlos Juega and Albert Cohen and José Ignacio Gómez and Christian Tenllado and Francky Catthoor Polyhedral parallel code generation for CUDA . . . . . . . . . . . . . . . . . . 54:1--54:?? Yu Du and Miao Zhou and Bruce Childers and Rami Melhem and Daniel Mossé Delta-compressed caching for overcoming the write bandwidth limitation of hybrid main memory . . . . . . . . . . . . . . 55:1--55:?? Suresh Purini and Lakshya Jain Finding good optimization sequences covering program space . . . . . . . . . 56:1--56:?? Mehmet E. Belviranli and Laxmi N. Bhuyan and Rajiv Gupta A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures . . . . . . . . . . . . . 57:1--57:?? Anurag Negi and Ruben Titos-Gil SCIN-cache: Fast speculative versioning in multithreaded cores . . . . . . . . . 58:1--58:?? Thibaut Lutz and Christian Fensch and Murray Cole PARTANS: an autotuning framework for stencil computation on multi-GPU systems 59:1--59:?? Chunhua Xiao and M-C. Frank Chang and Jason Cong and Michael Gill and Zhangqin Huang and Chunyue Liu and Glenn Reinman and Hao Wu Stream arbitration: Towards efficient bandwidth utilization for emerging on-chip interconnects . . . . . . . . . 60:1--60:??
Yunji Chen and Tianshi Chen and Ling Li and Ruiyang Wu and Daofu Liu and Weiwu Hu Deterministic Replay Using Global Clock 1:1--1:?? Daniel Lustig and Abhishek Bhattacharjee and Margaret Martonosi TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs 2:1--2:?? Rong Chen and Haibo Chen Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling . . . . . . . . . . . . . . . . . 3:1--3:?? Michela Becchi and Patrick Crowley A-DFA: a Time- and Space-Efficient DFA Compression Algorithm for Fast Regular Expression Evaluation . . . . . . . . . 4:1--4:26 Sheng Li and Jung Ho Ahn and Richard D. Strong and Jay B. Brockman and Dean M. Tullsen and Norman P. Jouppi The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing . . . . 5:1--5:??
Angeliki Kritikakou and Francky Catthoor and George S. Athanasiou and Vasilios Kelefouras and Costas Goutis Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints . . . . . . . . . 6:1--6:?? Lei Jiang and Yu Du and Bo Zhao and Youtao Zhang and Bruce R. Childers and Jun Yang Hardware-Assisted Cooperative Integration of Wear-Leveling and Salvaging for Phase Change Memory . . . 7:1--7:?? Kyuseung Han and Junwhan Ahn and Kiyoung Choi Power-Efficient Predication Techniques for Acceleration of Control Flow Execution on CGRA . . . . . . . . . . . 8:1--8:?? Chao Wang and Xi Li and Junneng Zhang and Xuehai Zhou and Xiaoning Nie MP-Tomasulo: a Dependency-Aware Automatic Parallel Execution Engine for Sequential Programs . . . . . . . . . . 9:1--9:??
Anonymous TACO Reviewers 2012 . . . . . . . . . . 9:1--9:?? Eran Shifer and Shlomo Weiss Low-latency adaptive mode transitions and hierarchical power management in asymmetric clustered cores . . . . . . . 10:1--10:?? Yosi Ben Asher and Nadav Rotem Hybrid type legalization for a sparse SIMD instruction set . . . . . . . . . . 11:1--11:?? Yuanwu Lei and Yong Dou and Lei Guo and Jinbo Xu and Jie Zhou and Yazhuo Dong and Hongjian Li VLIW coprocessor for IEEE-754 quadruple-precision elementary functions 12:1--12:?? Motohiro Kawahito and Hideaki Komatsu and Takao Moriyama and Hiroshi Inoue and Toshio Nakatani Idiom recognition framework using topological embedding . . . . . . . . . 13:1--13:?? Ghassan Shobaki and Maxim Shawabkeh and Najm Eldeen Abu Rmaileh Preallocation instruction scheduling with register pressure minimization using a combinatorial optimization approach . . . . . . . . . . . . . . . . 14:1--14:?? Dongrui She and Yifan He and Henk Corporaal An energy-efficient method of supporting flexible special instructions in an embedded processor with compact ISA . . 15:1--15:?? V. Krishna Nandivada and Rajkishore Barik Improved bitwidth-aware variable packing 16:1--16:?? Jung Ho Ahn and Young Hoon Son and John Kim Scalable high-radix router microarchitecture using a network switch organization . . . . . . . . . . . . . . 17:1--17:?? Libo Huang and Zhiying Wang and Nong Xiao and Yongwen Wang and Qiang Dou Adaptive communication mechanism for accelerating MPI functions in NoC-based multicore processors . . . . . . . . . . 18:1--18:?? Avinash Malik and David Gregg Orchestrating stream graphs using model checking . . . . . . . . . . . . . . . . 19:1--19:?? Zheng Wang and Michael F. P. O'Boyle Using machine learning to partition streaming programs . . . . . . . . . . . 20:1--20:?? Ali Bakhoda and John Kim and Tor M. Aamodt Designing on-chip networks for throughput accelerators . . . . . . . . 21:1--21:??
Michael R. Jantz and Prasad A. Kulkarni Exploring single and multilevel JIT compilation policy for modern machines 1 22:1--22:?? Xiangyu Dong and Norman P. Jouppi and Yuan Xie A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies . . . . . . . . . . . 23:1--23:?? Jishen Zhao and Guangyu Sun and Gabriel H. Loh and Yuan Xie Optimizing GPU energy efficiency with $3$D die-stacking graphics memory and reconfigurable memory interface . . . . 24:1--24:?? Chien-Chi Chen and Sheng-De Wang An efficient multicharacter transition string-matching engine based on the Aho--Corasick algorithm . . . . . . . . 25:1--25:?? Yangchun Luo and Wei-Chung Hsu and Antonia Zhai The design and implementation of heterogeneous multicore systems for energy-efficient speculative thread execution . . . . . . . . . . . . . . . 26:1--26:?? Dyer Rolán and Basilio B. Fraguela and Ramón Doallo Virtually split cache: an efficient mechanism to distribute instructions and data 1 . . . . . . . . . . . . . . . . . 27:1--27:?? Samantika Subramaniam and Simon C. Steely and Will Hasenplaugh and Aamer Jaleel and Carl Beckmann and Tryggve Fossum and Joel Emer Using in-flight chains to build a scalable cache coherence protocol . . . 28:1--28:?? Daniel Sánchez and Yiannakis Sazeides and Juan M. Cebrián and José M. García and Juan L. Aragón Modeling the impact of permanent faults in caches . . . . . . . . . . . . . . . 29:1--29:?? Sanghoon Lee and James Tuck Automatic parallelization of fine-grained metafunctions on a chip multiprocessor . . . . . . . . . . . . . 30:1--30:?? Christophe Dubach and Timothy M. Jones and Edwin V. Bonilla Dynamic microarchitectural adaptation using machine learning . . . . . . . . . 31:1--31:?? Long Chen and Yanan Cao and Zhao Zhang E$^3$CC: a memory error protection scheme with novel address mapping for subranked and low-power memories . . . . 32:1--32:?? Yingying Tian and Samira M. Khan and Daniel A. Jiménez Temporal-based multilevel correlating inclusive cache replacement . . . . . . 33:1--33:?? Qixiao Liu and Miquel Moreto and Victor Jimenez and Jaume Abella and Francisco J. Cazorla and Mateo Valero Hardware support for accurate per-task energy metering in multicore systems . . 34:1--34:?? Sanyam Mehta and Gautham Beeraka and Pen-Chung Yew Tile size selection revisited . . . . . 35:1--35:?? Bogdan Prisacari and German Rodriguez and Cyriel Minkenberg and Torsten Hoefler Fast pattern-specific routing for fat tree networks . . . . . . . . . . . . . 36:1--36:?? Maximilien B. Breughe and Lieven Eeckhout Selecting representative benchmark inputs for exploring microprocessor design spaces . . . . . . . . . . . . . 37:1--37:?? Christoph Kerschbaumer and Eric Hennigan and Per Larsen and Stefan Brunthaler and Michael Franz Information flow tracking meets just-in-time compilation . . . . . . . . 38:1--38:?? Rupesh Nasre Time- and space-efficient flow-sensitive points-to analysis . . . . . . . . . . . 39:1--39:?? Wenjia Ruan and Yujie Liu and Michael Spear Boosting timestamp-based transactional memory by exploiting hardware cycle counters . . . . . . . . . . . . . . . . 40:1--40:?? Tanima Dey and Wei Wang and Jack W. Davidson and Mary Lou Soffa ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity . . . . . . . 41:1--41:?? Adri\`a Armejach and Ruben Titos-Gil and Anurag Negi and Osman S. Unsal and Adrián Cristal Techniques to improve performance in requester-wins hardware transactional memory . . . . . . . . . . . . . . . . . 42:1--42:?? Myeongjae Jeon and Conglong Li and Alan L. Cox and Scott Rixner Reducing DRAM row activations with eager read/write clustering . . . . . . . . . 43:1--43:?? Zhijia Zhao and Michael Bebenita and Dave Herman and Jianhua Sun and Xipeng Shen HPar: a practical parallel parser for HTML --- taming HTML complexities for parallel parsing . . . . . . . . . . . . 44:1--44:?? Ehsan Totoni and Mert Dikmen and María Jesús Garzarán Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures . . . . . . . . . . . . . 45:1--45:?? Viacheslav V. Fedorov and Sheng Qiu and A. L. Narasimha Reddy and Paul V. Gratz ARI: Adaptive LLC-memory traffic management . . . . . . . . . . . . . . . 46:1--46:?? Cecilia González-Álvarez and Jennifer B. Sartor and Carlos Álvarez and Daniel Jiménez-González and Lieven Eeckhout Accelerating an application domain with specialized functional units . . . . . . 47:1--47:?? Xiaolin Wang and Lingmei Weng and Zhenlin Wang and Yingwei Luo Revisiting memory management on virtualized environments . . . . . . . . 48:1--48:?? Chuntao Jiang and Zhibin Yu and Hai Jin and Chengzhong Xu and Lieven Eeckhout and Wim Heirman and Trevor E. Carlson and Xiaofei Liao PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling . . . . . . . . . 49:1--49:?? Srdan Stipi\'c and Vesna Smiljkovi\'c and Osman Unsal and Adrián Cristal and Mateo Valero Profile-guided transaction coalescing-lowering transactional overheads by merging transactions . . . 50:1--50:?? Zhe Wang and Shuchang Shan and Ting Cao and Junli Gu and Yi Xu and Shuai Mu and Yuan Xie and Daniel A. Jiménez WADE: Writeback-aware dynamic cache management for NVM-based main memory system . . . . . . . . . . . . . . . . . 51:1--51:?? Yong Li and Yaojun Zhang and Hai LI and Yiran Chen and Alex K. Jones C1C: a configurable, compiler-guided STT-RAM L1 cache . . . . . . . . . . . . 52:1--52:?? Naznin Fauzia and Venmugil Elango and Mahesh Ravishankar and J. Ramanujam and Fabrice Rastello and Atanas Rountev and Louis-Noël Pouchet and P. Sadayappan Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential . . . . . . . . . . . 53:1--53:?? Alen Bardizbanyan and Magnus Själander and David Whalley and Per Larsson-Edefors Designing a practical data filter cache to improve both energy efficiency and performance . . . . . . . . . . . . . . 54:1--54:?? Andrei Hagiescu and Bing Liu and R. Ramanathan and Sucheendra K. Palaniappan and Zheng Cui and Bipasa Chattopadhyay and P. S. Thiagarajan and Weng-Fai Wong GPU code generation for ODE-based applications with phased shared-data access patterns . . . . . . . . . . . . 55:1--55:?? Junghee Lee and Chrysostomos Nicopoulos and Hyung Gyu Lee and Jongman Kim TornadoNoC: a lightweight and scalable on-chip network architecture for the many-core era . . . . . . . . . . . . . 56:1--56:?? Christos Strydis and Robert M. Seepers and Pedro Peris-Lopez and Dimitrios Siskos and Ioannis Sourdis A system architecture, processor, and communication protocol for secure implants . . . . . . . . . . . . . . . . 57:1--57:?? Wonsub Kim and Yoonseo Choi and Haewoo Park Fast modulo scheduler utilizing patternized routes for coarse-grained reconfigurable architectures . . . . . . 58:1--58:?? Dorit Nuzman and Revital Eres and Sergei Dyshel and Marcel Zalmanovici and Jose Castanos JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages . . . 59:1--59:?? Thejas Ramashekar and Uday Bondhugula Automatic data allocation and buffer management for multi-GPU machines . . . 60:1--60:?? Hans Vandierendonck and George Tzenakis and Dimitrios S. Nikolopoulos Analysis of dependence tracking algorithms for task dataflow execution 61:1--61:?? Yeonghun Jeong and Seongseok Seo and Jongeun Lee Evaluator-executor transformation for efficient pipelining of loops with conditionals . . . . . . . . . . . . . . 62:1--62:?? Rajkishore Barik and Jisheng Zhao and Vivek Sarkar A decoupled non-SSA global register allocation using bipartite liveness graphs . . . . . . . . . . . . . . . . . 63:1--63:?? Peter Gavin and David Whalley and Magnus Själander Reducing instruction fetch energy in multi-issue processors . . . . . . . . . 64:1--64:?? Anonymous List of distinguished reviewers ACM TACO 65:1--65:??
Neeraj Goel and Anshul Kumar and Preeti Ranjan Panda Shared-port register file architecture for low-energy VLIW processors . . . . . 1:1--1:32 Zheng Wang and Georgios Tournavitis and Björn Franke and Michael F. P. O'Boyle Integrating profile-driven parallelism detection and machine-learning-based mapping . . . . . . . . . . . . . . . . 2:1--2:26 Mehrzad Samadi and Amir Hormati and Janghaeng Lee and Scott Mahlke Leveraging GPUs using cooperative loop speculation . . . . . . . . . . . . . . 3:1--3:26 Jue Wang and Xiangyu Dong and Yuan Xie and Norman P. Jouppi Endurance-aware cache line management for non-volatile caches . . . . . . . . 4:1--4:24 Lei Liu and Zehan Cui and Yong Li and Yungang Bao and Mingyu Chen and Chengyong Wu BPM/BPM+: Software-based dynamic memory partitioning mechanisms for mitigating DRAM bank-/channel-level interferences in multicore systems . . . . . . . . . . 5:1--5:28 Christian Häubl and Christian Wimmer and Hanspeter Mössenböck Trace transitioning and exception handling in a trace-based JIT compiler for Java . . . . . . . . . . . . . . . . 6:1--6:26 Yongbing Huang and Licheng Chen and Zehan Cui and Yuan Ruan and Yungang Bao and Mingyu Chen and Ninghui Sun HMTT: a hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap . . . . . . . . . . 7:1--7:25 Quan Chen and Minyi Guo Adaptive workload-aware task scheduling for single-ISA asymmetric multicore architectures . . . . . . . . . . . . . 8:1--8:25 Gülfem Savrun-Yeniçeri and Wei Zhang and Huahan Zhang and Eric Seckler and Chen Li and Stefan Brunthaler and Per Larsen and Michael Franz Efficient hosted interpreters on the JVM 9:1--9:24 Prashant J. Nair and Chia-Chen Chou and Moinuddin K. Qureshi Refresh pausing in DRAM memory systems 10:1--10:26 Komal Jothi and Haitham Akkary Tuning the continual flow pipeline architecture with virtual register renaming . . . . . . . . . . . . . . . . 11:1--11:27 Thomas Carle and Dumitru Potop-Butucaru Predicate-aware, makespan-preserving software pipelining of scheduling tables 12:1--12:26 Angeliki Kritikakou and Francky Catthoor and Vasilios Kelefouras and Costas Goutis A scalable and near-optimal representation of access schemes for memory management . . . . . . . . . . . 13:1--13:25 Hugh Leather and Edwin Bonilla and Michael O'Boyle Automatic feature generation for machine learning--based optimising compilation 14:1--14:32
Theo Kluter and Samuel Burri and Philip Brisk and Edoardo Charbon and Paolo Ienne Virtual Ways: Low-Cost Coherence for Instruction Set Extensions with Architecturally Visible Storage . . . . 15:1--15:26 Bin Ren and Todd Mytkowicz and Gagan Agrawal A Portable Optimization Engine for Accelerating Irregular Data-Traversal Applications on SIMD Architectures . . . 16:1--16:?? Zhengwei Qi and Jianguo Yao and Chao Zhang and Miao Yu and Zhizhou Yang and Haibing Guan VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gaming 17:1--17:25 Bor-Yeh Shen and Wei-Chung Hsu and Wuu Yang A Retargetable Static Binary Translator for the ARM Architecture . . . . . . . . 18:1--18:?? Darío Suárez Gracia and Alexandra Ferrerón and Luis Montesano Del Campo and Teresa Monreal Arnal and Víctor Viñals Yúfera Revisiting LP--NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping . . . . . . . . . . . . . . . . 19:1--19:?? Zhibin Liang and Wei Zhang and Yung-Cheng Ma Deadline-Constrained Clustered Scheduling for VLIW Architectures using Power-Gated Register Files . . . . . . . 20:1--20:26 Shuangde Fang and Zidong Du and Yuntan Fang and Yuanjie Huang and Yang Chen and Lieven Eeckhout and Olivier Temam and Huawei Li and Yunji Chen and Chengyong Wu Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach . . . . . . . . . 21:1--21:?? Abdulrahman Kaitoua and Hazem Hajj and Mazen A. R. Saghir and Hassan Artail and Haitham Akkary and Mariette Awad and Mageda Sharafeddine and Khaleel Mershad Hadoop Extensions for Distributed Computing on Reconfigurable Active SSD Clusters . . . . . . . . . . . . . . . . 22:1--22:??
Jue Wang and Xiangyu Dong and Yuan Xie Preventing STT-RAM Last-Level Caches from Port Obstruction . . . . . . . . . 23:1--23:?? M. A. Gonzalez-Mesa and Eladio Gutierrez and Emilio L. Zapata and Oscar Plata Effective Transactional Memory Execution Management for Improved Concurrency . . 24:1--24:?? Rakesh Kumar and Alejandro Martínez and Antonio González Efficient Power Gating of SIMD Accelerators Through Dynamic Selective Devectorization in an HW/SW Codesigned Environment . . . . . . . . . . . . . . 25:1--25:?? Stefano Di Carlo and Salvatore Galfano and Marco Indaco and Paolo Prinetto and Davide Bertozzi and Piero Olivo and Cristian Zambelli FLARES: an Aging Aware Algorithm to Autonomously Adapt the Error Correction Capability in NAND Flash Memories . . . 26:1--26:?? Davide B. Bartolini and Filippo Sironi and Donatella Sciuto and Marco D. Santambrogio Automated Fine-Grained CPU Provisioning for Virtual Machines . . . . . . . . . . 27:1--27:?? Trevor E. Carlson and Wim Heirman and Stijn Eyerman and Ibrahim Hur and Lieven Eeckhout An Evaluation of High-Level Mechanistic Core Models . . . . . . . . . . . . . . 28:1--28:?? Farrukh Hijaz and Omer Khan NUCA-L1: a Non-Uniform Access Latency Level-1 Cache Architecture for Multicores Operating at Near-Threshold Voltages . . . . . . . . . . . . . . . . 29:1--29:?? Andi Drebes and Karine Heydemann and Nathalie Drach and Antoniu Pop and Albert Cohen Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages . . . . . . . . 30:1--30:?? Venkata Kalyan Tawa and Ravi Kasha and Madhu Mutyam EFGR: an Enhanced Fine Granularity Refresh Feature for High-Performance DDR4 DRAM Devices . . . . . . . . . . . 31:1--31:?? Gulay Yalcin and Oguz Ergin and Emrah Islek and Osman Sabri Unsal and Adrian Cristal Exploiting Existing Comparators for Fine-Grained Low-Cost Error Detection 32:1--32:?? Pradeep Ramachandran and Siva Kumar Sastry Hari and Manlap Li and Sarita V. Adve Hardware Fault Recovery for I/O Intensive Applications . . . . . . . . . 33:1--33:?? Stijn Eyerman and Pierre Michaud and Wouter Rogiest Multiprogram Throughput Metrics: a Systematic Approach . . . . . . . . . . 34:1--34:??
Cedric Nugteren and Henk Corporaal Bones: an Automatic Skeleton-Based C-to-CUDA Compiler for GPUs . . . . . . 35:1--35:?? Jue Wang and Xiangyu Dong and Yuan Xie Building and Optimizing MRAM-Based Commodity Memories . . . . . . . . . . . 36:1--36:?? Rakesh Komuravelli and Sarita V. Adve and Ching-Tsun Chou Revisiting the Complexity of Hardware Cache Coherence and Some Implications 37:1--37:?? Gabriel Rodríguez and Juan Touriño and Mahmut T. Kandemir Volatile STT--RAM Scratchpad Design and Data Allocation for Low Energy . . . . . 38:1--38:?? Cristóbal Camarero and Enrique Vallejo and Ramón Beivide Topological Characterization of Hamming and Dragonfly Networks and Its Implications on Routing . . . . . . . . 39:1--39:?? Hanbin Yoon and Justin Meza and Naveen Muralimanohar and Norman P. Jouppi and Onur Mutlu Efficient Data Mapping and Buffering Techniques for Multilevel Cell Phase-Change Memories . . . . . . . . . 40:1--40:?? Nathanael Prémillieu and André Seznec Efficient Out-of-Order Execution of Guarded ISAs . . . . . . . . . . . . . . 41:1--41:?? Zheng Wang and Dominik Grewe and Michael F. P. O'Boyle Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems . . . . 42:1--42:?? Dan He and Fang Wang and Hong Jiang and Dan Feng and Jing Ning Liu and Wei Tong and Zheng Zhang Improving Hybrid FTL by Fully Exploiting Internal SSD Parallelism with Virtual Blocks . . . . . . . . . . . . . . . . . 43:1--43:?? Eri Rubin and Ely Levy and Amnon Barak and Tal Ben-Nun MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction . . . . . . . . . . . . . . 44:1--44:?? Alessandro Cilardo and Luca Gallo Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning . . . . . . . . . . . . . . 45:1--45:?? Jan Kasper Martinsen and Håkan Grahn and Anders Isberg The Effects of Parameter Tuning in Software Thread-Level Speculation in JavaScript Engines . . . . . . . . . . . 46:1--46:?? Quentin Colombet and Florian Brandner and Alain Darte Studying Optimal Spilling in the Light of SSA . . . . . . . . . . . . . . . . . 47:1--47:?? Jawad Haj-Yihia and Yosi Ben Asher and Efraim Rotem and Ahmad Yasin and Ran Ginosar Compiler-Directed Power Management for Superscalars . . . . . . . . . . . . . . 48:1--48:?? Hong-Phuc Trinh and Marc Duranton and Michel Paindavoine Efficient Data Encoding for Convolutional Neural Network application 49:1--49:?? Maximilien B. Breugh and Stijn Eyerman and Lieven Eeckhout Mechanistic Analytical Modeling of Superscalar In-Order Processor Performance . . . . . . . . . . . . . . 50:1--50:?? Vivek Seshadri and Samihan Yedkar and Hongyi Xin and Onur Mutlu and Phillip B. Gibbons and Michael A. Kozuch and Todd C. Mowry Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks . . . . . . . . . . . 51:1--51:?? George Matheou and Paraskevas Evripidou Architectural Support for Data-Driven Execution . . . . . . . . . . . . . . . 52:1--52:?? Amir Morad and Leonid Yavits and Ran Ginosar GP--SIMD Processing-in-Memory . . . . . 53:1--53:?? Thomas Schaub and Simon Moll and Ralf Karrenberg and Sebastian Hack The Impact of the SIMD Width on Control-Flow and Memory Divergence . . . 54:1--54:?? Zhenman Fang and Sanyam Mehta and Pen-Chung Yew and Antonia Zhai and James Greensky and Gautham Beeraka and Binyu Zang Measuring Microarchitectural Details of Multi- and Many-Core Memory Systems through Microbenchmarking . . . . . . . 55:1--55:?? Chi Ching Chi and Mauricio Alvarez-Mesa and Ben Juurlink Low-Power High-Efficiency Video Decoding using General-Purpose Processors . . . . 56:1--56:?? Fabio Luporini and Ana Lucia Varbanescu and Florian Rathgeber and Gheorghe-Teodor Bercea and J. Ramanujam and David A. Ham and Paul H. J. Kelly Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly . . . . . . . . . . . . . . . . 57:1--57:?? Xing Zhou and María J. Garzarán and David A. Padua Optimal Parallelogram Selection for Hierarchical Tiling . . . . . . . . . . 58:1--58:?? Leo Porter and Michael A. Laurenzano and Ananta Tiwari and Adam Jundt and William A. Ward, Jr. and Roy Campbell and Laura Carrington Making the Most of SMT in HPC: System- and Application-Level Perspectives . . . 59:1--59:?? Xin Tong and Toshihiko Koju and Motohiro Kawahito and Andreas Moshovos Optimizing Memory Translation Emulation in Full System Emulators . . . . . . . . 60:1--60:?? Martin Kong and Antoniu Pop and Louis-Noël Pouchet and R. Govindarajan and Albert Cohen and P. Sadayappan Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs . . . . . . . . . . . . . . . . 61:1--61:?? Nicolas Melot and Christoph Kessler and Jörg Keller and Patrick Eitschberger Fast Crown Scheduling Heuristics for Energy-Efficient Mapping and Scaling of Moldable Streaming Tasks on Manycore Systems . . . . . . . . . . . . . . . . 62:1--62:?? Wenjia Ruan and Yujie Liu and Michael Spear Transactional Read-Modify-Write Without Aborts . . . . . . . . . . . . . . . . . 63:1--63:?? Zia Ul Huda and Ali Jannesari and Felix Wolf Using Template Matching to Infer Parallel Design Patterns . . . . . . . . 64:1--64:?? Heiner Litz and Ricardo J. Dias and David R. Cheriton Efficient Correction of Anomalies in Snapshot Isolation Transactions . . . . 65:1--65:?? Helge Bahmann and Nico Reissmann and Magnus Jahre and Jan Christian Meyer Perfect Reconstructability of Control Flow from Demand Dependence Graphs . . . 66:1--66:?? Venmugil Elango and Naser Sedaghati and Fabrice Rastello and Louis-Noël Pouchet and J. Ramanujam and Radu Teodorescu and P. Sadayappan On Using the Roofline Model with Lower Bounds on Data Movement . . . . . . . . 67:1--67:?? Anonymous List of Distinguished Reviewers ACM TACO 2014 . . . . . . . . . . . . . . . . . . 68:1--68:??
Christopher Zimmer and Frank Mueller NoCMsg: a Scalable Message-Passing Abstraction for Network-on-Chips . . . . 1:1--1:?? Beayna Grigorian and Glenn Reinman Accelerating Divergent Applications on SIMD Architectures Using Neural Networks 2:1--2:?? Anup Holey and Vineeth Mekkat and Pen-Chung Yew and Antonia Zhai Performance-Energy Considerations for Shared Cache Management in a Heterogeneous Multicore Processor . . . 3:1--3:?? Jinho Suh and Chieh-Ting Huang and Michel Dubois Dynamic MIPS Rate Stabilization for Complex Processors . . . . . . . . . . . 4:1--4:?? Naghmeh Karimi and Arun Karthik Kanuparthi and Xueyang Wang and Ozgur Sinanoglu and Ramesh Karri MAGIC: Malicious Aging in Circuits/Cores 5:1--5:?? Pablo De Oliveira Castro and Chadi Akel and Eric Petit and Mihail Popov and William Jalby CERE: LLVM-Based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization . . . . . . . . . . . . . . 6:1--6:?? Benedict R. Gaster and Derek Hower and Lee Howes HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models . . . . . . . . . . . . . 7:1--7:?? Kevin Streit and Johannes Doerfert and Clemens Hammacher and Andreas Zeller and Sebastian Hack Generalized Task Parallelism . . . . . . 8:1--8:??
Hamed Tabkhi and Gunar Schirner A Joint SW/HW Approach for Reducing Register File Vulnerability . . . . . . 9:1--9:?? Arun Kanuparthi and Ramesh Karri Reliable Integrity Checking in Multicore Processors . . . . . . . . . . . . . . . 10:1--10:?? Do-Heon Lee and Su-Kyung Yoon and Jung-Geun Kim and Charles C. Weems and Shin-Dug Kim A New Memory-Disk Integrated System with HW Optimizer . . . . . . . . . . . . . . 11:1--11:?? Morteza Mohajjel Kafshdooz and Alireza Ejlali Dynamic Shared SPM Reuse for Real-Time Multicore Embedded Systems . . . . . . . 12:1--12:?? Wenhao Jia and Elba Garza and Kelly A. Shaw and Margaret Martonosi GPU Performance and Power Tuning Using Regression Trees . . . . . . . . . . . . 13:1--13:?? Irshad Pananilath and Aravind Acharya and Vinay Vasista and Uday Bondhugula An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations . . . 14:1--14:?? Shuangde Fang and Wenwen Xu and Yang Chen and Lieven Eeckhout and Olivier Temam and Yunji Chen and Chengyong Wu and Xiaobing Feng Practical Iterative Optimization for the Data Center . . . . . . . . . . . . . . 15:1--15:?? Tao Zhang and Naifeng Jing and Kaiming Jiang and Wei Shu and Min-You Wu and Xiaoyao Liang Buddy SM: Sharing Pipeline Front-End for Improved Energy Efficiency in GPGPUs . . 16:1--16:?? Hsiang-Yun Cheng and Matt Poremba and Narges Shahidi and Ivan Stalev and Mary Jane Irwin and Mahmut Kandemir and Jack Sampson and Yuan Xie EECache: a Comprehensive Study on the Architectural Design for Energy-Efficient Last-Level Caches in Chip Multiprocessors . . . . . . . . . . 17:1--17:?? Arjun Suresh and Bharath Narasimha Swamy and Erven Rohou and André Seznec Intercepting Functions for Memoization: a Case Study Using Transcendental Functions . . . . . . . . . . . . . . . 18:1--18:?? Chung-Hsiang Lin and De-Yu Shen and Yi-Jung Chen and Chia-Lin Yang and Cheng-Yuan Michael Wang SECRET: a Selective Error Correction Framework for Refresh Energy Reduction in DRAMs . . . . . . . . . . . . . . . . 19:1--19:?? Doug Simon and Christian Wimmer and Bernhard Urban and Gilles Duboscq and Lukas Stadler and Thomas Würthinger Snippets: Taking the High Road to a Low Level . . . . . . . . . . . . . . . . . 20:1--20:?? Raghuraman Balasubramanian and Vinay Gangadhar and Ziliang Guo and Chen-Han Ho and Cherin Joseph and Jaikrishnan Menon and Mario Paulo Drumond and Robin Paul and Sharath Prasad and Pradip Valathol and Karthikeyan Sankaralingam Enabling GPGPU Low-Level Hardware Explorations with MIAOW: an Open-Source RTL Implementation of a GPGPU . . . . . 21:1--21:?? Quan Chen and Minyi Guo Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures . . 22:1--22:?? Madan Das and Gabriel Southern and Jose Renau Section-Based Program Analysis to Reduce Overhead of Detecting Unsynchronized Thread Communication . . . . . . . . . . 23:1--23:?? Atieh Lotfi and Abbas Rahimi and Luca Benini and Rajesh K. Gupta Aging-Aware Compilation for GP-GPUs . . 24:1--24:?? Brian P. Railing and Eric R. Hein and Thomas M. Conte Contech: Efficiently Generating Dynamic Task Graphs for Arbitrary Parallel Programs . . . . . . . . . . . . . . . . 25:1--25:??
Mahdad Davari and Alberto Ros and Erik Hagersten and Stefanos Kaxiras The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence . . . . . . 26:1--26:?? Mark Gottscho and Abbas BanaiyanMofrad and Nikil Dutt and Alex Nicolau and Puneet Gupta DPCS: Dynamic Power/Capacity Scaling for SRAM Caches in the Nanoscale Era . . . . 27:1--27:?? Pierre Michaud and Andrea Mondelli and André Seznec Revisiting Clustered Microarchitecture for Future Superscalar Cores: a Case for Wide Issue Clusters . . . . . . . . . . 28:1--28:?? Ragavendra Natarajan and Antonia Zhai Leveraging Transactional Execution for Memory Consistency Model Emulation . . . 29:1--29:?? Biswabandan Panda and Shankar Balachandran CAFFEINE: a Utility-Driven Prefetcher Aggressiveness Engine for Multicores . . 30:1--30:?? Jishen Zhao and Sheng Li and Jichuan Chang and John L. Byrne and Laura L. Ramirez and Kevin Lim and Yuan Xie and Paolo Faraboschi Buri: Scaling Big-Memory Computing with Hardware-Based Memory Expansion . . . . 31:1--31:?? Jan Lucas and Michael Andersch and Mauricio Alvarez-Mesa and Ben Juurlink Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency . . . . . . 32:1--32:??
Subhasis Das and Tor M. Aamodt and William J. Dally Reuse Distance-Based Probabilistic Cache Replacement . . . . . . . . . . . . . . 33:1--33:?? Etem Deniz and Alper Sen MINIME-GPU: Multicore Benchmark Synthesizer for GPUs . . . . . . . . . . 34:1--34:?? Li Tan and Zizhong Chen and Shuaiwen Leon Song Scalable Energy Efficiency with Resilience for High Performance Computing Systems: a Quantitative Methodology . . . . . . . . . . . . . . 35:1--35:?? Kishore Kumar Pusukuri and Rajiv Gupta and Laxmi N. Bhuyan Tumbler: an Effective Load-Balancing Technique for Multi-CPU Multicore Systems . . . . . . . . . . . . . . . . 36:1--36:?? Erik Tomusk and Christophe Dubach and Michael O'Boyle Four Metrics to Evaluate Heterogeneous Multicores . . . . . . . . . . . . . . . 37:1--37:?? Morteza Hoseinzadeh and Mohammad Arjomand and Hamid Sarbazi-Azad SPCM: The Striped Phase Change Memory 38:1--38:?? Chuntao Jiang and Zhibin Yu and Lieven Eeckhout and Hai Jin and Xiaofei Liao and Chengzhong Xu Two-Level Hybrid Sampled Simulation of Multithreaded Applications . . . . . . . 39:1--39:?? Sandeep D'souza and Soumya J. and Santanu Chattopadhyay Integrated Mapping and Synthesis Techniques for Network-on-Chip Topologies with Express Channels . . . . 40:1--40:?? Dimitrios Chasapis and Marc Casas and Miquel Moretó and Raul Vidal and Eduard Ayguadé and Jesús Labarta and Mateo Valero PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite . . . . . . . . . . . . . . . . . 41:1--41:?? Francisco Gaspar and Luis Taniça and Pedro Tomás and Aleksandar Ilic and Leonel Sousa A Framework for Application-Guided Task Management on Heterogeneous Embedded Systems . . . . . . . . . . . . . . . . 42:1--42:?? Ehsan K. Ardestani and Rafael Trapani Possignolo and Jose Luis Briz and Jose Renau Managing Mismatches in Voltage Stacking with CoreUnfolding . . . . . . . . . . . 43:1--43:?? Prashant J. Nair and David A. Roberts and Moinuddin K. Qureshi FaultSim: a Fast, Configurable Memory-Reliability Simulator for Conventional and $3$D-Stacked Systems 44:1--44:?? Byeongcheol Lee Adaptive Correction of Sampling Bias in Dynamic Call Graphs . . . . . . . . . . 45:1--45:?? Andrew J. Mcpherson and Vijay Nagarajan and Susmit Sarkar and Marcelo Cintra Fence Placement for Legacy Data-Race-Free Programs via Synchronization Read Detection . . . . . 46:1--46:?? Ding-Yong Hong and Chun-Chen Hsu and Cheng-Yi Chou and Wei-Chung Hsu and Pangfeng Liu and Jan-Jan Wu Optimizing Control Transfer and Memory Virtualization in Full System Emulators 47:1--47:?? Aravind Sukumaran-Rajam and Philippe Clauss The Polyhedral Model of Nonlinear Loops 48:1--48:?? Prashant J. Nair and David A. Roberts and Moinuddin K. Qureshi Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures . . . . . . . . . . . . . . . . 49:1--49:?? Andrew Anderson and Avinash Malik and David Gregg Automatic Vectorization of Interleaved Data Revisited . . . . . . . . . . . . . 50:1--50:?? Lihang Zhao and Lizhong Chen and Woojin Choi and Jeffrey Draper A Filtering Mechanism to Reduce Network Bandwidth Utilization of Transaction Execution . . . . . . . . . . . . . . . 51:1--51:?? Olivier Serres and Abdullah Kayi and Ahmad Anbar and Tarek El-Ghazawi Enabling PGAS Productivity with Hardware Support for Shared Address Mapping: a UPC Case Study . . . . . . . . . . . . . 52:1--52:?? Riccardo Cattaneo and Giuseppe Natale and Carlo Sicignano and Donatella Sciuto and Marco Domenico Santambrogio On How to Accelerate Iterative Stencil Loops: a Scalable Streaming-Based Approach . . . . . . . . . . . . . . . . 53:1--53:?? Unnikrishnan C and Rupesh Nasre and Y. N. Srikant Falcon: a Graph Manipulation Language for Heterogeneous Systems . . . . . . . 54:1--54:?? Rajshekar Kalayappan and Smruti R. Sarangi FluidCheck: a Redundant Threading-Based Approach for Reliable Execution in Manycore Processors . . . . . . . . . . 55:1--55:?? Jesse Elwell and Ryan Riley and Nael Abu-Ghazaleh and Dmitry Ponomarev and Iliano Cervesato Rethinking Memory Permissions for Protection Against Cross-Layer Attacks 56:1--56:?? Amir Morad and Leonid Yavits and Shahar Kvatinsky and Ran Ginosar Resistive GP-SIMD Processing-In-Memory 57:1--57:?? Yaohua Wang and Dong Wang and Shuming Chen and Zonglin Liu and Shenggang Chen and Xiaowen Chen and Xu Zhou Iteration Interleaving--Based SIMD Lane Partition . . . . . . . . . . . . . . . 58:1--58:?? Tomi Äijö and Pekka Jääskeläinen and Tapio Elomaa and Heikki Kultala and Jarmo Takala Integer Linear Programming-Based Scheduling for Transport Triggered Architectures . . . . . . . . . . . . . 59:1--59:?? Qixiao Liu and Miquel Moreto and Jaume Abella and Francisco J. Cazorla and Daniel A. Jimenez and Mateo Valero Sensible Energy Accounting with Abstract Metering for Multicore Systems . . . . . 60:1--60:?? Miao Zhou and Yu Du and Bruce Childers and Daniel Mosse and Rami Melhem Symmetry-Agnostic Coordinated Management of the Memory Hierarchy in Multicore Systems . . . . . . . . . . . . . . . . 61:1--61:?? Amir Yazdanbakhsh and Gennady Pekhimenko and Bradley Thwaites and Hadi Esmaeilzadeh and Onur Mutlu and Todd C. Mowry RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads . . . . . 62:1--62:?? Donghyuk Lee and Saugata Ghose and Gennady Pekhimenko and Samira Khan and Onur Mutlu Simultaneous Multi-Layer Access: Improving $3$D-Stacked Memory Bandwidth at Low Cost . . . . . . . . . . . . . . 63:1--63:?? Yeoul Na and Seon Wook Kim and Youngsun Han JavaScript Parallelizing Compiler for Exploiting Parallelism from Data-Parallel HTML5 Applications . . . . 64:1--64:?? Hiroyuki Usui and Lavanya Subramanian and Kevin Kai-Wei Chang and Onur Mutlu DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators . . . 65:1--65:?? Morteza Mohajjel Kafshdooz and Mohammadkazem Taram and Sepehr Assadi and Alireza Ejlali A Compile-Time Optimization Method for WCET Reduction in Real-Time Embedded Systems through Block Formation . . . . 66:1--66:25
Konstantinos Koukos and Alberto Ros and Erik Hagersten and Stefanos Kaxiras Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead . . 1:1--1:22 Zhigang Wang and Xiaolin Wang and Fang Hou and Yingwei Luo and Zhenlin Wang Dynamic Memory Balancing for Virtualization . . . . . . . . . . . . . 2:1--2:?? Xueyang Wang and Sek Chai and Michael Isnardi and Sehoon Lim and Ramesh Karri Hardware Performance Counter-Based Malware Identification and Detection with Adaptive Compressive Sensing . . . 3:1--3:?? Shoaib Akram and Jennifer B. Sartor and Kenzo Van Craeynest and Wim Heirman and Lieven Eeckhout Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors . . . . . . . . . . 4:1--4:?? Buse Yilmaz and Baris Aktemur and MaríA J. Garzarán and Sam Kamin and Furkan Kiraç Autotuning Runtime Specialization for Sparse Matrix-Vector Multiplication . . 5:1--5:?? Mingzhou Zhou and Bo Wu and Xipeng Shen and Yaoqing Gao and Graham Yiu Examining and Reducing the Influence of Sampling Errors on Feedback-Driven Optimizations . . . . . . . . . . . . . 6:1--6:?? Amanieu D'antras and Cosmin Gorgovan and Jim Garside and Mikel Luján Optimizing Indirect Branches in Dynamic Binary Translators . . . . . . . . . . . 7:1--7:?? Luiz G. A. Martins and Ricardo Nobre and João M. P. Cardoso and Alexandre C. B. Delbem and Eduardo Marques Clustering-Based Selection for the Exploration of Compiler Optimization Sequences . . . . . . . . . . . . . . . 8:1--8:?? Sang Wook Stephen Do and Michel Dubois Power Efficient Hardware Transactional Memory: Dynamic Issue of Transactions 9:1--9:?? Dmitry Evtyushkin and Dmitry Ponomarev and Nael Abu-Ghazaleh Understanding and Mitigating Covert Channels Through Branch Predictors . . . 10:1--10:?? Hao Zhou and Jingling Xue A Compiler Approach for Exploiting Partial SIMD Parallelism . . . . . . . . 11:1--11:?? Gert-Jan Van Den Braak and Henk Corporaal R-GPU: a Reconfigurable GPU Architecture 12:1--12:?? Peng Liu and Jiyang Yu and Michael C. Huang Thread-Aware Adaptive Prefetcher on Multicore Systems: Improving the Performance for Multithreaded Workloads 13:1--13:?? Cosmin Gorgovan and Amanieu D'antras and Mikel Luján MAMBO: a Low-Overhead Dynamic Binary Modification Tool for ARM . . . . . . . 14:1--14:??
Panagiotis Theocharis and Bjorn De Sutter A Bimodal Scheduler for Coarse-Grained Reconfigurable Arrays . . . . . . . . . 15:1--15:?? Ahmad Anbar and Olivier Serres and Engin Kayraklioglu and Abdel-Hameed A. Badawy and Tarek El-Ghazawi Exploiting Hierarchical Locality in Deep Parallel Architectures . . . . . . . . . 16:1--16:?? Cecilia González-álvarez and Jennifer B. Sartor and Carlos Álvarez and Daniel Jiménez-González and Lieven Eeckhout MInGLE: an Efficient Framework for Domain Acceleration Using Low-Power Specialized Functional Units . . . . . . 17:1--17:?? Christian Andreetta and Vivien Bégot and Jost Berthold and Martin Elsman and Fritz Henglein and Troels Henriksen and Maj-Britt Nordfang and Cosmin E. Oancea FinPar: a Parallel Financial Benchmark 18:1--18:?? Mickaël Dardaillon and Kevin Marquet and Tanguy Risset and Jérôme Martin and Henri-Pierre Charles A New Compilation Flow for Software-Defined Radio Applications on Heterogeneous MPSoCs . . . . . . . . . . 19:1--19:?? Jianwei Liao and François Trahay and Guoqiang Xiao Dynamic Process Migration Based on Block Access Patterns Occurring in Storage Servers . . . . . . . . . . . . . . . . 20:1--20:?? Amir Hossein Ashouri and Giovanni Mariani and Gianluca Palermo and Eunjung Park and John Cavazos and Cristina Silvano COBAYN: Compiler Autotuning Framework Using Bayesian Networks . . . . . . . . 21:1--21:?? Kypros Chrysanthou and Panayiotis Englezakis and Andreas Prodromou and Andreas Panteli and Chrysostomos Nicopoulos and Yiannakis Sazeides and Giorgos Dimitrakopoulos An Online and Real-Time Fault Detection and Localization Mechanism for Network-on-Chip Architectures . . . . . 22:1--22:??
Sanyam Mehta and Pen-Chung Yew Variable Liberalization . . . . . . . . 23:1--23:?? Hsing-Min Chen and Carole-Jean Wu and Trevor Mudge and Chaitali Chakrabarti RATT-ECC: Rate Adaptive Two-Tiered Error Correction Codes for Reliable $3$D Die-Stacked Memory . . . . . . . . . . . 24:1--24:?? Wenjie Chen and Zhibin Wang and Qin Wu and Jiuzhen Liang and Zhilei Chai Implementing Dense Optical Flow Computation on a Heterogeneous FPGA SoC in C . . . . . . . . . . . . . . . . . . 25:1--25:?? Nilay Vaish and Michael C. Ferris and David A. Wood Optimization Models for Three On-Chip Network Problems . . . . . . . . . . . . 26:1--26:?? Somayeh Sardashti and Andre Seznec and David A. Wood Yet Another Compressed Cache: a Low-Cost Yet Effective Compressed Cache . . . . . 27:1--27:?? Eduardo H. M. Cruz and Matthias Diener and Laércio L. Pilla and Philippe O. A. Navaux Hardware-Assisted Thread and Data Mapping in Hierarchical Multicore Architectures . . . . . . . . . . . . . 28:1--28:?? Almutaz Adileh and Stijn Eyerman and Aamer Jaleel and Lieven Eeckhout Maximizing Heterogeneous Processor Performance Under Power Constraints . . 29:1--29:?? Bagus Wibowo and Abhinav Agrawal and Thomas Stanton and James Tuck An Accurate Cross-Layer Approach for Online Architectural Vulnerability Estimation . . . . . . . . . . . . . . . 30:1--30:?? Manuel Acacio List of Distinguished Reviewers ACM TACO 2014 . . . . . . . . . . . . . . . . . . 31:1--31:??
Keval Vora and Rajiv Gupta and Guoqing Xu Synergistic Analysis of Evolving Graphs 32:1--32:?? Yunquan Zhang and Shigang Li and Shengen Yan and Huiyang Zhou A Cross-Platform SpMV Framework on Many-Core Architectures . . . . . . . . 33:1--33:?? Junwhan Ahn and Sungjoo Yoo and Kiyoung Choi AIM: Energy-Efficient Aggregation Inside the Memory Hierarchy . . . . . . . . . . 34:1--34:?? Amir Kavyan Ziabari and Yifan Sun and Yenai Ma and Dana Schaa and José L. Abellán and Rafael Ubal and John Kim and Ajay Joshi and David Kaeli UMH: a Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs . . . . . . . . . . . . . 35:1--35:?? Tom Spink and Harry Wagstaff and Björn Franke Hardware-Accelerated Cross-Architecture Full-System Virtualization . . . . . . . 36:1--36:?? Qingchuan Shi and George Kurian and Farrukh Hijaz and Srinivas Devadas and Omer Khan LDAC: Locality-Aware Data Access Control for Large-Scale Multicore Cache Hierarchies . . . . . . . . . . . . . . 37:1--37:?? Fernando Fernandes and Lucas Weigel and Claudio Jung and Philippe Navaux and Luigi Carro and Paolo Rech Evaluation of Histogram of Oriented Gradients Soft Errors Criticality for Automotive Applications . . . . . . . . 38:1--38:?? Saumay Dublish and Vijay Nagarajan and Nigel Topham Cooperative Caching for GPUs . . . . . . 39:1--39:?? Nikolaos Tampouratzis and Pavlos M. Mattheakis and Ioannis Papaefstathiou Accelerating Intercommunication in Highly Parallel Systems . . . . . . . . 40:1--40:?? Hyukwoo Park and Myungsu Cha and Soo-Mook Moon Concurrent JavaScript Parsing for Faster Loading of Web Apps . . . . . . . . . . 41:1--41:?? Dongliang Xiong and Kai Huang and Xiaowen Jiang and Xiaolang Yan Memory Access Scheduling Based on Dynamic Multilevel Priority in Shared DRAM Systems . . . . . . . . . . . . . . 42:1--42:?? Daniele De Sensi and Massimo Torquati and Marco Danelutto A Reconfiguration Algorithm for Power-Aware Parallel Applications . . . 43:1--43:?? Michael R. Jantz and Forrest J. Robinson and Prasad A. Kulkarni Impact of Intrinsic Profiling Limitations on Effectiveness of Adaptive Optimizations . . . . . . . . . . . . . 44:1--44:?? Marvin Damschen and Lars Bauer and Jörg Henkel Extending the WCET Problem to Optimize for Runtime-Reconfigurable Processors 45:1--45:?? Zheng Li and Fang Wang and Dan Feng and Yu Hua and Jingning Liu and Wei Tong MaxPB: Accelerating PCM Write by Maximizing the Power Budget Utilization 46:1--46:?? Saurav Muralidharan and Michael Garland and Albert Sidelnik and Mary Hall Designing a Tunable Nested Data-Parallel Programming System . . . . . . . . . . . 47:1--47:?? Ismail Akturk and Riad Akram and Mohammad Majharul Islam and Abdullah Muzahid and Ulya R. Karpuzcu Accuracy Bugs: a New Class of Concurrency Bugs to Exploit Algorithmic Noise Tolerance . . . . . . . . . . . . 48:1--48:?? Erik Tomusk and Christophe Dubach and Michael O'Boyle Selecting Heterogeneous Cores for Diversity . . . . . . . . . . . . . . . 49:1--49:?? Pierre Michaud Some Mathematical Facts About Optimal Cache Replacement . . . . . . . . . . . 50:1--50:?? Wenlei Bao and Changwan Hong and Sudheer Chunduri and Sriram Krishnamoorthy and Louis-Noël Pouchet and Fabrice Rastello and P. Sadayappan Static and Dynamic Frequency Scaling on Multicore CPUs . . . . . . . . . . . . . 51:1--51:?? Tiago M. Vale and João A. Silva and Ricardo J. Dias and João M. Lourenço Pot: Deterministic Transactional Execution . . . . . . . . . . . . . . . 52:1--52:?? Zhonghai Lu and Yuan Yao Aggregate Flow-Based Performance Fairness in CMPs . . . . . . . . . . . . 53:1--53:?? Yigit Demir and Nikos Hardavellas Energy-Proportional Photonic Interconnects . . . . . . . . . . . . . 54:1--54:?? Mehmet Can Kurt and Sriram Krishnamoorthy and Gagan Agrawal and Bin Ren User-Assisted Store Recycling for Dynamic Task Graph Schedulers . . . . . 55:1--55:?? Jawad Haj-Yihia and Ahmad Yasin and Yosi Ben Asher and Avi Mendelson Fine-Grain Power Breakdown of Modern Out-of-Order Cores and Its Implications on Skylake-Based Systems . . . . . . . . 56:1--56:?? Alberto Scolari and Davide Basilio Bartolini and Marco Domenico Santambrogio A Software Cache Partitioning System for Hash-Based Caches . . . . . . . . . . . 57:1--57:??
Lev Mukhanov and Pavlos Petoumenos and Zheng Wang and Nikos Parasyris and Dimitrios S. Nikolopoulos and Bronis R. De Supinski and Hugh Leather ALEA: a Fine-Grained Energy Profiling Tool . . . . . . . . . . . . . . . . . . 1:1--1:?? Anuj Pathania and Vanchinathan Venkataramani and Muhammad Shafique and Tulika Mitra and Jörg Henkel Defragmentation of Tasks in Many-Core Architecture . . . . . . . . . . . . . . 2:1--2:?? Darko Zivanovic and Milan Pavlovic and Milan Radulovic and Hyunsung Shin and Jongpil Son and Sally A. Mckee and Paul M. Carpenter and Petar Radojkovi\'c and Eduard Ayguadé Main Memory in HPC: Do We Need More or Could We Live with Less? . . . . . . . . 3:1--3:?? Wenguang Zheng and Hui Wu and Qing Yang WCET-Aware Dynamic I-Cache Locking for a Single Task . . . . . . . . . . . . . . 4:1--4:?? Byung-Sun Yang and Jae-Yun Kim and Soo-Mook Moon Exceptionization: a Java VM Optimization for Non-Java Languages . . . . . . . . . 5:1--5:?? Rathijit Sen and David A. Wood Pareto Governors for Energy-Optimal Computing . . . . . . . . . . . . . . . 6:1--6:?? Mainak Chaudhuri and Mukesh Agrawal and Jayesh Gaur and Sreenivas Subramoney Micro-Sector Cache: Improving Space Utilization in Sectored DRAM Caches . . 7:1--7:?? Kyriakos Georgiou and Steve Kerrison and Zbigniew Chamski and Kerstin Eder Energy Transparency for Deeply Embedded Programs . . . . . . . . . . . . . . . . 8:1--8:?? Pengcheng Li and Xiaoyu Hu and Dong Chen and Jacob Brock and Hao Luo and Eddy Z. Zhang and Chen Ding LD: Low-Overhead GPU Race Detection Without Access Monitoring . . . . . . . 9:1--9:?? Poovaiah M. Palangappa and Kartik Mohanram CompEx++: Compression-Expansion Coding for Energy, Latency, and Lifetime Improvements in MLC/TLC NVMs . . . . . . 10:1--10:??
Dongwoo Lee and Sangheon Lee and Soojung Ryu and Kiyoung Choi Dirty-Block Tracking in a Direct-Mapped DRAM Cache with Self-Balancing Dispatch 11:1--11:?? Konstantinos Parasyris and Vassilis Vassiliadis and Christos D. Antonopoulos and Spyros Lalis and Nikolaos Bellas Significance-Aware Program Execution on Unreliable Hardware . . . . . . . . . . 12:1--12:?? Gleison Mendonça and Breno Guimarães and Péricles Alves and Márcio Pereira and Guido Araújo and Fernando Magno Quintão Pereira DawnCC: Automatic Annotation for Data Parallelism and Offloading . . . . . . . 13:1--13:?? Rajeev Balasubramonian and Andrew B. Kahng and Naveen Muralimanohar and Ali Shafiee and Vaishnav Srinivas CACTI 7: New Tools for Interconnect Exploration in Innovative Off-Chip Memories . . . . . . . . . . . . . . . . 14:1--14:?? Vishwesh Jatala and Jayvant Anantpur and Amey Karkare Scratchpad Sharing in GPUs . . . . . . . 15:1--15:?? Tae Jun Ham and Juan L. Aragón and Margaret Martonosi Decoupling Data Supply from Computation for Latency-Tolerant Communication in Heterogeneous Architectures . . . . . . 16:1--16:?? Milan Stanic and Oscar Palomar and Timothy Hayes and Ivan Ratkovic and Adrian Cristal and Osman Unsal and Mateo Valero An Integrated Vector-Scalar Design on an In-Order ARM Core . . . . . . . . . . . 17:1--17:?? Fernando A. Endo and Arthur Perais and André Seznec On the Interactions Between Value Prediction and Compiler Optimizations in the Context of EOLE . . . . . . . . . . 18:1--18:?? Aswinkumar Sridharan and Biswabandan Panda and Andre Seznec Band-Pass Prefetching: an Effective Prefetch Management Mechanism Using Prefetch-Fraction Metric in Multi-Core Systems . . . . . . . . . . . . . . . . 19:1--19:?? Andrés Goens and Sergio Siccha and Jeronimo Castrillon Symmetry in Software Synthesis . . . . . 20:1--20:??
Sander Vocke and Henk Corporaal and Roel Jordans and Rosilde Corvino and Rick Nas Extending Halide to Improve Software Development for Imaging DSPs . . . . . . 21:1--21:?? Nicklas Bo Jensen and Sven Karlsson Improving Loop Dependence Analysis . . . 22:1--22:?? Stefan Ganser and Armin Grösslinger and Norbert Siegmund and Sven Apel and Christian Lengauer Iterative Schedule Optimization for Parallelization in the Polyhedron Model 23:1--23:?? Wei Wei and Dejun Jiang and Jin Xiong and Mingyu Chen HAP: Hybrid-Memory-Aware Partition in Shared Last-Level Cache . . . . . . . . 24:1--24:?? Dongliang Xiong and Kai Huang and Xiaowen Jiang and Xiaolang Yan Providing Predictable Performance via a Slowdown Estimation Model . . . . . . . 25:1--25:?? Jing Pu and Steven Bell and Xuan Yang and Jeff Setter and Stephen Richardson and Jonathan Ragan-Kelley and Mark Horowitz Programming Heterogeneous Systems from an Image Processing DSL . . . . . . . . 26:1--26:?? Ayman Hroub and M. E. S. Elrabaa and M. F. Mudawar and A. Khayyat Efficient Generation of Compact Execution Traces for Multicore Architectural Simulations . . . . . . . 27:1--27:?? Nicolas Weber and Michael Goesele MATOG: Array Layout Auto-Tuning for CUDA 28:1--28:?? Amir H. Ashouri and Andrea Bignoli and Gianluca Palermo and Cristina Silvano and Sameer Kulkarni and John Cavazos MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning . . . . . . . . . . . . . . . . 29:1--29:?? Erik Vermij and Leandro Fiorin and Rik Jongerius and Christoph Hagleitner and Jan Van Lunteren and Koen Bertels An Architecture for Integrated Near-Data Processors . . . . . . . . . . . . . . . 30:1--30:?? Andreas Diavastos and Pedro Trancoso SWITCHES: a Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores . . . . . . . . . . . . . . . 31:1--31:??
Rahul Jain and Preeti Ranjan Panda and Sreenivas Subramoney Cooperative Multi-Agent Reinforcement Learning-Based Co-optimization of Cores, Caches, and On-chip Network . . . . . . 32:1--32:?? Daniele De Sensi and Tiziano De Matteis and Massimo Torquati and Gabriele Mencagli and Marco Danelutto Bringing Parallel Patterns Out of the Corner: The P$^3$ARSEC Benchmark Suite 33:1--33:?? Chencheng Ye and Chen Ding and Hao Luo and Jacob Brock and Dong Chen and Hai Jin Cache Exclusivity and Sharing: Theory and Optimization . . . . . . . . . . . . 34:1--34:?? Rahul Shrivastava and V. Krishna Nandivada Energy-Efficient Compilation of Irregular Task-Parallel Loops . . . . . 35:1--35:?? Julien Proy and Karine Heydemann and Alexandre Berzati and Albert Cohen Compiler-Assisted Loop Hardening Against Fault Attacks . . . . . . . . . . . . . 36:1--36:?? Christina Peterson and Damian Dechev A Transactional Correctness Tool for Abstract Data Types . . . . . . . . . . 37:1--37:?? Matteo Ferroni and Andrea Corna and Andrea Damiani and Rolando Brondolin and Juan A. Colmenares and Steven Hofmeyr and John D. Kubiatowicz and Marco D. Santambrogio Power Consumption Models for Multi-Tenant Server Infrastructures . . 38:1--38:?? Milad Mohammadi and Tor M. Aamodt and William J. Dally CG-OoO: Energy-Efficient Coarse-Grain Out-of-Order Execution Near In-Order Energy with Near Out-of-Order Performance . . . . . . . . . . . . . . 39:1--39:?? Shivam Swami and Poovaiah M. Palangappa and Kartik Mohanram ECS: Error-Correcting Strings for Lifetime Improvements in Nonvolatile Memories . . . . . . . . . . . . . . . . 40:1--40:?? M. Waqar Azhar and Per Stenström and Vassilis Papaefstathiou SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures . . . . . . . . . . . . . 41:1--41:?? Raghavendra Kanakagiri and Biswabandan Panda and Madhu Mutyam MBZip: Multiblock Data Compression . . . 42:1--42:?? Richard Neill and Andi Drebes and Antoniu Pop Fuse: Accurate Multiplexing of Hardware Performance Counters Across Executions 43:1--43:?? Somayeh Sardashti and David A. Wood Could Compression Be of General Use? Evaluating Memory Compression across Domains . . . . . . . . . . . . . . . . 44:1--44:?? Libo Huang and Yashuai Lü and Li Shen and Zhiying Wang Improving the Efficiency of GPGPU Work-Queue Through Data Awareness . . . 45:1--45:?? Alexandra Angerd and Erik Sintorn and Per Stenström A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs . . . . . 46:1--46:?? Jaime Arteaga and Stéphane Zuckerman and Guang R. Gao Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:1--47:?? Ramyad Hadidi and Lifeng Nai and Hyojong Kim and Hyesoon Kim CAIRO: a Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory . . . . . . . . . . 48:1--48:?? Hongyeol Lim and Giho Park Triple Engine Processor (TEP): a Heterogeneous Near-Memory Processor for Diverse Kernel Operations . . . . . . . 49:1--49:?? George Patsilaras and James Tuck ReDirect: Reconfigurable Directories for Multicore Architectures . . . . . . . . 50:1--50:?? Adarsh Patil and Ramaswamy Govindarajan HAShCache: Heterogeneity-Aware Shared DRAMCache for Integrated Heterogeneous Systems . . . . . . . . . . . . . . . . 51:1--51:?? Christophe Alias and Alexandru Plesco Optimizing Affine Control With Semantic Factorizations . . . . . . . . . . . . . 52:1--52:?? George Matheou and Paraskevas Evripidou Data-Driven Concurrency for High Performance Computing . . . . . . . . . 53:1--53:?? Giorgis Georgakoudis and Hans Vandierendonck and Peter Thoman and Bronis R. De Supinski and Thomas Fahringer and Dimitrios S. Nikolopoulos SCALO: Scalability-Aware Parallelism Orchestration for Multi-Threaded Workloads . . . . . . . . . . . . . . . 54:1--54:?? Toufik Baroudi and Rachid Seghir and Vincent Loechner Optimization of Triangular and Banded Matrix Operations Using $2$ d-Packed Layouts . . . . . . . . . . . . . . . . 55:1--55:??
Hochan Lee and Mansureh S. Moghaddam and Dongkwan Suh and Bernhard Egger Improving Energy Efficiency of Coarse-Grain Reconfigurable Arrays Through Modulo Schedule Compression/Decompression . . . . . . . 1:1--1:?? Karthik Sangaiah and Michael Lui and Radhika Jagtap and Stephan Diestelhorst and Siddharth Nilakantan and Ankit More and Baris Taskin and Mark Hempstead SynchroTrace: Synchronization-Aware Architecture-Agnostic Traces for Lightweight Multicore Simulation of CMP and HPC Workloads . . . . . . . . . . . 2:1--2:?? Long Zheng and Xiaofei Liao and Hai Jin Efficient and Scalable Graph Parallel Processing With Symbolic Execution . . . 3:1--3:?? Jae-Eon Jo and Gyu-Hyeon Lee and Hanhwi Jang and Jaewon Lee and Mohammadamin Ajdari and Jangwoo Kim DiagSim: Systematically Diagnosing Simulators for Healthy Simulations . . . 4:1--4:?? Sushant Kondguli and Michael Huang A Case for a More Effective, Power-Efficient Turbo Boosting . . . . . 5:1--5:?? Kuan-Chung Chen and Chung-Ho Chen Enabling SIMT Execution Model on Homogeneous Multi-Core System . . . . . 6:1--6:?? Mingzhe Zhang and King Tin Lam and Xin Yao and Cho-Li Wang SIMPO: a Scalable In-Memory Persistent Object Framework Using NVRAM for Reliable Big Data Computing . . . . . . 7:1--7:?? Bobin Deng and Sriseshan Srikanth and Eric R. Hein and Thomas M. Conte and Erik Debenedictis and Jeanine Cook and Michael P. Frank Extending Moore's Law via Computationally Error-Tolerant Computing 8:1--8:?? Dave Dice and Maurice Herlihy and Alex Kogan Improving Parallelism in Hardware Transactional Memory . . . . . . . . . . 9:1--9:?? Namhyung Kim and Junwhan Ahn and Kiyoung Choi and Daniel Sanchez and Donghoon Yoo and Soojung Ryu Benzene: an Energy-Efficient Distributed Hybrid Cache Architecture for Manycore Systems . . . . . . . . . . . . . . . . 10:1--10:?? Yulong Ao and Chao Yang and Fangfang Liu and Wanwang Yin and Lijuan Jiang and Qiao Sun Performance Optimization of the HPCG Benchmark on the Sunway TaihuLight Supercomputer . . . . . . . . . . . . . 11:1--11:?? Saeed Rashidi and Majid Jalili and Hamid Sarbazi-Azad Improving MLC PCM Performance through Relaxed Write and Read for Intermediate Resistance Levels . . . . . . . . . . . 12:1--12:?? Wenlai Zhao and Haohuan Fu and Jiarui Fang and Weijie Zheng and Lin Gan and Guangwen Yang Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer 13:1--13:?? Dimitrios Mbakoyiannis and Othon Tomoutzoglou and George Kornaros Energy-Performance Considerations for Data Offloading to FPGA-Based Accelerators Over PCIe . . . . . . . . . 14:1--14:?? Zhen Lin and Michael Mantor and Huiyang Zhou GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP . . . . . . . . 15:1--15:?? Oleksandr Zinenko and Stéphane Huot and Cédric Bastoul Visual Program Manipulation in the Polyhedral Model . . . . . . . . . . . . 16:1--16:??
Mustafa M. Shihab and Jie Zhang and Myoungsoo Jung and Mahmut Kandemir ReveNAND: a Fast-Drift-Aware Resilient $3$D NAND Flash Design . . . . . . . . . 17:1--17:?? Seyed Majid Zahedi and Songchun Fan and Benjamin C. Lee Managing Heterogeneous Datacenters with Tokens . . . . . . . . . . . . . . . . . 18:1--18:?? Miquel Peric\`as Elastic Places: an Adaptive Resource Manager for Scalable and Portable Performance . . . . . . . . . . . . . . 19:1--19:?? Matthew Benjamin Olson and Joseph T. Teague and Divyani Rao and Michael R. JANTZ and Kshitij A. Doshi and Prasad A. Kulkarni Cross-Layer Memory Management to Improve DRAM Energy Efficiency . . . . . . . . . 20:1--20:?? Davide Zoni and Luca Colombo and William Fornaciari DarkCache: Energy-Performance Optimization of Tiled Multi-Cores by Adaptively Power-Gating LLC Banks . . . 21:1--21:?? Yang Zhang and Dan Feng and Wei Tong and Yu Hua and Jingning Liu and Zhipeng Tan and Chengning Wang and Bing Wu and Zheng Li and Gaoxiang Xu CACF: a Novel Circuit Architecture Co-optimization Framework for Improving Performance, Reliability and Energy of ReRAM-based Main Memory System . . . . . 22:1--22:?? Nicolai Stawinoga and Tony Field Predictable Thread Coarsening . . . . . 23:1--23:?? Probir Roy and Shuaiwen Leon Song and Sriram Krishnamoorthy and Abhinav Vishnu and Dipanjan Sengupta and Xu Liu NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks . . . . . . . . . . . . 24:1--24:?? Ahsen Ejaz and Vassilios Papaefstathiou and Ioannis Sourdis DDRNoC: Dual Data-Rate Network-on-Chip 25:1--25:?? Ying Cai and Yulong Ao and Chao Yang and Wenjing Ma and Haitao Zhao Extreme-Scale High-Order WENO Simulations of $3$-D Detonation Wave with 10 Million Cores . . . . . . . . . 26:1--26:??
Yannis Sfakianakis and Christos Kozanitis and Christos Kozyrakis and Angelos Bilas QuMan: Profile-based Improvement of Cluster Utilization . . . . . . . . . . 27:1--27:?? Engin Kayraklioglu and Michael P. Ferguson and Tarek El-Ghazawi LAPPS: Locality-Aware Productive Prefetching Support for PGAS . . . . . . 28:1--28:?? Akrem Benatia and Weixing Ji and Yizhuo Wang and Feng Shi BestSF: a Sparse Meta-Format for Optimizing SpMV on GPU . . . . . . . . . 29:1--29:?? Pierre Michaud An Alternative TAGE-like Conditional Branch Predictor . . . . . . . . . . . . 30:1--30:?? James Garland and David Gregg Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing . . . . . . . . . . . . . 31:1--31:?? Hyojong Kim and Ramyad Hadidi and Lifeng Nai and Hyesoon Kim and Nuwan Jayasena and Yasuko Eckert and Onur Kayiran and Gabriel Loh CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems . . . . . . . . . . . . . . . . 32:1--32:?? Madhavan Manivannan and Miquel Pericás and Vassilis Papaefstathiou and Per Stenström Global Dead-Block Management for Task-Parallel Programs . . . . . . . . . 33:1--33:?? Roman Gareev and Tobias Grosser and Michael Kruse High-Performance Generalized Tensor Operations: a Compiler-Oriented Approach 34:1--34:?? Hervé Yviquel and Lauro Cruz and Guido Araujo Cluster Programming using the OpenMP Accelerator Model . . . . . . . . . . . 35:1--35:?? Mohammad Khavari Tavana and Amir Kavyan Ziabari and David Kaeli Block Cooperation: Advancing Lifetime of Resistive Memories by Increasing Utilization of Error Correcting Codes 36:1--36:?? Hai Jin and Bo Liu and Wenbin Jiang and Yang Ma and Xuanhua Shi and Bingsheng He and Shaofeng Zhao Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures . . 37:1--37:?? Dani Voitsechov and Arslan Zulfiqar and Mark Stephenson and Mark Gebhart and Stephen W. Keckler Software-Directed Techniques for Improved GPU Register File Utilization 38:1--38:?? Huanxin Lin and Cho-Li Wang and Hongyuan Liu On-GPU Thread-Data Remapping for Branch Divergence Reduction . . . . . . . . . . 39:1--39:??
Stefan Kronawitter and Christian Lengauer Polyhedral Search Space Exploration in the ExaStencils Code Generator . . . . . 40:1--40:?? Jingheng Xu and Haohuan Fu and Wen Shi and Lin Gan and Yuxuan Li and Wayne Luk and Guangwen Yang Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor . . . . . . . . . . . . . . . 41:1--41:?? Jiajun Wang and Reena Panda and Lizy K. John SelSMaP: a Selective Stride Masking Prefetching Scheme . . . . . . . . . . . 42:1--42:?? Xing Su and Xiangke Liao and Hao Jiang and Canqun Yang and Jingling Xue SCP: Shared Cache Partitioning for High-Performance GEMM . . . . . . . . . 43:1--43:?? Fernando Magno Quintão Pereira and Guilherme Vieira Leobas and Abdoulaye Gamatié Static Prediction of Silent Stores . . . 44:1--44:?? Neal C. Crago and Mark Stephenson and Stephen W. Keckler Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs . . . . . . . . . . . 45:1--45:?? Feng Zhang and Jingling Xue Poker: Permutation-Based SIMD Execution of Intensive Tree Search by Path Encoding . . . . . . . . . . . . . . . . 46:1--46:?? Nicolas Belleville and Damien Couroussé and Karine Heydemann and Henri-Pierre Charles Automated Software Protection for the Masses Against Side-Channel Attacks . . 47:1--47:?? Chao Yu and Yuebin Bai and Qingxiao Sun and Hailong Yang Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory . . . . . . . . . . . 48:1--48:?? Lois Orosa and Rodolfo Azevedo and Onur Mutlu AVPP: Address-first Value-next Predictor with Value Prefetching for Improving the Efficiency of Load Value Prediction . . 49:1--49:?? Jun Zhang and Rui Hou and Wei Song and Sally A. Mckee and Zhen Jia and Chen Zheng and Mingyu Chen and Lixin Zhang and Dan Meng RAGuard: an Efficient and User-Transparent Hardware Mechanism against ROP Attacks . . . . . . . . . . 50:1--50:?? Ping Wang and Luke Mchale and Paul V. Gratz and Alex Sprintson GenMatcher: a Generic Clustering-Based Arbitrary Matching Framework . . . . . . 51:1--51:?? Ding-Yong Hong and Jan-Jan Wu and Yu-Ping Liu and Sheng-Yu Fu and Wei-Chung Hsu Processor-Tracing Guided Region Formation in Dynamic Binary Translation 52:1--52:?? Yu Wang and Victor Lee and Gu-Yeon Wei and David Brooks Predicting New Workload or CPU Performance by Analyzing Public Datasets 53:1--53:?? Hyukwoo Park and Sungkook Kim and Jung-Geun Park and Soo-Mook Moon Reusing the Optimized Code for JavaScript Ahead-of-Time Compilation . . 54:1--54:?? Han Zhao and Quan Chen and Yuxian Qiu and Ming Wu and Yao Shen and Jingwen Leng and Chao Li and Minyi Guo Bandwidth and Locality Aware Task-stealing for Manycore Architectures with Bandwidth-Asymmetric Memory . . . . 55:1--55:?? Stefan Ganser and Armin Größlinger and Norbert Siegmund and Sven Apel and Christian Lengauer Speeding up Iterative Polyhedral Schedule Optimization with Surrogate Performance Models . . . . . . . . . . . 56:1--56:?? Song Wu and Fang Zhou and Xiang Gao and Hai Jin and Jinglei Ren Dual-Page Checkpointing: an Architectural Approach to Efficient Data Persistence for In-Memory Applications 57:1--57:?? Mohsen Kiani and Amir Rajabzadeh Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis . . . 58:1--58:?? Thomas Debrunner and Sajad Saeedi and Paul H. J. Kelly AUKE: Automatic Kernel Code Generation for an Analogue SIMD Focal-Plane Sensor-Processor Array . . . . . . . . . 59:1--59:?? You Zhou and Fei Wu and Zhonghai Lu and Xubin He and Ping Huang and Changsheng Xie SCORE: a Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash Memory 60:1--60:?? Franciso J. Andújar and Salvador Coll and Marina Alonso and Pedro López and Juan-Miguel Martínez POWAR: Power-Aware Routing in HPC Networks with On/Off Links . . . . . . . 61:1--61:?? Rahim Mammadli and Felix Wolf and Ali Jannesari The Art of Getting Deep Neural Networks in Shape . . . . . . . . . . . . . . . . 62:1--62:?? Stavros Tzilis and Pedro Trancoso and Ioannis Sourdis Energy-Efficient Runtime Management of Heterogeneous Multicores using Online Projection . . . . . . . . . . . . . . . 63:1--63:?? Matthew Kay Fei Lee and Yingnan Cui and Thannirmalai Somu and Tao Luo and Jun Zhou and Wai Teng Tang and Weng-Fai Wong and Rick Siow Mong Goh A System-Level Simulator for RRAM-Based Neuromorphic Computing Chips . . . . . . 64:1--64:?? Evangelos Vasilakis and Vassilis Papaefstathiou and Pedro Trancoso and Ioannis Sourdis Decoupled Fused Cache: Fusing a Decoupled LLC with a DRAM Cache . . . . 65:1--65:?? Peter Pirkelbauer and Amalee Wilson and Christina Peterson and Damian Dechev Blaze-Tasks: a Framework for Computing Parallel Reductions over Tasks . . . . . 66:1--66:?? Yukinori Sato and Tomoya Yuki and Toshio Endo An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation . . . . . . . . . 67:1--67:?? S.-Kazem Shekofteh and Hamid Noori and Mahmoud Naghibzadeh and Hadi Sadoghi Yazdi and Holger Fröning Metric Selection for GPU Kernel Classification . . . . . . . . . . . . . 68:1--68:?? Angelos Bilas List of 2018 Distinguished Reviewers ACM TACO . . . . . . . . . . . . . . . . . . 69:1--69:??
Ghassan Shobaki and Austin Kerbow and Christopher Pulido and William Dobson Exploring an Alternative Cost Function for Combinatorial Register-Pressure-Aware Instruction Scheduling . . . . . . . . . . . . . . . 1:1--1:?? Yu-Ping Liu and Ding-Yong Hong and Jan-Jan Wu and Sheng-Yu Fu and Wei-Chung Hsu Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation . . . . . . . 2:1--2:?? Mohammad Sadrosadati and Seyed Borna Ehsani and Hajar Falahati and Rachata Ausavarungnirun and Arash Tavakkol and Mojtaba Abaee and Lois Orosa and Yaohua Wang and Hamid Sarbazi-Azad and Onur Mutlu ITAP: Idle-Time-Aware Power Management for GPU Execution Units . . . . . . . . 3:1--3:?? Halit Dogan and Masab Ahmad and Brian Kahne and Omer Khan Accelerating Synchronization Using Moving Compute to Data Model at 1,000-core Multicore Scale . . . . . . . 4:1--4:?? Leonid Azriel and Lukas Humbel and Reto Achermann and Alex Richardson and Moritz Hoffmann and Avi Mendelson and Timothy Roscoe and Robert N. M. Watson and Paolo Faraboschi and Dejan Milojicic Memory-Side Protection With a Capability Enforcement Co-Processor . . . . . . . . 5:1--5:?? Aamer Jaleel and Eiman Ebrahimi and Sam Duncan DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems . . . . . . . . 6:1--6:??
Yemao Xu and Dezun Dong and Weixia Xu and Xiangke Liao SketchDLC: a Sketch on Distributed Deep Learning Communication via Trace Capturing . . . . . . . . . . . . . . . 7:1--7:?? Aristeidis Mastoras and Thomas R. Gross Efficient and Scalable Execution of Fine-Grained Dynamic Linear Pipelines 8:1--8:?? Tae Jun Ham and Juan L. Aragón and Margaret Martonosi Efficient Data Supply for Parallel Heterogeneous Architectures . . . . . . 9:1--9:?? Savvas Sioutas and Sander Stuijk and Luc Waeijen and Twan Basten and Henk Corporaal and Lou Somers Schedule Synthesis for Halide Pipelines through Reuse Analysis . . . . . . . . . 10:1--10:?? Xiaoyuan Wang and Haikun Liu and Xiaofei Liao and Ji Chen and Hai Jin and Yu Zhang and Long Zheng and Bingsheng He and Song Jiang Supporting Superpages and Lightweight Page Migration in Hybrid Memory Systems 11:1--11:?? Sahar Sargaran and Naser Mohammadzadeh SAQIP: a Scalable Architecture for Quantum Information Processors . . . . . 12:1--12:?? Prerna Budhkar and Ildar Absalyamov and Vasileios Zois and Skyler Windh and Walid A. Najjar and Vassilis J. Tsotras Accelerating In-Memory Database Selections Using Latency Masking Hardware Threads . . . . . . . . . . . . 13:1--13:?? Heinrich Riebler and Gavin Vaz and Tobias Kenter and Christian Plessl Transparent Acceleration for Heterogeneous Platforms With Compilation to OpenCL . . . . . . . . . . . . . . . 14:1--14:?? Xun Gong and Xiang Gong and Leiming Yu and David Kaeli HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution . . . . . . . . . . . . . . . 15:1--15:?? Yang Song and Olivier Alavoine and Bill Lin A Self-aware Resource Management Framework for Heterogeneous Multicore SoCs with Diverse QoS Targets . . . . . 16:1--16:?? Pedro Yebenes and Jose Rocher-Gonzalez and Jesus Escudero-Sahuquillo and Pedro Javier Garcia and Francisco J. Alfaro and Francisco J. Quiles and Crispín Gómez and Jose Duato Combining Source-adaptive and Oblivious Routing with Congestion Control in High-performance Interconnects using Hybrid and Direct Topologies . . . . . . 17:1--17:?? Mohammad Alshboul and Hussein Elnawawy and Reem Elkhouly and Keiji Kimura and James Tuck and Yan Solihin Efficient Checkpointing with Recompute Scheme for Non-volatile Main Memory . . 18:1--18:?? Zacharias Hadjilambrou and Marios Kleanthous and Georgia Antoniou and Antoni Portero and Yiannakis Sazeides Comprehensive Characterization of an Open Source Document Search Engine . . . 19:1--19:??
Bingchao Li and Jizeng Wei and Jizhou Sun and Murali Annavaram and Nam Sung Kim An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns . . . . . . . . . . . . 20:1--20:?? Stephen I. Roberts and Steven A. Wright and Suhaib A. Fahmy and Stephen A. Jarvis The Power-optimised Software Envelope 21:1--21:?? Ram Srivatsa Kannan and Michael Laurenzano and Jeongseob Ahn and Jason Mars and Lingjia Tang Caliper: Interference Estimator for Multi-tenant Environments Sharing Architectural Resources . . . . . . . . 22:1--22:?? Zhen Lin and Hongwen Dai and Michael Mantor and Huiyang Zhou Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution . . . . . . 23:1--23:?? Keryan Didier and Dumitru Potop-Butucaru and Guillaume Iooss and Albert Cohen and Jean Souyris and Philippe Baufreton and Amaury Graillat Correct-by-Construction Parallelization of Hard Real-Time Avionics Applications on Off-the-Shelf Predictable Hardware 24:1--24:?? Pantea Zardoshti and Tingzhe Zhou and Pavithra Balaji and Michael L. Scott and Michael Spear Simplifying Transactional Memory Support in C++ . . . . . . . . . . . . . . . . . 25:1--25:?? Jungwoo Park and Myoungjun Lee and Soontae Kim and Minho Ju and Jeongkyu Hong MH Cache: a Multi-retention STT-RAM-based Low-power Last-level Cache for Mobile Hardware Rendering Systems 26:1--26:?? Jakob Leben and George Tzanetakis Polyhedral Compilation for Multi-dimensional Stream Processing . . 27:1--27:?? Mohammad Sadegh Sadeghi and Siavash Bayat Sarmadi and Shaahin Hessabi Toward On-chip Network Security Using Runtime Isolation Mapping . . . . . . . 28:1--28:?? Stephane Louise A First Step Toward Using Quantum Computing for Low-level WCETs Estimations . . . . . . . . . . . . . . 29:1--29:?? Artem Chikin and Taylor Lloyd and José Nelson Amaral and Ettore Tiotto and Muhammad Usman Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops . . . . . . . . . . . . . . 30:1--30:?? Sanghoon Cha and Bokyeong Kim and Chang Hyun Park and Jaehyuk Huh Morphable DRAM Cache Design for Hybrid Memory Systems . . . . . . . . . . . . . 31:1--31:?? Chao Luo and Yunsi Fei and David Kaeli Side-channel Timing Attack of RSA on a GPU . . . . . . . . . . . . . . . . . . 32:1--32:?? Liang Yuan and Chen Ding and Wesley Smith and Peter Denning and Yunquan Zhang A Relational Theory of Locality . . . . 33:1--33:??
Arun Thangamani and V. Krishna Nandivada Optimizing Remote Communication in X10 34:1--34:26 Sriseshan Srikanth and Anirudh Jain and Joseph M. Lennon and Thomas M. Conte and Erik Debenedictis and Jeanine Cook MetaStrider: Architectures for Scalable Memory-centric Reduction of Sparse Data Streams . . . . . . . . . . . . . . . . 35:1--35:26 Mostafa Koraei and Omid Fatemi and Magnus Jahre DCMI: a Scalable Strategy for Accelerating Iterative Stencil Loops on FPGAs . . . . . . . . . . . . . . . . . 36:1--36:24 Leeor Peled and Uri Weiser and Yoav Etsion A Neural Network Prefetcher for Arbitrary Memory Access Patterns . . . . 37:1--37:27 Nicolas Vasilache and Oleksandr Zinenko and Theodoros Theodoridis and Priya Goyal and Zachary Devito and William S. Moses and Sven Verdoolaege and Andrew Adams and Albert Cohen The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically . . . . . . . . . 38:1--38:26 Wenbin Jiang and Yang Ma and Bo Liu and Haikun Liu and Bing Bing Zhou and Jian Zhu and Song Wu and Hai Jin Layup: Layer-adaptive and Multi-type Intermediate-oriented Memory Optimization for GPU-based CNNs . . . . 39:1--39:23 Sergi Siso and Wes Armour and Jeyarajan Thiyagalingam Evaluating Auto-Vectorizing Compilers through Objective Withdrawal of Useful Information . . . . . . . . . . . . . . 40:1--40:23 Salonik Resch and S. Karen Khatamifard and Zamshed Iqbal Chowdhury and Masoud Zabihi and Zhengyang Zhao and Jian-Ping Wang and Sachin S. Sapatnekar and Ulya R. Karpuzcu PIMBALL: Binary Neural Networks in Spintronic Memory . . . . . . . . . . . 41:1--41:26 Zhen Hang Jiang and Yunsi Fei and David Kaeli Exploiting Bank Conflict-based Side-channel Timing Leakage of GPUs . . 42:1--42:24 Kyle Daruwalla and Heng Zhuo and Rohit Shukla and Mikko Lipasti BitSAD v2: Compiler Optimization and Analysis for Bitstream Computing . . . . 43:1--43:25 Aristeidis Mastoras and Thomas R. Gross Chunking for Dynamic Linear Pipelines 44:1--44:25 Manuel Selva and Fabian Gruber and Diogo Sampaio and Christophe Guillon and Louis-Noël Pouchet and Fabrice Rastello Building a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of Nonaffine Programs Scalable . . . . . . . . . . . . . . . . 45:1--45:26 Ahmad Yasin and Jawad Haj-Yahya and Yosi Ben-Asher and Avi Mendelson A Metric-Guided Method for Discovering Impactful Features and Architectural Insights for Skylake-Based Processors 46:1--46:25 Jie Zhao and Albert Cohen Flextended Tiles: a Flexible Extension of Overlapped Tiles for Polyhedral Compilation . . . . . . . . . . . . . . 47:1--47:25 Daniel Gerzhoy and Xiaowu Sun and Michael Zuzak and Donald Yeung Nested MIMD--SIMD Parallelization for Heterogeneous Microprocessors . . . . . 48:1--48:27 Chunwei Xia and Jiacheng Zhao and Huimin Cui and Xiaobing Feng and Jingling Xue DNNTune: Automatic Benchmarking DNN Models for Mobile-cloud Computing . . . 49:1--49:26 Ian Briggs and Arnab Das and Mark Baranowski and Vishal Sharma and Sriram Krishnamoorthy and Zvonimir Rakamari\'c and Ganesh Gopalakrishnan FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation . . . . . . . . . . . 50:1--50:21 Khalid Ahmad and Hari Sundar and Mary Hall Data-driven Mixed Precision Sparse Matrix Vector Multiplication for GPUs 51:1--51:24 Larisa Stoltzfus and Bastian Hagedorn and Michel Steuwer and Sergei Gorlatch and Christophe Dubach Tiling Optimizations for Stencil Computations Using Rewrite Rules in Lift 52:1--52:25 Michiel A. van der Vlag and Georgios Smaragdos and Zaid Al-Ars and Christos Strydis Exploring Complex Brain-Simulation Workloads on Multi-GPU Deployments . . . 53:1--53:25 Reem Elkhouly and Mohammad Alshboul and Akihiro Hayashi and Yan Solihin and Keiji Kimura Compiler-support for Critical Data Persistence in NVM . . . . . . . . . . . 54:1--54:25 Lorenzo Chelini and Oleksandr Zinenko and Tobias Grosser and Henk Corporaal Declarative Loop Tactics for Domain-specific Optimization . . . . . . 55:1--55:25 Asif Ali Khan and Fazal Hameed and Robin Bläsing and Stuart S. P. Parkin and Jeronimo Castrillon ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0 . . . . . . . . . . 56:1--56:23
Yuhao Li and Dan Sun and Benjamin C. Lee Dynamic Colocation Policies with Reinforcement Learning . . . . . . . . . 1:1--1:25 Nikolaos Tampouratzis and Ioannis Papaefstathiou and Antonios Nikitakis and Andreas Brokalakis and Stamatis Andrianakis and Apostolos Dollas and Marco Marcon and Emanuele Plebani A Novel, Highly Integrated Simulator for Parallel and Distributed Systems . . . . 2:1--2:28 Lijuan Jiang and Chao Yang and Wenjing Ma Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor . . . . . . . . . . . . . . . 3:1--3:23 Mustafa Cavus and Resit Sendag and Joshua J. Yi Informed Prefetching for Indirect Memory Accesses . . . . . . . . . . . . . . . . 4:1--4:29 Yohann Uguen and Florent De Dinechin and Victor Lezaud and Steven Derrien Application-Specific Arithmetic in High-Level Synthesis Tools . . . . . . . 5:1--5:23 Yang Song and Bill Lin Improving Memory Efficiency in Heterogeneous MPSoCs through Row-Buffer Locality-aware Forwarding . . . . . . . 6:1--6:26 Hao Wu and Weizhi Liu and Huanxin Lin and Cho-Li Wang A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUs 7:1--7:26 Xuanhua Shi and Wei Liu and Ligang He and Hai Jin and Ming Li and Yong Chen Optimizing the SSD Burst Buffer by Traffic Detection . . . . . . . . . . . 8:1--8:26
Charu Kalra and Fritz Previlon and Norm Rubin and David Kaeli ArmorAll: Compiler-based Resilience Targeting GPU Applications . . . . . . . 9:1--9:24 Stefano Cherubin and Daniele Cattaneo and Michele Chiari and Giovanni Agosta Dynamic Precision Autotuning with TAFFO 10:1--10:26 Ahmet Erdem and Cristina Silvano and Thomas Boesch and Andrea Carlo Ornstein and Surinder-Pal Singh and Giuseppe Desoli Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC . . . . . . . . . . . . . . 11:1--11:25 Amir Hossein Nodehi Sabet and Junqiao Qiu and Zhijia Zhao and Sriram Krishnamoorthy Reliability Analysis for Unreliable FSM Computations . . . . . . . . . . . . . . 12:1--12:23 Jiachen Xue and T. N. Vijaykumar and Mithuna Thottethodi Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters . . . . . . . . . . . . . . 13:1--13:22 Qinggang Wang and Long Zheng and Jieshan Zhao and Xiaofei Liao and Hai Jin and Jingling Xue A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs . . . . . . . . . . 14:1--14:26 Anita Tino and Caroline Collange and André Seznec SIMT-X: Extending Single-Instruction Multi-Threading to Out-of-Order Cores 15:1--15:23
Dave Kaeli Editorial: a Message from the Editor-in-Chief . . . . . . . . . . . . 16:1--16:2 Ram Rangan and Mark W. Stephenson and Aditya Ukarande and Shyam Murthy and Virat Agarwal and Marc Blackstein Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications . . . . . . . . . . . . . . 17:1--17:26 Karel Adámek and Sofia Dimoudi and Mike Giles and Wesley Armour GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory 18:1--18:20 Arnab Das and Sriram Krishnamoorthy and Ian Briggs and Ganesh Gopalakrishnan and Ramakrishna Tipireddy FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation . . . . . . . . . . . . . . . 19:1--19:27 Tarek S. Abdelrahman Cooperative Software-hardware Acceleration of $K$-means on a Tightly Coupled CPU--FPGA System . . . . . . . . 20:1--20:24 Jaekyu Lee and Yasuo Ishii and Dam Sunwoo Securing Branch Predictors with Two-Level Encryption . . . . . . . . . . 21:1--21:25 L. Cerina and M. D. Santambrogio and G. Franco and C. Gallicchio and A. Micheli EchoBay: Design and Optimization of Echo State Networks under Memory and Time Constraints . . . . . . . . . . . . . . 22:1--22:24 Savvas Sioutas and Sander Stuijk and Twan Basten and Henk Corporaal and Lou Somers Schedule Synthesis for Halide Pipelines on GPUs . . . . . . . . . . . . . . . . 23:1--23:25 Muhammad Huzaifa and Johnathan Alsop and Abdulrahman Mahmoud and Giordano Salvador and Matthew D. Sinclair and Sarita V. Adve Inter-kernel Reuse-aware Thread Block Scheduling . . . . . . . . . . . . . . . 24:1--24:27
Syed M. A. H. Jafri and Hasan Hassan and Ahmed Hemani and Onur Mutlu Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators . . . . . . 2:1--2:29 Solomon Abera and M. Balakrishnan and Anshul Kumar Performance-Energy Trade-off in Modern CMPs . . . . . . . . . . . . . . . . . . 3:1--3:26 Atefeh Mehrabi and Aninda Manocha and Benjamin C. Lee and Daniel J. Sorin Bayesian Optimization for Efficient Accelerator Synthesis . . . . . . . . . 4:1--4:25 Minsu Kim and Jeong-Keun Park and Soo-Mook Moon Irregular Register Allocation for Translation of Test-pattern Programs . . 5:1--5:23 Negin Nematollahi and Mohammad Sadrosadati and Hajar Falahati and Marzieh Barkhordar and Mario Paulo Drumond and Hamid Sarbazi-Azad and Babak Falsafi Efficient Nearest-Neighbor Data Sharing in GPUs . . . . . . . . . . . . . . . . 6:1--6:26 Lorenz Braun and Sotirios Nikas and Chen Song and Vincent Heuveline and Holger Fröning A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels . . . . . . . 7:1--7:25 Marcel Mettler and Daniel Mueller-Gritschneder and Ulf Schlichtmann A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs . . . . . . . . . . . . . . . . . 8:1--8:25 Yu Emma Wang and Carole-Jean Wu and Xiaodong Wang and Kim Hazelwood and David Brooks Exploiting Parallelism Opportunities with Deep Learning Frameworks . . . . . 9:1--9:23 Sanket Tavarageri and Alexander Heinecke and Sasikanth Avancha and Bharat Kaul and Gagandeep Goyal and Ramakrishna Upadrasta PolyDL: Polyhedral Optimizations for Creation of High-performance DL Primitives . . . . . . . . . . . . . . . 11:1--11:27 Sujay Yadalam and Vinod Ganapathy and Arkaprava Basu SG XL: Security and Performance for Enclaves Using Large Pages . . . . . . . 12:1--12:25 Kleovoulos Kalaitzidis and André Seznec Leveraging Value Equality Prediction for Value Speculation . . . . . . . . . . . 13:1--13:20 Abhishek Singh and Shail Dave and Pantea Zardoshti and Robert Brotzman and Chao Zhang and Xiaochen Guo and Aviral Shrivastava and Gang Tan and Michael Spear SPX64: a Scratchpad Memory for General-purpose Microprocessors . . . . 14:1--14:26 Paolo Sylos Labini and Marco Cianfriglia and Damiano Perri and Osvaldo Gervasi and Grigori Fursin and Anton Lokhmotov and Cedric Nugteren and Bruno Carpentieri and Fabiana Zollo and Flavio Vella On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond . . . . . . . . . . . . . . . . . 16:1--16:24
Nils Voss and Bastiaan Kwaadgras and Oskar Mencer and Wayne Luk and Georgi Gaydadjiev On Predictable Reconfigurable System Design . . . . . . . . . . . . . . . . . 17:1--17:28 Anirudh Mohan Kaushik and Gennady Pekhimenko and Hiren Patel Gretch: a Hardware Prefetcher for Graph Analytics . . . . . . . . . . . . . . . 18:1--18:25 Nhut-Minh Ho and Himeshi De Silva and Weng-Fai Wong GRAM: a Framework for Dynamically Mixing Precisions in GPU Applications . . . . . 19:1--19:24 Arnab Kumar Biswas Cryptographic Software IP Protection without Compromising Performance or Timing Side-channel Leakage . . . . . . 20:1--20:20 Maxime France-Pillois and Jérôme Martin and Frédéric Rousseau A Non-Intrusive Tool Chain to Optimize MPSoC End-to-End Systems . . . . . . . . 21:1--21:22 Pengyu Wang and Jing Wang and Chao Li and Jianzong Wang and Haojin Zhu and Minyi Guo Grus: Toward Unified-memory-efficient High-performance Graph Processing on GPU 22:1--22:25 Ramin Izadpanah and Christina Peterson and Yan Solihin and Damian Dechev PETRA: Persistent Transactional Non-blocking Linked Data Structures . . 23:1--23:26 Muhammad Hassan and Chang Hyun Park and David Black-Schaffer A Reusable Characterization of the Memory System Behavior of SPEC2017 and SPEC2006 . . . . . . . . . . . . . . . . 24:1--24:20
Sugandha Tiwari and Neel Gala and Chester Rebeiro and V. Kamakoti PERI: a Configurable Posit Enabled RISC-V Core . . . . . . . . . . . . . . 25:1--25:26 George Charitopoulos and Dionisios N. Pnevmatikatos and Georgi Gaydadjiev MC-DeF: Creating Customized CGRAs for Dataflow Applications . . . . . . . . . 26:1--26:25 Jose M. Rodriguez Borbon and Junjie Huang and Bryan M. Wong and Walid Najjar Acceleration of Parallel-Blocked $ Q R $ Decomposition of Tall-and-Skinny Matrices on FPGAs . . . . . . . . . . . 27:1--27:25 Michael Stokes and David Whalley and Soner Onder Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache . . . . . . . . . . . . . . . . . 28:1--28:22 Shoaib Akram Performance Evaluation of Intel Optane Memory for Managed Workloads . . . . . . 29:1--29:26 Yashuai Lü and Hui Guo and Libo Huang and Qi Yu and Li Shen and Nong Xiao and Zhiying Wang GraphPEG: Accelerating Graph Processing on GPUs . . . . . . . . . . . . . . . . 30:1--30:24 Hamza Omar and Omer Khan PRISM: Strong Hardware Isolation-based Soft-Error Resilient Multicore Architecture with High Performance and Availability at Low Hardware Overheads 31:1--31:25 Devashree Tripathy and Amirali Abdolrashidi and Laxmi Narayan Bhuyan and Liang Zhou and Daniel Wong PAVER: Locality Graph-Based Thread Block Scheduling for GPUs . . . . . . . . . . 32:1--32:26 Wim Heirman and Stijn Eyerman and Kristof Du Bois and Ibrahim Hur Automatic Sublining for Efficient Sparse Memory Accesses . . . . . . . . . . . . 33:1--33:23 Mustafa Cavus and Mohammed Shatnawi and Resit Sendag and Augustus K. Uht Fast Key-Value Lookups with Node Tracker 34:1--34:26 Weijia Song and Christina Delimitrou and Zhiming Shen and Robbert Van Renesse and Hakim Weatherspoon and Lotfi Benmohamed and Frederic De Vaulx and Charif Mahmoudi CacheInspector: Reverse Engineering Cache Resources in Public Clouds . . . . 35:1--35:25 Daniel Rodrigues Carvalho and André Seznec Understanding Cache Compression . . . . 36:1--36:27 Daniel Thuerck and Nicolas Weber and Roberto Bifulco Flynn's Reconciliation: Automating the Register Cache Idiom for Cross-accelerator Programming . . . . . 37:1--37:26 João P. L. De Carvalho and Braedy Kuzma and Ivan Korostelev and José Nelson Amaral and Christopher Barton and José Moreira and Guido Araujo KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls . . . . . . . . . . . . . . . . . 38:1--38:22 Ricardo Alves and Stefanos Kaxiras and David Black-Schaffer Early Address Prediction: Efficient Pipeline Prefetch and Reuse . . . . . . 39:1--39:22
Kaustav Goswami and Dip Sankar Banerjee and Shirshendu Das Towards Enhanced System Efficiency while Mitigating Row Hammer . . . . . . . . . 40:1--40:26 Jerzy Proficz All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns . . 41:1--41:22 Rui Xu and Sheng Ma and Yaohua Wang and Xinhai Chen and Yang Guo Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks . . . . . . . . . . . . 42:1--42:24 Wonik Seo and Sanghoon Cha and Yeonjae Kim and Jaehyuk Huh and Jongse Park SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms . . . . . . . . . . . . . . . 43:1--43:26 Yasir Mahmood Qureshi and William Andrew Simon and Marina Zapater and Katzalin Olcoz and David Atienza Gem5-X: a Many-core Heterogeneous Simulation Platform for Architectural Exploration and Optimization . . . . . . 44:1--44:27 Tina Jung and Fabian Ritter and Sebastian Hack PICO: a Presburger In-bounds Check Optimization for Compiler-based Memory Safety Instrumentations . . . . . . . . 45:1--45:27 Zhibing Sha and Jun Li and Lihao Song and Jiewen Tang and Min Huang and Zhigang Cai and Lianju Qian and Jianwei Liao and Zhiming Liu Low I/O Intensity-aware Partial GC Scheduling to Reduce Long-tail Latency in SSDs . . . . . . . . . . . . . . . . 46:1--46:25 Syed Asad Alam and James Garland and David Gregg Low-precision Logarithmic Number Systems: Beyond Base-2 . . . . . . . . . 47:1--47:25 Candace Walden and Devesh Singh and Meenatchi Jagasivamani and Shang Li and Luyi Kang and Mehdi Asnaashari and Sylvain Dubois and Bruce Jacob and Donald Yeung Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache 48:1--48:26 Matthew Tomei and Shomit Das and Mohammad Seyedzadeh and Philip Bedoukian and Bradford Beckmann and Rakesh Kumar and David Wood Byte-Select Compression . . . . . . . . 49:1--49:27 Cunlu Li and Dezun Dong and Shazhou Yang and Xiangke Liao and Guangyu Sun and Yongheng Liu CIB-HIER: Centralized Input Buffer Design in Hierarchical High-radix Routers . . . . . . . . . . . . . . . . 50:1--50:21 Tobias Gysi and Christoph Müller and Oleksandr Zinenko and Stephan Herhut and Eddie Davis and Tobias Wicky and Oliver Fuhrer and Torsten Hoefler and Tobias Grosser Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation . . . 51:1--51:23 An Zou and Huifeng Zhu and Jingwen Leng and Xin He and Vijay Janapa Reddi and Christopher D. Gill and Xuan Zhang System-level Early-stage Modeling and Evaluation of IVR-assisted Processor Power Delivery System . . . . . . . . . 52:1--52:27 Aninda Manocha and Tyler Sorensen and Esin Tureci and Opeoluwa Matthews and Juan L. Aragón and Margaret Martonosi GraphAttack: Optimizing Data Supply for Graph Applications on In-Order Multicore Architectures . . . . . . . . . . . . . 53:1--53:26 Joscha Benz and Oliver Bringmann Scenario-Aware Program Specialization for Timing Predictability . . . . . . . 54:1--54:26 Shounak Chakraborty and Magnus Själander WaFFLe: Gated Cache-Ways with Per-Core Fine-Grained DVFS for Reduced On-Chip Temperature and Leakage Consumption . . 55:1--55:25 Sriseshan Srikanth and Anirudh Jain and Thomas M. Conte and Erik P. Debenedictis and Jeanine Cook SortCache: Intelligent Cache Management for Accelerating Sparse Data Workloads 56:1--56:24 Paul Metzger and Volker Seeker and Christian Fensch and Murray Cole Device Hopping: Transparent Mid-Kernel Runtime Switching for Heterogeneous Systems . . . . . . . . . . . . . . . . 57:1--57:25 Yu Zhang and Da Peng and Xiaofei Liao and Hai Jin and Haikun Liu and Lin Gu and Bingsheng He LargeGraph: an Efficient Dependency-Aware GPU-Accelerated Large-Scale Graph Processing . . . . . . 58:1--58:24 Hüsrev Cilasun and Salonik Resch and Zamshed I. Chowdhury and Erin Olson and Masoud Zabihi and Zhengyang Zhao and Thomas Peterson and Keshab K. Parhi and Jian-Ping Wang and Sachin S. Sapatnekar and Ulya R. Karpuzcu Spiking Neural Networks in Spintronic Computational RAM . . . . . . . . . . . 59:1--59:21
Aditya Ukarande and Suryakant Patidar and Ram Rangan Locality-Aware CTA Scheduling for Gaming Applications . . . . . . . . . . . . . . 1:1--1:26 Hongzhi Liu and Jie Luo and Ying Li and Zhonghai Wu Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering . . . . . . . . . . . . . . . 2:1--2:25 Muhammad Aditya Sasongko and Milind Chabbi and Mandana Bagheri Marzijarani and Didem Unat ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer . . . 3:1--3:25 Yaosheng Fu and Evgeny Bolotin and Niladrish Chatterjee and David Nellans and Stephen W. Keckler GPU Domain Specialization via Composable On-Package Architecture . . . . . . . . 4:1--4:23 Daeyeal Lee and Bill Lin and Chung-Kuan Cheng SMT-Based Contention-Free Task Mapping and Scheduling on $2$D/$3$D SMART NoC with Mixed Dimension-Order Routing . . . 5:1--5:21 Prasanth Chatarasi and Hyoukjun Kwon and Angshuman Parashar and Michael Pellauer and Tushar Krishna and Vivek Sarkar Marvel: a Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators . . . . . . . . . . 6:1--6:26 Dennis Rieber and Axel Acosta and Holger Fröning Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming . . . . . . . . . . . . . . 7:1--7:26 Mengya Lei and Fan Li and Fang Wang and Dan Feng and Xiaomin Zou and Renzhi Xiao SecNVM: an Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM . . . . . . . . . . . . . . . 8:1--8:26 Bang Di and Daokun Hu and Zhen Xie and Jianhua Sun and Hao Chen and Jinkui Ren and Dong Li TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling . . . 9:1--9:23 Gururaj Saileshwar and Rick Boivie and Tong Chen and Benjamin Segal and Alper Buyuktosunoglu HeapCheck: Low-cost Hardware Support for Memory Safety . . . . . . . . . . . . . 10:1--10:24 M. Waqar Azhar and Miquel Peric\`as and Per Stenström Task-RM: a Resource Manager for Energy Reduction in Task-Parallel Applications under Quality of Service Constraints . . 11:1--11:26 Cesar Gomes and Maziar Amiraski and Mark Hempstead CASHT: Contention Analysis in Shared Hierarchies with Thefts . . . . . . . . 12:1--12:27 Yufei Wang and Xiaoshe Dong and Longxiang Wang and Weiduo Chen and Xingjun Zhang Optimizing Small-Sample Disk Fault Detection Based on LSTM-GAN Model . . . 13:1--13:24 Franyell Silfa and Jose Maria Arnau and Antonio González E-BATCH: Energy-Efficient and High-Throughput RNN Batching . . . . . . 14:1--14:23 Chen Ding and Dong Chen and Fangzhou Liu and Benjamin Reber and Wesley Smith CARL: Compiler Assigned Reference Leasing . . . . . . . . . . . . . . . . 15:1--15:28
Christof Schlaak and Tzung-Han Juang and Christophe Dubach Memory-Aware Functional IR for Higher-Level Synthesis of Accelerators 16:1--16:26 Kartik Lakshminarasimhan and Ajeya Naithani and Josué Feliu and Lieven Eeckhout The Forward Slice Core: a High-Performance, Yet Low-Complexity Microarchitecture . . . . . . . . . . . 17:1--17:25 Sharanyan Srikanthan and Sayak Chakraborti and Princeton Ferro and Sandhya Dwarkadas MAPPER: Managing Application Performance via Parallel Efficiency Regulation * . . 18:1--18:26 Tziouvaras Athanasios and Dimitriou Georgios and Stamoulis Georgios Low-power Near-data Instruction Execution Leveraging Opcode-based Timing Analysis . . . . . . . . . . . . . . . . 19:1--19:26 Xingguo Jia and Jin Zhang and Boshi Yu and Xingyue Qian and Zhengwei Qi and Haibing Guan GiantVM: a Novel Distributed Hypervisor for Resource Aggregation with DSM-aware Optimizations . . . . . . . . . . . . . 20:1--20:27 Mehrzad Nejat and Madhavan Manivannan and Miquel Peric\`as and Per Stenström Cooperative Slack Management: Saving Energy of Multicore Processors by Trading Performance Slack Between QoS-Constrained Applications . . . . . . 21:1--21:27 Hugo Pompougnac and Ulysse Beaugnon and Albert Cohen and Dumitru Potop Butucaru Weaving Synchronous Reactions into the Fabric of SSA-form Compilers . . . . . . 22:1--22:25 Ghassan Shobaki and Vahl Scott Gordon and Paul McHugh and Theodore Dubois and Austin Kerbow Register-Pressure-Aware Instruction Scheduling Using Ant Colony Optimization 23:1--23:23 Qihan Wang and Zhen Peng and Bin Ren and Jie Chen and Robert G. Edwards MemHC: an Optimized GPU Memory Management Framework for Accelerating Many-body Correlation . . . . . . . . . 24:1--24:26 Rakesh Kumar and Mehdi Alipour and David Black-Schaffer Dependence-aware Slice Execution to Boost MLP in Slice-out-of-order Cores 25:1--25:28 Nandita Vijaykumar and Ataberk Olgun and Konstantinos Kanellopoulos and F. Nisa Bostanci and Hasan Hassan and Mehrshad Lotfi and Phillip B. Gibbons and Onur Mutlu \pkgMetaSys: a Practical Open-source Metadata Management System to Implement and Evaluate Cross-layer Optimizations 26:1--26:29 Jing Chen and Madhavan Manivannan and Mustafa Abduljabbar and Miquel Peric\`as \pkgERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes . . . . . . . . . . . 27:1--27:29 Chencheng Ye and Yuanchao Xu and Xipeng Shen and Hai Jin and Xiaofei Liao and Yan Solihin Preserving Addressability Upon GC-Triggered Data Movements on Non-Volatile Memory . . . . . . . . . . 28:1--28:26 George Michelogiannakis and Benjamin Klenk and Brandon Cook and Min Yee Teh and Madeleine Glick and Larry Dennison and Keren Bergman and John Shalf A Case For Intra-rack Resource Disaggregation in HPC . . . . . . . . . 29:1--29:26
Ping Wang and Fei Wen and Paul V. Gratz and Alex Sprintson SIMD-Matcher: a SIMD-based Arbitrary Matching Framework . . . . . . . . . . . 30:1--30:20 Marcel Mettler and Martin Rapp and Heba Khdr and Daniel Mueller-Gritschneder and Jörg Henkel and Ulf Schlichtmann An FPGA-based Approach to Evaluate Thermal and Resource Management Strategies of Many-core Processors . . . 31:1--31:24 Paschalis Mpeis and Pavlos Petoumenos and Kim Hazelwood and Hugh Leather Object Intersection Captures on Interactive Apps to Drive a Crowd-sourced Replay-based Compiler Optimization . . . . . . . . . . . . . . 32:1--32:25 Cunlu Li and Dezun Dong and Xiangke Liao MUA-Router: Maximizing the Utility-of-Allocation for On-chip Pipelining Routers . . . . . . . . . . . 33:1--33:23 Ziaul Choudhury and Shashwat Shrivastava and Lavanya Ramapantulu and Suresh Purini An FPGA Overlay for CNN Inference with Fine-grained Flexible Parallelism . . . 34:1--34:26 Diksha Moolchandani and Anshul Kumar and Smruti R. Sarangi Performance and Power Prediction for Concurrent Execution on GPUs . . . . . . 35:1--35:27 Ali Jahanshahi and Nanpeng Yu and Daniel Wong PowerMorph: QoS-Aware Server Power Reshaping for Data Center Regulation Service . . . . . . . . . . . . . . . . 36:1--36:27 Peng Xu and Nannan Zhao and Jiguang Wan and Wei Liu and Shuning Chen and Yuanhui Zhou and Hadeel Albahar and Hanyang Liu and Liu Tang and Zhihu Tan Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud Storage . . . . . . . . . . . . . 37:1--37:26 Horng-Ruey Huang and Ding-Yong Hong and Jan-Jan Wu and Kung-Fu Chen and Pangfeng Liu and Wei-Chung Hsu Accelerating Video Captioning on Heterogeneous System Architectures . . . 38:1--38:25 David Corbalán-Navarro and Juan L. Aragón and Martí Anglada and Joan-Manuel Parcerisa and Antonio González Triangle Dropping: an Occluded-geometry Predictor for Energy-efficient Mobile GPUs . . . . . . . . . . . . . . . . . . 39:1--39:20 Shivam Kundan and Theodoros Marinakis and Iraklis Anagnostopoulos and Dimitri Kagaris A Pressure-Aware Policy for Contention Minimization on Multicore Systems . . . 40:1--40:26 Johnathan Alsop and Weon Taek Na and Matthew D. Sinclair and Samuel Grayson and Sarita Adve A Case for Fine-grain Coherence Specialization in Heterogeneous Systems 41:1--41:26 Mohammadreza Soltaniyeh and Richard P. Martin and Santosh Nagarakatte An Accelerator for Sparse Convolutional Neural Networks Leveraging Systolic General Matrix--matrix Multiplication 42:1--42:26 Dharanidhar Dang and Bill Lin and Debashis Sahoo LiteCON: an All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning . . . . . . . . . . . . . . . . 43:1--43:22 Lokesh Siddhu and Rajesh Kedia and Shailja Pandey and Martin Rapp and Anuj Pathania and Jörg Henkel and Preeti Ranjan Panda CoMeT: an Integrated Interval Thermal Simulation Toolchain for $2$D, 2.5D, and $3$D Processor-Memory Systems . . . . . 44:1--44:25 M. Ben Olson and Brandon Kammerdiener and Michael R. Jantz and Kshitij A. Doshi and Terry Jones Online Application Guidance for Heterogeneous Memory Systems . . . . . . 45:1--45:27 Bruno Chinelato Honorio and João P. L. De Carvalho and Catalina Munoz Morales and Alexandro Baldassin and Guido Araujo Using Barrier Elision to Improve Transactional Code Generation . . . . . 46:1--46:23
Jiansong Li and Xueying Wang and Xiaobing Chen and Guangli Li and Xiao Dong and Peng Zhao and Xianzhi Yu and Yongxin Yang and Wei Cao and Lei Liu and Xiaobing Feng An Application-oblivious Memory Scheduling System for DNN Accelerators 47:1--47:?? Aditya Narayan and Yvain Thonnart and Pascal Vivet and Ayse Coskun and Ajay Joshi Architecting Optically Controlled Phase Change Memory . . . . . . . . . . . . . 48:1--48:?? Chao Zhang and Maximilian Bremer and Cy Chan and John Shalf and Xiaochen Guo ASA: Accelerating Sparse Accumulation in Column-wise SpGEMM . . . . . . . . . . . 49:1--49:?? Aart Bik and Penporn Koanantakool and Tatiana Shpeisman and Nicolas Vasilache and Bixia Zheng and Fredrik Kjolstad Compiler Support for Sparse Tensor Computations in MLIR . . . . . . . . . . 50:1--50:?? Pierre Michaud and Anis Peysieux HAIR: Halving the Area of the Integer Register File with Odd/Even Banking . . 51:1--51:?? Amirreza Yousefzadeh and Jan Stuijt and Martijn Hijdra and Hsiao-Hsuan Liu and Anteneh Gebregiorgis and Abhairaj Singh and Said Hamdioui and Francky Catthoor Energy-efficient In-Memory Address Calculation . . . . . . . . . . . . . . 52:1--52:?? Hwisoo So and Moslem Didehban and Yohan Ko and Aviral Shrivastava and Kyoungwoo Lee EXPERTISE: an Effective Software-level Redundant Multithreading Scheme against Hardware Faults . . . . . . . . . . . . 53:1--53:?? Tim Hartley and Foivos S. Zakkak and Andy Nisbet and Christos Kotselidis and Mikel Luján Just-In-Time Compilation on ARM --- a Closer Look at Call-Site Code Consistency . . . . . . . . . . . . . . 54:1--54:?? Erling Jellum and Milica Orlandi\'c and Edmund Brekke and Tor Johansen and Torleiv Bryne Solving Sparse Assignment Problems on FPGAs . . . . . . . . . . . . . . . . . 55:1--55:?? Yuhao Li and Benjamin C. Lee Phronesis: Efficient Performance Modeling for High-dimensional Configuration Tuning . . . . . . . . . . 56:1--56:?? Chandrahas Tirumalasetty and Chih Chieh Chou and Narasimha Reddy and Paul Gratz and Ayman Abouelwafa Reducing Minor Page Fault Overheads through Enhanced Page Walker . . . . . . 57:1--57:?? Lan Gao and Jing Wang and Weigong Zhang Adaptive Contention Management for Fine-Grained Synchronization on Commodity GPUs . . . . . . . . . . . . . 58:1--58:?? Ruobing Han and Jaewon Lee and Jaewoong Sim and Hyesoon Kim COX : Exposing CUDA Warp-level Functions to CPUs . . . . . . . . . . . . . . . . 59:1--59:?? Yiding Liu and Xingyao Zhang and Donglin Zhuang and Xin Fu and Shuaiwen Song DynamAP: Architectural Support for Dynamic Graph Traversal on the Automata Processor . . . . . . . . . . . . . . . 60:1--60:?? Changwei Zou and Yaoqing Gao and Jingling Xue Practical Software-Based Shadow Stacks on x86-64 . . . . . . . . . . . . . . . 61:1--61:??
Thomas Luinaud and J. M. Pierre Langlois and Yvon Savaria Symbolic Analysis for Data Plane Programs Specialization . . . . . . . . 1:1--1:?? Nilesh Rajendra Shah and Ashitabh Misra and Antoine Miné and Rakesh Venkat and Ramakrishna Upadrasta BullsEye: Scalable and Accurate Approximation Framework for Cache Miss Calculation . . . . . . . . . . . . . . 2:1--2:?? Mitali Soni and Asmita Pal and Joshua San Miguel As-Is Approximate Computing . . . . . . 3:1--3:?? Parth Shah and Ranjal Gautham Shenoy and Vaidyanathan Srinivasan and Pradip Bose and Alper Buyuktosunoglu TokenSmart: Distributed, Scalable Power Management in the Many-core Era . . . . 4:1--4:?? Zhangyu Chen and Yu Hua and Luochangqi Ding and Bo Ding and Pengfei Zuo and Xue Liu Lock-Free High-performance Hashing for Persistent Memory via PM-aware Holistic Optimization . . . . . . . . . . . . . . 5:1--5:?? Aristeidis Mastoras and Sotiris Anagnostidis and Albert-Jan N. Yzelman Design and Implementation for Nonblocking Execution in GraphBLAS: Tradeoffs and Performance . . . . . . . 6:1--6:?? Yemao Xu and Dezun Dong and Dongsheng Wang and Shi Xu and Enda Yu and Weixia Xu and Xiangke Liao SSD-SGD: Communication Sparsification for Distributed Deep Learning Training 7:1--7:?? Ataberk Olgun and Juan Gómez Luna and Konstantinos Kanellopoulos and Behzad Salami and Hasan Hassan and Oguz Ergin and Onur Mutlu PiDRAM: a Holistic End-to-end FPGA-based Framework for Processing-in-DRAM . . . . 8:1--8:?? Christos Sakalis and Stefanos Kaxiras and Magnus Själander Delay-on-Squash: Stopping Microarchitectural Replay Attacks in Their Tracks . . . . . . . . . . . . . . 9:1--9:?? Yi Liang and Shaokang Zeng and Lei Wang Quantifying Resource Contention of Co-located Workloads with the System-level Entropy . . . . . . . . . . 10:1--10:?? Hur Suyeon and Seongmin Na and Dongup Kwon and Kim Joonsung and Andrew Boutros and Eriko Nurvitadhi and Jangwoo Kim A Fast and Flexible FPGA-based Accelerator for Natural Language Processing Neural Networks . . . . . . . 11:1--11:?? Ashish Gondimalla and Jianqiao Liu and Mithuna Thottethodi and T. N. Vijaykumar Occam: Optimal Data Reuse for Convolutional Neural Networks . . . . . 12:1--12:?? Bo Peng and Yaozu Dong and Jianguo Yao and Fengguang Wu and Haibing Guan FlexHM: a Practical System for Heterogeneous Memory with Flexible and Efficient Performance Optimizations . . 13:1--13:?? Qiang Zhang and Lei Xu and Baowen Xu RegCPython: a Register-based Python Interpreter for Better Performance . . . 14:1--14:?? Hai Jin and Zhuo He and Weizhong Qiang SpecTerminator: Blocking Speculative Side Channels Based on Instruction Classes on RISC-V . . . . . . . . . . . 15:1--15:?? Tuowen Zhao and Tobi Popoola and Mary Hall and Catherine Olschanowsky and Michelle Strout Polyhedral Specification and Code Generation of Sparse Tensor Contraction with Co-iteration . . . . . . . . . . . 16:1--16:?? Manuela Schuler and Richard Membarth and Philipp Slusallek XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments . . . . . . . 17:1--17:?? Ivan Korostelev and João P. L. De Carvalho and José Moreira and José Nelson Amaral YaConv: Convolution with Low Cache Footprint . . . . . . . . . . . . . . . 18:1--18:?? Furkan Eris and Marcia Louis and Kubra Eris and José Abellán and Ajay Joshi Puppeteer: a Random Forest Based Manager for Hardware Prefetchers Across the Memory Hierarchy . . . . . . . . . . . . 19:1--19:??
Nicolas Tollenaere and Guillaume Iooss and Stéphane Pouget and Hugo Brunie and Christophe Guillon and Albert Cohen and P. Sadayappan and Fabrice Rastello Autotuning Convolutions Is Easier Than You Think . . . . . . . . . . . . . . . 20:1--20:?? Víctor Pérez and Lukas Sommer and Victor Lomüller and Kumudha Narasimhan and Mehdi Goli User-driven Online Kernel Fusion for SYCL . . . . . . . . . . . . . . . . . . 21:1--21:?? Vinicius Espindola and Luciano Zago and Hervé Yviquel and Guido Araujo Source Matching and Rewriting for MLIR Using String-Based Automata . . . . . . 22:1--22:?? Wenjing Ma and Fangfang Liu and Daokun Chen and Qinglin Lu and Yi Hu and Hongsen Wang and Xinhui Yuan An Optimized Framework for Matrix Factorization on the New Sunway Many-core Platform . . . . . . . . . . . 23:1--23:?? Sarabjeet Singh and Neelam Surana and Kailash Prasad and Pranjali Jain and Joycee Mekie and Manu Awasthi HyGain: High-performance, Energy-efficient Hybrid Gain Cell-based Cache Hierarchy . . . . . . . . . . . . 24:1--24:?? Chandra Sekhar Mummidi and Sandip Kundu ACTION: Adaptive Cache Block Migration in Distributed Cache Architectures . . . 25:1--25:?? Qiaoyi Liu and Jeff Setter and Dillon Huff and Maxwell Strange and Kathleen Feng and Mark Horowitz and Priyanka Raina and Fredrik Kjolstad Unified Buffer: Compiling Image Processing and Machine Learning Applications to Push-Memory Accelerators 26:1--26:?? Ahmet Caner Yüzügüler and Canberk Sönmez and Mario Drumond and Yunho Oh and Babak Falsafi and Pascal Frossard Scale-out Systolic Arrays . . . . . . . 27:1--27:?? Francesco Minervini and Oscar Palomar and Osman Unsal and Enrico Reggiani and Josue Quiroga and Joan Marimon and Carlos Rojas and Roger Figueras and Abraham Ruiz and Alberto Gonzalez and Jonnatan Mendoza and Ivan Vargas and César Hernandez and Joan Cabre and Lina Khoirunisya and Mustapha Bouhali and Julian Pavon and Francesc Moll and Mauro Olivieri and Mario Kovac and Mate Kovac and Leon Dragic and Mateo Valero and Adrian Cristal Vitruvius+: an Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications . . . 28:1--28:?? Hadjer Benmeziane and Hamza Ouarnoughi and Kaoutar El Maghraoui and Smail Niar Multi-objective Hardware-aware Neural Architecture Search with Pareto Rank-preserving Surrogate Models . . . . 29:1--29:?? Dongwei Chen and Dong Tong and Chun Yang and Jiangfang Yi and Xu Cheng FlexPointer: Fast Address Translation Based on Range TLB and Tagged Pointers 30:1--30:?? Jingwen Du and Fang Wang and Dan Feng and Changchen Gan and Yuchao Cao and Xiaomin Zou and Fan Li Fast One-Sided RDMA-Based State Machine Replication for Disaggregated Memory . . 31:1--31:??
Abdul Rasheed Sahni and Hamza Omar and Usman Ali and Omer Khan ASM: an Adaptive Secure Multicore for Co-located Mutually Distrusting Processes . . . . . . . . . . . . . . . 32:1--32:?? Sooraj Puthoor and Mikko H. Lipasti Turn-based Spatiotemporal Coherence for GPUs . . . . . . . . . . . . . . . . . . 33:1--33:?? Ruobing Chen and Haosen Shi and Jinping Wu and Yusen Li and Xiaoguang Liu and Gang Wang Jointly Optimizing Job Assignment and Resource Partitioning for Improving System Throughput in Cloud Datacenters 34:1--34:?? Gokul Subramanian Ravi and Tushar Krishna and Mikko Lipasti TNT: a Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency . . . . . . . . . . . 35:1--35:?? Weizhi Xu and Yintai Sun and Shengyu Fan and Hui Yu and Xin Fu Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs 36:1--36:?? Jin Zhao and Yu Zhang and Ligang He and Qikun Li and Xiang Zhang and Xinyu Jiang and Hui Yu and Xiaofei Liao and Hai Jin and Lin Gu and Haikun Liu and Bingsheng He and Ji Zhang and Xianzheng Song and Lin Wang and Jun Zhou GraphTune: an Efficient Dependency-Aware Substrate to Alleviate Irregularity in Concurrent Graph Processing . . . . . . 37:1--37:?? Yufeng Zhou and Alan L. Cox and Sandhya Dwarkadas and Xiaowan Dong The Impact of Page Size and Microarchitecture on Instruction Address Translation Overhead . . . . . . . . . . 38:1--38:?? Benjamin Reber and Matthew Gould and Alexander H. Kneipp and Fangzhou Liu and Ian Prechtl and Chen Ding and Linlin Chen and Dorin Patru Cache Programming for Scientific Loops Using Leases . . . . . . . . . . . . . . 39:1--39:?? Xinfeng Xie and Peng Gu and Yufei Ding and Dimin Niu and Hongzhong Zheng and Yuan Xie MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing . . . . . . 40:1--40:?? Alexander Krolik and Clark Verbrugge and Laurie Hendren rNdN: Fast Query Compilation for NVIDIA GPUs . . . . . . . . . . . . . . . . . . 41:1--41:?? Jiazhi Jiang and Zijian Huang and Dan Huang and Jiangsu Du and Lin Chen and Ziguan Chen and Yutong Lu Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled $3$D-CNN Structure . . . . . . . . . . . . . . . 42:1--42:?? Yuwen Zhao and Fangfang Liu and Wenjing Ma and Huiyuan Li and Yuanchi Peng and Cui Wang MFFT: a GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework . . . . . . . . . . . . . . . 43:1--43:?? Muhammad Waqar Azhar and Madhavan Manivannan and Per Stenström Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints . . . . 44:1--44:?? Dong Huang and Dan Feng and Qiankun Liu and Bo Ding and Wei Zhao and Xueliang Wei and Wei Tong SplitZNS: Towards an Efficient LSM-Tree on Zoned Namespace SSDs . . . . . . . . 45:1--45:??
Jiangsu Du and Jiazhi Jiang and Jiang Zheng and Hongbin Zhang and Dan Huang and Yutong Lu Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs . . . . . . . . . . . 46:1--46:?? Hai Jin and Bo Lei and Haikun Liu and Xiaofei Liao and Zhuohui Duan and Chencheng Ye and Yu Zhang A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures . . . . . . . . . . . . . 47:1--47:?? Christian Menard and Marten Lohstroh and Soroush Bateni and Matthew Chorlian and Arthur Deng and Peter Donovan and Clément Fournier and Shaokai Lin and Felix Suchert and Tassilo Tanneberger and Hokeun Kim and Jeronimo Castrillon and Edward A. Lee High-performance Deterministic Concurrency Using Lingua Franca . . . . 48:1--48:?? Donglei Wu and Weihao Yang and Xiangyu Zou and Wen Xia and Shiyi Li and Zhenbo Hu and Weizhe Zhang and Binxing Fang Smart-DNN+: a Memory-efficient Neural Networks Compression Framework for the Model Inference . . . . . . . . . . . . 49:1--49:?? Syed Salauddin Mohammad Tariq and Lance Menard and Pengfei Su and Probir Roy MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications . . . . . . . 50:1--50:?? Shiyi Li and Qiang Cao and Shenggang Wan and Wen Xia and Changsheng Xie gPPM: a Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes . . . . . . . . . . . . . . . . . 51:1--51:?? Petros Anastasiadis and Nikela Papadopoulou and Georgios Goumas and Nectarios Koziris and Dennis Hoppe and Li Zhong PARALiA: a Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems . . . . . . . . . 52:1--52:?? Hui Yu and Yu Zhang and Jin Zhao and Yujian Liao and Zhiying Huang and Donghao He and Lin Gu and Hai Jin and Xiaofei Liao and Haikun Liu and Bingsheng He and Jianhui Yue RACE: an Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network . . . . . . . . . . . . . . . . 53:1--53:?? Victor Ferrari and Rafael Sousa and Marcio Pereira and João P. L. De Carvalho and José Nelson Amaral and José Moreira and Guido Araujo Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions . . . . . . . . . . . . . . . 54:1--54:?? Bowen He and Xiao Zheng and Yuan Chen and Weinan Li and Yajin Zhou and Xin Long and Pengcheng Zhang and Xiaowei Lu and Linquan Jiang and Qiang Liu and Dennis Cai and Xiantao Zhang DxPU: Large-scale Disaggregated GPU Pools in the Datacenter . . . . . . . . 55:1--55:?? Shiqing Zhang and Mahmood Naderan-Tahan and Magnus Jahre and Lieven Eeckhout Characterizing Multi-Chip GPU Data Sharing . . . . . . . . . . . . . . . . 56:1--56:?? Jens Domke and Emil Vatai and Balazs Gerofi and Yuetsu Kodama and Mohamed Wahib and Artur Podobas and Sparsh Mittal and Miquel Peric\`as and Lingqi Zhang and Peng Chen and Aleksandr Drozd and Satoshi Matsuoka At the Locus of Performance: Quantifying the Effects of Copious $3$D-Stacked Cache on HPC Workloads . . . . . . . . . 57:1--57:?? Satya Jaswanth Badri and Mukesh Saini and Neeraj Goel Mapi-Pro: an Energy Efficient Memory Mapping Technique for Intermittent Computing . . . . . . . . . . . . . . . 58:1--58:?? Miao Yu and Tingting Xiang and Venkata Pavan Kumar Miriyala and Trevor E. Carlson Multiply-and-Fire: an Event-Driven Sparse Neural Network Accelerator . . . 59:1--59:?? Ziaul Choudhury and Anish Gulati and Suresh Purini FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler . . . . . . . . 60:1--60:?? Zachary Susskind and Aman Arora and Igor D. S. Miranda and Alan T. L. Bacellar and Luis A. Q. Villon and Rafael F. Katopodis and Leandro S. de Araújo and Diego L. C. Dutra and Priscila M. V. Lima and Felipe M. G. França and Mauricio Breternitz Jr. and Lizy K. John ULEEN: a Novel Architecture for Ultra-low-energy Edge Neural Networks 61:1--61:?? Jia Wei and Xingjun Zhang and Longxiang Wang and Zheng Wei Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training . . . . . . . . . . . . . . . . 62:1--62:??
Longfei Luo and Dingcui Yu and Yina Lv and Liang Shi Critical Data Backup with Hybrid Flash-Based Consumer Devices . . . . . . 1:1--1:?? Peng Chen and Hui Chen and Weichen Liu and Linbo Long and Wanli Chang and Nan Guan DAG-Order: an Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip . . . . . . . . . . . . 2:1--2:?? Zhang Jiang and Ying Chen and Xiaoli Gong and Jin Zhang and Wenwen Wang and Pen-Chung Yew JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation . . . . . . . . . . . . . . . 3:1--3:?? Hayfa Tayeb and Ludovic Paillat and Bérenger Bramas Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations . . 4:1--4:?? Xueying Wang and Guangli Li and Zhen Jia and Xiaobing Feng and Yida Wang Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs . . . . . . . 5:1--5:?? Hao Fan and Yiliang Ye and Shadi Ibrahim and Zhuo Huang and Xingru Li and Weibin Xue and Song Wu and Chen Yu and Xuanhua Shi and Hai Jin QoS-pro: a QoS-enhanced Transaction Processing Framework for Shared SSDs . . 6:1--6:?? Yunping Zhao and Sheng Ma and Heng Liu and Libo Huang and Yi Dai SAC: an Ultra-Efficient Spin-based Architecture for Compressed DNNs . . . . 7:1--7:?? Tong-Yu Liu and Jianmei Guo and Bo Huang Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping . . . . . . . . . . . 8:1--8:?? Lei Liu and Xinglei Dou QuCloud+: a Holistic Qubit Mapping Scheme for Single/Multi-programming on $2$D/$3$D NISQ Quantum Computers . . . . 9:1--9:?? Lingxi Wu and Minxuan Zhou and Weihong Xu and Ashish Venkat and Tajana Rosing and Kevin Skadron Abakus: Accelerating $k$-mer Counting with Storage Technology . . . . . . . . 10:1--10:?? Seokwon Kang and Jongbin Kim and Gyeongyong Lee and Jeongmyung Lee and Jiwon Seo and Hyungsoo Jung and Yong Ho Song and Yongjun Park ISP Agent: a Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities . . 11:1--11:?? Prasoon Mishra and V. Krishna Nandivada COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop . . . . . . . . . . . . . . . . . . 12:1--12:?? Joongun Park and Seunghyo Kang and Sanghyeon Lee and Taehoon Kim and Jongse Park and Youngjin Kwon and Jaehyuk Huh Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing . . . . . . 13:1--13:?? Tyler Allen and Bennett Cooper and Rong Ge Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory 14:1--14:?? Zhonghua Wang and Yixing Guo and Kai Lu and Jiguang Wan and Daohui Wang and Ting Yao and Huatao Wu Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL . . . . . . . . . 15:1--15:?? Linbo Long and Shuiyong He and Jingcheng Shen and Renping Liu and Zhenhua Tan and Congming Gao and Duo Liu and Kan Zhong and Yi Jiang WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs 16:1--16:?? Zhihua Fan and Wenming Li and Zhen Wang and Yu Yang and Xiaochun Ye and Dongrui Fan and Ninghui Sun and Xuejun An Improving Utilization of Dataflow Unit for Multi-Batch Processing . . . . . . . 17:1--17:?? Dunbo Zhang and Qingjie Lang and Ruoxi Wang and Li Shen Extension VM: Interleaved Data Layout in Vector Memory . . . . . . . . . . . . . 18:1--18:?? Can Firtina and Kamlesh Pillai and Gurpreet S. Kalsi and Bharathwaj Suresh and Damla Senol Cali and Jeremie S. Kim and Taha Shahroodi and Meryem Banu Cavlak and Joël Lindegger and Mohammed Alser and Juan Gómez Luna and Sreenivas Subramoney and Onur Mutlu ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis . . . . 19:1--19:?? Khalid Ahmad and Cris Cecka and Michael Garland and Mary Hall Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs . . . . . . . 20:1--20:??
Chandra Sekhar Mummidi and Victor C. Ferreira and Sudarshan Srinivasan and Sandip Kundu Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators 21:1--21:?? Zhonghua Wang and Chen Ding and Fengguang Song and Kai Lu and Jiguang Wan and Zhihu Tan and Changsheng Xie and Guokuan Li WIPE: a Write-Optimized Learned Index for Persistent Memory . . . . . . . . . 22:1--22:?? Gino A. Chacon and Charles Williams and Johann Knechtel and Ozgur Sinanoglu and Paul V. Gratz and Vassos Soteriou Coherence Attacks and Countermeasures in Interposer-based Chiplet Systems . . . . 23:1--23:?? Yan Wei and Zhang Xingjun A Concise Concurrent B+-Tree for Persistent Memory . . . . . . . . . . . 24:1--24:?? Fareed Qararyah and Muhammad Waqar Azhar and Pedro Trancoso An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs . . . . . . . . . . . 25:1--25:?? Fernando Fernandes Dos Santos and Luigi Carro and Flavio Vella and Paolo Rech Assessing the Impact of Compiler Optimizations on GPUs Reliability . . . 26:1--26:?? Valentin Isaac-Chassande and Adrian Evans and Yves Durand and Frédéric Rousseau Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: a Survey . . . . . . . . . . . 27:1--27:?? Benyi Xie and Yue Yan and Chenghao Yan and Sicheng Tao and Zhuangzhuang Zhang and Xinyu Li and Yanzhi Lan and Xiang Wu and Tianyi Liu and Tingting Zhang and Fuxin Zhang An Instruction Inflation Analyzing Framework for Dynamic Binary Translators 28:1--28:?? Samuel Rac and Mats Brorsson Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum 29:1--29:?? Feng Xue and Chenji Han and Xinyu Li and Junliang Wu and Tingting Zhang and Tianyi Liu and Yifan Hao and Zidong Du and Qi Guo and Fuxin Zhang Tyche: an Efficient and General Prefetcher for Indirect Memory Accesses 30:1--30:?? Kunpeng Xie and Ye Lu and Xinyu He and Dezhi Yi and Huijuan Dong and Yao Chen Winols: a Large-Tiling Sparse Winograd CNN Accelerator on FPGAs . . . . . . . . 31:1--31:?? Ke Liu and Kan Wu and Hua Wang and Ke Zhou and Peng Wang and Ji Zhang and Cong Li SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching . . . . . . . . . . . . 32:1--32:?? Panagiotis Miliadis and Dimitris Theodoropoulos and Dionisios Pnevmatikatos and Nectarios Koziris Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources . . . . . . . . . . . . . . . 33:1--33:?? Haitao Du and Yuhan Qin and Song Chen and Yi Kang FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration . . . . . . . . . . . . . . 34:1--34:?? Michael Canesche and Vanderson Rosário and Edson Borin and Fernando Quintão Pereira The Droplet Search Algorithm for Kernel Scheduling . . . . . . . . . . . . . . . 35:1--35:?? Asmita Pal and Keerthana Desai and Rahul Chatterjee and Joshua San Miguel Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces . . . . . . . . . . . . . 36:1--36:?? Chengying Huan and Yongchao Liu and Heng Zhang and Shuaiwen Song and Santosh Pandey and Shiyang Chen and Xiangfei Fang and Yue Jin and Baptiste Lepers and Yanjun Wu and Hang Liu TEA+: a Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture 37:1--37:?? Soojin Hwang and Daehyeon Baek and Jongse Park and Jaehyuk Huh Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication 38:1--38:?? Siddhartha Raman Sundara Raman and Lizy John and Jaydeep P. Kulkarni NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks . . . . . . . . . . . . . . . . 39:1--39:?? Yan Chen and Qiwen Ke and Huiba Li and Yongwei Wu and Yiming Zhang xMeta: SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object Storage . . . . . . . . . . . . . 40:1--40:?? Vidush Singhal and Laith Sakka and Kirshanthan Sundararajah and Ryan Newton and Milind Kulkarni Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals . . . . . . . . . . . . . . . 41:1--41:??
Hajar Falahati and Mohammad Sadrosadati and Qiumin Xu and Juan Gómez-Luna and Banafsheh Saber Latibari and Hyeran Jeon and Shaahin Hesaabi and Hamid Sarbazi-Azad and Onur Mutlu and Murali Annavaram and Masoud Pedram Cross-core Data Sharing for Energy-efficient GPUs . . . . . . . . . 42:1--42:?? Ching-Jui Lee and Tsung Tai Yeh ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors . . . . . . . 43:1--43:?? Ziheng Wang and Xiaoshe Dong and Yan Kang and Heng Chen and Qiang Wang An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton--Micali Signature on the GPU . . . . . . . . . . 44:1--44:?? Jiang Wu and Zhuo Zhang and Deheng Yang and Jianjun Xu and Jiayu He and Xiaoguang Mao Knowledge-Augmented Mutation-Based Bug Localization for Hardware Design Code 45:1--45:?? Chen Ding and Jian Zhou and Kai Lu and Sicen Li and Yiqin Xiong and Jiguang Wan and Ling Zhan D$^2$Comp: Efficient Offload of LSM-tree Compaction with Data Processing Units on Disaggregated Storage . . . . . . . . . 46:1--46:?? Zhuohao Wang and Lei Liu and Limin Xiao iSwap: a New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments . . . . . . . . . 47:1--47:?? Junkaixuan Li and Yi Kang GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems 48:1--48:?? Ke Wu and Dezun Dong and Weixia Xu COER: a Network Interface Offloading Architecture for RDMA and Congestion Control Protocol Codesign . . . . . . . 49:1--49:?? Qunyou Liu and Darong Huang and Luis Costero and Marina Zapater and David Atienza Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads . . . . . . . . . . . . . . . 50:1--50:?? Dongmoon Min and Ilkwon Byun and Gyu-Hyeon Lee and Jangwoo Kim CoolDC: a Cost-Effective Immersion-Cooled Datacenter with Workload-Aware Temperature Scaling . . . 51:1--51:?? Hai Zhou and Dan Feng Stripe-schedule Aware Repair in Erasure-coded Clusters with Heterogeneous Star Networks . . . . . . 52:1--52:?? Bobin Deng and Bhargava Nadendla and Kun Suo and Yixin Xie and Dan Chia-Tien Lo Fixed-point Encoding and Architecture Exploration for Residue Number Systems 53:1--53:?? Yizhuo Wang and Fangli Chang and Bingxin Wei and Jianhua Gao and Weixing Ji Optimization of Sparse Matrix Computation for Algebraic Multigrid on GPUs . . . . . . . . . . . . . . . . . . 54:1--54:?? Luming Wang and Xu Zhang and Songyue Wang and Zhuolun Jiang and Tianyue Lu and Mingyu Chen and Siwei Luo and Keji Huang Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access . . . . . . . . . . . . . 55:1--55:?? Yunping Zhao and Sheng Ma and Hengzhu Liu and Dongsheng Li SAL: Optimizing the Dataflow of Spin-based Architectures for Lightweight Neural Networks . . . . . . . . . . . . 56:1--56:?? Kai Lu and Siqi Zhao and Haikang Shan and Qiang Wei and Guokuan Li and Jiguang Wan and Ting Yao and Huatao Wu and Daohui Wang Scythe: a Low-latency RDMA-enabled Distributed Transaction System for Disaggregated Memory . . . . . . . . . . 57:1--57:?? Wangqi Peng and Yusen Li and Xiaoguang Liu and Gang Wang Lavender: an Efficient Resource Partitioning Framework for Large-Scale Job Colocation . . . . . . . . . . . . . 58:1--58:?? Feng Zhang and Fulin Nan and Binbin Xu and Zhirong Shen and Jiebin Zhai and Dmitrii Kalplun and Jiwu Shu Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning 59:1--59:?? Ataberk Olgun and F. Nisa Bostanci and Geraldo Francisco de Oliveira Junior and Yahya Can Tugrul and Rahul Bera and Abdullah Giray Yaglikci and Hasan Hassan and Oguz Ergin and Onur Mutlu Sectored DRAM: a Practical Energy-Efficient and High-Performance Fine-Grained DRAM Architecture . . . . . 60:1--60:?? Xiaohui Wei and Chenyang Wang and Hengshan Yue and Jingweijia Tan and Zeyu Guan and Nan Jiang and Xinyang Zheng and Jianpeng Zhao and Meikang Qiu ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection . . . . . . . . . . 61:1--61:?? Qiao Li and Yu Chen and Guanyu Wu and Yajuan Du and Min Ye and Xinbiao Gan and Jie Zhang and Zhirong Shen and Jiwu Shu and Chun Xue Characterizing and Optimizing LDPC Performance on $3$D NAND Flash Memories 62:1--62:?? Jiahong Xu and Haikun Liu and Zhuohui Duan and Xiaofei Liao and Hai Jin and Xiaokang Yang and Huize Li and Cong Liu and Fubing Mao and Yu Zhang ReHarvest: an ADC Resource-Harvesting Crossbar Architecture for ReRAM-Based DNN Accelerators . . . . . . . . . . . . 63:1--63:?? Jiang Wu and Zhuo Zhang and Deheng Yang and Jianjun Xu and Jiayu He and Xiaoguang Mao Time-Aware Spectrum-Based Bug Localization for Hardware Design Code with Data Purification . . . . . . . . . 64:1--64:??
Zhuoran Song and Zhongkai Yu and Xinkai Song and Yifan Hao and Li Jiang and Naifeng Jing and Xiaoyao Liang Environmental Condition Aware Super-Resolution Acceleration Framework in Server--Client Hierarchies . . . . . 65:1--65:?? Georgia Antoniou and Davide Bartolini and Haris Volos and Marios Kleanthous and Zhe Wang and Kleovoulos Kalaitzidis and Tom Rollet and Ziwei Li and Onur Mutlu and Yiannakis Sazeides and Jawad Haj Yahya Agile C-states: a Core C-state Architecture for Latency Critical Applications Optimizing both Transition and Cold-Start Latency . . . . . . . . . 66:1--66:?? Xinbiao Gan and Tiejun Li and Feng Xiong and Bo Yang and Xinhai Chen and Chunye Gong and Shijie Li and Kai Lu and Qiao Li and Yiming Zhang MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric Algorithms . . . . . . 67:1--67:?? Yujie Cui and Wei Chen and Xu Cheng and Jiangfang Yi Hyperion: a Highly Effective Page and PC Based Delta Prefetcher . . . . . . . . . 68:1--68:?? Jianhua Gao and Weixing Ji and Yizhuo Wang Optimization of Large-Scale Sparse Matrix--Vector Multiplication on Multi-GPU Systems . . . . . . . . . . . 69:1--69:?? Zhengding Hu and Jingwei Sun and Zhongyang Li and Guangzhong Sun AG-SpTRSV: an Automatic Framework to Optimize Sparse Triangular Solve on GPUs 70:1--70:?? Wenbo Zhang and Yiqi Liu and Tianhao Zang and Zhenshan Bao EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding Algorithm . . . . 71:1--71:?? Arun Thangamani and Vincent Loechner and Stéphane Genaud A Survey of General-purpose Polyhedral Compilers . . . . . . . . . . . . . . . 72:1--72:?? Junqing Lin and Jingwei Sun and Xiaolong Shi and Honghe Zhang and Xianzhi Yu and Xinzhi Wang and Jun Yao and Guangzhong Sun LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs 73:1--73:?? Chenglong Yi and Jintong Liu and Shenggang Wan and Juntao Fang and Bin Sun and Liqiang Zhang Data Deduplication Based on Content Locality of Transactions to Enhance Blockchain Scalability . . . . . . . . . 74:1--74:?? Joshua Dennis Booth and Phillip Lane A NUMA-Aware Version of an Adaptive Self-Scheduling Loop Scheduler . . . . . 75:1--75:?? Yu Tang and Qiao Li and Lujia Yin and Dongsheng Li and Yiming Zhang and Chenyu Wang and Xingcheng Zhang and Linbo Qiao and Zhaoning Zhang and Kai Lu DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping . . . . . . . . . . . . . . . . 76:1--76:?? Zhenhua Tan and Linbo Long and Jingcheng Shen and Renping Liu and Congming Gao and Kan Zhong and Yi Jiang Optimizing Garbage Collection for ZNS SSDs via In-storage Data Migration and Address Remapping . . . . . . . . . . . 77:1--77:?? Xiang Li and Qiong Chang and Aolong Zha and Shijie Chang and Yun Li and Jun Miyazaki An Optimized GPU Implementation for GIST Descriptor . . . . . . . . . . . . . . . 78:1--78:?? Xiaobo Lu and Jianbin Fang and Lin Peng and Chun Huang and Zidong Du and Yongwei Zhao and Zheng Wang Mentor: a Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product . . . . . . . . . 79:1--79:?? Yu Feng and Weikai Lin and Zihan Liu and Jingwen Leng and Minyi Guo and Han Zhao and Xiaofeng Hou and Jieru Zhao and Yuhao Zhu Potamoi: Accelerating Neural Rendering via a Unified Streaming Architecture . . 80:1--80:?? Changxi Liu and Alen Sabu and Akanksha Chaudhari and Qingxuan Kang and Trevor E. Carlson Pac-Sim: Simulation of Multi-threaded Workloads using Intelligent, Live Sampling . . . . . . . . . . . . . . . . 81:1--81:?? Saurabh Raje and Yufan Xu and Atanas Rountev and Edward F. Valeev and P. Sadayappan CoNST: Code Generator for Sparse Tensor Networks . . . . . . . . . . . . . . . . 82:1--82:?? Danlin Jia and Geng Yuan and Yiming Xie and Xue Lin and Ningfang Mi A Data-Loader Tunable Knob to Shorten GPU Idleness for Distributed Deep Learning . . . . . . . . . . . . . . . . 83:1--83:?? Shaobu Wang and Guangyan Zhang and Junyu Wei and Yang Wang and Jiesheng Wu and Qingchao Luo Understanding Silent Data Corruption in Processors for Mitigating its Effects 84:1--84:?? Yen-Yu Lu and Chin-Hsien Wu and Shih-Jen Li and Cheng-Tze Lee and Cheng-Yen Wu A Stable Idle Time Detection Platform for Real I/O Workloads . . . . . . . . . 85:1--85:?? Lingyu Sun and Xiaofeng Hou and Chao Li and Jiacheng Liu and Xinkai Wang and Quan Chen and Minyi Guo $ A^2 $: Towards Accelerator Level Parallelism for Autonomous Micromobility Systems . . . . . . . . . . . . . . . . 86:1--86:?? Manojna Sistla and Yiding Liu and Xin Fu Towards High Performance QNNs via Distribution-Based CNOT Gate Reduction 87:1--87:?? Fubing Mao and Xu Liu and Yu Zhang and Haikun Liu and Xiaofei Liao and Hai Jin and Wei Zhang and Jian Zhou and Yufei Wu and Longyu Nie and Yapu Guo and Zihan Jiang and Jingkang Liu PMGraph: Accelerating Concurrent Graph Queries over Streaming Graphs . . . . . 88:1--88:?? Wentong Li and Yina Lv and Longfei Luo and Yunpeng Song and Liang Shi Access Characteristic-Guided Remote Swapping Across Mobile Devices . . . . . 89:1--89:?? Yinan Zhang and Shun Yang and Huiqi Hu and Chengcheng Yang and Peng Cai and Xuan Zhou SuccinctKV: a CPU-efficient LSM-tree Based KV Store with Scan-based Compaction . . . . . . . . . . . . . . . 90:1--90:?? Siyuan Ma and Kaustubh Mhatre and Jian Weng and Bagus Hanindhito and Zhengrong Wang and Tony Nowatzki and Lizy John and Aman Arora PIMSAB: a Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation . . . . . . 91:1--91:??