Babak Falsafi
Full Professor
babak.falsafi@epfl.ch +41 21 693 55 92 http://parsa.epfl.ch/~falsafi
EPFL IC IINFCOM PARSA
INJ 233 (Bâtiment INJ)
Station 14
1015 Lausanne
+41 21 693 55 92
+41 21 693 13 93
Office:
INJ 233
EPFL
>
IC
>
IINFCOM
>
PARSA
Web site: Web site: https://parsa.epfl.ch/
+41 21 693 55 92
EPFL
>
IC
>
IC-SIN
>
SIN-ENS
Web site: Web site: https://sin.epfl.ch
+41 21 693 55 92
EPFL
>
IC
>
IC-SSC
>
SSC-ENS
Web site: Web site: https://ssc.epfl.ch
+41 21 693 55 92
EPFL
>
SB
>
SB-SMA
>
SMA-ENS
Web site: Web site: https://sma.epfl.ch/
Fields of expertise
Biography
Babak is a Professor in the School of Computer and Communication Sciences and the founder of EcoCloud, an industrial/academic consortium at EPFL investigating scalable sustainable information technology. He has made numerous contributions to computer system design and evaluation including a scalable multiprocessor architecture which was prototyped by Sun Microsystems (now Oracle), snoop filters incorporated into multi-socket x86 servers and IBM BlueGene supercomputers, spatial and temporal memory streaming that appear in ARM cores, and computer system performance evaluation methodologies that have been in use by AMD, HP and Google PerfKit . He has shown that hardware memory consistency models are neither necessary (in the 90's) nor sufficient (a decade later) to achieve high performance in servers. These results eventually led to fence speculation in modern CPUs. His work on cloud-native CPUs laid the foundation for the first generation of Cavium ARM server CPUs, ThunderX. He is a recipient of an NSF CAREER award, IBM Faculty Partnership Awards, and an Alfred P. Sloan Research Fellowship. He is a fellow of ACM and IEEE.NEWS
Online services are stuck in memory and DRAM is not scaling. AstriFlash at HPCA'23 presents a system to serve data directly out of Flash, reducing memory cost by 20x and meeting ms-scale SLO for online services at 95% of throughput compared to DRAM.Network bandwidth is projected to grow at 20% a year for a decade thanks to optics. Logic density is lagging behind at 15% a year and slowing down resulting a "datacenter tax". Optimus Prime a data transformation accelerator, NebuLA a hardware-terminated network stack, and Cerebros an RPC processor are examples of how to mitigate the datacenter tax in the post-Moore era. Great to see that Google has followed up with their own data transformation accelerator in 2022.
See our paper on "Rebooting Virtual Memory with Midgard" for a novel approach to future-proof virtual memory. Here is a news snippet.
Numerical training of DNNs is converging on fixed point with orders of magnitude improvement in logic, memory, power and bandwidth. See our blog.
RESEARCH
Data has emerged as a currency for modern society and datacenters are now the backbone of IT offering large-scale cloud services at low costs benefiting from and exploiting the economies of scale. With silicon efficiency scaling having dwindled since 2004 and silicon density scaling, Moore's Law, slowing down, future digital platforms will rely on heterogeneous logic and memory to allow for IT scalability. Meanwhile, the demand for large-scale cloud services has grown dramatically faster than conventional silicon scaling making IT platform scalability a grand challenge. Future platforms will need hand-in-hand collaboration of application domain experts and platform designers to improve scalability. With many online services being in-memory and the minimum communication latency between the farthest nodes being microseconds, future server platforms will go through revolutionary changes in architecture and systems to enable seamless aggregation of logic and memory resources across nodes, breaking the conventional abstraction layers. Babak's research and educational activities center around post-Moore server design.He investigates techniques to address these challenges in the context of the following projects:
- CloudSuite: A Benchmark Suite for Scale-Out Workloads
- ColTraIn: Co-Located Training and Inference DNN Accelerators
- HARNESS: Heterogeneous Architectures for Next-Generation Server Systems
- Midgard: Future-Proofing Virtual Memory
- QFlex: Fast, Full-System Open-Source Server Simulation/Emulation
- VISA: Cloud-Native CPUs
Selected Talks
Integration, Specialization and Approximation: the "ISA" of Post-Moore ServersHPCA Keynote, 2022.
Post-Moore AI Infrastructure
Facebook SysML Talk, 2021.
Post-Moore Server Architecture
ICS Keynote, 2020 (Video on YouTube!).
Server Architecture for the Post-Moore Era
HotDC Keynote, 2017.
Publications
Infoscience publications
UrbanTwin: An urban digital twin for climate action
EcoCloud Annual Event on IT Sustainability 2024, Lausanne, Switzerland, 2024-10-08.Electrical-Level Fault-Injection Attacks on FPGA-Based Systems
Lausanne, EPFL, 2024. DOI : 10.5075/epfl-thesis-10315.Secure Interface Design Leveraging Hardware/Software Support
Lausanne, EPFL, 2024. DOI : 10.5075/epfl-thesis-9975.What's Missing in Agile Hardware Design? Verification!
Journal Of Computer Science And Technology. 2023. DOI : 10.1007/s11390-023-0005-3.Scale-out Systolic Arrays
Acm Transactions On Architecture And Code Optimization. 2023. DOI : 10.1145/3572917.AstriFlash: A Flash-Based System for Online Services
2023. The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), Montreal, QC, Canada, Feb 25 – March 01, 2023. DOI : 10.1109/HPCA56546.2023.10070955.Rebooting Virtual Memory with Midgard
Lausanne, EPFL, 2023. DOI : 10.5075/epfl-thesis-8864.Cooperative Concurrency Control for Write-Intensive Key-Value Workloads
2023. The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'23), Vancouver, BC, Canada, March 25–29, 2023. p. 30 - 46. DOI : 10.1145/3567955.3567957.Evaluating, Exploiting, and Hiding Power Side-Channel Leakage of Remote FPGAs
Lausanne, EPFL, 2023. DOI : 10.5075/epfl-thesis-9918.Imprecise Store Exceptions
2023. The 50th Annual International Symposium on Computer Architecture (ISCA ’23), Orlando, FL, USA, June 17–21, 2023. DOI : 10.1145/3579371.3589087.SecureCells: A Secure Compartmentalized Architecture
2023. 44th IEEE Symposium on Security and Privacy, San Francisco, USA, May 22-24, 2023. p. 2921 - 2939. DOI : 10.1109/SP46215.2023.00125.Algorithms for Efficient and Robust Distributed Deep Learning
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8980.Hardware and Software Support for RPC-Centric Server Architecture
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8017.Cerebros: Evading the RPC Tax in Datacenters
2021. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18–22, 2021. p. 407 - 420. DOI : 10.1145/3466752.3480055.Equinox: Training (for Free) on a Custom Inference Accelerator
2021. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21), Virtual Event, Greece, October 18–22, 2021. DOI : 10.1145/3466752.3480057.Hardware-Software Co-Design of an RPC Processor
Lausanne, EPFL, 2021. DOI : 10.5075/epfl-thesis-7217.Data transformer apparatus
US11748254 ; US2022327048 ; WO2021037341 . 2021.Rebooting Virtual Memory with Midgard
2021. ISCA 2021 48th International Symposium on Computer Architecture, Online conference, June 14-19, 2021. DOI : 10.1109/ISCA52012.2021.00047.Exploiting Errors for Efficiency: A Survey from Circuits to Applications
Acm Computing Surveys. 2020. DOI : 10.1145/3394898.ColTraIn: Co-located DNN training and inference
Lausanne, EPFL, 2020. DOI : 10.5075/epfl-thesis-10265.SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators
2020Optimus Prime: Accelerating Data Transformation in Servers
2020. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16–20, 2020. p. 1203 - 1216. DOI : 10.1145/3373376.3378501.The NEBULA RPC-Optimized Architecture
2020. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, May, 30th - June, 3rd 2020. p. 199 - 212. DOI : 10.1109/ISCA45697.2020.00027.Distributed Logless Atomic Durability with Persistent Memory
2019. The 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52), Columbus, OH, USA, October 12–16, 2019. DOI : 10.1145/3352460.3358321.Analog Neural Networks with Deep-submicron Nonlinear Synapses
IEEE Micro. 2019. DOI : 10.1109/MM.2019.2931182.SMoTherSpectre: Exploiting Speculative Execution through Port Contention
2019. The 26th ACM Conference on Computer and Communications Security - ACM CSS 2019, London, UK, November 11-15, 2019. p. 785 - 800. DOI : 10.1145/3319535.3363194.Design Guidelines for High-Performance SCM Hierarchies
2018. 4th International Symposium on Memory Systems (MEMSYS), Old Town Alexandria, VA, USA, October 1-4, 2018. DOI : 10.1145/3240302.3240310.Atomic object reads for in-memory rack-scale computing
US10929174 ; US2018173673 . 2018.Training DNNs with Hybrid Block Floating Point
2018. NeurIPS 2018 - 32nd Conference on Neural Information Processing Systems, Montreal, CANADA, Dec 02-08, 2018.Network-Compute Co-Design for Distributed In-Memory Computing
Lausanne, EPFL, 2018. DOI : 10.5075/epfl-thesis-8749.Near-Memory Address Translation
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7875.FPGAs versus GPUs in Data centers
IEEE Micro. 2017. DOI : 10.1109/MM.2017.19.The Mondrian Data Engine
2017. The 44th International Symposium on Computer Architecture, Toronto, ON, Canada, June 24-28, 2017. DOI : 10.1145/3079856.3080233.Unified prefetching into instruction cache and branch target buffer
US9996358 ; US2017090935 . 2017.Fat Caches For Scale-Out Servers
Ieee Micro. 2017. DOI : 10.1109/MM.2017.32.Near-Memory Address Translation
2017. 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, OR, SEP 09-13, 2017. p. 303 - 317. DOI : 10.1109/Pact.2017.56.Rack-Scale Memory Pooling for Datacenters
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7612.Towards Near-Threshold Server Processors
2016. Design, Automation and Test in Europe Conference (DATE '16), Dresden, Germany, March 14-18, 2016. p. 7 - 12.Near-Memory Data Services
IEEE Micro. 2016. DOI : 10.1109/MM.2016.9.Unlocking Energy
2016. 2016 USENIX Annual Technical Conference, Denver, Colorado, USA, June 22-24, 2016. p. 393 - 406.SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
2016. 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, October 15-19, 2016. DOI : 10.1109/MICRO.2016.7783709.The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems
2016. ACM Symposium on Cloud Computing, Santa Clara, USA, October 05-07, 2016. DOI : 10.1145/2987550.2987577.An Analysis of Load Imbalance in Scale-out Data Serving
2016. ACM SIGMETRICS, Antibes Juan-Les-Pins, France, June 14-18, 2016. p. 367 - 368. DOI : 10.1145/2896377.2901501.Accelerators for Data Processing
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6710.Memory Systems and Interconnects for Scale-Out Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6682.Sort vs. Hash Join Revisited for Near-Memory Execution
5th Workshop on Architectures and Systems for Big Data ( ASBD 2015 ), Portland, Oregon, USA, June 13, 2015.Sort vs. Hash Join Revisited for Near-Memory Execution
2015. 5th Workshop on Architectures and Systems for Big Data (ASBD 2015), Portland, Oregon, USA, June 13, 2015.Manycore Network Interfaces for In-Memory Rack-Scale Computing
2015. 42nd International Symposium in Computer Architecture, Portland, Oregon, USA, June 13-17, 2015. DOI : 10.1145/2749469.2750415.Multi-Gigabyte On-Chip DRAM Caches for Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6631.Asynchronous memory access chaining
Proceedings of the VLDB Endowment. 2015. DOI : 10.14778/2856318.2856321.Confluence: unified instruction supply for scale-out servers
2015. the 48th International Symposium, Waikiki, Hawaii, 05-09 December 2015. p. 166 - 177. DOI : 10.1145/2830772.2830785.Shared Frontend for Manycore Server Processors
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6669.A Primer on Hardware Prefetching
Morgan & Claypool.BuMP: Bulk Memory Access Prediction and Streaming
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 13-17, 2014. p. 545 - 557. DOI : 10.1109/MICRO.2014.44.Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models
2014Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, December 13-17, 2014. p. 25 - 37. DOI : 10.1109/MICRO.2014.51.FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring
2014. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), Orlando, Florida, USA, February 15-19, 2014. p. 108 - 119. DOI : 10.1109/HPCA.2014.6835922.Architectural Support to Accelerate Fine-Grain Program Monitoring
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6257.Big Data
IEEE Micro. 2014. DOI : 10.1109/MM.2014.65.Towards stable cloud performance
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6261.Scale-Out NUMA
2014. Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, Utah, USA, March 1-5, 2014. DOI : 10.1145/2541940.2541965.A Case for Specialized Processors for Scale-Out Workloads
IEEE Micro. 2014. DOI : 10.1109/MM.2014.41.DeSyRe: On-demand system reliability
Microprocessors and Microsystems - Embedded Hardware Design. 2013. DOI : 10.1016/j.micpro.2013.08.008.Multi-Grain Coherence Directory
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540739.TOP PICKS FROM THE 2012 COMPUTER ARCHITECTURE CONFERENCES Introduction
IEEE Micro. 2013. DOI : 10.1109/MM.2013.65.Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache
2013. 40th International Symposium on Computer Architecture, Tel-Aviv, Israel, June 23-27, 2013. p. 404 - 415. DOI : 10.1145/2485922.2485957.Scale-Out Processors
Lausanne, EPFL, 2013. DOI : 10.5075/epfl-thesis-5906.SHIFT: Shared History Instruction Fetch for Lean-Core Server Processors
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540732.Meet the Walkers: Accelerating Index Traversals for In-Memory Databases
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540748.Dark Silicon Accelerators for Database Indexing
Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.NOC-Out: Microarchitecting a Scale-Out Processor
2012. 45th International Symposium on Microarchitecture, Vancouver, BC, Canada, December 1-5, 2012. DOI : 10.1109/MICRO.2012.25.Thermal Characterization of Cloud Workloads on a Power-Efficient Server-on-Chip
2012. 30th IEEE International Conference on Computer Design, Montreal, Quebec, Canada, September 30 - October 3, 2012. DOI : 10.1109/ICCD.2012.6378637.Scale-Out Processors
2012BugSifter: A Generalized Accelerator for Flexible Instruction-Grain Monitoring
2012Dark Silicon Accelerators for Database Indexing
2012. 1st Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Scale-Out Processors
2012. 39th Annual International Symposium on Computer Architecture, Portland, Oregon, USA, June 9-13, 2012. DOI : 10.1145/2366231.2337217.Optimizing Data-Center TCO with Scale-Out Processors
IEEE Micro. 2012. DOI : 10.1109/MM.2012.71.CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
2012. 6th International Symposium on Networks-on-Chip, Lyngby, Denmark, May 9-11, 2012.Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware
2012. Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, London, UK, March 3-7, 2012.Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems. 2012. DOI : 10.1145/2382553.2382557.Proactive Instruction Fetch
2011. 44th Annual IEEE/ACM Symposium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, December 3-7. p. 152 - 162. DOI : 10.1145/2155620.2155638.Reliability in the Dark Silicon Era
2011. 17th IEEE International On-Line Testing Symposium (IOLTS), Athens, Greece, Jul 13-15, 2011. p. V - V.Cuckoo Directory: A Scalable Directory for Many-Core Systems
2011. HPCA 2011, San Antonio, Texas, USA, February 12-16, 2011. DOI : 10.1109/HPCA.2011.5749726.Toward Dark Silicon in Servers
IEEE Micro. 2011. DOI : 10.1109/MM.2011.77.CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips
2011. Workshop on Energy-Efficient Design (WEED 2011), San Jose, California, USA, June 5, 2011.Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware
2011ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
2010. ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. p. 271 - 284. DOI : 10.1145/1736020.1736051.Making Address-Correlated Prefetching Practical
IEEE Micro. 2010. DOI : 10.1109/MM.2010.21.Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
IEEE Micro. 2010. DOI : 10.1109/MM.2010.22.TurboTag: Lookup Filtering to Reduce Coherence Directory Power
2010. 16th International Symposium on Low Power Electronics and Design (ISLPED 10), Austin, Texas, USA, August 18-20. p. 377 - 382. DOI : 10.1145/1840845.1840929.Flexible Hardware Acceleration for Instruction-Grain Lifeguards
IEEE Micro Top Picks. 2009. DOI : 10.1109/MM.2009.6.Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 184 - 195. DOI : 10.1145/1555754.1555779.Spatio-Temporal Memory Streaming
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 69 - 80. DOI : 10.1145/1555754.1555766.ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems. 2009. DOI : 10.1145/1534916.1534925.Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
2009. p. 195 - 201. DOI : 10.1109/PRDC.2009.39.Practical Off-chip Meta-data for Temporal Memory Streaming
2009. 15th International Symposium on High-Performance Computer Architecture, Raleigh, NC. p. 79 - 90. DOI : 10.1109/HPCA.2009.4798239.Workshop on Transactional Computing (TRANSACT 2008) - Introduction
Acm Sigplan Notices. 2008. DOI : 10.1145/1402227.1402233.Temporal streams in commercial server applications
2008. IEEE International Symposium on Workload Characterization (IISWC), Seattle, WA, September. p. 99 - 108. DOI : 10.1109/IISWC.2008.4636095.Predictor virtualization
2008. the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS), Seattle, WA, March. p. 157 - 167. DOI : 10.1145/1346281.1346301.Flexible hardware acceleration for instruction-grain program monitoring
2008. the 35th Annual International Symposium on Computer Architecture (ISCA), Beijing, China, June. p. 377 - 388. DOI : 10.1109/ISCA.2008.20.A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
2008. 16th international ACM/SIGDA symposium on Field programmable gate arrays (FPGA), Monterey, CA, February. p. 77 - 86. DOI : 10.1145/1344671.1344684.Temporal instruction fetch streaming
2008. the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Lake Como, Italy, November. p. 1 - 10. DOI : 10.1109/MICRO.2008.4771774.Database Servers on Chip Multiprocessors: Limitations and Opportunities
2007.Mechanisms for store-wait-free multiprocessors
2007. p. 266 - 277. DOI : 10.1145/1250662.1250696.PROTOFLEX: FPGA-accelerated hybrid functional simulator
2007. DOI : 10.1109/IPDPS.2007.370516.PAI: A lightweight mechanism for single-node memory recovery in DSM servers
2007. p. 298 - 305. DOI : 10.1109/PRDC.2007.53.An Analysis of Database System Performance on Chip Multiprocessors
2007.To Share or Not To Share?
2007. 33rd International Conference on Very Large Data Bases, Vienna, Austria, September. p. 351 - 362.Last-touch correlated data streaming
2007. p. 105 - 115. DOI : 10.1109/ISPASS.2007.363741.Multi-bit error tolerant caches using two-dimensional error coding
2007. p. 197 - 209. DOI : 10.1109/MICRO.2007.19.Scheduling threads for constructive cache sharing on CMPs
2007. p. 105 - 115. DOI : 10.1145/1248377.1248396.Spatial Memory Streaming
2006. p. 252 - 263. DOI : 10.1109/ISCA.2006.38.Dynamic feature selection for hardware prediction
Journal of Systems Architecture. 2006. DOI : 10.1016/j.sysarc.2004.12.007.Coarse-grain coherence tracking: RegionScout and region coherence arrays
IEEE Micro. 2006. DOI : 10.1109/MM.2006.8.Reunion: Complexity-effective multicore redundancy
2006. p. 223 - 234. DOI : 10.1109/MICRO.2006.42.ProtoFlex: Co-simulation for Component-wise FPGA Emulator Development
2006.Log-based architectures for general-purpose monitoring of deployed code
2006. p. 63 - 65. DOI : 10.1145/1181309.1181319.Statistical sampling of microarchitecture simulation
ACM Transactions on Modeling and Computer Simulation. 2006. DOI : 10.1145/1147224.1147225.The Granularity of Soft-Error Containment in Shared-Memory Multiprocessors
2006.Exploiting reference idempotency to reduce speculative storage overflow
ACM Transactions on Programming Languages and Systems. 2006. DOI : 10.1145/1152649.1152653.Parallel depth first vs. work stealing schedulers on CMP architectures
2006. DOI : 10.1145/1148109.1148167.Simulation sampling with live-points
2006. p. 2 - 12. DOI : 10.1109/ISPASS.2006.1620785.A case for asymmetric-cell cache memories
IEEE Transactions on Very Large Scale Integration Systems. 2005. DOI : 10.1109/TVLSI.2005.850127.Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs
Journal of Parallel and Distributed Computing. 2005. DOI : 10.1016/j.jpdc.2004.11.011.Understanding the performance of concurrent error detecting superscalar microarchitectures
2005. p. 13 - 18. DOI : 10.1109/ISSPIT.2005.1577062.DBmbench: fast and accurate database workload representation on modern microarchitecture
2005. p. 254 - 267. DOI : 10.1145/1105634.1105653.TRUSS: A Reliable, Scalable Server Architecture
IEEE Micro. 2005. DOI : 10.1109/MM.2005.122.TurboSMARTS: Accurate microarchitecture simulation sampling in minutes
2005. p. 408 - 409. DOI : 10.1145/1064212.1064278.ReCast: Boosting tag line buffer coverage in low-power high-level caches "for free"
2005. p. 609 - 616. DOI : 10.1109/ICCD.2005.90.Temporal Streaming of Shared Memory
2005. p. 222 - 233. DOI : 10.1109/ISCA.2005.50.Store-Ordered Streaming of Shared Memory
2005. p. 75 - 86. DOI : 10.1109/PACT.2005.37.Accelerating Database Operations Using a Network Processor
2005.Fingerprinting: Bounding the Soft-Error Detection Latency and Bandwidth
IEEE Micro. 2004. DOI : 10.1109/MM.2004.72.An Evaluation of Stratified Sampling of Microarchitecture Simulations
2004.SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture
Performance Evaluation Review. 2004. DOI : 10.1145/1054907.1054914.Fingerprinting: Bounding the Soft-Error Detection Latency and Bandwidth
2004.TurboSMARTS: Accurate Microarchitecture Simulation Sampling in Minute
2004The Fourth International Workshop on Power-Aware Computer Systems. Revised Papers
2004Efficient resource sharing in concurrent error detecting superscalar microarchitectures
2004. p. 257 - 268. DOI : 10.1109/MICRO.2004.19.Memory coherence activity prediction in commercial workloads
2004. p. 37 - 45. DOI : 10.1145/1054943.1054949.SORDS: Just-In-Time Streaming of Temporally-Correlated Shared Data
2004The Third International Workshop on Power-Aware Computer Systems. Revised Papers.
2004Accurate and complexity-effective spatial pattern prediction
2004. p. 276 - 287.Performance and Energy Trade-Offs of Bitline Isolation in Nanoscale CMOS Caches
2003.Implicitly-multithreaded processors
2003. p. 39 - 50. DOI : 10.1145/859618.859624.The Second International Workshop on Power-Aware Computer Systems. Revised Papers.
2003Near-optimal precharging in high-performance nanoscale CMOS caches
2003. p. 67 - 78. DOI : 10.1109/MICRO.2003.1253184.Speculative Sequential Consistency with Little Custom Storage
Journal of Instruction-Level Parallelism. 2003.Optimizing traffic in DSM clusters: fine-grain memory caching versus page migration/replication
Theory of Computing Systems. 2002. DOI : 10.1007/s00224-002-1054-6.Speculative sequential consistency with little custom storage
2002. p. 179 - 188. DOI : 10.1109/PACT.2002.1106016.Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay
2002. p. 151 - 161. DOI : 10.1109/HPCA.2002.995706.Gated Precharge: Using Temporal Locality of Subarrays to Save Deep- Submicron Cache Energy
2002.Reducing leakage in a high-performance deep-submicron instruction cache
IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2001. DOI : 10.1109/92.920821.Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor
2001. p. 368 - 380. DOI : 10.1145/377792.377863.The First International Workshop on Power-Aware Computer Systems. Revised Papers.
2001An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches
2001. p. 147 - 157. DOI : 10.1109/HPCA.2001.903259.Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery
2001. 34th Annual IEEE/ACM International Symposium on Microarchitecture, Austin, Texas, December 1-5, 2001. p. 214 - 224. DOI : 10.1109/MICRO.2001.991120.Evaluating Opportunity and Effectiveness of Cache Resizing in Reducing Energy Dissipation
2001Dead-block prediction & dead-block correlating prefetchers
2001. p. 144 - 154. DOI : 10.1109/ISCA.2001.937443.JETTY: Filtering snoops for reduced energy consumption in SMP servers
2001. p. 85 - 96. DOI : 10.1109/HPCA.2001.903254.Reference idempotency analysis: A framework for optimizing speculative execution
2001. p. 2 - 11. DOI : 10.1145/379539.379547.Reducing set-associative cache energy via way-prediction and selective direct-mapping
2001. p. 54 - 65.Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters
2000. p. 79 - 88. DOI : 10.1145/341800.341811.Address partitioning in DSM clusters with parallel coherence controllers
2000. p. 47 - 56. DOI : 10.1109/PACT.2000.888330.Selective, accurate, and timely self-invalidation using last-touch prediction
2000. p. 139 - 148. DOI : 10.1109/ISCA.2000.854385.Gated-Vdd: a circuit technique to reduce leakage in deep- submicron cache memories
2000. International Symposium on Low Power Electronics and Design (ISLPED), Rapallo, Italy, July. p. 90 - 95. DOI : 10.1109/LPE.2000.876763.The Fourth International Workshop on Network-Based Parallel Computing. Communication, Architecture, and Applications. Revised Papers.
2000Low-Overhead and High-Performance Implementations of Sequential Consistency
2000.Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
2000Dynamically Resizable Instruction Cache: A Design for an Energy-Efficient and High-Performance Deep-Submicron Instruction Cache
2000Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator
IEEE Concurrency. 2000. DOI : 10.1109/4434.895100.Dynamic Feature Selection for Hardware Prediction
2000Memory sharing predictor: the key to a speculative coherent DSM
1999. p. 172 - 183. DOI : 10.1109/ISCA.1999.765949.Is SC+ILP=RC?
1999. ISCA, Atlanta, GA, May. p. 162 - 171. DOI : 10.1109/ISCA.1999.765948.Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols
1999. p. 182 - 192. DOI : 10.1109/HPCA.1999.744362.Cacheable Interface Control Registers for High Speed Data Transfer
US5951657 . 1999.Is SC + ILP = RC?
ACM SIGARCH Computer Architecture News. 1999. DOI : 10.1145/307338.300993.Sirocco: cost-effective fine-grain distributed shared memory
1998. p. 40 - 49. DOI : 10.1109/PACT.1998.727144.Reactive NUMA: A design for unifying S-COMA and CC-NUMA
1997. p. 229 - 240. DOI : 10.1145/264107.264205.Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture Simulator
1997.Fine-grain Access Control for Distributed Shared Memory
Distributed Shared Memory: Concepts and Systems; IEEE Computer Society Press, 1997.Scheduling communication on an SMP node parallel machine
1997. p. 128 - 138. DOI : 10.1109/HPCA.1997.569649.Modeling cost/performance of a parallel computer simulator
ACM Transactions on Modeling and Computer Simulation. 1997. DOI : 10.1145/244804.244808.Coherent network interfaces for fine-grain communication
1996. p. 247 - 258. DOI : 10.1145/232973.232999.Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations
1996When does Dedicated Protocol Processing Make Sense?
1996Fine-grain access control for distributed shared memory
1994. ASPLOS'94. 6th International Conference on Architectural support for Programming Languages and Operating Systems, San Jose, CA, October. p. 297 - 306. DOI : 10.1145/195470.195575.Application-specific protocols for user-level shared memory
1994. Supercomputing '94, Washington D.C., USA, November 14-18. p. 380 - 389. DOI : 10.1109/SUPERC.1994.344301.Cost/performance of a parallel computer simulator
1994. p. 173 - 182.Mechanisms for Cooperative Shared Memory
CMG Transactions. 1994. DOI : 10.1145/173682.165151.Kernel support for the Wisconsin Wind Tunnel
1993. p. 73 - 89.Mechanisms for cooperative shared memory
1993. 20th International Symposium on Computer Architecture, San Diego, CA, May. p. 156 - 167. DOI : 10.1145/165123.165151.Component Labeling Algorithms on an Intel iPSC/2 Hypercube
1990. p. 159 - 164.Infoscience
UrbanTwin: An urban digital twin for climate action
EcoCloud Annual Event on IT Sustainability 2024, Lausanne, Switzerland, 2024-10-08.Electrical-Level Fault-Injection Attacks on FPGA-Based Systems
Lausanne, EPFL, 2024. DOI : 10.5075/epfl-thesis-10315.Secure Interface Design Leveraging Hardware/Software Support
Lausanne, EPFL, 2024. DOI : 10.5075/epfl-thesis-9975.What's Missing in Agile Hardware Design? Verification!
Journal Of Computer Science And Technology. 2023. DOI : 10.1007/s11390-023-0005-3.Scale-out Systolic Arrays
Acm Transactions On Architecture And Code Optimization. 2023. DOI : 10.1145/3572917.AstriFlash: A Flash-Based System for Online Services
2023. The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), Montreal, QC, Canada, Feb 25 – March 01, 2023. DOI : 10.1109/HPCA56546.2023.10070955.Rebooting Virtual Memory with Midgard
Lausanne, EPFL, 2023. DOI : 10.5075/epfl-thesis-8864.Cooperative Concurrency Control for Write-Intensive Key-Value Workloads
2023. The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'23), Vancouver, BC, Canada, March 25–29, 2023. p. 30 - 46. DOI : 10.1145/3567955.3567957.Evaluating, Exploiting, and Hiding Power Side-Channel Leakage of Remote FPGAs
Lausanne, EPFL, 2023. DOI : 10.5075/epfl-thesis-9918.Imprecise Store Exceptions
2023. The 50th Annual International Symposium on Computer Architecture (ISCA ’23), Orlando, FL, USA, June 17–21, 2023. DOI : 10.1145/3579371.3589087.SecureCells: A Secure Compartmentalized Architecture
2023. 44th IEEE Symposium on Security and Privacy, San Francisco, USA, May 22-24, 2023. p. 2921 - 2939. DOI : 10.1109/SP46215.2023.00125.Algorithms for Efficient and Robust Distributed Deep Learning
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8980.Hardware and Software Support for RPC-Centric Server Architecture
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8017.Cerebros: Evading the RPC Tax in Datacenters
2021. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18–22, 2021. p. 407 - 420. DOI : 10.1145/3466752.3480055.Equinox: Training (for Free) on a Custom Inference Accelerator
2021. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21), Virtual Event, Greece, October 18–22, 2021. DOI : 10.1145/3466752.3480057.Hardware-Software Co-Design of an RPC Processor
Lausanne, EPFL, 2021. DOI : 10.5075/epfl-thesis-7217.Data transformer apparatus
US11748254 ; US2022327048 ; WO2021037341 . 2021.Rebooting Virtual Memory with Midgard
2021. ISCA 2021 48th International Symposium on Computer Architecture, Online conference, June 14-19, 2021. DOI : 10.1109/ISCA52012.2021.00047.Exploiting Errors for Efficiency: A Survey from Circuits to Applications
Acm Computing Surveys. 2020. DOI : 10.1145/3394898.ColTraIn: Co-located DNN training and inference
Lausanne, EPFL, 2020. DOI : 10.5075/epfl-thesis-10265.SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators
2020Optimus Prime: Accelerating Data Transformation in Servers
2020. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16–20, 2020. p. 1203 - 1216. DOI : 10.1145/3373376.3378501.The NEBULA RPC-Optimized Architecture
2020. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, May, 30th - June, 3rd 2020. p. 199 - 212. DOI : 10.1109/ISCA45697.2020.00027.Distributed Logless Atomic Durability with Persistent Memory
2019. The 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52), Columbus, OH, USA, October 12–16, 2019. DOI : 10.1145/3352460.3358321.Analog Neural Networks with Deep-submicron Nonlinear Synapses
IEEE Micro. 2019. DOI : 10.1109/MM.2019.2931182.SMoTherSpectre: Exploiting Speculative Execution through Port Contention
2019. The 26th ACM Conference on Computer and Communications Security - ACM CSS 2019, London, UK, November 11-15, 2019. p. 785 - 800. DOI : 10.1145/3319535.3363194.Design Guidelines for High-Performance SCM Hierarchies
2018. 4th International Symposium on Memory Systems (MEMSYS), Old Town Alexandria, VA, USA, October 1-4, 2018. DOI : 10.1145/3240302.3240310.Atomic object reads for in-memory rack-scale computing
US10929174 ; US2018173673 . 2018.Training DNNs with Hybrid Block Floating Point
2018. NeurIPS 2018 - 32nd Conference on Neural Information Processing Systems, Montreal, CANADA, Dec 02-08, 2018.Network-Compute Co-Design for Distributed In-Memory Computing
Lausanne, EPFL, 2018. DOI : 10.5075/epfl-thesis-8749.Near-Memory Address Translation
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7875.FPGAs versus GPUs in Data centers
IEEE Micro. 2017. DOI : 10.1109/MM.2017.19.The Mondrian Data Engine
2017. The 44th International Symposium on Computer Architecture, Toronto, ON, Canada, June 24-28, 2017. DOI : 10.1145/3079856.3080233.Unified prefetching into instruction cache and branch target buffer
US9996358 ; US2017090935 . 2017.Fat Caches For Scale-Out Servers
Ieee Micro. 2017. DOI : 10.1109/MM.2017.32.Near-Memory Address Translation
2017. 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, OR, SEP 09-13, 2017. p. 303 - 317. DOI : 10.1109/Pact.2017.56.Rack-Scale Memory Pooling for Datacenters
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7612.Towards Near-Threshold Server Processors
2016. Design, Automation and Test in Europe Conference (DATE '16), Dresden, Germany, March 14-18, 2016. p. 7 - 12.Near-Memory Data Services
IEEE Micro. 2016. DOI : 10.1109/MM.2016.9.Unlocking Energy
2016. 2016 USENIX Annual Technical Conference, Denver, Colorado, USA, June 22-24, 2016. p. 393 - 406.SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
2016. 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, October 15-19, 2016. DOI : 10.1109/MICRO.2016.7783709.The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems
2016. ACM Symposium on Cloud Computing, Santa Clara, USA, October 05-07, 2016. DOI : 10.1145/2987550.2987577.An Analysis of Load Imbalance in Scale-out Data Serving
2016. ACM SIGMETRICS, Antibes Juan-Les-Pins, France, June 14-18, 2016. p. 367 - 368. DOI : 10.1145/2896377.2901501.Accelerators for Data Processing
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6710.Memory Systems and Interconnects for Scale-Out Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6682.Sort vs. Hash Join Revisited for Near-Memory Execution
5th Workshop on Architectures and Systems for Big Data ( ASBD 2015 ), Portland, Oregon, USA, June 13, 2015.Sort vs. Hash Join Revisited for Near-Memory Execution
2015. 5th Workshop on Architectures and Systems for Big Data (ASBD 2015), Portland, Oregon, USA, June 13, 2015.Manycore Network Interfaces for In-Memory Rack-Scale Computing
2015. 42nd International Symposium in Computer Architecture, Portland, Oregon, USA, June 13-17, 2015. DOI : 10.1145/2749469.2750415.Multi-Gigabyte On-Chip DRAM Caches for Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6631.Asynchronous memory access chaining
Proceedings of the VLDB Endowment. 2015. DOI : 10.14778/2856318.2856321.Confluence: unified instruction supply for scale-out servers
2015. the 48th International Symposium, Waikiki, Hawaii, 05-09 December 2015. p. 166 - 177. DOI : 10.1145/2830772.2830785.Shared Frontend for Manycore Server Processors
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6669.A Primer on Hardware Prefetching
Morgan & Claypool.BuMP: Bulk Memory Access Prediction and Streaming
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 13-17, 2014. p. 545 - 557. DOI : 10.1109/MICRO.2014.44.Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models
2014Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, December 13-17, 2014. p. 25 - 37. DOI : 10.1109/MICRO.2014.51.FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring
2014. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), Orlando, Florida, USA, February 15-19, 2014. p. 108 - 119. DOI : 10.1109/HPCA.2014.6835922.Architectural Support to Accelerate Fine-Grain Program Monitoring
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6257.Big Data
IEEE Micro. 2014. DOI : 10.1109/MM.2014.65.Towards stable cloud performance
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6261.Scale-Out NUMA
2014. Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, Utah, USA, March 1-5, 2014. DOI : 10.1145/2541940.2541965.A Case for Specialized Processors for Scale-Out Workloads
IEEE Micro. 2014. DOI : 10.1109/MM.2014.41.DeSyRe: On-demand system reliability
Microprocessors and Microsystems - Embedded Hardware Design. 2013. DOI : 10.1016/j.micpro.2013.08.008.Multi-Grain Coherence Directory
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540739.TOP PICKS FROM THE 2012 COMPUTER ARCHITECTURE CONFERENCES Introduction
IEEE Micro. 2013. DOI : 10.1109/MM.2013.65.Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache
2013. 40th International Symposium on Computer Architecture, Tel-Aviv, Israel, June 23-27, 2013. p. 404 - 415. DOI : 10.1145/2485922.2485957.Scale-Out Processors
Lausanne, EPFL, 2013. DOI : 10.5075/epfl-thesis-5906.SHIFT: Shared History Instruction Fetch for Lean-Core Server Processors
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540732.Meet the Walkers: Accelerating Index Traversals for In-Memory Databases
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540748.Dark Silicon Accelerators for Database Indexing
Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.NOC-Out: Microarchitecting a Scale-Out Processor
2012. 45th International Symposium on Microarchitecture, Vancouver, BC, Canada, December 1-5, 2012. DOI : 10.1109/MICRO.2012.25.Thermal Characterization of Cloud Workloads on a Power-Efficient Server-on-Chip
2012. 30th IEEE International Conference on Computer Design, Montreal, Quebec, Canada, September 30 - October 3, 2012. DOI : 10.1109/ICCD.2012.6378637.Scale-Out Processors
2012BugSifter: A Generalized Accelerator for Flexible Instruction-Grain Monitoring
2012Dark Silicon Accelerators for Database Indexing
2012. 1st Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Scale-Out Processors
2012. 39th Annual International Symposium on Computer Architecture, Portland, Oregon, USA, June 9-13, 2012. DOI : 10.1145/2366231.2337217.Optimizing Data-Center TCO with Scale-Out Processors
IEEE Micro. 2012. DOI : 10.1109/MM.2012.71.CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
2012. 6th International Symposium on Networks-on-Chip, Lyngby, Denmark, May 9-11, 2012.Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware
2012. Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, London, UK, March 3-7, 2012.Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems. 2012. DOI : 10.1145/2382553.2382557.Proactive Instruction Fetch
2011. 44th Annual IEEE/ACM Symposium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, December 3-7. p. 152 - 162. DOI : 10.1145/2155620.2155638.Reliability in the Dark Silicon Era
2011. 17th IEEE International On-Line Testing Symposium (IOLTS), Athens, Greece, Jul 13-15, 2011. p. V - V.Cuckoo Directory: A Scalable Directory for Many-Core Systems
2011. HPCA 2011, San Antonio, Texas, USA, February 12-16, 2011. DOI : 10.1109/HPCA.2011.5749726.Toward Dark Silicon in Servers
IEEE Micro. 2011. DOI : 10.1109/MM.2011.77.CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips
2011. Workshop on Energy-Efficient Design (WEED 2011), San Jose, California, USA, June 5, 2011.Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware
2011ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
2010. ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. p. 271 - 284. DOI : 10.1145/1736020.1736051.Making Address-Correlated Prefetching Practical
IEEE Micro. 2010. DOI : 10.1109/MM.2010.21.Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
IEEE Micro. 2010. DOI : 10.1109/MM.2010.22.TurboTag: Lookup Filtering to Reduce Coherence Directory Power
2010. 16th International Symposium on Low Power Electronics and Design (ISLPED 10), Austin, Texas, USA, August 18-20. p. 377 - 382. DOI : 10.1145/1840845.1840929.Flexible Hardware Acceleration for Instruction-Grain Lifeguards
IEEE Micro Top Picks. 2009. DOI : 10.1109/MM.2009.6.Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 184 - 195. DOI : 10.1145/1555754.1555779.Spatio-Temporal Memory Streaming
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 69 - 80. DOI : 10.1145/1555754.1555766.ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems. 2009. DOI : 10.1145/1534916.1534925.Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
2009. p. 195 - 201. DOI : 10.1109/PRDC.2009.39.Practical Off-chip Meta-data for Temporal Memory Streaming
2009. 15th International Symposium on High-Performance Computer Architecture, Raleigh, NC. p. 79 - 90. DOI : 10.1109/HPCA.2009.4798239.Workshop on Transactional Computing (TRANSACT 2008) - Introduction
Acm Sigplan Notices. 2008. DOI : 10.1145/1402227.1402233.Temporal streams in commercial server applications
2008. IEEE International Symposium on Workload Characterization (IISWC), Seattle, WA, September. p. 99 - 108. DOI : 10.1109/IISWC.2008.4636095.Predictor virtualization
2008. the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS), Seattle, WA, March. p. 157 - 167. DOI : 10.1145/1346281.1346301.Flexible hardware acceleration for instruction-grain program monitoring
2008. the 35th Annual International Symposium on Computer Architecture (ISCA), Beijing, China, June. p. 377 - 388. DOI : 10.1109/ISCA.2008.20.A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
2008. 16th international ACM/SIGDA symposium on Field programmable gate arrays (FPGA), Monterey, CA, February. p. 77 - 86. DOI : 10.1145/1344671.1344684.Temporal instruction fetch streaming
2008. the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Lake Como, Italy, November. p. 1 - 10. DOI : 10.1109/MICRO.2008.4771774.Database Servers on Chip Multiprocessors: Limitations and Opportunities
2007.Mechanisms for store-wait-free multiprocessors
2007. p. 266 - 277. DOI : 10.1145/1250662.1250696.PROTOFLEX: FPGA-accelerated hybrid functional simulator
2007. DOI : 10.1109/IPDPS.2007.370516.PAI: A lightweight mechanism for single-node memory recovery in DSM servers
2007. p. 298 - 305. DOI : 10.1109/PRDC.2007.53.An Analysis of Database System Performance on Chip Multiprocessors
2007.To Share or Not To Share?
2007. 33rd International Conference on Very Large Data Bases, Vienna, Austria, September. p. 351 - 362.Last-touch correlated data streaming
2007. p. 105 - 115. DOI : 10.1109/ISPASS.2007.363741.Multi-bit error tolerant caches using two-dimensional error coding
2007. p. 197 - 209. DOI : 10.1109/MICRO.2007.19.Scheduling threads for constructive cache sharing on CMPs
2007. p. 105 - 115. DOI : 10.1145/1248377.1248396.Spatial Memory Streaming
2006. p. 252 - 263. DOI : 10.1109/ISCA.2006.38.Dynamic feature selection for hardware prediction
Journal of Systems Architecture. 2006. DOI : 10.1016/j.sysarc.2004.12.007.Coarse-grain coherence tracking: RegionScout and region coherence arrays
IEEE Micro. 2006. DOI : 10.1109/MM.2006.8.Reunion: Complexity-effective multicore redundancy
2006. p. 223 - 234. DOI : 10.1109/MICRO.2006.42.ProtoFlex: Co-simulation for Component-wise FPGA Emulator Development
2006.Log-based architectures for general-purpose monitoring of deployed code
2006. p. 63 - 65. DOI : 10.1145/1181309.1181319.Statistical sampling of microarchitecture simulation
ACM Transactions on Modeling and Computer Simulation. 2006. DOI : 10.1145/1147224.1147225.The Granularity of Soft-Error Containment in Shared-Memory Multiprocessors
2006.Exploiting reference idempotency to reduce speculative storage overflow
ACM Transactions on Programming Languages and Systems. 2006. DOI : 10.1145/1152649.1152653.Parallel depth first vs. work stealing schedulers on CMP architectures
2006. DOI : 10.1145/1148109.1148167.Simulation sampling with live-points
2006. p. 2 - 12. DOI : 10.1109/ISPASS.2006.1620785.A case for asymmetric-cell cache memories
IEEE Transactions on Very Large Scale Integration Systems. 2005. DOI : 10.1109/TVLSI.2005.850127.Evaluating scheduling policies for fine-grain communication protocols on a cluster of SMPs
Journal of Parallel and Distributed Computing. 2005. DOI : 10.1016/j.jpdc.2004.11.011.Understanding the performance of concurrent error detecting superscalar microarchitectures
2005. p. 13 - 18. DOI : 10.1109/ISSPIT.2005.1577062.DBmbench: fast and accurate database workload representation on modern microarchitecture
2005. p. 254 - 267. DOI : 10.1145/1105634.1105653.TRUSS: A Reliable, Scalable Server Architecture
IEEE Micro. 2005. DOI : 10.1109/MM.2005.122.TurboSMARTS: Accurate microarchitecture simulation sampling in minutes
2005. p. 408 - 409. DOI : 10.1145/1064212.1064278.ReCast: Boosting tag line buffer coverage in low-power high-level caches "for free"
2005. p. 609 - 616. DOI : 10.1109/ICCD.2005.90.Temporal Streaming of Shared Memory
2005. p. 222 - 233. DOI : 10.1109/ISCA.2005.50.Store-Ordered Streaming of Shared Memory
2005. p. 75 - 86. DOI : 10.1109/PACT.2005.37.Accelerating Database Operations Using a Network Processor
2005.Fingerprinting: Bounding the Soft-Error Detection Latency and Bandwidth
IEEE Micro. 2004. DOI : 10.1109/MM.2004.72.An Evaluation of Stratified Sampling of Microarchitecture Simulations
2004.SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture
Performance Evaluation Review. 2004. DOI : 10.1145/1054907.1054914.Fingerprinting: Bounding the Soft-Error Detection Latency and Bandwidth
2004.TurboSMARTS: Accurate Microarchitecture Simulation Sampling in Minute
2004The Fourth International Workshop on Power-Aware Computer Systems. Revised Papers
2004Efficient resource sharing in concurrent error detecting superscalar microarchitectures
2004. p. 257 - 268. DOI : 10.1109/MICRO.2004.19.Memory coherence activity prediction in commercial workloads
2004. p. 37 - 45. DOI : 10.1145/1054943.1054949.SORDS: Just-In-Time Streaming of Temporally-Correlated Shared Data
2004The Third International Workshop on Power-Aware Computer Systems. Revised Papers.
2004Accurate and complexity-effective spatial pattern prediction
2004. p. 276 - 287.Performance and Energy Trade-Offs of Bitline Isolation in Nanoscale CMOS Caches
2003.Implicitly-multithreaded processors
2003. p. 39 - 50. DOI : 10.1145/859618.859624.The Second International Workshop on Power-Aware Computer Systems. Revised Papers.
2003Near-optimal precharging in high-performance nanoscale CMOS caches
2003. p. 67 - 78. DOI : 10.1109/MICRO.2003.1253184.Speculative Sequential Consistency with Little Custom Storage
Journal of Instruction-Level Parallelism. 2003.Optimizing traffic in DSM clusters: fine-grain memory caching versus page migration/replication
Theory of Computing Systems. 2002. DOI : 10.1007/s00224-002-1054-6.Speculative sequential consistency with little custom storage
2002. p. 179 - 188. DOI : 10.1109/PACT.2002.1106016.Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay
2002. p. 151 - 161. DOI : 10.1109/HPCA.2002.995706.Gated Precharge: Using Temporal Locality of Subarrays to Save Deep- Submicron Cache Energy
2002.Reducing leakage in a high-performance deep-submicron instruction cache
IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 2001. DOI : 10.1109/92.920821.Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor
2001. p. 368 - 380. DOI : 10.1145/377792.377863.The First International Workshop on Power-Aware Computer Systems. Revised Papers.
2001An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches
2001. p. 147 - 157. DOI : 10.1109/HPCA.2001.903259.Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery
2001. 34th Annual IEEE/ACM International Symposium on Microarchitecture, Austin, Texas, December 1-5, 2001. p. 214 - 224. DOI : 10.1109/MICRO.2001.991120.Evaluating Opportunity and Effectiveness of Cache Resizing in Reducing Energy Dissipation
2001Dead-block prediction & dead-block correlating prefetchers
2001. p. 144 - 154. DOI : 10.1109/ISCA.2001.937443.JETTY: Filtering snoops for reduced energy consumption in SMP servers
2001. p. 85 - 96. DOI : 10.1109/HPCA.2001.903254.Reference idempotency analysis: A framework for optimizing speculative execution
2001. p. 2 - 11. DOI : 10.1145/379539.379547.Reducing set-associative cache energy via way-prediction and selective direct-mapping
2001. p. 54 - 65.Comparing the effectiveness of fine-grain memory caching against page migration/replication in reducing traffic in DSM clusters
2000. p. 79 - 88. DOI : 10.1145/341800.341811.Address partitioning in DSM clusters with parallel coherence controllers
2000. p. 47 - 56. DOI : 10.1109/PACT.2000.888330.Selective, accurate, and timely self-invalidation using last-touch prediction
2000. p. 139 - 148. DOI : 10.1109/ISCA.2000.854385.Gated-Vdd: a circuit technique to reduce leakage in deep- submicron cache memories
2000. International Symposium on Low Power Electronics and Design (ISLPED), Rapallo, Italy, July. p. 90 - 95. DOI : 10.1109/LPE.2000.876763.The Fourth International Workshop on Network-Based Parallel Computing. Communication, Architecture, and Applications. Revised Papers.
2000Low-Overhead and High-Performance Implementations of Sequential Consistency
2000.Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor
2000Dynamically Resizable Instruction Cache: A Design for an Energy-Efficient and High-Performance Deep-Submicron Instruction Cache
2000Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator
IEEE Concurrency. 2000. DOI : 10.1109/4434.895100.Dynamic Feature Selection for Hardware Prediction
2000Memory sharing predictor: the key to a speculative coherent DSM
1999. p. 172 - 183. DOI : 10.1109/ISCA.1999.765949.Is SC+ILP=RC?
1999. ISCA, Atlanta, GA, May. p. 162 - 171. DOI : 10.1109/ISCA.1999.765948.Parallel Dispatch Queue: a queue-based programming abstraction to parallelize fine-grain communication protocols
1999. p. 182 - 192. DOI : 10.1109/HPCA.1999.744362.Cacheable Interface Control Registers for High Speed Data Transfer
US5951657 . 1999.Is SC + ILP = RC?
ACM SIGARCH Computer Architecture News. 1999. DOI : 10.1145/307338.300993.Sirocco: cost-effective fine-grain distributed shared memory
1998. p. 40 - 49. DOI : 10.1109/PACT.1998.727144.Reactive NUMA: A design for unifying S-COMA and CC-NUMA
1997. p. 229 - 240. DOI : 10.1145/264107.264205.Wisconsin Wind Tunnel II: A Fast and Portable Parallel Architecture Simulator
1997.Fine-grain Access Control for Distributed Shared Memory
Distributed Shared Memory: Concepts and Systems; IEEE Computer Society Press, 1997.Scheduling communication on an SMP node parallel machine
1997. p. 128 - 138. DOI : 10.1109/HPCA.1997.569649.Modeling cost/performance of a parallel computer simulator
ACM Transactions on Modeling and Computer Simulation. 1997. DOI : 10.1145/244804.244808.Coherent network interfaces for fine-grain communication
1996. p. 247 - 258. DOI : 10.1145/232973.232999.Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations
1996When does Dedicated Protocol Processing Make Sense?
1996Fine-grain access control for distributed shared memory
1994. ASPLOS'94. 6th International Conference on Architectural support for Programming Languages and Operating Systems, San Jose, CA, October. p. 297 - 306. DOI : 10.1145/195470.195575.Application-specific protocols for user-level shared memory
1994. Supercomputing '94, Washington D.C., USA, November 14-18. p. 380 - 389. DOI : 10.1109/SUPERC.1994.344301.Cost/performance of a parallel computer simulator
1994. p. 173 - 182.Mechanisms for Cooperative Shared Memory
CMG Transactions. 1994. DOI : 10.1145/173682.165151.Kernel support for the Wisconsin Wind Tunnel
1993. p. 73 - 89.Mechanisms for cooperative shared memory
1993. 20th International Symposium on Computer Architecture, San Diego, CA, May. p. 156 - 167. DOI : 10.1145/165123.165151.Component Labeling Algorithms on an Intel iPSC/2 Hypercube
1990. p. 159 - 164.Teaching & PhD
Teaching
Computer Science
Mathematics
Communication Systems