Babak Falsafi

Full Professor
babak.falsafi@epfl.ch +41 21 693 55 92 http://parsa.epfl.ch/~falsafi
EPFL IC IINFCOM PARSA
INJ 233 (Bâtiment INJ)
Station 14
CH-1015 Lausanne
+41 21 693 55 92
+41 21 693 13 93
Office:
INJ 233
EPFL
>
IC
>
IINFCOM
>
PARSA
Web site: Web site: https://parsa.epfl.ch/
+41 21 693 55 92
EPFL
>
IC
>
IC-SIN
>
SIN-ENS
Web site: Web site: https://sin.epfl.ch
+41 21 693 55 92
EPFL
>
IC
>
IC-SSC
>
SSC-ENS
Web site: Web site: https://ssc.epfl.ch
+41 21 693 55 92
EPFL
>
SB
>
SB-SMA
>
SMA-ENS
Web site: Web site: https://sma.epfl.ch/
Fields of expertise
Biography
Babak is a Professor in the School of Computer and Communication Sciences and the founder of EcoCloud, an industrial/academic consortium at EPFL investigating scalable data-centric technologies. He has made numerous contributions to computer system design and evaluation including a scalable multiprocessor architecture which was prototyped by Sun Microsystems (now Oracle), snoop filters and temporal memory streaming technologies that are incorporated into IBM BlueGene/P and Q, spatial memory streaming that appear in ARM cores, and computer system performance evaluation methodologies that have been in use by AMD, HP and Google PerKit . He has shown that hardware memory consistency models are neither necessary (in the 90's) nor sufficient (a decade later) to achieve high performance in multiprocessor systems. These results eventually led to fence speculation in modern microprocessors. His latest work on workload-optimized server processors laid the foundation for the first generation of Cavium ARM server CPUs, ThunderX. He is a recipient of an NSF CAREER award, IBM Faculty Partnership Awards, and an Alfred P. Sloan Research Fellowship. He is a fellow of IEEE and ACM.NEWS
Network bandwidth is projected to grow at 20% a year for a decade thanks to optics. Logic density is lagging behind at 15% a year and slowing down. Optimus Prime (ASPLOS'20) a transformer accelerator, NebuLA (ISCA'20) a hardware-terminated network stack, and Cerebros (MICRO'21) an RPC processor combining the two help bridge the gap for uServices. Great to see that Google has followed up with their own transformer prototype this year.See our paper on "Rebooting Virtual Memory with Midgard" for a novel approach to future-proof virtual memory. Here is a news snippet.
Numerical training of DNNs is converging on fixed point with orders of magnitude improvement in logic, memory, power and bandwidth. See our blog.
Cavium ThunderX, an ARM-based server processor, is the first scale-out processor which is workload-optimized based on our work in "Clearing the Clouds" and the first version of CloudSuite. See this article in EETimes from EEMBC and Cavium.
RESEARCH
Data has emerged as a currency for modern society and datacenters are now the backbone of IT offering large-scale cloud services at low costs benefiting from and exploiting the economies of scale. With silicon efficiency scaling having dwindled since 2004 and silicon density scaling, Moore's Law, slowing down, future digital platforms will rely on heterogeneous logic and memory to allow for IT scalability. Meanwhile, the demand for large-scale cloud services has grown dramatically faster than conventional silicon scaling making IT platform scalability a grand challenge. Future platforms will need hand-in-hand collaboration of application domain experts and platform designers to improve scalability. With many online services being in-memory and the minimum communication latency between the farthest nodes being microseconds, future server platforms will go through revolutionary changes in architecture and systems to enable seamless aggregation of logic and memory resources across nodes, breaking the conventional abstraction layers. Babak's research and educational activities center around post-Moore server design.He investigates techniques to address these challenges in the context of the following projects:- CloudSuite: A Benchmark Suite for Scale-Out Workloads
- ColTraIn: Co-Located Training and Inference DNN Accelerators
- QFlex: Fast, Full-System Open-Source Server Simulation/Emulation
- Scale-Out NUMA: Rack-Scale Computer Architecture
- VISA: Server Processors for the Dark Silicon Era
Selected Talks
Silicon Heterogeneity in the CloudDATE, March 2019, PDF.
Datacenter for the Post-Moore Era
Euro-Par Keynote, August 2018, PDF.
Public Clouds will Subsume (Most of) HPC
HPC Summit,
May 2017, PDF.
Memory-Centric Server Architecture
Talks at Columbia, Edinburgh and HKUST,
2016, PDF.
Big Data & Dark Silicon: Taming Two IT Trends on a Collision Course
HiPEAC CSW & IEEE CloudNet Keynotes,
October 2014, PDF.
Reliability in the Dark Silicon Era
IOLTS 2011 Keynote,
July 2011, PDF.
Dark Silicon & Its Implications on Server Chip Design
Microsoft Research,
November 2010, PDF, Video.
TRUSS: Reliable, Scalable Server Architecture
Georgia Institute of Technology, College of Computing Colloquia,
April 2006, PDF.
Temporal Memory Streaming
University of Texas, Computer Science Department Colloquia,
December 2005, PDF.
Transactional Execution: Wait-Free Hardware Memory Ordering
Dagstuhl Seminar on "Hardware and Software Consistency
Models: Programmability and Performance",
October 2003, PDF.
Publications
Infoscience publications
AstriFlash: A Flash-Based System for Online Services
2022-12-04. The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), Montreal, QC, Canada, Feb 25 – March 01, 2023.Elimination of ringing artifacts by finite-element projection in FFT-based homogenization
Journal Of Computational Physics. 2022-03-15. DOI : 10.1016/j.jcp.2021.110931.Efficient Meso-Scale Modeling of Alkali-Silica-Reaction Damage in Concrete
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-9591.Hardware and Software Support for RPC-Centric Server Architecture
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8017.Algorithms for Efficient and Robust Distributed Deep Learning
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8980.Equinox: Training (for Free) on a Custom Inference Accelerator
2021-10-18. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21), Virtual Event, Greece, October 18–22, 2021. DOI : 10.1145/3466752.3480057.Cerebros: Evading the RPC Tax in Datacenters
2021-10-18. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18–22, 2021. p. 407-420. DOI : 10.1145/3466752.3480055.Hardware-Software Co-Design of an RPC Processor
Lausanne, EPFL, 2021. DOI : 10.5075/epfl-thesis-7217.Rebooting Virtual Memory with Midgard
2021. ISCA 2021 48th International Symposium on Computer Architecture, Online conference, June 14-19, 2021. DOI : 10.1109/ISCA52012.2021.00047.Data transformer apparatus
US2022327048 ; WO2021037341 . 2021.Exploiting Errors for Efficiency: A Survey from Circuits to Applications
Acm Computing Surveys. 2020-06-01. DOI : 10.1145/3394898.ColTraIn: Co-located DNN training and inference
Lausanne, EPFL, 2020. DOI : 10.5075/epfl-thesis-10265.The NEBULA RPC-Optimized Architecture
2020. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, May, 30th - June, 3rd 2020. p. 199-212. DOI : 10.1109/ISCA45697.2020.00027.Optimus Prime: Accelerating Data Transformation in Servers
2020. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16–20, 2020. p. 1203-1216. DOI : 10.1145/3373376.3378501.SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators
2020Distributed Logless Atomic Durability with Persistent Memory
2019-10-16. The 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52), Columbus, OH, USA, October 12–16, 2019. DOI : 10.1145/3352460.3358321.RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs
2019-04-15. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '19, Providence, Rhode Island, USA, April 13-17, 2019. p. 35-48. DOI : 10.1145/3297858.3304070.Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
ACM Transactions on Computer Systems. 2019-04-01. DOI : 10.1145/3309986.SMoTherSpectre: Exploiting Speculative Execution through Port Contention
2019. The 26th ACM Conference on Computer and Communications Security - ACM CSS 2019, London, UK, November 11-15, 2019. p. 785–800. DOI : 10.1145/3319535.3363194.Analog Neural Networks with Deep-submicron Nonlinear Synapses
IEEE Micro. 2019. DOI : 10.1109/MM.2019.2931182.Design Guidelines for High-Performance SCM Hierarchies
2018-10-01. 4th International Symposium on Memory Systems (MEMSYS), Old Town Alexandria, VA, USA, October 1-4, 2018. DOI : 10.1145/3240302.3240310.Atomic object reads for in-memory rack-scale computing
US10929174 ; US2018173673 . 2018.Training DNNs with Hybrid Block Floating Point
2018-01-01. NeurIPS 2018 - 32nd Conference on Neural Information Processing Systems, Montreal, CANADA, Dec 02-08, 2018.Network-Compute Co-Design for Distributed In-Memory Computing
Lausanne, EPFL, 2018. DOI : 10.5075/epfl-thesis-8749.LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
2018. Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '18, Williamsburg, VA, USA, March 24th – March 28th, 2018. p. 489-502. DOI : 10.1145/3173162.3173211.Near-Memory Address Translation
2017. 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, OR, SEP 09-13, 2017. p. 303-317. DOI : 10.1109/Pact.2017.56.Near-Memory Address Translation
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7875.Fat Caches For Scale-Out Servers
Ieee Micro. 2017. DOI : 10.1109/MM.2017.32.Rack-Scale Memory Pooling for Datacenters
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7612.The Mondrian Data Engine
2017. The 44th International Symposium on Computer Architecture, Toronto, ON, Canada, June 24-28, 2017. DOI : 10.1145/3079856.3080233.Unified prefetching into instruction cache and branch target buffer
US9996358 ; US2017090935 . 2017.FPGAs versus GPUs in Data centers
IEEE Micro. 2017. DOI : 10.1109/MM.2017.19.Unlocking Energy
2016. 2016 USENIX Annual Technical Conference, Denver, Colorado, USA, June 22-24, 2016. p. 393-406.The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems
2016. ACM Symposium on Cloud Computing, Santa Clara, USA, October 05-07, 2016. DOI : 10.1145/2987550.2987577.SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
2016. 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, October 15-19, 2016. DOI : 10.1109/MICRO.2016.7783709.Near-Memory Data Services
IEEE Micro. 2016. DOI : 10.1109/MM.2016.9.An Analysis of Load Imbalance in Scale-out Data Serving
2016. ACM SIGMETRICS, Antibes Juan-Les-Pins, France, June 14-18, 2016. p. 367–368. DOI : 10.1145/2896377.2901501.Towards Near-Threshold Server Processors
2016. Design, Automation and Test in Europe Conference (DATE '16), Dresden, Germany, March 14-18, 2016. p. 7-12.Scale-out non-uniform memory access
US9734063 ; US2015242324 . 2015.Asynchronous memory access chaining
Proceedings of the VLDB Endowment. 2015. DOI : 10.14778/2856318.2856321.Confluence: unified instruction supply for scale-out servers
2015. the 48th International Symposium, Waikiki, Hawaii, 05-09 December 2015. p. 166-177. DOI : 10.1145/2830772.2830785.Accelerators for Data Processing
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6710.Memory Systems and Interconnects for Scale-Out Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6682.Multi-Gigabyte On-Chip DRAM Caches for Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6631.Shared Frontend for Manycore Server Processors
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6669.Sort vs. Hash Join Revisited for Near-Memory Execution
2015. 5th Workshop on Architectures and Systems for Big Data (ASBD 2015), Portland, Oregon, USA, June 13, 2015.Sort vs. Hash Join Revisited for Near-Memory Execution
5th Workshop on Architectures and Systems for Big Data ( ASBD 2015 ), Portland, Oregon, USA, June 13, 2015.Manycore Network Interfaces for In-Memory Rack-Scale Computing
2015. 42nd International Symposium in Computer Architecture, Portland, Oregon, USA, June 13-17, 2015. DOI : 10.1145/2749469.2750415.Network-on-chip using request and reply trees for low-latency processor-memory communication
US9703707 ; US2014156929 . 2014.Big Data
IEEE Micro. 2014. DOI : 10.1109/MM.2014.65.Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, December 13-17, 2014. p. 25-37. DOI : 10.1109/MICRO.2014.51.Architectural Support to Accelerate Fine-Grain Program Monitoring
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6257.BuMP: Bulk Memory Access Prediction and Streaming
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 13-17, 2014. p. 545-557. DOI : 10.1109/MICRO.2014.44.Towards stable cloud performance
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6261.A Case for Specialized Processors for Scale-Out Workloads
IEEE Micro. 2014. DOI : 10.1109/MM.2014.41.A Primer on Hardware Prefetching
Morgan & Claypool.Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models
2014FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring
2014. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), Orlando, Florida, USA, February 15-19, 2014. p. 108-119. DOI : 10.1109/HPCA.2014.6835922.Scale-Out NUMA
2014. Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, Utah, USA, March 1-5, 2014. DOI : 10.1145/2541940.2541965.DeSyRe: On-demand system reliability
Microprocessors and Microsystems - Embedded Hardware Design. 2013. DOI : 10.1016/j.micpro.2013.08.008.Multi-Grain Coherence Directory
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540739.Meet the Walkers: Accelerating Index Traversals for In-Memory Databases
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540748.SHIFT: Shared History Instruction Fetch for Lean-Core Server Processors
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540732.TOP PICKS FROM THE 2012 COMPUTER ARCHITECTURE CONFERENCES Introduction
IEEE Micro. 2013. DOI : 10.1109/MM.2013.65.Scale-Out Processors
Lausanne, EPFL, 2013. DOI : 10.5075/epfl-thesis-5906.Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache
2013. 40th International Symposium on Computer Architecture, Tel-Aviv, Israel, June 23-27, 2013. p. 404–415. DOI : 10.1145/2485922.2485957.BugSifter: A Generalized Accelerator for Flexible Instruction-Grain Monitoring
2012Dark Silicon Accelerators for Database Indexing
2012. 1st Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Thermal Characterization of Cloud Workloads on a Power-Efficient Server-on-Chip
2012. 30th IEEE International Conference on Computer Design, Montreal, Quebec, Canada, September 30 - October 3, 2012. DOI : 10.1109/ICCD.2012.6378637.Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems. 2012. DOI : 10.1145/2382553.2382557.NOC-Out: Microarchitecting a Scale-Out Processor
2012. 45th International Symposium on Microarchitecture, Vancouver, BC, Canada, December 1-5, 2012. DOI : 10.1109/MICRO.2012.25.Optimizing Data-Center TCO with Scale-Out Processors
IEEE Micro. 2012. DOI : 10.1109/MM.2012.71.Dark Silicon Accelerators for Database Indexing
Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Scale-Out Processors
2012. 39th Annual International Symposium on Computer Architecture, Portland, Oregon, USA, June 9-13, 2012. DOI : 10.1145/2366231.2337217.CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
2012. 6th International Symposium on Networks-on-Chip, Lyngby, Denmark, May 9-11, 2012.Scale-Out Processors
2012Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware
2012. Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, London, UK, March 3-7, 2012.Reliability in the Dark Silicon Era
2011. 17th IEEE International On-Line Testing Symposium (IOLTS), Athens, Greece, Jul 13-15, 2011. p. V-V.Proactive Instruction Fetch
2011. 44th Annual IEEE/ACM Symposium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, December 3-7. p. 152-162. DOI : 10.1145/2155620.2155638.Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware
2011Toward Dark Silicon in Servers
IEEE Micro. 2011. DOI : 10.1109/MM.2011.77.CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips
2011. Workshop on Energy-Efficient Design (WEED 2011), San Jose, California, USA, June 5, 2011.Cuckoo Directory: A Scalable Directory for Many-Core Systems
2011. HPCA 2011, San Antonio, Texas, USA, February 12-16, 2011. DOI : 10.1109/HPCA.2011.5749726.ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
2010. ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. p. 271-284. DOI : 10.1145/1736020.1736051.TurboTag: Lookup Filtering to Reduce Coherence Directory Power
2010. 16th International Symposium on Low Power Electronics and Design (ISLPED 10), Austin, Texas, USA, August 18-20. p. 377-382. DOI : 10.1145/1840845.1840929.Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
IEEE Micro. 2010. DOI : 10.1109/MM.2010.22.Making Address-Correlated Prefetching Practical
IEEE Micro. 2010. DOI : 10.1109/MM.2010.21.Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
2009. p. 195-201. DOI : 10.1109/PRDC.2009.39.Flexible Hardware Acceleration for Instruction-Grain Lifeguards
IEEE Micro Top Picks. 2009. DOI : 10.1109/MM.2009.6.ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems. 2009. DOI : 10.1145/1534916.1534925.Spatio-Temporal Memory Streaming
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 69-80. DOI : 10.1145/1555754.1555766.Practical Off-chip Meta-data for Temporal Memory Streaming
2009. 15th International Symposium on High-Performance Computer Architecture, Raleigh, NC. p. 79-90. DOI : 10.1109/HPCA.2009.4798239.Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 184-195. DOI : 10.1145/1555754.1555779.Shore-MT: A Scalable Storage Manager for the Multicore Era
2009. 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26. p. 24-35. DOI : 10.1145/1516360.1516365.Workshop on Transactional Computing (TRANSACT 2008) - Introduction
Acm Sigplan Notices. 2008. DOI : 10.1145/1402227.1402233.A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
2008. 16th international ACM/SIGDA symposium on Field programmable gate arrays (FPGA), Monterey, CA, February. p. 77–86. DOI : 10.1145/1344671.1344684.Temporal instruction fetch streaming
2008. the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Lake Como, Italy, November. p. 1-10. DOI : 10.1109/MICRO.2008.4771774.Flexible hardware acceleration for instruction-grain program monitoring
2008. the 35th Annual International Symposium on Computer Architecture (ISCA), Beijing, China, June. p. 377-388. DOI : 10.1109/ISCA.2008.20.Predictor virtualization
2008. the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS), Seattle, WA, March. p. 157-167. DOI : 10.1145/1346281.1346301.Temporal streams in commercial server applications
2008. IEEE International Symposium on Workload Characterization (IISWC), Seattle, WA, September. p. 99-108. DOI : 10.1109/IISWC.2008.4636095.Infoscience
AstriFlash: A Flash-Based System for Online Services
2022-12-04. The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), Montreal, QC, Canada, Feb 25 – March 01, 2023.Elimination of ringing artifacts by finite-element projection in FFT-based homogenization
Journal Of Computational Physics. 2022-03-15. DOI : 10.1016/j.jcp.2021.110931.Efficient Meso-Scale Modeling of Alkali-Silica-Reaction Damage in Concrete
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-9591.Hardware and Software Support for RPC-Centric Server Architecture
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8017.Algorithms for Efficient and Robust Distributed Deep Learning
Lausanne, EPFL, 2022. DOI : 10.5075/epfl-thesis-8980.Equinox: Training (for Free) on a Custom Inference Accelerator
2021-10-18. 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21), Virtual Event, Greece, October 18–22, 2021. DOI : 10.1145/3466752.3480057.Cerebros: Evading the RPC Tax in Datacenters
2021-10-18. MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18–22, 2021. p. 407-420. DOI : 10.1145/3466752.3480055.Hardware-Software Co-Design of an RPC Processor
Lausanne, EPFL, 2021. DOI : 10.5075/epfl-thesis-7217.Rebooting Virtual Memory with Midgard
2021. ISCA 2021 48th International Symposium on Computer Architecture, Online conference, June 14-19, 2021. DOI : 10.1109/ISCA52012.2021.00047.Data transformer apparatus
US2022327048 ; WO2021037341 . 2021.Exploiting Errors for Efficiency: A Survey from Circuits to Applications
Acm Computing Surveys. 2020-06-01. DOI : 10.1145/3394898.ColTraIn: Co-located DNN training and inference
Lausanne, EPFL, 2020. DOI : 10.5075/epfl-thesis-10265.The NEBULA RPC-Optimized Architecture
2020. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, May, 30th - June, 3rd 2020. p. 199-212. DOI : 10.1109/ISCA45697.2020.00027.Optimus Prime: Accelerating Data Transformation in Servers
2020. Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16–20, 2020. p. 1203-1216. DOI : 10.1145/3373376.3378501.SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators
2020Distributed Logless Atomic Durability with Persistent Memory
2019-10-16. The 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52), Columbus, OH, USA, October 12–16, 2019. DOI : 10.1145/3352460.3358321.RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs
2019-04-15. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '19, Providence, Rhode Island, USA, April 13-17, 2019. p. 35-48. DOI : 10.1145/3297858.3304070.Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling
ACM Transactions on Computer Systems. 2019-04-01. DOI : 10.1145/3309986.SMoTherSpectre: Exploiting Speculative Execution through Port Contention
2019. The 26th ACM Conference on Computer and Communications Security - ACM CSS 2019, London, UK, November 11-15, 2019. p. 785–800. DOI : 10.1145/3319535.3363194.Analog Neural Networks with Deep-submicron Nonlinear Synapses
IEEE Micro. 2019. DOI : 10.1109/MM.2019.2931182.Design Guidelines for High-Performance SCM Hierarchies
2018-10-01. 4th International Symposium on Memory Systems (MEMSYS), Old Town Alexandria, VA, USA, October 1-4, 2018. DOI : 10.1145/3240302.3240310.Atomic object reads for in-memory rack-scale computing
US10929174 ; US2018173673 . 2018.Training DNNs with Hybrid Block Floating Point
2018-01-01. NeurIPS 2018 - 32nd Conference on Neural Information Processing Systems, Montreal, CANADA, Dec 02-08, 2018.Network-Compute Co-Design for Distributed In-Memory Computing
Lausanne, EPFL, 2018. DOI : 10.5075/epfl-thesis-8749.LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching
2018. Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '18, Williamsburg, VA, USA, March 24th – March 28th, 2018. p. 489-502. DOI : 10.1145/3173162.3173211.Near-Memory Address Translation
2017. 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Portland, OR, SEP 09-13, 2017. p. 303-317. DOI : 10.1109/Pact.2017.56.Near-Memory Address Translation
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7875.Fat Caches For Scale-Out Servers
Ieee Micro. 2017. DOI : 10.1109/MM.2017.32.Rack-Scale Memory Pooling for Datacenters
Lausanne, EPFL, 2017. DOI : 10.5075/epfl-thesis-7612.The Mondrian Data Engine
2017. The 44th International Symposium on Computer Architecture, Toronto, ON, Canada, June 24-28, 2017. DOI : 10.1145/3079856.3080233.Unified prefetching into instruction cache and branch target buffer
US9996358 ; US2017090935 . 2017.FPGAs versus GPUs in Data centers
IEEE Micro. 2017. DOI : 10.1109/MM.2017.19.Unlocking Energy
2016. 2016 USENIX Annual Technical Conference, Denver, Colorado, USA, June 22-24, 2016. p. 393-406.The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems
2016. ACM Symposium on Cloud Computing, Santa Clara, USA, October 05-07, 2016. DOI : 10.1145/2987550.2987577.SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing
2016. 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, October 15-19, 2016. DOI : 10.1109/MICRO.2016.7783709.Near-Memory Data Services
IEEE Micro. 2016. DOI : 10.1109/MM.2016.9.An Analysis of Load Imbalance in Scale-out Data Serving
2016. ACM SIGMETRICS, Antibes Juan-Les-Pins, France, June 14-18, 2016. p. 367–368. DOI : 10.1145/2896377.2901501.Towards Near-Threshold Server Processors
2016. Design, Automation and Test in Europe Conference (DATE '16), Dresden, Germany, March 14-18, 2016. p. 7-12.Scale-out non-uniform memory access
US9734063 ; US2015242324 . 2015.Asynchronous memory access chaining
Proceedings of the VLDB Endowment. 2015. DOI : 10.14778/2856318.2856321.Confluence: unified instruction supply for scale-out servers
2015. the 48th International Symposium, Waikiki, Hawaii, 05-09 December 2015. p. 166-177. DOI : 10.1145/2830772.2830785.Accelerators for Data Processing
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6710.Memory Systems and Interconnects for Scale-Out Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6682.Multi-Gigabyte On-Chip DRAM Caches for Servers
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6631.Shared Frontend for Manycore Server Processors
Lausanne, EPFL, 2015. DOI : 10.5075/epfl-thesis-6669.Sort vs. Hash Join Revisited for Near-Memory Execution
2015. 5th Workshop on Architectures and Systems for Big Data (ASBD 2015), Portland, Oregon, USA, June 13, 2015.Sort vs. Hash Join Revisited for Near-Memory Execution
5th Workshop on Architectures and Systems for Big Data ( ASBD 2015 ), Portland, Oregon, USA, June 13, 2015.Manycore Network Interfaces for In-Memory Rack-Scale Computing
2015. 42nd International Symposium in Computer Architecture, Portland, Oregon, USA, June 13-17, 2015. DOI : 10.1145/2749469.2750415.Network-on-chip using request and reply trees for low-latency processor-memory communication
US9703707 ; US2014156929 . 2014.Big Data
IEEE Micro. 2014. DOI : 10.1109/MM.2014.65.Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, December 13-17, 2014. p. 25-37. DOI : 10.1109/MICRO.2014.51.Architectural Support to Accelerate Fine-Grain Program Monitoring
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6257.BuMP: Bulk Memory Access Prediction and Streaming
2014. 47th Annual IEEE/ACM International Symposium on Microarchitecture, December 13-17, 2014. p. 545-557. DOI : 10.1109/MICRO.2014.44.Towards stable cloud performance
Lausanne, EPFL, 2014. DOI : 10.5075/epfl-thesis-6261.A Case for Specialized Processors for Scale-Out Workloads
IEEE Micro. 2014. DOI : 10.1109/MM.2014.41.A Primer on Hardware Prefetching
Morgan & Claypool.Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models
2014FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring
2014. 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), Orlando, Florida, USA, February 15-19, 2014. p. 108-119. DOI : 10.1109/HPCA.2014.6835922.Scale-Out NUMA
2014. Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, Utah, USA, March 1-5, 2014. DOI : 10.1145/2541940.2541965.DeSyRe: On-demand system reliability
Microprocessors and Microsystems - Embedded Hardware Design. 2013. DOI : 10.1016/j.micpro.2013.08.008.Multi-Grain Coherence Directory
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540739.Meet the Walkers: Accelerating Index Traversals for In-Memory Databases
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'13), Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540748.SHIFT: Shared History Instruction Fetch for Lean-Core Server Processors
2013. 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, December 7-11, 2013. DOI : 10.1145/2540708.2540732.TOP PICKS FROM THE 2012 COMPUTER ARCHITECTURE CONFERENCES Introduction
IEEE Micro. 2013. DOI : 10.1109/MM.2013.65.Scale-Out Processors
Lausanne, EPFL, 2013. DOI : 10.5075/epfl-thesis-5906.Die-Stacked DRAM Caches for Servers: Hit Ratio, Latency, or Bandwidth? Have It All with Footprint Cache
2013. 40th International Symposium on Computer Architecture, Tel-Aviv, Israel, June 23-27, 2013. p. 404–415. DOI : 10.1145/2485922.2485957.BugSifter: A Generalized Accelerator for Flexible Instruction-Grain Monitoring
2012Dark Silicon Accelerators for Database Indexing
2012. 1st Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Thermal Characterization of Cloud Workloads on a Power-Efficient Server-on-Chip
2012. 30th IEEE International Conference on Computer Design, Montreal, Quebec, Canada, September 30 - October 3, 2012. DOI : 10.1109/ICCD.2012.6378637.Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors
ACM Transactions on Computer Systems. 2012. DOI : 10.1145/2382553.2382557.NOC-Out: Microarchitecting a Scale-Out Processor
2012. 45th International Symposium on Microarchitecture, Vancouver, BC, Canada, December 1-5, 2012. DOI : 10.1109/MICRO.2012.25.Optimizing Data-Center TCO with Scale-Out Processors
IEEE Micro. 2012. DOI : 10.1109/MM.2012.71.Dark Silicon Accelerators for Database Indexing
Dark Silicon Workshop, Portland, Oregon, USA, June 10, 2012.Scale-Out Processors
2012. 39th Annual International Symposium on Computer Architecture, Portland, Oregon, USA, June 9-13, 2012. DOI : 10.1145/2366231.2337217.CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers
2012. 6th International Symposium on Networks-on-Chip, Lyngby, Denmark, May 9-11, 2012.Scale-Out Processors
2012Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware
2012. Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, London, UK, March 3-7, 2012.Reliability in the Dark Silicon Era
2011. 17th IEEE International On-Line Testing Symposium (IOLTS), Athens, Greece, Jul 13-15, 2011. p. V-V.Proactive Instruction Fetch
2011. 44th Annual IEEE/ACM Symposium on Microarchitecture (MICRO 2011), Porto Alegre, Brazil, December 3-7. p. 152-162. DOI : 10.1145/2155620.2155638.Clearing the Clouds: A Study of Emerging Workloads on Modern Hardware
2011Toward Dark Silicon in Servers
IEEE Micro. 2011. DOI : 10.1109/MM.2011.77.CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips
2011. Workshop on Energy-Efficient Design (WEED 2011), San Jose, California, USA, June 5, 2011.Cuckoo Directory: A Scalable Directory for Many-Core Systems
2011. HPCA 2011, San Antonio, Texas, USA, February 12-16, 2011. DOI : 10.1109/HPCA.2011.5749726.ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
2010. ASPLOS 2010, Pittsburgh, Pennsylvania, USA, March 13-17, 2010. p. 271-284. DOI : 10.1145/1736020.1736051.TurboTag: Lookup Filtering to Reduce Coherence Directory Power
2010. 16th International Symposium on Low Power Electronics and Design (ISLPED 10), Austin, Texas, USA, August 18-20. p. 377-382. DOI : 10.1145/1840845.1840929.Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures
IEEE Micro. 2010. DOI : 10.1109/MM.2010.22.Making Address-Correlated Prefetching Practical
IEEE Micro. 2010. DOI : 10.1109/MM.2010.21.Chip-Level Redundancy in Distributed Shared-Memory Multiprocessors
2009. p. 195-201. DOI : 10.1109/PRDC.2009.39.Flexible Hardware Acceleration for Instruction-Grain Lifeguards
IEEE Micro Top Picks. 2009. DOI : 10.1109/MM.2009.6.ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems. 2009. DOI : 10.1145/1534916.1534925.Spatio-Temporal Memory Streaming
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 69-80. DOI : 10.1145/1555754.1555766.Practical Off-chip Meta-data for Temporal Memory Streaming
2009. 15th International Symposium on High-Performance Computer Architecture, Raleigh, NC. p. 79-90. DOI : 10.1109/HPCA.2009.4798239.Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
2009. 36th ACM/IEEE Annual International Symposium on Computer Architecture, Austin, TX. p. 184-195. DOI : 10.1145/1555754.1555779.Shore-MT: A Scalable Storage Manager for the Multicore Era
2009. 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26. p. 24-35. DOI : 10.1145/1516360.1516365.Workshop on Transactional Computing (TRANSACT 2008) - Introduction
Acm Sigplan Notices. 2008. DOI : 10.1145/1402227.1402233.A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
2008. 16th international ACM/SIGDA symposium on Field programmable gate arrays (FPGA), Monterey, CA, February. p. 77–86. DOI : 10.1145/1344671.1344684.Temporal instruction fetch streaming
2008. the 41st annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Lake Como, Italy, November. p. 1-10. DOI : 10.1109/MICRO.2008.4771774.Flexible hardware acceleration for instruction-grain program monitoring
2008. the 35th Annual International Symposium on Computer Architecture (ISCA), Beijing, China, June. p. 377-388. DOI : 10.1109/ISCA.2008.20.Predictor virtualization
2008. the 13th international conference on Architectural support for programming languages and operating systems (ASPLOS), Seattle, WA, March. p. 157-167. DOI : 10.1145/1346281.1346301.Temporal streams in commercial server applications
2008. IEEE International Symposium on Workload Characterization (IISWC), Seattle, WA, September. p. 99-108. DOI : 10.1109/IISWC.2008.4636095.Teaching & PhD
Teaching
Computer Science
Mathematics
Communication Systems