Babak Falsafi

Edit profile

Full Professor

babak.falsafi@epfl.ch +41 21 693 55 92 http://parsa.epfl.ch/~falsafi

EPFL IC IINFCOM PARSA
INJ 233 (Bâtiment INJ)
Station 14
1015 Lausanne

+41 21 693 55 92
+41 21 693 13 93
Office: INJ 233
EPFL > IC > IINFCOM > PARSA

Web site: Web site: https://parsa.epfl.ch/

+41 21 693 55 92
EPFL > IC > IC-SIN > SIN-ENS

Web site: Web site: https://sin.epfl.ch

+41 21 693 55 92
EPFL > VPA-AVP-DLE > AVP-DLE-EDOC > EDIC-ENS

+41 21 693 55 92
EPFL > IC > IC-SSC > SSC-ENS

Web site: Web site: https://ssc.epfl.ch

+41 21 693 55 92
EPFL > SB > SB-SMA > SMA-ENS

Web site: Web site: https://sma.epfl.ch/

vCard
Administrative data

Fields of expertise

Computer architecture, datacenter systems, cloud-native server architecture.

Biography

Babak is a Professor in the School of Computer and Communication Sciences and the founder of EcoCloud, an industrial/academic consortium at EPFL investigating scalable sustainable information technology. He has made numerous contributions to computer system design and evaluation including a scalable multiprocessor architecture which was prototyped by Sun Microsystems (now Oracle), snoop filters incorporated into multi-socket x86 servers and IBM BlueGene supercomputers, spatial and temporal memory streaming that appear in ARM cores, and computer system performance evaluation methodologies that have been in use by AMD, HP and Google PerfKit . He has shown that hardware memory consistency models are neither necessary (in the 90's) nor sufficient (a decade later) to achieve high performance in servers. These results eventually led to fence speculation in modern CPUs. His work on cloud-native CPUs laid the foundation for the first generation of Cavium ARM server CPUs, ThunderX. He is a recipient of an NSF CAREER award, IBM Faculty Partnership Awards, and an Alfred P. Sloan Research Fellowship. He is a fellow of ACM and IEEE.

NEWS

Online services are stuck in memory and DRAM is not scaling. AstriFlash at HPCA'23 presents a system to serve data directly out of Flash, reducing memory cost by 20x and meeting ms-scale SLO for online services at 95% of throughput compared to DRAM.

Network bandwidth is projected to grow at 20% a year for a decade thanks to optics. Logic density is lagging behind at 15% a year and slowing down resulting a "datacenter tax". Optimus Prime a data transformation accelerator, NebuLA a hardware-terminated network stack, and Cerebros an RPC processor are examples of how to mitigate the datacenter tax in the post-Moore era. Great to see that Google has followed up with their own data transformation accelerator in 2022.

See our paper on "Rebooting Virtual Memory with Midgard" for a novel approach to future-proof virtual memory. Here is a news snippet.

Numerical training of DNNs is converging on fixed point with orders of magnitude improvement in logic, memory, power and bandwidth. See our blog.

RESEARCH

Data has emerged as a currency for modern society and datacenters are now the backbone of IT offering large-scale cloud services at low costs benefiting from and exploiting the economies of scale. With silicon efficiency scaling having dwindled since 2004 and silicon density scaling, Moore's Law, slowing down, future digital platforms will rely on heterogeneous logic and memory to allow for IT scalability. Meanwhile, the demand for large-scale cloud services has grown dramatically faster than conventional silicon scaling making IT platform scalability a grand challenge. Future platforms will need hand-in-hand collaboration of application domain experts and platform designers to improve scalability. With many online services being in-memory and the minimum communication latency between the farthest nodes being microseconds, future server platforms will go through revolutionary changes in architecture and systems to enable seamless aggregation of logic and memory resources across nodes, breaking the conventional abstraction layers. Babak's research and educational activities center around post-Moore server design.

He investigates techniques to address these challenges in the context of the following projects:

CloudSuite: A Benchmark Suite for Scale-Out Workloads
ColTraIn: Co-Located Training and Inference DNN Accelerators
HARNESS: Heterogeneous Architectures for Next-Generation Server Systems
Midgard: Future-Proofing Virtual Memory
QFlex: Fast, Full-System Open-Source Server Simulation/Emulation
VISA: Cloud-Native CPUs

Selected Talks

Integration, Specialization and Approximation: the "ISA" of Post-Moore Servers
HPCA Keynote, 2022.

Post-Moore AI Infrastructure
Facebook SysML Talk, 2021.

Post-Moore Server Architecture
ICS Keynote, 2020 (Video on YouTube!).

Server Architecture for the Post-Moore Era
HotDC Keynote, 2017.

Awards

2015 : Elected Fellow of Association for Computing Machinery (ACM)

2012 : Elected Fellow of the Institute of Electrical and Electronics Engineers

2004 : Sloan Research Fellowship : Alfred P. Sloan Foundation

Publications

Infoscience publications

Single-Address-Space FaaS with Jord

Y. Li; A. Bhattacharyya; M. Kumar; A. Bhattacharjee; Yoav Etsion et al.

2025. The 52nd Annual International Symposium on Computer Architecture, Tokyo, Japan, 2025-06-21 - 2025-06-25. p. 694 - 707. DOI : 10.1145/3695053.3731108.

Babak Falsafi

Full Professor

Fields of expertise

Biography

NEWS

RESEARCH

Selected Talks

Awards

Publications

Infoscience publications

Single-Address-Space FaaS with Jord

QFlex 3.0: Fast and Accurate ARM Server Simulation

Avant-Garde: Empowering GPUs with Scaled Numeric Formats

Constrained bit allocation for neural networks

UrbanTwin: An urban digital twin for climate action

Silicon Efficiency in Post-Moore Servers

Server Architecture from Enterprise to Post-Moore

Electrical-Level Fault-Injection Attacks on FPGA-Based Systems

Secure Interface Design Leveraging Hardware/Software Support

What's Missing in Agile Hardware Design? Verification!

Scale-out Systolic Arrays

AstriFlash: A Flash-Based System for Online Services

Imprecise Store Exceptions

Cooperative Concurrency Control for Write-Intensive Key-Value Workloads

SecureCells: A Secure Compartmentalized Architecture

Evaluating, Exploiting, and Hiding Power Side-Channel Leakage of Remote FPGAs

Rebooting Virtual Memory with Midgard

Hardware and Software Support for RPC-Centric Server Architecture

Algorithms for Efficient and Robust Distributed Deep Learning

Cerebros: Evading the RPC Tax in Datacenters

Equinox: Training (for Free) on a Custom Inference Accelerator

Hardware-Software Co-Design of an RPC Processor

Data transformer apparatus

Rebooting Virtual Memory with Midgard

Exploiting Errors for Efficiency: A Survey from Circuits to Applications

Optimus Prime: Accelerating Data Transformation in Servers

The NEBULA RPC-Optimized Architecture

SPARTA: A Divide and Conquer Approach to Address Translation for Accelerators

ColTraIn: Co-located DNN training and inference

Distributed Logless Atomic Durability with Persistent Memory

Analog Neural Networks with Deep-submicron Nonlinear Synapses

SMoTherSpectre: Exploiting Speculative Execution through Port Contention

Design Guidelines for High-Performance SCM Hierarchies

Atomic object reads for in-memory rack-scale computing

Training DNNs with Hybrid Block Floating Point

Network-Compute Co-Design for Distributed In-Memory Computing

Rack-Scale Memory Pooling for Datacenters

The Mondrian Data Engine

Unified prefetching into instruction cache and branch target buffer

Near-Memory Address Translation

Fat Caches For Scale-Out Servers

Near-Memory Address Translation

FPGAs versus GPUs in Data centers

Unlocking Energy

Towards Near-Threshold Server Processors

Near-Memory Data Services

An Analysis of Load Imbalance in Scale-out Data Serving

SABRes: Atomic Object Reads for In-Memory Rack-Scale Computing

The Case for RackOut: Scalable Data Serving Using Rack-Scale Systems

Sort vs. Hash Join Revisited for Near-Memory Execution

Sort vs. Hash Join Revisited for Near-Memory Execution

Confluence: unified instruction supply for scale-out servers

Multi-Gigabyte On-Chip DRAM Caches for Servers

Manycore Network Interfaces for In-Memory Rack-Scale Computing

Memory Systems and Interconnects for Scale-Out Servers

Shared Frontend for Manycore Server Processors

Asynchronous memory access chaining

Accelerators for Data Processing

Resolve: Enabling Accurate Parallel Monitoring under Relaxed Memory Models

A Primer on Hardware Prefetching

Towards stable cloud performance

Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache

FADE: A Programmable Filtering Accelerator for Instruction-Grain Monitoring

Architectural Support to Accelerate Fine-Grain Program Monitoring

Big Data

A Case for Specialized Processors for Scale-Out Workloads

Scale-Out NUMA

BuMP: Bulk Memory Access Prediction and Streaming

SHIFT: Shared History Instruction Fetch for Lean-Core Server Processors

DeSyRe: On-demand system reliability