Maud Ehrmann

EPFL CDH DHI DHLAB
INN 116 (Bâtiment INN)
Station 14
1015 Lausanne

Expertise

With a background in both natural language processing (NLP) and the humanities, my expertise is in the area of historical and multilingual NLP, with particular focus on historical document processing, information extraction, named entity processing, multilingual and historical resource creation, NLP system evaluation, and large-scale infrastructure. In recent years, I have worked and coordinated work on these topics in research projects at the intersection of computer science and cultural heritage - an interdisciplinary setting in which I have often acted as an intermediary between computer scientists, humanities scholars, engineers, and representatives of cultural heritage institutions.

Highlights:

impresso. Media Monitoring of the Past. How can newspaper archives help understand the past? How to explore them? This large-scale, impact-driven project aims to enable critical mining of newspaper archives by integrating robust content mining and innovative data visualisation and exploration into a powerful user interface that can support digital scholarship.
The HIPE Evaluation Campaigns. What is the ability of machines to recognise and disambiguate entities (e.g. people, places, organisations) in multilingual historical documents? The series of HIPE shared tasks aims to assess and advance the development of robust, adaptable and transferable approaches to named entity processing in historical documents to foster efficient semantic indexing of digitised cultural heritage collections. See the HIPE-2020 and HIPE-2022 websites, the HIPE-eval GitHub organisation, the HIPE-2022 dataset, and the DHLAB web page.

Expertise

With a background in both natural language processing (NLP) and the humanities, my expertise is in the area of historical and multilingual NLP, with particular focus on historical document processing, information extraction, named entity processing, multilingual and historical resource creation, NLP system evaluation, and large-scale infrastructure. In recent years, I have worked and coordinated work on these topics in research projects at the intersection of computer science and cultural heritage - an interdisciplinary setting in which I have often acted as an intermediary between computer scientists, humanities scholars, engineers, and representatives of cultural heritage institutions.

Highlights:

impresso. Media Monitoring of the Past. How can newspaper archives help understand the past? How to explore them? This large-scale, impact-driven project aims to enable critical mining of newspaper archives by integrating robust content mining and innovative data visualisation and exploration into a powerful user interface that can support digital scholarship.
The HIPE Evaluation Campaigns. What is the ability of machines to recognise and disambiguate entities (e.g. people, places, organisations) in multilingual historical documents? The series of HIPE shared tasks aims to assess and advance the development of robust, adaptable and transferable approaches to named entity processing in historical documents to foster efficient semantic indexing of digitised cultural heritage collections. See the HIPE-2020 and HIPE-2022 websites, the HIPE-eval GitHub organisation, the HIPE-2022 dataset, and the DHLAB web page.
Maud Ehrmann is a research scientist and lecturer at the Digital Humanities Laboratory of the Ecole Polytechnique Fédérale de Lausanne. She holds a PhD in Computational Linguistics from the Paris Diderot Universtiy (Paris 7) and has been engaged in a large number of scientific projects centred on information extraction and text analysis, both for present-time and historical documents. Before joining the DHLAB, she worked at the Linguistics Computing Laboratory at the Sapienza University of Rome where she worked on the BabelNet resource and contributed to the LIDER project (2013-2014). Prior to that, she worked at the European Commission's Joint Research Centre in Ispra, Italy, as member of the OPTIMA unit (now Text and Data mining unit) which develops innovative and application-oriented solutions (Europe Media Monitor) for retrieving and extracting information from the Internet with a focus on high multilinguality (2009-2013). Previously, she worked at the Xerox Europe Research Centre in Grenoble, France (now Naver Labs Europe) in the Parsing and Semantics unit, first as PhD candidate supported by a CIFRE grant (2005-2008), then as a post-doctoral researcher (2008-2009). There, her research focused mainly on the automatic processing and fine-grained analysis of entities of interest, specifically named entities and temporal expressions.

Education

PhD in Computational Linguistics

|

2008 – 2008 Paris 7 Diderot University, LaTTICE laboratory

Master in Computational Linguistics

|

2004 – 2004 University of Lorraine, France

Master in General Linguistics

|

2003 – 2003 University of Lorraine, France

Bachelor in History

|

2002 – 2002 University of Lorraine, France

Bachelor in Comparative Literature

|

2001 – 2001 University of Lorraine, France

Selected publications

Named Entity Recognition and Classification in Historical Documents: A Survey

Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, Antoine Doucet.
Published in ACM Computing Survey (accepted) in

Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

Maud Ehrmann, Matteo Romanello, Sven Najem-Meyer, Antoine Doucet, Simon Clematide.
Published in CLEF 2022 proceedings in

Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers

Maud Ehrmann, Matteo Romanello, Alex Flückiger, Simon Clematide.
Published in CLEF 2020 proceedings in

Language Resources for Historical Newspapers: the Impresso Collection

Maud Ehrmann, Matteo Romanello, Simon Clematide, Phillip Benjamin Ströbel, Raphaël Barman
Published in LREC 2020 in

Exploring Large Vision-Language Models for Historical Newspaper Segmentation

D. C. Papadopoulos

2025

Advisor(s) : M. EhrmannF. KaplanP. I. ContiE. Boros

Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical Documents

E. BorosM. Ehrmann

Sustainability and Empowerment in the Context of Digital Libraries - 26th International Conference on Asia-Pacific Digital Libraries, ICADL 2024, Proceedings. 2025. 26th International Conference on Asia-Pacific Digital Libraries , Bandar Sunway, Malaysia , 2024-12-04 - 2024-12-06. p. 54 - 66.

DOI : 10.1007/978-981-96-0865-2_5.

Data Visualization Dashboard For Large-Scale Data Processing Monitoring And Quality Control

E. G. J. E. Garandel

2025

Advisor(s) : M. EhrmannP. I. Conti

Towards Chapterisation of Podcasts Detection of Host and Structuring Questions in Radio Transcripts

M. Piguet

2024

Advisor(s) : M. Ehrmann

Post-correction of Historical Text Transcripts with Large Language Models: An Exploratory Study

E. BorosM. EhrmannMatteo RomanelloS. Najem-MeyerF. Kaplan

Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024). 2024. The 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature , St Julian's, Malta , March 22, 2024. p. 133 - 159.

impresso Text Reuse at Scale. An interface for the exploration of text reuse data in semantically enriched historical newspapers

M. DüringM. RomanelloM. EhrmannK. BeelenD. Guido  et al.

Frontiers in Big Data

2023

Vol. 6, num. Visualizing Big Culture and History Data.

DOI : 10.3389/fdata.2023.1249469

Where Did the News Come From? Detection of News Agency Releases in Historical Newspapers

L. Marxen

2023

Advisor(s) : M. EhrmannE. BorosM. DueringF. Kaplan

From Archival Sources to Structured Historical Information: Annotating and Exploring the "Accordi dei Garzoni"

M. EhrmannO. TopalovF. Kaplan

Apprenticeship, Work, Society in Early Modern Venice; Abingdon: Routledge, Taylor & Francis Group,

2023.

DOI : 10.4324/9781003197195-6.

Computational Approaches to Digitised Historical Newspapers (Dagstuhl Seminar 22292)

M. EhrmannM. DüringC. NeudeckerA. Doucet

2023

Digitised Historical Newspapers: A Changing Research Landscape (Introduction)

M. EhrmannE. BunoutF. Clavert

Digitised Newspapers – A New Eldorado for Historians?; Berlin, Boston: De Gruyter Oldenbourg,

2022.

Digitised Newspapers – A New Eldorado for Historians? Reflections on Tools, Methods and Epistemology

Berlin: De Gruyter, 2022.

Introducing the HIPE 2022 Shared Task: Named Entity Recognition and Linking in Multilingual Historical Documents

M. EhrmannM. RomanelloA. DoucetS. Clematide

Advances in Information Retrieval. 2022. 44th European Conference on IR Research, ECIR 2022 , Stavanger, Norway , April 10-14, 2022. p. 347 - 354.

DOI : 10.1007/978-3-030-99739-7_44.

Automatic table detection and classification in large-scale newspaper archives

A. Vernet

2022

Advisor(s) : M. EhrmannS. ClematideF. Kaplan

HIPE-2022 Shared Task Named Entity Datasets

M. EhrmannM. RomanelloA. DoucetS. Clematide

2022.

ECCE: Entity-centric Corpus Exploration Using Contextual Implicit Networks

J. SchelbM. EhrmannM. RomanelloA. O. Spitz

WWW ’22 Companion. 2022. The Web Conference (WWW'22) , Lyon, France , April 25-29, 2022. p. 1 - 4.

DOI : 10.1145/3487553.3524237.

Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

M. EhrmannM. RomanelloS. Najem-MeyerA. DoucetS. Clematide

Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association, CLEF 2022, Bologna, Italy, September 5–8, 2022, Proceedings. 2022. 13th Conference and Labs of the Evaluation Forum (CLEF 2022) , Bologna, Italy , 5-8 September 2022. p. 423 - 446.

DOI : 10.1007/978-3-031-13643-6_26.

Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

M. EhrmannM. RomanelloS. Najem-MeyerA. DoucetS. Clematide

Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum. 2022. 13th Conference and Labs of the Evaluation Forum (CLEF 2022) , Bologna, Italy , 5-8 Sept 2022.

DOI : 10.5281/zenodo.6979577.

Explorer la presse numérisée : le projet Impresso

M. Ehrmann

Revue Historique Vaudoise

2021

Vol. 129/2021.

Named Entity Recognition and Classification in Historical Documents: A Survey

M. EhrmannA. HamdiE. Linhares PontesM. RomanelloA. Doucet

ACM Computing Surveys

2021

Vol. 56, num. 2.

Datasets and Models for Historical Newspaper Article Segmentation

R. BarmanM. EhrmannS. ClematideS. Ares Oliveira

2021.

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

R. BarmanM. EhrmannS. ClematideS. Ares OliveiraF. Kaplan

Journal of Data Mining & Digital Humanities

2021

Vol. 2021, num. Special Issue on HistoInformatics: Computational Approaches to History.

DOI : 10.5281/zenodo.4065271

Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers

M. EhrmannM. RomanelloA. FlückigerS. Clematide

CLEF 2020 Working Notes. Conference and Labs of the Evaluation Forum. 2020. 11th Conference and Labs of the Evaluation Forum (CLEF 2020) , [Online event] , 22-25 September, 2020.

DOI : 10.5281/zenodo.4117566.

Overview of CLEF HIPE 2020: Named Entity Recognition and Linking on Historical Newspapers

M. EhrmannM. RomanelloA. FlückigerS. Clematide

Experimental IR meets multilinguality, multimodality, and interaction. 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings. 2020. 11th International Conference of the CLEF Association - CLEF 2020 , Thessaloniki, Greece , September 22–25, 2020. p. 288 - .

DOI : 10.1007/978-3-030-58219-7_21.

Language Resources for Historical Newspapers: the Impresso Collection

M. EhrmannM. RomanelloS. ClematideP. B. StröbelR. Barman

Proceedings of the 12th Language Resources and Evaluation Conference. 2020. 12th International Conference on Language Resources and Evaluation (LREC) , Marseille, France , May 11-16 2020. p. 958 - 968.

DOI : 10.5281/zenodo.4641902.

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

M. EhrmannM. RomanelloS. BircherS. Clematide

Advances in Information Retrieval. ECIR 2020. 2020. ECIR 2020 : 42nd European Conference on Information Retrieval , Lisbon, Portugal , April 14-17, 2020. p. 524 - 532.

DOI : 10.1007/978-3-030-45442-5_68.

CLEF-HIPE-2020 - Shared Task Participation Guidelines

M. EhrmannM. RomanelloS. ClematideA. Flückiger

2020

The impresso system architecture in a nutshell

M. RomanelloM. EhrmannS. ClematideD. Guido

2020

CLEF-HIPE-2020 Shared Task Named Entity Datasets

M. EhrmannM. RomanelloS. ClematideA. Flückiger

2020.

Historical Newspaper Content Mining: Revisiting the impresso Project's Challenges in Text and Image Processing, Design and Historical Scholarship

M. EhrmannE. BunoutS. ClematideM. DüringA. Fickers  et al.

DH2020 Book of Abstracts. 2020. Digital Humanities Conference (DH) , Ottawa, Canada , July 20-24, 2020.

DOI : 10.5281/zenodo.4641894.

Impresso Named Entity Annotation Guidelines (CLEF-HIPE-2020)

M. EhrmannC. WatterM. RomanelloC. SimonA. Flückiger

2020

Historical Newspaper User Interfaces: A Review

M. EhrmannE. BunoutM. Düring

[Proceedings of the 85th IFLA General Conference and Assembly]. 2019. 85th IFLA General Conference and Assembly , Athens, Greece , 24-30 August 2019. p. 1 - 24.

DOI : 10.5281/zenodo.3404155.

Named Entity Processing for Historical Texts

M. EhrmannM. RomanelloS. Clematide

2019.

The Past, Present and Future of Digital Scholarship with Newspaper Collections

M. RidgeG. ColavizzaL. BrakeM. EhrmannJ.-P. Moreux  et al.

DH 2019 Book of Abstracts. 2019. DIgital Humanities Conference , Utrecht , July 2019.

Historical newspaper semantic segmentation using visual and textual features

R. Barman

2019

Advisor(s) : M. EhrmannS. Ares OliveiraS. Clematide

Index-Driven Digitization and Indexation of Historical Archives

G. ColavizzaM. EhrmannF. Bortoluzzi

Frontiers in Digital Humanities

2019

Vol. 6, num. 1-16.

DOI : 10.3389/fdigh.2019.00004

Beyond Keyword Search: Semantic Indexing and Exploration of Large Collections of Historical Newspapers

M. Ehrmann

Digital Humanitites in the Nordic Countries, Copenhagen, Denmark, March 2019.

Survey of digitized newspaper interfaces (dataset and notebooks)

M. EhrmannE. BunoutM. Duering

2019.

JRC-Names: Multilingual Entity Name variants and titles as Linked Data

M. EhrmannG. JacquetR. Steinberger

Semantic Web

2017

Vol. 8, num. 2.

DOI : 10.3233/SW-160228

Linked Lexical Knowledge Bases Foundations and Applications

M. Ehrmann

Computational Linguistics

2017

Vol. 43, num. 2.

DOI : 10.1162/COLI_r_00289

A Method for Record Linkage with Sparse Historical Data

G. ColavizzaM. EhrmannY. Rochat

2016. Digital Humanities Conference 2016 , Krakow, Poland , July 11-16, 2016.

Cross-lingual Linking of Multi-word Entities and their corresponding Acronyms

G. JacquetM. EhrmannR. SteinbergerJ. Väyrynen

Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). 2016. 10th International Conference on Language Resources and Evaluation , Portorož, Slovenia , May 2016.

Named Entity Resources - Overview and Outlook

M. EhrmannD. NouvelS. Rosset

Proceedings of the 9th International Conference on Language Resources and Evaluation. 2016. 10th International Conference on Language Resources and Evaluation , Portorož, Slovenia , May 2016.

From Documents to Structured Data: First Milestones of the Garzoni Project

M. EhrmannG. ColavizzaO. TopalovR. CellaD. Drago  et al.

DHCommons

2016

num. 2.

Diachronic Evaluation of NER Systems on Old Newspapers

M. EhrmannG. ColavizzaY. RochatF. Kaplan

Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). 2016. 13th Conference on Natural Language Processing (KONVENS 2016)Conference on Natural Language Processing , Bochum, GermanyBochum, Germany , September 19-21, 2016September 19–21, 2016. p. 97 - 107.

Navigating through 200 years of historical newspapers

Y. RochatM. EhrmannV. BuntinxC. BornetF. Kaplan

2016. International Conference on Digital Preservation (IPRES) , Bern, Switzerland , October 3-6, 2016.

Les entités nommées pour le traitement automatique des langues

D. NouvelM. EhrmannS. Rosset

ISTE editions, 2015.

Teaching & PhD

Courses

Historical Document and Media Processing

DH-400

This course introduces historical document processing, focusing on concepts and methods that enable the transformation of digitised materials into searchable information. Grounded in machine learning and document processing, it also covers data curation and copyright considerations.