ICDAR2017 Competition on Post-OCR Text Correction

TitreICDAR2017 Competition on Post-OCR Text Correction
Type de publicationArticle de colloque/conférence
Année de publication2018
AuteursJean-Philippe Moreux, Guillaume Chiron, Antoine Doucet, Coustady, M
Nom du colloque2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Date de la réunionnov. 2017
Lieu du colloqueKyoto, Japon
Mots clésOCR; OCR errors

This paper describes the ICDAR2017 competitionon post-OCR text correction and presents the different methodssubmitted by the participants. OCR has been an active researchfield for over the past 30 years but results are still imperfect,especially for historical documents. The purpose of this competitionis to compare and evaluate automatic approaches forcorrecting (denoising) OCR-ed texts. The challenge consists oftwo independent tasks: 1) error detection and 2) error correction.An original dataset of 12M OCR-ed symbols along with analigned ground truth was provided to the participants with80% of the dataset dedicated to the training and 20% to theevaluation. Different sources were aggregated and namely containnewspapers and monographs covering 2 languages (English andFrench). 11 teams submitted results, while the difficulty of thetask was underlined by the fact that only half of the submittedmethods were able to denoise the evaluation dataset on average.In any case, this competition, which counted 35 registrations,illustrates the strong interest of the community in this essentialproblem, which is key to any digitization process involving textualdata.

Fichier attaché: