A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Stefan Klampfl, Kris Jack, Roman Kern

Research output: Contribution to journalArticle

Abstract

In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.
Original languageEnglish
JournalD-Lib Magazine
Volume20
Issue number11-12
DOIs
Publication statusPublished - 2014

Fields of Expertise

  • Information, Communication & Computing

Fingerprint

Dive into the research topics of 'A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles'. Together they form a unique fingerprint.

Cite this