A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Stefan Klampfl; Kris Jack; Roman Kern

doi:10.1045/november14-klampfl

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Stefan Klampfl, Kris Jack, Roman Kern

Know-Center GmbH Research Center for Data-Driven Business & Big Data Analytics (98770)

Research output: Contribution to journal › Article

Abstract

In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

Original language	English
Journal	D-Lib Magazine
Volume	20
Issue number	11-12
DOIs	https://doi.org/10.1045/november14-klampfl
Publication status	Published - 2014

Fields of Expertise

Information, Communication & Computing

Access to Document

10.1045/november14-klampflLicence: Other

Cite this

@article{9be4470baaa044af8bdf56a21b50f3cd,

title = "A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles",

abstract = "In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain. ",

author = "Stefan Klampfl and Kris Jack and Roman Kern",

year = "2014",

doi = "10.1045/november14-klampfl",

language = "English",

volume = "20",

journal = "D-Lib Magazine",

number = "11-12",

}

TY - JOUR

T1 - A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

AU - Klampfl, Stefan

AU - Jack, Kris

AU - Kern, Roman

PY - 2014

Y1 - 2014

N2 - In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

AB - In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

UR - http://www.dlib.org/dlib/november14/klampfl/11klampfl.html

U2 - 10.1045/november14-klampfl

DO - 10.1045/november14-klampfl

M3 - Article

VL - 20

JO - D-Lib Magazine

JF - D-Lib Magazine

IS - 11-12

ER -

A Comparison of Two Unsupervised Table Recognition Methods from Digital Scientific Articles

Abstract

Fields of Expertise

Access to Document

Other files and links

Fingerprint

Cite this