Unsupervised document structure analysis of digital scientific articles

Stefan Klampfl; Michael Granitzer; Kris Jack; Roman Kern

doi:10.1007/s00799-014-0115-1

Unsupervised document structure analysis of digital scientific articles

Stefan Klampfl^*, Michael Granitzer, Kris Jack, Roman Kern

^*Korrespondierende/r Autor/-in für diese Arbeit

Know-Center GmbH Research Center for Data-Driven Business & Big Data Analytics (98770)

Publikation: Beitrag in einer Fachzeitschrift › Artikel › Begutachtung

Abstract

Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

Originalsprache	englisch
Seiten (von - bis)	83-99
Fachzeitschrift	International Journal on Digital Libraries
Jahrgang	14
Ausgabenummer	3-4
DOIs	https://doi.org/10.1007/s00799-014-0115-1
Publikationsstatus	Veröffentlicht - 2014

Fields of Expertise

Information, Communication & Computing

Zugriff auf Dokument

10.1007/s00799-014-0115-1

Dieses zitieren

@article{21634e05993a4db7a02c9a70a4fc7a7e,

title = "Unsupervised document structure analysis of digital scientific articles",

abstract = "Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents{\textquoteright} content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.",

author = "Stefan Klampfl and Michael Granitzer and Kris Jack and Roman Kern",

year = "2014",

doi = "10.1007/s00799-014-0115-1",

language = "English",

volume = "14",

pages = "83--99",

journal = "International Journal on Digital Libraries",

issn = "1432-1300",

publisher = "Springer Verlag",

number = "3-4",

}

TY - JOUR

T1 - Unsupervised document structure analysis of digital scientific articles

AU - Klampfl, Stefan

AU - Granitzer, Michael

AU - Jack, Kris

AU - Kern, Roman

PY - 2014

Y1 - 2014

N2 - Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

AB - Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

U2 - 10.1007/s00799-014-0115-1

DO - 10.1007/s00799-014-0115-1

M3 - Article

SN - 1432-1300

VL - 14

SP - 83

EP - 99

JO - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

IS - 3-4

ER -

Unsupervised document structure analysis of digital scientific articles

Abstract

Fields of Expertise

Zugriff auf Dokument

Fingerprint

Dieses zitieren