Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

Mohammad Chegini; Jürgen Bernard; Philip Berger; Alexei Sourin; Keith Andrews; Tobias Schreck

doi:10.1016/j.visinf.2019.03.002

Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

Mohammad Chegini, Jürgen Bernard, Philip Berger, Alexei Sourin, Keith Andrews, Tobias Schreck

Research output: Contribution to journal › Article › peer-review

Abstract

Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.

Original language	English
Pages (from-to)	9 - 17
Number of pages	9
Journal	Visual Informatics
Volume	3
Issue number	1
DOIs	https://doi.org/10.1016/j.visinf.2019.03.002
Publication status	Published - 1 Mar 2019
Event	PacificVAST 2019 - Bangkok, Bangkok, Thailand Duration: 23 Apr 2019 → … http://research.cbs.chula.ac.th/pvis2019/PacificVAST.aspx

Keywords

Labelling, Clustering, Classification, Active learning, Multivariate data, Visualisation
Multivariate data
Active learning
Classification
Visualisation
Labelling
Clustering

ASJC Scopus subject areas

Food Science

Access to Document

10.1016/j.visinf.2019.03.002

http://www.sciencedirect.com/science/article/pii/S2468502X19300178

Cite this

@article{9adf46e6fbaf45ed901c8acf07ac7cc9,

title = "Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning",

abstract = "Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.",

keywords = "Labelling, Clustering, Classification, Active learning, Multivariate data, Visualisation, Multivariate data, Active learning, Classification, Visualisation, Labelling, Clustering",

author = "Mohammad Chegini and J{\"u}rgen Bernard and Philip Berger and Alexei Sourin and Keith Andrews and Tobias Schreck",

note = "SI: Proceedings of PacificVAST 2019; PacificVAST 2019 ; Conference date: 23-04-2019",

year = "2019",

month = mar,

day = "1",

doi = "10.1016/j.visinf.2019.03.002",

language = "English",

volume = "3",

pages = "9 -- 17",

journal = "Visual Informatics",

issn = "2468-502X",

publisher = "Elsevier B.V.",

number = "1",

url = "http://research.cbs.chula.ac.th/pvis2019/PacificVAST.aspx",

}

TY - JOUR

T1 - Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

AU - Chegini, Mohammad

AU - Bernard, Jürgen

AU - Berger, Philip

AU - Sourin, Alexei

AU - Andrews, Keith

AU - Schreck, Tobias

N1 - SI: Proceedings of PacificVAST 2019

PY - 2019/3/1

Y1 - 2019/3/1

N2 - Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.

AB - Supervised machine learning techniques require labelled multivariate training datasets. Many approaches address the issue of unlabelled datasets by tightly coupling machine learning algorithms with interactive visualisations. Using appropriate techniques, analysts can play an active role in a highly interactive and iterative machine learning process to label the dataset and create meaningful partitions. While this principle has been implemented either for unsupervised, semi-supervised, or supervised machine learning tasks, the combination of all three methodologies remains challenging. In this paper, a visual analytics approach is presented, combining a variety of machine learning capabilities with four linked visualisation views, all integrated within the mVis (multivariate Visualiser) system. The available palette of techniques allows an analyst to perform exploratory data analysis on a multivariate dataset and divide it into meaningful labelled partitions, from which a classifier can be built. In the workflow, the analyst can label interesting patterns or outliers in a semi-supervised process supported by active learning. Once a dataset has been interactively labelled, the analyst can continue the workflow with supervised machine learning to assess to what degree the subsequent classifier has effectively learned the concepts expressed in the labelled training dataset. Using a novel technique called automatic dimension selection, interactions the analyst had with dimensions of the multivariate dataset are used to steer the machine learning algorithms. A real-world football dataset is used to show the utility of mVis for a series of analysis and labelling tasks, from initial labelling through iterations of data exploration, clustering, classification, and active learning to refine the named partitions, to finally producing a high-quality labelled training dataset suitable for training a classifier. The tool empowers the analyst with interactive visualisations including scatterplots, parallel coordinates, similarity maps for records, and a new similarity map for partitions.

KW - Labelling, Clustering, Classification, Active learning, Multivariate data, Visualisation

KW - Multivariate data

KW - Active learning

KW - Classification

KW - Visualisation

KW - Labelling

KW - Clustering

UR - http://www.scopus.com/inward/record.url?scp=85066328558&partnerID=8YFLogxK

U2 - 10.1016/j.visinf.2019.03.002

DO - 10.1016/j.visinf.2019.03.002

M3 - Article

SN - 2468-502X

VL - 3

SP - 9

EP - 17

JO - Visual Informatics

JF - Visual Informatics

IS - 1

T2 - PacificVAST 2019

Y2 - 23 April 2019

ER -

Interactive labelling of a multivariate dataset for supervised machine learning using linked visualisations, clustering, and active learning

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this