More Efficient Manual Review of Automatically Transcribed Tabular Data
Permanent lenke
https://hdl.handle.net/10037/34528Dato
2024-04-04Type
Journal articleTidsskriftartikkel
Peer reviewed
Forfatter
Pedersen, Bjørn-Richard; Johansen, Rigmor Katrine; Holsbø, Einar Jakobsen; Sommerseth, Hilde Leikny; Bongo, Lars Ailo AslaksenSammendrag
Any machine learning method for transcribing historical text requires manual verification and correction,
which is often time-consuming and expensive. Our aim is to make it more efficient. Previously, we
developed a machine learning model to transcribe 2.3 million handwritten occupation codes from the
Norwegian 1950 census. Here, we manually review the 90,000 codes (3%) for which our model had
the lowest confidence scores. We allocated these codes to human reviewers, who used our custom
annotation tool to review them. The reviewers agreed with the model's labels 31.9% of the time. They
corrected 62.8% of the labels, and 5.1% of the images were uncertain or assigned invalid labels. 9,000
images were reviewed by multiple reviewers, resulting in an agreement of 86.4% and a disagreement
of 9%. The results suggest that one reviewer per image is sufficient. We recommend that reviewers
indicate any uncertainty about the label they assign to an image by adding a flag to their label. Our
interviews show that the reviewers performed internal quality control and found our custom tool
to be useful and easy to operate. We provide guidelines for efficient and accurate transcription of
historical text by combining machine learning and manual review. We have open-sourced our custom
annotation tool and made the reviewed images open access.
Forlag
IISHSitering
Pedersen, Johansen, Holsbø, Sommerseth, Bongo. More Efficient Manual Review of Automatically Transcribed Tabular Data. Historical Life Course Studies. 2024;14:3-15Metadata
Vis full innførsel
Copyright 2024 The Author(s)