Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes

Pedersen, Bjørn-Richard; Holsbø, Einar; Andersen, Trygve; Shvetsov, Nikita; Ravn, Johan; Sommerseth, Hilde Leikny; Bongo, Lars Ailo

dc.contributor.author	Pedersen, Bjørn-Richard
dc.contributor.author	Holsbø, Einar
dc.contributor.author	Andersen, Trygve
dc.contributor.author	Shvetsov, Nikita
dc.contributor.author	Ravn, Johan
dc.contributor.author	Sommerseth, Hilde Leikny
dc.contributor.author	Bongo, Lars Ailo
dc.date.accessioned	2022-02-09T13:38:12Z
dc.date.available	2022-02-09T13:38:12Z
dc.date.issued	2022-01-06
dc.description.abstract	Machine learning approaches achieve high accuracy for text recognition and are therefore increasingly used for the transcription of handwritten historical sources. However, using machine learning in production requires a streamlined end-to-end pipeline that scales to the dataset size and a model that achieves high accuracy with few manual transcriptions. The correctness of the model results must also be verified. This paper describes our lessons learned developing, tuning and using the Occode end-to-end machine learning pipeline for transcribing 2.3 million handwritten occupation codes from the Norwegian 1950 population census. We achieve an accuracy of 97% for the automatically transcribed codes, and we send 3% of the codes for manual verification . We verify that the occupation code distribution found in our results matches the distribution found in our training data, which should be representative for the census as a whole. We believe our approach and lessons learned may be useful for other transcription projects that plan to use machine learning in production.	en_US
dc.identifier.citation	Pedersen B, Holsbø EJ, Andersen T, Shvetsov N, Ravn J, Sommerseth HL, Bongo LA. Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes. Historical Life Course Studies. 2022;11:1-17	en_US
dc.identifier.cristinID	FRIDAID 1986163
dc.identifier.doi	https://doi.org/10.51964/hlcs11331
dc.identifier.issn	2352-6343
dc.identifier.uri	https://hdl.handle.net/10037/24000
dc.language.iso	eng	en_US
dc.relation.journal	Historical Life Course Studies
dc.relation.projectID	info:eu-repo/grantAgreement/RCN/FORINFRA/225950/Norway/National Historical Population Register for Norway 1800-2024 (HPR) / Historisk befolkningsregister (HBR) 1800-2024//	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2022 The Author(s)	en_US
dc.title	Lessons Learned Developing and Using a Machine Learning Model to Automatically Transcribe 2.3 Million Handwritten Occupation Codes	en_US
dc.type.version	publishedVersion	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

File(s) in this item

Name:: article.pdf
Size:: 1.723Mb
Format:: PDF

View/Open

This item appears in the following collection(s)

Artikler, rapporter og annet (arkeologi, historie, religionsvitenskap og teologi) [301]

Show simple item record