Fine-tuning Large Language Models on historical causes of death data

Wilhelmsen, Kristoffer Berg

Permanent lenke

https://hdl.handle.net/10037/34160

Åpne

thesis.pdf (2.508Mb)

(PDF)

Dato

2024-05-15

Type

Master thesis
Mastergradsoppgave

Forfatter

Wilhelmsen, Kristoffer Berg

Sammendrag

This thesis assesses the impact of fine-tuning and rag on llms in accurately assigning icd-10 codes to historical causes of death. Using funeral records from Trondheim, Norway (1830-1920), we fine-tuned Llama 3 and Mistral on 2000 records. Twelve experiments were conducted on 2000 additional records to evaluate the accuracy of each knowledge-injection technique, as well as a combination of the two. The results indicate that fine-tuning as a standalone knowledge-injection technique achieved the highest accuracy, generating 88% full matches and 2% partial matches for icd-10 codes, up from 58% full matches and 25% partial matches in previous research. However, concerns regarding memorization of training data due to the lack of diversity in the available dataset remain. Moreover, combining RAG with fine-tuning led to a decrease in accuracy, while a sole rag approach decreased the results even further. These findings serve as proof-of-concept for the automatic assignment of icd-10 codes to historical causes of death, paving the way for future research.

Forlag

UiT Norges arktiske universitet
UiT The Arctic University of Norway

Metadata

Vis full innførsel

Samlinger

Mastergradsoppgaver i datateknologi og beregningsorienterte ingeniørfag [82]

Følgende lisensfil er knyttet til denne innførselen:

Original lisens

Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)