Enhancing Investigative Journalism: Leveraging Large Language Models and Vector Databases

Ali, Muhammad Nauman

dc.contributor.advisor	Bongo, Lars Ailo
dc.contributor.advisor	Ricaud, Benjamin
dc.contributor.advisor	Bakkeli, Nicoali
dc.contributor.author	Ali, Muhammad Nauman
dc.date.accessioned	2024-06-13T05:35:06Z
dc.date.available	2024-06-13T05:35:06Z
dc.date.issued	2024-05-15	en
dc.description.abstract	The advancement in the field of Artificial Intelligence (AI) has brought revolution in almost every field of life, and Journalism is also one of them. Which includes prospective use in Investigating reports and uncovering information. This project explores the avenue of integrating technologies such as Large Language Model (LLM) with the Vector Databases. At the same time, the motive is to address two avenues: Information Retrieval and LLM for summarization and finding information of interest to the journalists. We begin the study with an overview of related concepts/literature. Then, we proposed a system based on the literature in the methodology. The proposed system is based on Retrieval Augmented Generation (RAG) architecture employing Vector Database and the integration of LLM. The vector database was employed to efficiently retrieve relevant documents, and LLM for putting the information in concise form and also identifying any irregularities in the cases. A series of queries and prompts were presented by iTromsø, and the system was tested. The results, both documents retrieved and the prompt answers were evaluated by iTromsø. The results for documents retrieval, had varied varied degree of accuracy, with some queries giving the most relevant and some completely fail to retrieve the document in- tended. The quality of answers from also showed variance as expected and ChatGPT4 outperforming ChatGPT 3.5 turbo and GPT4All in answering the prompt with high accuracy. The duplication of documents and also the presence of special characters and void spaces in the text effected the results for documents retrieval by not able to retrieve most desired document in most cases. Except ChatGPT 4, ChatGPT 3.5 turbo and GPT4All response was also effected due to special characters and white spaces. While the proposed system showing advantage in assisting journalists with inves- tigative process both in term of scalability and efficiency when compared to traditional approaches. But the limitations in accurate document retrieval must be addressed by cleaning the text data.	en_US
dc.identifier.uri	https://hdl.handle.net/10037/33792
dc.language.iso	eng	en_US
dc.publisher	UiT Norges arktiske universitet	no
dc.publisher	UiT The Arctic University of Norway	en
dc.rights.holder	Copyright 2024 The Author(s)
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/4.0	en_US
dc.rights	Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)	en_US
dc.subject.courseID	INF-3990
dc.subject	LLM	en_US
dc.subject	Journalisim	en_US
dc.subject	RAG	en_US
dc.subject	Vector Database	en_US
dc.subject	ChatGPT	en_US
dc.subject	Summarization	en_US
dc.title	Enhancing Investigative Journalism: Leveraging Large Language Models and Vector Databases	en_US
dc.type	Mastergradsoppgave	no
dc.type	Master thesis	en

Tilhørende fil(er)

Navn:: thesis.pdf
Størrelse:: 2.620Mb
Format:: PDF

Åpne

Navn:: license.txt
Størrelse:: 1.093Kb
Format:: Tekstfil

Åpne

Denne innførselen finnes i følgende samling(er)

Mastergradsoppgaver i informatikk [135]

Vis enkel innførsel

Med mindre det står noe annet, er denne innførselens lisens beskrevet som Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)