Information Retrieval using Knowledge Graphs | Information Retrieval Solution for Unstructured Data Processing | AI-Powered System for Data Extraction and Integration

Background

In the era of big data and rapid information growth, traditional information retrieval (IR) methods often struggle to effectively manage and extract relevant information from vast and diverse datasets. Conventional IR techniques rely heavily on keyword matching and statistical analysis, which can be insufficient for understanding the context and relationships between entities. As a result, users may face challenges in obtaining precise, contextually relevant information quickly.

Knowledge graphs (KGs) have emerged as a transformative technology in addressing these limitations. A knowledge graph represents a network of real-world entities and their interrelations, capturing both semantic meaning and contextual relationships. By leveraging structured data and semantic relationships, KGs enhance the ability to perform complex queries and deliver more accurate and contextually relevant results. They facilitate a deeper understanding of the data by linking related concepts and providing a richer context for queries.

The integration of knowledge graphs into information retrieval systems can lead to significant improvements in search accuracy, relevancy, and user satisfaction.

Objective

The objective of this experiment is to design and implement algorithms for automatically generating knowledge graphs from unstructured data sources. This involves entity extraction, relationship identification and schema construction.

Furthermore, it aims to evaluate the effectiveness of Knowledge Graphs in improving data integration, query precision and contextual understanding within information retrieval systems. The goal is to demonstrate improvements in the quality and efficiency of user query responses compared to traditional keyword-based and vector database retrieval methods.

Business use cases and applications

The Knowledge Graph based retrieval system has several business use cases across various domains. Some of the key use cases include:

1. Enhanced data discovery: Knowledge graphs provide a structured representation of concepts and their relationships, enabling researchers to discover relevant data and resources more efficiently. By linking related topics and sources, researchers can uncover connections and insights that might be missed with traditional search methods.

2. Improved customer support: Improve customer support efficiency and effectiveness by leveraging Knowledge Graphs to provide agents with comprehensive and contextually relevant information.

3. Personalised e-commerce experiences: Enhance e-commerce platforms by using Knowledge Graphs to deliver more relevant product recommendations and improve search functionality.

4. Healthcare decision support: Utilise Knowledge Graphs to integrate and analyse diverse healthcare data sources, improving decision-making and patient outcomes.

5. Financial risk management: Enhance financial analysis and risk management by employing Knowledge Graphs to integrate and analyse financial data from multiple sources.

These use cases and applications illustrate how Knowledge Graphs can transform information retrieval and management across various industries, leading to more effective decision-making, personalised user experiences, and improved operational efficiencies.

Environment setup

Python: For documentation, exploratory data analysis, data preparation, model training and inference
Computing infrastructure: AWS EC2 instance (instance type: t2.xlarge)
Azure OpenAI: For generating knowledge graph and information retrieval (LLM: gpt-35-turbo-16k, Embedding: text-embedding-ada-002 )
User interface: Streamlit API developed in Python

Experiment approach

Our approach to Knowledge Graph construction leveraging GenAI

Knowledge Graph Construction

To automate the creation of knowledge graphs from research papers in PDF format using Large Language Models (LLMs), we followed below steps:

1. Extract and clean text: Implemented parsing tools to extract and refine text from the PDFs.

2. Segment text: Break down the processed text into smaller chunks of 256 tokens each.

3. Generate queries and triplets: Leveraged the Azure OpenAI LLM model to create Cypher queries and triplets from these text chunks.

4. Construct and index knowledge graph: Utilised the generated Cypher queries and triplets to build and index the knowledge graph.

Query Engine

The process for handling a user query using a knowledge graph is explained below:

User query: User submits a query.
Extract query keywords: Applied the Azure OpenAI LLM Model to extract keywords from the user's query.
Triplets retriever engine: After extracting the keywords, we used the Triplets Retriever Engine to search the Knowledge Graph Index and identify relevant triplets.
Retrieve top-k similar triplets: We then retrieved the top-K most similar triplets and their associated text from the Knowledge Graph Index. Triplets are structured data units consisting of a subject, predicate, and object that represent relationships and attributes of entities.
Response generation: Leveraged an LLM Model to combine the user's query, the retrieved triplets, and additional contextual information to generate a response.
Output: The generated response is delivered to the user based on the retrieved context.

Choice of algorithms

Neo4j: Neo4j is a leading graph database management system designed to store, query, and analyse data in graph structures. Unlike traditional relational databases, which organise data into tables and rows, Neo4j represents data as nodes (entities) and relationships (connections) between them. This approach is particularly well-suited for applications that involve complex, interconnected data.
Nebula: Nebula Graph is an open-source, distributed graph database designed for managing and querying large-scale graph data. It’s engineered to handle high-performance, real-time graph analytics and is particularly well-suited for applications involving complex, interconnected data.
Llamaindex: LlamaIndex (formerly known as GPT Index) is a versatile, open-source indexing and retrieval library designed to facilitate the integration of large language models (LLMs) with structured and unstructured data. Its core functionality is to index documents, build retrieval systems, and perform efficient information retrieval, making it easier to leverage LLMs for applications such as search engines, question-answering systems, and document summarisation.
Generative AI: Utilised the Azure OpenAI large language model (LLM) to extract entities, relationships and generate responses for user queries

Experiment outcomes

The experiment confirmed that using a knowledge graph for information retrieval and question answering offers significant advantages over traditional keyword-based search methods.

Additionally, we assessed the responses generated by the Knowledge Graph-based approach against those produced by RAG techniques utilizing vector databases. The knowledge graph provided higher precision, faster response times, and greater contextual accuracy, resulting in improved user experience. Its structured approach to data representation and querying enhanced the ability to handle complex and nuanced queries effectively.

Sample Knowldege Graph

Sample Output

What's next ...

Enhance accuracy: Enhance the accuracy of the system's responses by fine-tuning the model on larger and custom datasets.
Incorporate feedback loop: Iterative improvements driven by user feedback can greatly improve system performance, as ongoing analysis of user interactions and adjustments based on identified issues and preferences help the system evolve and offer more precise answers.
Elevate user experience: Enhancing the user interface and experience can make the system more intuitive and accessible by creating a clear design, offering suggestions for unclear queries, and integrating features like autocomplete and spell-check to help users craft their questions more effectively.

Enhancing information retrieval using knowledge graphs