Rag_with_knowledge_graph_neo4j

Technical Paper Extraction and Neo4j Knowledge Graph System

System Architecture Example Graph

Architecture Overview

This project implements a high-performance Natural Language Processing (NLP) pipeline for scientific document analysis, leveraging a Neo4j knowledge graph for structured information storage and retrieval. The system follows a Retrieval-Augmented Generation (RAG) paradigm to enable semantic search and contextual querying across scientific literature.

System Architecture Details

Data Flow Pipeline:

PDF/DOCX parsing with text structure preservation

Scientific text normalization (equation handling, citation formatting)

Node creation with properties and metadata -Relationship establishment with weights/attributes -Embedding vector storage

Query processing and embedding

Core Components

1. Document Processing Pipeline (ner.py)

The ScientificDocumentPipeline class implements an advanced multi-stage extraction workflow that processes structured and unstructured scientific data.

Key Technical Components:

Processing Workflow:

 def process_document(self, file_path: str) -> "ProcessedDocument":
     # Stage 1: Extract content from PDFs, DOCX, or other sources.
     # Stage 2: Perform Named Entity Recognition (NER).
     # Stage 3: Extract relationships between entities.
     # Stage 4: Topic Modeling using BERTopic.
     # Stage 5: Generate document embeddings for vector search.

Performance Metrics:


2. Knowledge Graph Architecture

The Neo4j knowledge graph enables structured representation and querying of extracted document knowledge.

Graph Schema Design:

CREATE CONSTRAINT ON (d:Document) ASSERT d.id IS UNIQUE;
CREATE CONSTRAINT ON (e:Entity) ASSERT (e.text, e.label) IS UNIQUE;
CREATE CONSTRAINT ON (t:Topic) ASSERT t.id IS UNIQUE;
CREATE CONSTRAINT ON (c:Claim) ASSERT c.text IS UNIQUE;

Relationship Definitions:

CREATE (:Document)-[:CONTAINS_ENTITY {confidence: float}]->(:Entity);
CREATE (:Document)-[:HAS_TOPIC {weight: float}]->(:Topic);
CREATE (:Document)-[:MAKES_CLAIM {confidence: float}]->(:Claim);
CREATE (:Entity)-[:RELATES_TO {relation_type: string}]->(:Entity);

Graph Statistics:


3. Retrieval-Augmented Generation (RAG) System (rag.py)

What This RAG System Does:

Why This System Outperforms General RAG Approaches:

Performance Metrics:


4. OpenAI Integration

The system integrates OpenAI API for advanced NLP capabilities:

Configuration (config.json)

{
    "neo4j": {
        "uri": "http://localhost:7687",
        "user": "neo4j",
        "password": "password"
    },
    "openai": {
        "OPENAI_API_KEY": "your-api-key-here",
        "embedding_model": "text-embedding-ada-002",
        "completion_model": "gpt-4o",
        "temperature": 0.3,
        "max_tokens": 500
    }
}

Usage Examples

Process a Batch of Papers

python process_documents.py --input-dir ./papers --config config.json

Evaluate RAG Performance

python evaluate_rag.py --test-questions ./questions.json --metrics rouge,bert-score

For more detailed usage or specific configurations, refer to CONTRIBUTING.md.


Future Enhancements

  1. Scientific-Domain Embeddings: Replace Sentence-BERT with specialized models like SciBERT or BioBERT for improved domain coverage.
  2. Neo4j Vector Indexing: Implement HNSW or other approximate nearest neighbor indexing (available in newer Neo4j versions).
  3. Citation Network Analysis: Introduce graph algorithms to evaluate citation influence and co-citation patterns.
  4. Multi-Modal Processing: Extract data from figures, tables, and other non-textual elements.
  5. Incremental Learning: Continually update topic models and embeddings as new papers are added.

Additional Implementation Details


Example Usage

Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage Example Usage

License

This project is licensed under the MIT License.

Acknowledgments