How We Process 10,000+ Books

Processing a single book—extracting its core insights, decomposing them into atomic nodes, mapping their relationships—is intellectually demanding work.

Processing ten thousand books is an engineering challenge.

This is the story of how we built a pipeline that scales human-quality knowledge curation to library-scale volumes.

The Core Challenge

Most people assume knowledge extraction is either:

Fully manual — subject matter experts read and annotate every book (high quality, doesn't scale)
Fully automated — algorithms extract keywords and summaries (scales beautifully, low quality)

We needed something in between: automation that amplifies human judgment, not replaces it.

Our pipeline combines machine learning, natural language processing, and expert curation into a multi-stage system we call KONCEP™ (Knowledge Organizational Network with Conceptual Extraction Protocol).

Stage 1: Acquisition & Normalization

Books arrive in multiple formats: EPUB, PDF, MOBI, plain text, even scanned images. The first challenge is normalizing them into a consistent structure.

Format Conversion

We use a combination of Calibre (for ebook formats) and custom parsers (for PDFs and scanned documents). OCR handles physical books and image-based PDFs via Tesseract with language-specific models.

Output: clean markdown with preserved structure (headings, lists, emphasis) but stripped formatting (fonts, colors, layout).

Metadata Enrichment

Every book gets tagged with:

Bibliographic data (title, author, publication year, ISBN)
Category/genre classification
Subject tags from Library of Congress, Dewey Decimal, or custom taxonomies
Reading level and domain expertise required

This metadata becomes crucial later for contextual ranking and relationship discovery.

Stage 2: Structural Decomposition

Now we have clean text. Next step: identify structural boundaries.

Chapter Detection

Sounds simple, but chapter markers vary wildly across publishers. Some use consistent heading levels (`# Chapter 1`). Others use page breaks, centered text, or custom styling.

We trained a classifier on 50,000 labeled books to detect chapter boundaries with 96% accuracy, even in poorly formatted documents.

Section Hierarchy

Beyond chapters, we map the full hierarchy: parts → chapters → sections → subsections. This tree structure becomes the scaffolding for later atomization.

Paragraph Segmentation

At the finest grain, we segment into paragraphs and identify their roles:

Conceptual — introduces or explains an idea
Illustrative — provides examples or stories
Transitional — connects sections
Summative — recaps or concludes

Conceptual paragraphs become candidates for atomic insights. Illustrative ones provide supporting context.

Stage 3: Candidate Extraction

Not every paragraph contains a reusable insight. Many are narrative glue, anecdotes, or transitions.

This stage filters text down to candidate insights—passages likely to contain atomic, actionable knowledge.

NLP Feature Detection

We look for linguistic markers:

Definitional language: "X is...", "X refers to..."
Causal claims: "X causes Y", "because of X"
Prescriptive statements: "You should...", "Always...", "Never..."
Framework introductions: Lists, numbered steps, matrices
Comparative structures: "Unlike X, Y...", "The difference between..."

Passages with these markers score higher as candidate insights.

Semantic Density Scoring

We calculate "idea density"—how much conceptual content per sentence. High-density paragraphs (lots of unique concepts, low redundancy) are prioritized.

Citation Context

Passages that cite research, reference data, or quote experts get boosted. These are more likely to contain validated claims rather than opinions.

Stage 4: Atomization

Candidate passages are still too coarse. A paragraph might contain three separate insights that should be independent nodes.

Sentence-Level Parsing

We parse each candidate into dependency trees—grammatical structures showing how words relate. This reveals where distinct claims begin and end.

Coreference Resolution

Pronouns ("it", "this", "they") get linked to their referents. This ensures atomic insights don't lose meaning when extracted from context.

Before: "Habit loops have three parts. They consist of cue, routine, and reward."

After: "Habit loops consist of three parts: cue, routine, and reward."

Compression & Expansion

Some insights are too verbose (unnecessary qualifiers, redundant phrasing). Others are too terse (missing critical context).

We use GPT-4 fine-tuned on human-curated examples to:

Compress wordy passages while preserving nuance
Expand cryptic statements with necessary context
Convert passive voice to active, improve clarity

Stage 5: Relationship Mapping

Atomic insights don't exist in isolation. The real value emerges from connections.

Embedding Generation

Every insight gets vectorized using sentence transformers—768-dimensional embeddings that capture semantic meaning.

Insights with similar embeddings are conceptually related, even if they use different words.

Similarity Clustering

We compute cosine similarity across all embeddings. Insights above a threshold (typically 0.75) become candidates for linkage.

Relationship Classification

Not all similarities are equal. We classify relationships as:

Synonymous — same concept, different phrasing
Complementary — related but distinct concepts
Contradictory — conflicting claims
Causal — one explains the other
Hierarchical — parent-child, general-specific
Sequential — temporal or procedural order

Cross-Book Linking

This is where it gets interesting. We don't just link insights within books—we find connections across books.

When Atomic Habits and The Power of Habit both discuss habit formation, their insights get linked. Users can compare perspectives across authors.

Stage 6: Human Curation

Automation gets us 80% of the way. The final 20% requires human expertise.

Editorial Review

Domain experts review:

Atomicity: Is each insight properly sized? (not too broad, not too narrow)
Accuracy: Does the insight faithfully represent the source?
Clarity: Is it comprehensible without extensive context?
Completeness: Are critical insights missing from the extraction?

Relationship Validation

Algorithmic linking suggests relationships; humans validate them. False positives get pruned. Missed connections get added.

Metadata Enhancement

Curators add:

Tags for discoverability
Difficulty ratings
Application contexts
Prerequisite concepts

Performance at Scale

Current pipeline statistics:

Processing speed: 15-45 minutes per book (depending on length and format)
Extraction rate: Average 120 atomic insights per book
Curator productivity: One expert can review 8-12 books per day
Storage efficiency: Compressed graph database, ~2MB per book
Accuracy: 94% precision, 89% recall on human-labeled test set

Continuous Improvement

Every human edit feeds back into model training. When curators split an insight, merge two insights, or correct a relationship, that becomes a training example.

The system gets smarter over time, reducing the curation workload as it learns from expert feedback.

The Result

A knowledge graph with:

1.2M+ atomic insights from 10,000+ books
8.7M+ relationships connecting concepts within and across sources
Semantic search that finds insights by meaning, not keywords
Real-time updates via NodeSync™ as new books are processed

Explore the graph yourself in Universe—the result of this pipeline running at scale.

Knowledge extraction is a solved problem. Knowledge extraction at scale, with quality, is an engineering triumph.