Theme

How We Process 10,000+ Books

Processing a single book—extracting its core insights, decomposing them into atomic nodes, mapping their relationships—is intellectually demanding work.

Processing ten thousand books is an engineering challenge.

This is the story of how we built a pipeline that scales human-quality knowledge curation to library-scale volumes.

The Core Challenge

Most people assume knowledge extraction is either:

  • Fully manual — subject matter experts read and annotate every book (high quality, doesn't scale)
  • Fully automated — algorithms extract keywords and summaries (scales beautifully, low quality)

We needed something in between: automation that amplifies human judgment, not replaces it.

Our pipeline combines machine learning, natural language processing, and expert curation into a multi-stage system we call KONCEP™ (Knowledge Organizational Network with Conceptual Extraction Protocol).

Stage 1: Acquisition & Normalization

Books arrive in multiple formats: EPUB, PDF, MOBI, plain text, even scanned images. The first challenge is normalizing them into a consistent structure.

Format Conversion

We use a combination of Calibre (for ebook formats) and custom parsers (for PDFs and scanned documents). OCR handles physical books and image-based PDFs via Tesseract with language-specific models.

Output: clean markdown with preserved structure (headings, lists, emphasis) but stripped formatting (fonts, colors, layout).

Metadata Enrichment

Every book gets tagged with:

  • Bibliographic data (title, author, publication year, ISBN)
  • Category/genre classification
  • Subject tags from Library of Congress, Dewey Decimal, or custom taxonomies
  • Reading level and domain expertise required

This metadata becomes crucial later for contextual ranking and relationship discovery.

Stage 2: Structural Decomposition

Now we have clean text. Next step: identify structural boundaries.

Chapter Detection

Sounds simple, but chapter markers vary wildly across publishers. Some use consistent heading levels (`# Chapter 1`). Others use page breaks, centered text, or custom styling.

We trained a classifier on 50,000 labeled books to detect chapter boundaries with 96% accuracy, even in poorly formatted documents.

Section Hierarchy

Beyond chapters, we map the full hierarchy: parts → chapters → sections → subsections. This tree structure becomes the scaffolding for later atomization.

Paragraph Segmentation

At the finest grain, we segment into paragraphs and identify their roles:

  • Conceptual — introduces or explains an idea
  • Illustrative — provides examples or stories
  • Transitional — connects sections
  • Summative — recaps or concludes

Conceptual paragraphs become candidates for atomic insights. Illustrative ones provide supporting context.

Stage 3: Candidate Extraction

Not every paragraph contains a reusable insight. Many are narrative glue, anecdotes, or transitions.

This stage filters text down to candidate insights—passages likely to contain atomic, actionable knowledge.

NLP Feature Detection

We look for linguistic markers:

  • Definitional language: "X is...", "X refers to..."
  • Causal claims: "X causes Y", "because of X"
  • Prescriptive statements: "You should...", "Always...", "Never..."
  • Framework introductions: Lists, numbered steps, matrices
  • Comparative structures: "Unlike X, Y...", "The difference between..."

Passages with these markers score higher as candidate insights.

Semantic Density Scoring

We calculate "idea density"—how much conceptual content per sentence. High-density paragraphs (lots of unique concepts, low redundancy) are prioritized.

Citation Context

Passages that cite research, reference data, or quote experts get boosted. These are more likely to contain validated claims rather than opinions.

Stage 4: Atomization

Candidate passages are still too coarse. A paragraph might contain three separate insights that should be independent nodes.

Sentence-Level Parsing

We parse each candidate into dependency trees—grammatical structures showing how words relate. This reveals where distinct claims begin and end.

Coreference Resolution

Pronouns ("it", "this", "they") get linked to their referents. This ensures atomic insights don't lose meaning when extracted from context.

Before: "Habit loops have three parts. They consist of cue, routine, and reward."

After: "Habit loops consist of three parts: cue, routine, and reward."

Compression & Expansion

Some insights are too verbose (unnecessary qualifiers, redundant phrasing). Others are too terse (missing critical context).

We use GPT-4 fine-tuned on human-curated examples to:

  • Compress wordy passages while preserving nuance
  • Expand cryptic statements with necessary context
  • Convert passive voice to active, improve clarity

Stage 5: Relationship Mapping

Atomic insights don't exist in isolation. The real value emerges from connections.

Embedding Generation

Every insight gets vectorized using sentence transformers—768-dimensional embeddings that capture semantic meaning.

Insights with similar embeddings are conceptually related, even if they use different words.

Similarity Clustering

We compute cosine similarity across all embeddings. Insights above a threshold (typically 0.75) become candidates for linkage.

Relationship Classification

Not all similarities are equal. We classify relationships as:

  • Synonymous — same concept, different phrasing
  • Complementary — related but distinct concepts
  • Contradictory — conflicting claims
  • Causal — one explains the other
  • Hierarchical — parent-child, general-specific
  • Sequential — temporal or procedural order

Cross-Book Linking

This is where it gets interesting. We don't just link insights within books—we find connections across books.

When Atomic Habits and The Power of Habit both discuss habit formation, their insights get linked. Users can compare perspectives across authors.

Stage 6: Human Curation

Automation gets us 80% of the way. The final 20% requires human expertise.

Editorial Review

Domain experts review:

  • Atomicity: Is each insight properly sized? (not too broad, not too narrow)
  • Accuracy: Does the insight faithfully represent the source?
  • Clarity: Is it comprehensible without extensive context?
  • Completeness: Are critical insights missing from the extraction?

Relationship Validation

Algorithmic linking suggests relationships; humans validate them. False positives get pruned. Missed connections get added.

Metadata Enhancement

Curators add:

  • Tags for discoverability
  • Difficulty ratings
  • Application contexts
  • Prerequisite concepts

Performance at Scale

Current pipeline statistics:

  • Processing speed: 15-45 minutes per book (depending on length and format)
  • Extraction rate: Average 120 atomic insights per book
  • Curator productivity: One expert can review 8-12 books per day
  • Storage efficiency: Compressed graph database, ~2MB per book
  • Accuracy: 94% precision, 89% recall on human-labeled test set

Continuous Improvement

Every human edit feeds back into model training. When curators split an insight, merge two insights, or correct a relationship, that becomes a training example.

The system gets smarter over time, reducing the curation workload as it learns from expert feedback.

The Result

A knowledge graph with:

  • 1.2M+ atomic insights from 10,000+ books
  • 8.7M+ relationships connecting concepts within and across sources
  • Semantic search that finds insights by meaning, not keywords
  • Real-time updates via NodeSync™ as new books are processed

Explore the graph yourself in Universe—the result of this pipeline running at scale.

Knowledge extraction is a solved problem. Knowledge extraction at scale, with quality, is an engineering triumph.