Machine Learning for Knowledge Extraction

Extracting knowledge from text isn't new. Information retrieval has existed for decades. But understanding text—capturing meaning, nuance, and relationships—requires modern machine learning.

This is a technical deep dive into the ML stack powering KONCEP™: the models we use, why we chose them, and how they work together to atomize knowledge at scale.

The NLP Pipeline Architecture

Our system combines multiple models in a pipeline, each specialized for a specific task:

Sentence segmentation — spaCy's rule-based sentence boundary detection
Tokenization — BPE (Byte-Pair Encoding) via Hugging Face tokenizers
Named entity recognition — fine-tuned BERT for domain-specific entities
Coreference resolution — NeuralCoref (spaCy extension)
Semantic embeddings — Sentence-BERT (SBERT) for dense vector representations
Classification — custom transformers for insight candidacy scoring
Relationship extraction — graph neural networks (GNNs) for link prediction

Let's break down each component.

Sentence Embeddings: The Foundation

Traditional NLP used bag-of-words or TF-IDF: represent text by word frequencies. Problem: "The cat sat on the mat" and "The mat sat on the cat" have identical representations.

Word order matters. Meaning matters. That's where embeddings come in.

Word2Vec to BERT

Early embeddings (Word2Vec, GloVe) captured word-level semantics: "king - man + woman ≈ queen." Impressive, but limited—they couldn't handle polysemy (words with multiple meanings).

BERT (Bidirectional Encoder Representations from Transformers) solved this with context-aware embeddings. "Apple" in "I ate an apple" gets a different vector than "Apple released new hardware."

But BERT embeddings aren't ideal for semantic similarity. They're trained on next-word prediction, not sentence similarity.

Sentence-BERT: Purpose-Built Similarity

We use Sentence-BERT (SBERT), fine-tuned specifically for semantic textual similarity.

SBERT maps sentences to a 768-dimensional dense vector space where cosine similarity correlates with semantic similarity.

Example:

"Habits form through repetition" → vector A
"Repeated behaviors become automatic" → vector B
"The sky is blue" → vector C

Cosine similarity:

A · B = 0.89 (high similarity, same concept)
A · C = 0.12 (low similarity, unrelated)

This powers our relationship discovery: find insights with similar embeddings, and you've found conceptual connections.

Insight Candidacy Classification

Not every sentence is an insight. Most text is narrative glue, examples, or transitions.

We trained a binary classifier: insight vs. non-insight.

Training Data

We hand-labeled 50,000 sentences from 500 books:

25,000 atomic insights (definitions, frameworks, causal claims, principles)
25,000 non-insights (anecdotes, transitions, filler)

Model Architecture

Fine-tuned DistilBERT (lighter, faster than full BERT) with a classification head on top.

Input: sentence embedding. Output: probability of being an insight.

Features Learned

The model picks up on linguistic patterns:

Definitional language: "X is defined as...", "X refers to..."
Causal structure: "X causes Y", "Because of X, Y happens"
Prescriptive phrasing: "You should...", "Always...", "Avoid..."
Generalization markers: "In general...", "Typically...", "Usually..."

Performance

Precision: 91% (when it predicts "insight," it's right 91% of the time)
Recall: 87% (it catches 87% of actual insights)
F1 Score: 0.89

This filters millions of sentences down to manageable candidate sets for human review.

Coreference Resolution: Context Preservation

Atomic insights must be self-contained. But pronouns ("it," "this," "they") create dependencies on prior context.

Example from a book:

"Habit loops consist of cue, routine, and reward. They form the foundation of automatic behavior."

Extract the second sentence alone, and "they" is meaningless.

NeuralCoref Solution

We use NeuralCoref to resolve pronouns to their referents:

Input: "Habit loops consist of cue, routine, and reward. They form the foundation..."
Coreference: "They" → "Habit loops"
Resolved: "Habit loops form the foundation of automatic behavior."

Now the insight is standalone, no pronoun ambiguity.

Named Entity Recognition: Domain Extraction

Standard NER models recognize "person," "organization," "location." But we need domain-specific entities:

Concepts: cognitive bias, compound interest, habit loop
Frameworks: Eisenhower Matrix, SMART goals, Jobs-to-be-Done
Metrics: NPS, CAC, LTV

Fine-Tuning for Domains

We fine-tuned spaCy's entity recognizer on domain corpora (business books, psychology texts, technical documentation).

This lets us tag entities like:

CONCEPT: "confirmation bias"
FRAMEWORK: "OKRs"
PRINCIPLE: "Pareto principle"

These become structured metadata for search and filtering.

Relationship Extraction: Building the Graph

Atomic insights are nodes. Relationships are edges. How do we discover which insights should link?

Similarity-Based Linking

Compute pairwise cosine similarity across all SBERT embeddings. Insights above a threshold (0.75) become link candidates.

But raw similarity isn't enough. We need to classify types of relationships.

Graph Neural Networks for Link Classification

We trained a GNN to classify relationship types:

Synonymous — same concept, different phrasing
Hierarchical — parent-child (general → specific)
Causal — X explains or causes Y
Contradictory — conflicting claims
Complementary — related but distinct

Input: two insight embeddings + contextual features (source metadata, entity overlap, citation patterns).

Output: relationship type + confidence score.

Active Learning Loop

Human curators review predicted links, correcting errors. Every correction feeds back into training data, improving the model.

Over 18 months, we reduced false positive link predictions from 34% to 11%.

Semantic Clustering: Discovering Themes

Beyond pairwise links, we cluster insights into thematic groups.

HDBSCAN for Density-Based Clustering

We use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) on SBERT embeddings.

Why HDBSCAN over k-means?

No preset cluster count: Discovers natural groupings without arbitrary k
Handles noise: Outlier insights don't force into clusters
Hierarchical structure: Reveals sub-themes within themes

Example cluster from business books:

Core theme: Product-Market Fit
Sub-themes: customer validation, pivot strategies, early adopter targeting, value proposition testing

Transformer Fine-Tuning: Domain Adaptation

General-purpose models (GPT, BERT) are trained on web text, Wikipedia, books—but not optimized for our specific task.

Continued Pre-Training

We take pre-trained models and continue training on our corpus: 10,000+ non-fiction books.

This adapts the model to:

Domain-specific vocabulary (business jargon, psychology terms, technical concepts)
Authorial patterns (how experts explain ideas)
Conceptual density (non-fiction is denser than web text)

Task-Specific Fine-Tuning

After domain adaptation, we fine-tune for specific tasks:

Insight extraction: Trained on labeled insight vs. non-insight sentences
Relationship prediction: Trained on human-validated link pairs
Summarization: Trained to compress verbose passages while preserving meaning

Computational Costs & Optimization

Embeddings and transformers are expensive. Some optimizations:

Batch Processing

We process books in batches of 50, amortizing model loading and GPU overhead.

Quantization

Convert 32-bit float models to 8-bit integers (INT8 quantization). 4x smaller, 3x faster inference, <1% accuracy loss.

Distillation

Where possible, use distilled models (DistilBERT, DistilRoBERTa)—smaller, faster, 95%+ of full model performance.

Caching

Embeddings are computed once per insight and cached. Similarity calculations reuse cached vectors rather than recomputing.

The Human-in-the-Loop

ML models are probabilistic. They make mistakes. That's why we keep humans in the loop:

Candidate filtering: ML suggests insights; humans validate
Relationship review: ML predicts links; curators approve/reject
Quality checks: Random sampling for accuracy audits

This hybrid approach scales human judgment rather than replacing it.

Open Questions & Future Work

We're actively researching:

Multimodal embeddings: Incorporate diagrams, charts, tables from books
Fact verification: Auto-detect and flag unsupported claims
Argument extraction: Identify premises, conclusions, and logical structure
Cross-lingual transfer: Process books in multiple languages, link across translations

The Stack

For those curious about implementation details:

Frameworks: PyTorch, Hugging Face Transformers, spaCy
Models: Sentence-BERT, DistilBERT, RoBERTa, GPT-4 (for compression/expansion)
Infrastructure: NVIDIA A100 GPUs, distributed training across 8 nodes
Storage: Neo4j graph database for relationships, PostgreSQL for metadata

Experience the Output

All this ML powers what you see in NodeCore and Universe: atomic insights, semantic search, relationship graphs.

The tech is invisible. The experience is magic.

Machine learning doesn't replace human expertise. It amplifies it.