Extracting knowledge from text isn't new. Information retrieval has existed for decades. But understanding text—capturing meaning, nuance, and relationships—requires modern machine learning.
This is a technical deep dive into the ML stack powering KONCEP™: the models we use, why we chose them, and how they work together to atomize knowledge at scale.
The NLP Pipeline Architecture
Our system combines multiple models in a pipeline, each specialized for a specific task:
- Sentence segmentation — spaCy's rule-based sentence boundary detection
- Tokenization — BPE (Byte-Pair Encoding) via Hugging Face tokenizers
- Named entity recognition — fine-tuned BERT for domain-specific entities
- Coreference resolution — NeuralCoref (spaCy extension)
- Semantic embeddings — Sentence-BERT (SBERT) for dense vector representations
- Classification — custom transformers for insight candidacy scoring
- Relationship extraction — graph neural networks (GNNs) for link prediction
Let's break down each component.
Sentence Embeddings: The Foundation
Traditional NLP used bag-of-words or TF-IDF: represent text by word frequencies. Problem: "The cat sat on the mat" and "The mat sat on the cat" have identical representations.
Word order matters. Meaning matters. That's where embeddings come in.
Word2Vec to BERT
Early embeddings (Word2Vec, GloVe) captured word-level semantics: "king - man + woman ≈ queen." Impressive, but limited—they couldn't handle polysemy (words with multiple meanings).
BERT (Bidirectional Encoder Representations from Transformers) solved this with context-aware embeddings. "Apple" in "I ate an apple" gets a different vector than "Apple released new hardware."
But BERT embeddings aren't ideal for semantic similarity. They're trained on next-word prediction, not sentence similarity.
Sentence-BERT: Purpose-Built Similarity
We use Sentence-BERT (SBERT), fine-tuned specifically for semantic textual similarity.
SBERT maps sentences to a 768-dimensional dense vector space where cosine similarity correlates with semantic similarity.
Example:
- "Habits form through repetition" → vector A
- "Repeated behaviors become automatic" → vector B
- "The sky is blue" → vector C
Cosine similarity:
- A · B = 0.89 (high similarity, same concept)
- A · C = 0.12 (low similarity, unrelated)
This powers our relationship discovery: find insights with similar embeddings, and you've found conceptual connections.
Insight Candidacy Classification
Not every sentence is an insight. Most text is narrative glue, examples, or transitions.
We trained a binary classifier: insight vs. non-insight.
Training Data
We hand-labeled 50,000 sentences from 500 books:
- 25,000 atomic insights (definitions, frameworks, causal claims, principles)
- 25,000 non-insights (anecdotes, transitions, filler)
Model Architecture
Fine-tuned DistilBERT (lighter, faster than full BERT) with a classification head on top.
Input: sentence embedding. Output: probability of being an insight.
Features Learned
The model picks up on linguistic patterns:
- Definitional language: "X is defined as...", "X refers to..."
- Causal structure: "X causes Y", "Because of X, Y happens"
- Prescriptive phrasing: "You should...", "Always...", "Avoid..."
- Generalization markers: "In general...", "Typically...", "Usually..."
Performance
- Precision: 91% (when it predicts "insight," it's right 91% of the time)
- Recall: 87% (it catches 87% of actual insights)
- F1 Score: 0.89
This filters millions of sentences down to manageable candidate sets for human review.
Coreference Resolution: Context Preservation
Atomic insights must be self-contained. But pronouns ("it," "this," "they") create dependencies on prior context.
Example from a book:
"Habit loops consist of cue, routine, and reward. They form the foundation of automatic behavior."
Extract the second sentence alone, and "they" is meaningless.
NeuralCoref Solution
We use NeuralCoref to resolve pronouns to their referents:
- Input: "Habit loops consist of cue, routine, and reward. They form the foundation..."
- Coreference: "They" → "Habit loops"
- Resolved: "Habit loops form the foundation of automatic behavior."
Now the insight is standalone, no pronoun ambiguity.
Named Entity Recognition: Domain Extraction
Standard NER models recognize "person," "organization," "location." But we need domain-specific entities:
- Concepts: cognitive bias, compound interest, habit loop
- Frameworks: Eisenhower Matrix, SMART goals, Jobs-to-be-Done
- Metrics: NPS, CAC, LTV
Fine-Tuning for Domains
We fine-tuned spaCy's entity recognizer on domain corpora (business books, psychology texts, technical documentation).
This lets us tag entities like:
- CONCEPT: "confirmation bias"
- FRAMEWORK: "OKRs"
- PRINCIPLE: "Pareto principle"
These become structured metadata for search and filtering.
Relationship Extraction: Building the Graph
Atomic insights are nodes. Relationships are edges. How do we discover which insights should link?
Similarity-Based Linking
Compute pairwise cosine similarity across all SBERT embeddings. Insights above a threshold (0.75) become link candidates.
But raw similarity isn't enough. We need to classify types of relationships.
Graph Neural Networks for Link Classification
We trained a GNN to classify relationship types:
- Synonymous — same concept, different phrasing
- Hierarchical — parent-child (general → specific)
- Causal — X explains or causes Y
- Contradictory — conflicting claims
- Complementary — related but distinct
Input: two insight embeddings + contextual features (source metadata, entity overlap, citation patterns).
Output: relationship type + confidence score.
Active Learning Loop
Human curators review predicted links, correcting errors. Every correction feeds back into training data, improving the model.
Over 18 months, we reduced false positive link predictions from 34% to 11%.
Semantic Clustering: Discovering Themes
Beyond pairwise links, we cluster insights into thematic groups.
HDBSCAN for Density-Based Clustering
We use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) on SBERT embeddings.
Why HDBSCAN over k-means?
- No preset cluster count: Discovers natural groupings without arbitrary k
- Handles noise: Outlier insights don't force into clusters
- Hierarchical structure: Reveals sub-themes within themes
Example cluster from business books:
- Core theme: Product-Market Fit
- Sub-themes: customer validation, pivot strategies, early adopter targeting, value proposition testing
Transformer Fine-Tuning: Domain Adaptation
General-purpose models (GPT, BERT) are trained on web text, Wikipedia, books—but not optimized for our specific task.
Continued Pre-Training
We take pre-trained models and continue training on our corpus: 10,000+ non-fiction books.
This adapts the model to:
- Domain-specific vocabulary (business jargon, psychology terms, technical concepts)
- Authorial patterns (how experts explain ideas)
- Conceptual density (non-fiction is denser than web text)
Task-Specific Fine-Tuning
After domain adaptation, we fine-tune for specific tasks:
- Insight extraction: Trained on labeled insight vs. non-insight sentences
- Relationship prediction: Trained on human-validated link pairs
- Summarization: Trained to compress verbose passages while preserving meaning
Computational Costs & Optimization
Embeddings and transformers are expensive. Some optimizations:
Batch Processing
We process books in batches of 50, amortizing model loading and GPU overhead.
Quantization
Convert 32-bit float models to 8-bit integers (INT8 quantization). 4x smaller, 3x faster inference, <1% accuracy loss.
Distillation
Where possible, use distilled models (DistilBERT, DistilRoBERTa)—smaller, faster, 95%+ of full model performance.
Caching
Embeddings are computed once per insight and cached. Similarity calculations reuse cached vectors rather than recomputing.
The Human-in-the-Loop
ML models are probabilistic. They make mistakes. That's why we keep humans in the loop:
- Candidate filtering: ML suggests insights; humans validate
- Relationship review: ML predicts links; curators approve/reject
- Quality checks: Random sampling for accuracy audits
This hybrid approach scales human judgment rather than replacing it.
Open Questions & Future Work
We're actively researching:
- Multimodal embeddings: Incorporate diagrams, charts, tables from books
- Fact verification: Auto-detect and flag unsupported claims
- Argument extraction: Identify premises, conclusions, and logical structure
- Cross-lingual transfer: Process books in multiple languages, link across translations
The Stack
For those curious about implementation details:
- Frameworks: PyTorch, Hugging Face Transformers, spaCy
- Models: Sentence-BERT, DistilBERT, RoBERTa, GPT-4 (for compression/expansion)
- Infrastructure: NVIDIA A100 GPUs, distributed training across 8 nodes
- Storage: Neo4j graph database for relationships, PostgreSQL for metadata
Experience the Output
All this ML powers what you see in NodeCore and Universe: atomic insights, semantic search, relationship graphs.
The tech is invisible. The experience is magic.
Machine learning doesn't replace human expertise. It amplifies it.