Which sentence is doing the most work in your favourite novel ? I tried to find out.
Summary
An experiment applied machine learning to identify "load-bearing" sentences in novels, defined as those whose removal most significantly alters a book's overall semantic "fingerprint." The method converts each sentence into a numerical embedding, averages these to represent the book, and then measures the semantic shift when individual sentences are removed. This technique was tested on five public-domain books. For Crime and Punishment, the top sentence was "What's the point of it?" from Raskolnikov's mother's letter, a moral trigger. Pride and Prejudice highlighted "But to live in ignorance on such a point was impossible…" after Lydia's elopement. The Great Gatsby identified a sentence from the distinctive guest list. Wuthering Heights found a descriptive sentence about Lockwood's room, a narrative trigger. Frankenstein yielded a date stamp, reflecting its epistolary frame. The author notes the method identifies semantically distinctive sentences, which may or may not align with literary importance.
Key takeaway
For data scientists or computational linguists analyzing large text corpora, this method offers a novel approach to identify structurally or narratively significant sentences beyond traditional literary analysis. You can adapt this embedding-based technique to pinpoint semantically distinctive elements in your own datasets, potentially revealing hidden structural patterns or critical information triggers. Consider applying this to legal documents, scientific papers, or historical texts to uncover key passages that drive meaning or structure.
Key insights
Machine learning can identify semantically distinctive sentences in texts, revealing structural or narrative pivots.
Principles
- Semantic distinctiveness can indicate structural importance.
- Statistical methods offer novel literary analysis perspectives.
- Data preprocessing is crucial for meaningful ML results.
Method
Sentences are embedded numerically, averaged for a book's fingerprint. Each sentence is removed, and the fingerprint is recomputed; the largest change indicates a "load-bearing" sentence.
In practice
- Apply embedding-based analysis to identify text outliers.
- Use semantic distinctiveness to find narrative triggers.
- Explore structural elements in texts via ML methods.
Topics
- Natural Language Processing
- Sentence Embeddings
- Computational Linguistics
- Literary Analysis
- Text Structure
- Semantic Distinctiveness
Code references
Best for: Research Scientist, AI Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.