Proposal and study of statistical features for string similarity computation and classification

2026-05-14 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Researchers propose adaptations of co-occurrence matrix (COM) and run-length matrix (RLM) features, traditionally used in visual computing, for general string similarity computation across words, phrases, codes, and texts. These proposed features are language-agnostic and purely statistical, making them applicable across diverse linguistic and grammatical structures. The study evaluates COM and RLM against established statistical measures like longest common subsequence, maximal consecutive longest common subsequence, mutual information, and various edit distances. In synthetic experiments, COM and RLM features significantly outperformed other state-of-the-art statistical features, demonstrating statistical significance (P-value < 0.001) in three out of four cases. Furthermore, RLM features achieved the best results when applied to a real text plagiarism dataset.

Key takeaway

For research scientists developing string similarity algorithms, you should investigate integrating co-occurrence matrix (COM) and run-length matrix (RLM) features. These methods demonstrated superior performance and language independence in both synthetic and real-world plagiarism detection tasks, potentially offering more robust and versatile solutions than traditional statistical measures for your text analysis and matching applications.

Key insights

Visual computing features, COM and RLM, offer superior, language-agnostic string similarity computation.

Principles

Statistical features can transcend domain boundaries.
Language-agnostic methods enhance string analysis.
COM and RLM improve string similarity metrics.

Method

Adapt co-occurrence matrix (COM) and run-length matrix (RLM) from visual computing to analyze string patterns for similarity, then compare against traditional statistical measures like edit distances.

In practice

Apply RLM for plagiarism detection.
Use COM/RLM for cross-language text analysis.
Integrate these features into string matching algorithms.

Topics

String Similarity Computation
Co-occurrence Matrix
Run-length Matrix
Statistical Features
Plagiarism Detection

Best for: Research Scientist, AI Scientist, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.