Text Analysis of Debt Collection Lawsuit Misspellings in Hamilton County, Ohio and Connecticut
Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Text Analysis of Debt Collection Lawsuit Misspellings in Hamilton County, Ohio and Connecticut

Abstract

This study investigates the patterns and potential motivations behind misspellings of plaintiffnames in debt collection lawsuits across two different datasets: Hamilton County, Ohio, and Connecticut. By employing four fuzzy matching techniques- the Levenshtein Distance, Damerau-Levenshtein Distance, Jaro-Winkler Similarity, and Euclidean Keyboard Distance, this study aims to quantify the discrepancies between the raw and standardized plaintiff names in both datasets. The analysis reveals that a significant proportion of name variations require very minimal edits, with median distances of 0 for all metrics, suggesting that many of the names are already correctly formatted. However, the presence of outliers and higher mean distances, particularly in the Euclidean Keyboard Distance, strongly indicates that some misspellings may be deliberate, rather than accidental. Both the Kernel Density Estimation (KDE) and Empirical Cumulative Distribution Function (ECDF) analysis further highlight the distribution of these discrepancies, with time analysis revealing various fluctuations in misspelling rates over time. Notably, spikes in misspelling rates during specific periods, such as the 2008 financial crisis, may suggest a potential strategic behavior by plaintiffs to obscure their identities or to try to manipulate the legal processes. These findings underscore the importance of standardized data cleaning and the overall need for further qualitative research to distinguish between types of errors and deliberate misspellings across debt collection lawsuits.