- Main
Text Analysis of Debt Collection Lawsuit Misspellings in Hamilton County, Ohio and Connecticut
- Xu, Cindy Ziyi
- Advisor(s): Montufar Cuartas, Guido Francisco
Abstract
This study investigates the patterns and potential motivations behind misspellings of plaintiffnames in debt collection lawsuits across two different datasets: Hamilton County, Ohio, and Connecticut. By employing four fuzzy matching techniques- the Levenshtein Distance, Damerau-Levenshtein Distance, Jaro-Winkler Similarity, and Euclidean Keyboard Distance, this study aims to quantify the discrepancies between the raw and standardized plaintiff names in both datasets. The analysis reveals that a significant proportion of name variations require very minimal edits, with median distances of 0 for all metrics, suggesting that many of the names are already correctly formatted. However, the presence of outliers and higher mean distances, particularly in the Euclidean Keyboard Distance, strongly indicates that some misspellings may be deliberate, rather than accidental. Both the Kernel Density Estimation (KDE) and Empirical Cumulative Distribution Function (ECDF) analysis further highlight the distribution of these discrepancies, with time analysis revealing various fluctuations in misspelling rates over time. Notably, spikes in misspelling rates during specific periods, such as the 2008 financial crisis, may suggest a potential strategic behavior by plaintiffs to obscure their identities or to try to manipulate the legal processes. These findings underscore the importance of standardized data cleaning and the overall need for further qualitative research to distinguish between types of errors and deliberate misspellings across debt collection lawsuits.