We introduce dCrypt, a novel ML algorithm, and its Python implementation, for unstructured text categorization and classification. dCrypt refines the classic text categorization algorithm of Cavnar and Trenkle (Proc. SDAIR '94) to construct label-dependent n-gram profiles, as well as label-dependent multi-criteria feature selection. The dCrypt implementation is fully featured, and algorithmic details are abstracted away from the user by careful design of the application interface and modularizing Python components. Tuple-based dictionary design implies sparse storage of large vocabulary sets and efficient querying. Where scale prevents in-memory operations, we demonstrate use of in-process key-value stores, such as Redis and pros and cons vis-a-vis sqlite. We discuss a number of applications of the algorithm, including automated mapping of product attributes from unstructured product descriptions, and prediction of insurance and credit card fraud from case descriptions. Extensions of the algorithm to unsupervised and semi-supervised learning are discussed along with adaptation of the algorithm to the MapReduce framework.
Ruchir is a Data Scientist and holds a B.Tech. from IIT Kanpur. He is involved in development of solutions around unstructured/semi-structured text.