Language independent unsupervised learning of short message service dialect |
| |
Authors: | Sreangsu Acharyya Sumit Negi L Venkata Subramaniam Shourya Roy |
| |
Affiliation: | (1) State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing, 100084, China |
| |
Abstract: | Noise in textual data such as those introduced by multilinguality, misspellings, abbreviations, deletions, phonetic spellings,
non-standard transliteration, etc. pose considerable problems for text-mining. Such corruptions are very common in instant
messenger and short message service data and they adversely affect off-the-shelf text mining methods. Most techniques address
this problem by supervised methods by making use of hand labeled corrections. But they require human generated labels and
corrections that are very expensive and time consuming to obtain because of multilinguality and complexity of the corruptions.
While we do not champion unsupervised methods over supervised when quality of results is the singular concern, we demonstrate
that unsupervised methods can provide cost effective results without the need for expensive human intervention that is necessary
to generate a parallel labeled corpora. A generative model based unsupervised technique is presented that maps non-standard
words to their corresponding conventional frequent form. A hidden Markov model (HMM) over a “subsequencized” representation
of words is used, where a word is represented as a bag of weighted subsequences. The approximate maximum likelihood inference
algorithm used is such that the training phase involves clustering over vectors and not the customary and expensive dynamic
programming (Baum–Welch algorithm) over sequences that is necessary for HMMs. A principled transformation of maximum likelihood
based “central clustering” cost function of Baum–Welch into a “pairwise similarity” based clustering is proposed. This transformation
makes it possible to apply “subsequence kernel” based methods that model delete and insert corruptions well. The novelty of
this approach lies in that the expensive (Baum–Welch) iterations required for HMM, can be avoided through an approximation
of the loglikelihood function and by establishing a connection between the loglikelihood and a pairwise distance. Anecdotal
evidence of efficacy is provided on public and proprietary data. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|