首页 | 官方网站   微博 | 高级检索  
     


Efficient implementation of Aho–Corasick pattern matching automata using Unicode
Authors:Janne Nieminen  Pekka Kilpeläinen
Affiliation:1. Department of Computer Science, University of Kuopio, P.O. Box 1627, FI‐70211 Kuopio, FinlandDepartment of Computer Science, University of Kuopio, P.O. Box 1627, FI‐70211 Kuopio, Finland;2. Department of Computer Science, University of Kuopio, P.O. Box 1627, FI‐70211 Kuopio, Finland
Abstract:We study different efficient implementations of an Aho–Corasick pattern matching automaton when searching for patterns in Unicode text. Much of the previous research has been based on the assumption of a relatively small alphabet, for example the 7‐bit ASCII. Our aim is to examine the differences in performance arising from the use of a large alphabet, such as Unicode that is widely used today. The main concern is the representation of the transition function of the pattern matching automaton. We examine and compare array, linked list, hashing, balanced tree, perfect hashing, hybrid, triple‐array, and double‐array representations. For perfect hashing, we present an algorithm that constructs the hash tables in expected linear time and linear space. We implement the Aho–Corasick automaton in Java using the different transition function representations, and we evaluate their performance. Triple‐array and double‐array performed best in our experiments, with perfect hashing, hashing, and balanced tree coming next. We discovered that the array implementation has a slow preprocessing time when using the Unicode alphabet. It seems that the use of a large alphabet can slow down the preprocessing time of the automaton considerably depending on the transition function representation used. Copyright © 2006 John Wiley & Sons, Ltd.
Keywords:string pattern matching  Aho–  Corasick  implementation  transition function
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号