Efficient way of storing and matching names against large data sets
For a data loss prevention like tool, I have a requirement where I need to lookup different types of data such as driver's license number, social security number, names etc. While most of this is pattern based and hence could be looked up using pattern matching with regular expressions, name happens to be a very broad category. There could be virtually any set of characters that could form a name. However, to make it a meaningful lookup, I think I should only lookup them against a defined dictionary of names. Here is what I am thinking.
Provide a dictionary of names as a configuration item. This looks more sensible as for each use case, the names might vary from different geographic regions. I am looking for best practices for doing this in Java. Basically these are the questions-
What is a good data structure to store the names. Set comes to mind as the first option, are there better options like in memory databases.
How should I go about searching these names in the large data sets. These data sets are really large and I only have the facility to read them row by row.
You can do it with full text indexing or online search.
I would prefer full text indexing, e.g. with Lucene. You will have to define how the indexer finds tokens in the text (by defining the token patterns and the dont-care-patterns).
Known patterns (e.g. license numbers) should be annotated at indexing time with their type. Querying the index for an annotated type (e.g. license number) will return you all contained license numbers.
Flexible patterns (like names) should be index as tokens. You can then iterate over the collection of legal names and query the index with it.
This approach is not the most flexible, but it is very robust to changes to the set of data files (simply put the new file to the index) or to the set of names (simply query the new name in the index).
In this approach it is not really performance relevant how you store the set of names
The other approach would be to search for multiple strings (names). Note that there are special search algorithms for multiple strings and that most algorithms have a preferred range of params (pattern size, alphabet size, number of patterns to search). You can get some impressions at StringBench.
This approach allows you more flexible string patterns.
However it is not robust to modifications to the set of names (then the complete search has to be repeated).
Multi-string usually would accept a set of strings to search, but they will store this set in a algorithm-specific way (most use a trie)
Efficient search for multiple patterns/strings can be done with DFA-based automata.