pelican_george February 2016

Lucene - custom analyzer/tokenizer to index JSON key pair values

I'm aiming to store and index JSON key pair values. Ideally I would store them in a constant fieldname. (For simplicity sake, "GRADES")

An example of the incoming JSON object:

    "Data": [{
        "Key": "DP01",
        "Value": "Excellent"
    }, {
        "Key": "DP02",
        "Value": "Average"
    }, {
        "Key": "DP03",
        "Value": "Negative"
    }]

The JSON object would be serialized and stored as it is, but I would like to index it in a way to enable me to search within that same field by key and value. The main idea is to search multiple values within the same Lucene Field.

Any suggestions on how to structure the indexing? Lets imagine for example that I would like to search using following query:

[GRADES: "key:DP01 UNIQUEIDasDELIMITER value:Excellent"]

How would a customer analyzer/tokenizer achieve this ?

EDIT: An attempt to depict my goal more accurately.

Think of this typical relational type of structure (for simplicity sake).

  • Each document is a website.

  • A website can have multiple images (and other important metadata).

  • Each image has multiple sets of free keyvaluepair properties:

    {
        "Key": "Scenery",
        "Value": "Nature"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    
  • Another set:

    {
        "Key": "Scenery",
        "Value": "Industrial"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

My challenge is come from a similar type of structure and index it in a way which enables me to build queries such as:

A website with an image of scenery:industrial and style:vintage.

I'm probably taking the wrong approach as indicated by Andy Pook. Any ideas how to efficiently flatten out these properties?

Answers


cris almodovar March 2016

How about storing the JSON Data in a multi-valued field, e.g. GRADES, like this:

GRADES: "Key DP01 Value Excellent"
GRADES: "Key DP02 Value Average"
GRADES: "Key DP03 Value Negative"

You could then run a query like this:

GRADES: ("Key DP01" AND "Value Excellent")


Andy Pook March 2016

A common "problem" is to think about indexes and documents as having a consistent set of fields. It is not the same as a relational database with tables of a fixed set of columns.

in a previous life I had an entity with a set of "attributes". A key/value collection (much like your grades).

Each document was created with fields named for each attribute ie "attr-thing" with the value added "NOT_ANALYZED".

So, in your example I'd create fields like

new Field("grade-"+gradeID, grade, Field.Store.NO, Field.Index.NOT_ANALYZED)

Then you can search with a query like "grade-DP01:excellent".

Alternatively you can just have a fixed field name (similar to @cris-almodovar) and set the value to something like "id=grade". Again NOT_ANALYZED. The search for "grade:DP01=excellent".

Either will work. I've used both approaches with success but typically prefer the first.

Additional in response to edit...

I think I understand the problem... If you had "scenery=industrial style=vintage" and "scenery=nature style=modern" you wouldn't want it to match if you searched "nature AND vintage", right?

You could add an "imageType" field for each set with a value like "scenery=industrial style=vintage abc=xyz" with the KeywordAnalyzer (just splits by space).

Then search with imageType:"scenery=industrial style=vintage"~2. Using a slop phrase guarantees that the values are in the same field and the slop allows for the order to be different or for there to be extra values. The number you'd have to figure out based on the number of properties you expect in each field. Simplistically, if you expect for there to be a max of N values then the slop should be N too.

Post Status

Asked in February 2016
Viewed 3,187 times
Voted 13
Answered 2 times

Search




Leave an answer