Home Ask Login Register

Developers Planet

Your answer is one click away!

Jesper February 2016

How to handle "private" data in Elastic Search

I need some input from anyone who might have experience/knowledge about how ElasticSearch works and API's.. I have a (very large) database with a lot of data for a lot of different items.

I need to make all of this data searchable through a public API, so that anyone can use it and query the API about data for specific items. I already have ElasticSearch up & running, and have populated an index in ElasticSearch with all of the data from the database. ElasticSearch is working fine and so is the API. The challenge I now face is that some of the data in our database is "private" data which must not be publicly searchable. At the same time this private data must be searchable internally, which means that I need to make the API run in both at public mode and a private mode (user authenticated). When a client that has not been authenticated queries the API for some data the client should only get the public items, whereas the private (user authenticated) client should get all possible results. I don't have a problem with the items where all the data for one item must not be publicly available. I can simply mark them with a flag and make sure that when I return data to the client through the API they are not returned by ElasticSearch. The challenge occurs when there is data for an item where part of the data is private and part of the data is public. I have thought about peeling off the private data before returning the data to the (public) client. This way the private data is not available directly through the API, but it will be indirectly/implicitly. If for instance the client have searched for some data which is of a private nature and in which case I will "strip" the private data from the search result before returning it to the user, then the client will get the document returned, indicating that the document was a "hit" for the specific query. However the specific query string from the client is nowhere to be found in the document that I return, thus

Answers


Peter Dixon-Moses February 2016

From your description, you clearly need two distinct views of your data:

  • PUBLIC: Subset of the documents in the collection, and certain fields should not be searched or returned.

  • PRIVATE: Entire collection, all fields searchable and visible.


You can accomplish two distinct views of the data by either having:

  1. One index / Two queries, one public, and one private (you can either implement this yourself, or have Shield manage this opaquely for you).
  2. Two indices / Two queries (one public, one private)

In the first case, your public query will filter out private documents as you mention, and only search/return the publicly visible fields. While the private query will not filter, and will search/return all fields.

In the second case, you would actually index your data into two separate indices, and explicitly have the public query run against the public index (containing only the public fields), and the private query run against the private index.


It's true that you can build a mechanism (or use Shield) to accomplish what you need on top of a single index. However, you might want to consider (2) the public/private indices option if:

  • You want to reduce the risk of inadvertently exposing your sensitive data through an oversight, or configuration change.
  • You'd like to reduce the coupling between the public features of your application and the private features of your application.
  • You anticipate the scaling characteristics of public usage to deviate significantly from private usage.

As an example of this last point, most freemium sites have a very skewed distribution of payin

Post Status

Asked in February 2016
Viewed 1,166 times
Voted 8
Answered 1 times

Search




Leave an answer


Quote of the day: live life