jdotjdot February 2016

Faceting by geolocation in Elasticsearch (clustering)

I have a project that enables users to search for POIs using Elasticsearch, and they can filter by a number of different attributes, including location. I'd like to add faceting to all of the filters, most of which are categorical variables for which faceting is perfect. However, I also want users to be able to facet by location/city/metro area. Each location is currently a lat/long pair.

From my research, it seems that the best approach is to use k-means clustering of the lat/long pairs to get the most common groupings of locations for faceting. Once I have those groupings, I would want to provide the most commonly recognizable name for the area (e.g., even if "Brooklyn" was the center of a cluster, I'd want to provide the name "New York City").

(a) Can geo-clustering (k-means, or any other way) be done in Elasticsearch to allow faceting by location? If so, how? If not, can this be done in Postgres instead? (b) How can I make sure that I'm providing the most widely recognizable political name for any given region returned by the clustering?

Answers


Peter Dixon-Moses February 2016

Given Lat/Long (or address) as input, you can use the Google Maps Geocoding API to retrieve (and index) specific hierarchically-scoped labels for:

  • country
  • administrative_area_level_1 (state: in the US)
  • administrative_area_level_2 (county: in the US)
  • sublocality_level_1 (borough: in NYC)
  • administrative_area_level_3 (city: in the US)
  • locality (neighborhood: in the US)

If you're building out a Yelp or AirBnB-like search interface with a zoom-map component, you can choose which location facet to display based some diversity criteria:


e.g. request all 6 term facets, but only display the one with the appropriate selection diversity (say 2-10 terms) ... for example, if your zoom level (and bounding box) includes Brooklyn, Manhattan and Staten Island, then you'll see the following:

  • country (United States) ... ignore, too broad
  • administrative_area_level_1 (New York) ... ignore, too broad
  • administrative_area_level_2 (Kings County, New York County, Richmond County) ... ignore (just in the case of NYC where sublocality_level_1 is more commonly used)
  • sublocality_level_1: (Brooklyn, Manhattan, Staten Island) ... appropriately specific, show this!
  • administrative_area_level_3 (New York City) ... ignore, too broad
  • locality (<100s of neighborhoods>) ... ignore, too narrow

Post Status

Asked in February 2016
Viewed 3,765 times
Voted 7
Answered 1 times

Search




Leave an answer