Products > MetaCarta GSRP > Processing and Indexing

MetaCarta’s Geo Referencing Engine uses natural language processing (NLP) to identify and disambiguate geographic references within unstructured content.  MetaCarta rapidly builds a searchable index that enables users to find content using a combination of keywords and a map. 

While MetaCarta uses advanced algorithms capable of processing massive amounts of data at lightning speeds, the heart of the platform is the data that allows it to think like a person. The following information describes how the MetaCarta platform processes and indexes content.

Like other references in natural language text, geographic references are often under-specified and ambiguous. To take an extreme example, when encountering a reference to “Al Hamra”, the task is to determine to which of the 65 possible places in the world that name references, or even whether it is a place referenced, for the phrase also means “red” in Arabic. The same applies to the more than two dozen U.S. towns named “Madison”. In fact, the majority of references to places are ambiguous in this way.  To handle this ambiguity, MetaCarta leverages the data assets in our Geographic Data Modules (GDMs).

Geo Referencing Engine - Processing

The MetaCarta platform extracts or computes the following items for each document:

  • Geographic references, which may be placenames or other forms of geographic annotation such as coordinates, military grid references, etc.
  • A latitude and longitude for each geographic reference
  • A “geoconfidence” score for each geographic reference, which is the estimated probability that the assigned latitude and longitude are correct
  • An emphasis score for each geographic reference and each keyword reference, which is the estimated prominence of the reference in the document
  • The document's first recognized temporal reference

Geo Referencing Engine - Indexing


This optimized text and spatial index that allows documents to be rapidly retrieved based on geographic and textual elements of interest.

Traditional text indices allow documents to be retrieved based on keywords. Spatial indices allow documents to be retrieved based on geographic factors. MetaCarta uses an index that contains both text and location data types to provide sub-second response.

To use both keywords and a geographic extent as filters, one must apply both filters to an indexed list of documents. Such operations become prohibitively slow on large collections of data. Given the importance of such keyword-plus-map filtering, MetaCarta developed a specialized index that can apply both filters with a time cost that grows only with the size of the requested result set, not the collection size.

MetaCarta's index contains the geographic reference names, latitude and longitudes coordinates as well as confidence, relevance, geographic term position and geographic term proximity. The MetaCarta Platform's query engine is uniquely capable of handling hundreds of geographic text queries per second, enabling support of thousands of simultaneous users with sub-second response.

The success or failure of any search solution, including geographic search, is based on performance and accuracy. For example, if a user entered a query for “crime” in the United States in an integrated (non-MetaCarta) solution that has indexed a collection of 5 million documents, the solution would have to perform the following processes:

  • The first step would be to send the text query string “crime” to the text search engine that produces a results set of 355,123 documents that match that query in the total collection.
  • The coordinates for an area that encloses the United States is also sent to the spatial search engine and a results set of 251,098 documents that are within that area in the total collection are returned. The spatial search engine is using the place name coordinates derived from a geographic entity meaning tool. 
  • To evaluate relevance during the list merge a total of 606,221 seeks with an average seek time of 4 milliseconds (4-5 ms is the average seek time on modern drives) must be performed resulting in a query time of 2425 seconds. At this point, the solution has failed to provide the necessary performance; no user would be willing to wait over 40 minutes for the results.


As stated in the example, to get useful geographic text search results, any query must incorporate both geographic and non-geographic components. The resultant response time would be unacceptable for a product or any production environment: especially since as the number of documents increase the time delay would increase as well.

For More Information

More detailed technical information can be found in MetaCarta's white paper entitled "White Paper: Geographic Search and Referencing Platform".