Products > MetaCarta GSRP > Processing and Indexing

MetaCarta’s Geo Referencing Engine uses natural language processing (NLP) to identify and disambiguate geographic references within unstructured content.  MetaCarta rapidly builds a searchable index that enables users to find content using a combination of keywords and a map. 

While MetaCarta uses advanced algorithms capable of processing massive amounts of data at lightning speeds, the heart of the platform is the data that allows it to think like a person. The following information describes how the MetaCarta platform processes and indexes content.

Like other references in natural language text, geographic references are often under-specified and ambiguous. To take an extreme example, when encountering a reference to “Al Hamra”, the task is to determine to which of the 65 possible places in the world that name references, or even whether it is a place referenced, for the phrase also means “red” in Arabic. The same applies to the more than two dozen U.S. towns named “Madison”. In fact, the majority of references to places are ambiguous in this way.  To handle this ambiguity, MetaCarta leverages the data assets in our Geographic Data Modules (GDMs).

Geo Referencing Engine - Processing

The MetaCarta platform extracts or computes the following items for each document:

  • Geographic references, which may be placenames or other forms of geographic annotation such as coordinates, military grid references, etc.
  • A latitude and longitude for each geographic reference
  • A “geoconfidence” score for each geographic reference, which is the estimated probability that the assigned latitude and longitude are correct
  • An emphasis score for each geographic reference and each keyword reference, which is the estimated prominence of the reference in the document
  • The document's first recognized temporal reference

Step 0: Manual Tagging (The Foundation)

Statistical machine learning relies on ground truth training examples generated by human annotators. This technique is used in many areas of machine learning. In natural language processing, the training examples are usually texts that annotators mark up with metadata indicating various properties of substrings in the document.

For example, when the phrase "the city of" precedes a location reference, should it be tagged? Generally, additional guidelines must be created for new languages and genres. Even with good guidelines, human taggers still disagree on how to tag some documents.

To ensure the quality of the ground truth corpora that MetaCarta uses for training and evaluation of its machine learning models, MetaCarta's manual taggers pass each document through multiple manual taggers. Since many of MetaCarta's corpora involve specialized subjects, we retain experts capable of capturing domain specific nuances in our ground truth.

To construct a GDM, MetaCarta's scientists train and evaluate statistical models on collections of ground truth documents.

Step 1: Assessing Candidate Georeferences

When processing a document, MetaCarta considers every substring of the document. For each substring, it computes the probability that the author intended for it to refer to a location. To compute this probability, MetaCarta detects clues surrounding each substring. For example, in the English language, a reader might gather contextual information from the word "to" separated from other words by whitespace or punctuation.

While the presence of "to" does not fully determine the geographic meaning of subsequent substrings, it carries location-oriented meaning in the minds of both reader and author. The weight of this location-oriented meaning can be measured by counting how frequently people use the word "to" immediately before a location reference.

When a substring happens to match a known place name, MetaCarta uses this as another predicate --- useful; however, not conclusive. As indicated above, most place names can occur in non-geographic contexts. MetaCarta can detect geographic references for which no coordinates are available. Such "out of gazetteer" names are useful to our internal processes.

Step 2: Resolution (a.k.a. Disambiguation)

For every substring with non-negligible probability of being a location reference, MetaCarta looks up all possible location meanings in the GDM. For example, a reference to "Madison" can mean more than any 22 places with that name. In a manner similar to Step 1, MetaCarta computes probabilities for each candidate meaning.

Some substrings refer to locations via a pattern, such as a street address or grid coordinate. The resolution process decodes these patterns using high-speed parsers and data rich lookup tables.

Like street addresses, relative references combine patterned and named references. For example, resolving the meaning of "three quarters of a mile north of the Prudential" requires resolving a location for the Prudential building and translating 0.75 miles north.

Step 3: Search Relevance

MetaCarta measures the relevance of each location referenced in a document. MetaCarta uses these confidence scores as one of the data points used to rank search results.

Geo Referencing Engine - Indexing


This optimized text and spatial index that allows documents to be rapidly retrieved based on geographic and textual elements of interest.

Traditional text indices allow documents to be retrieved based on keywords. Spatial indices allow documents to be retrieved based on geographic factors. MetaCarta uses an index that contains both text and location data types to provide sub-second response.

To use both keywords and a geographic extent as filters, one must apply both filters to an indexed list of documents. Such operations become prohibitively slow on large collections of data. Given the importance of such keyword-plus-map filtering, MetaCarta developed a specialized index that can apply both filters with a time cost that grows only with the size of the requested result set, not the collection size.

MetaCarta's index contains the geographic reference names, latitude and longitudes coordinates as well as confidence, relevance, geographic term position and geographic term proximity. The MetaCarta Platform's query engine is uniquely capable of handling hundreds of geographic text queries per second, enabling support of thousands of simultaneous users with sub-second response.

The success or failure of any search solution, including geographic search, is based on performance and accuracy. For example, if a user entered a query for “crime” in the United States in an integrated (non-MetaCarta) solution that has indexed a collection of 5 million documents, the solution would have to perform the following processes:

  • The first step would be to send the text query string “crime” to the text search engine that produces a results set of 355,123 documents that match that query in the total collection.
  • The coordinates for an area that encloses the United States is also sent to the spatial search engine and a results set of 251,098 documents that are within that area in the total collection are returned. The spatial search engine is using the place name coordinates derived from a geographic entity meaning tool. 
  • To evaluate relevance during the list merge a total of 606,221 seeks with an average seek time of 4 milliseconds (4-5 ms is the average seek time on modern drives) must be performed resulting in a query time of 2425 seconds. At this point, the solution has failed to provide the necessary performance; no user would be willing to wait over 40 minutes for the results.


As stated in the example, to get useful geographic text search results, any query must incorporate both geographic and non-geographic components. The resultant response time would be unacceptable for a product or any production environment: especially since as the number of documents increase the time delay would increase as well.

For More Information

More detailed technical information can be found in MetaCarta's white paper entitled "White Paper: Geographic Search and Referencing Platform".