Our Technology > Geographic Entity Resolution

Like other references in natural language text, geographic references are often under-specified and ambiguous. To take an extreme example, when encountering a reference to “Al Hamra”, the task is to determine to which of the 65 possible places in the world that name references, or even whether it is a place referenced, for the phrase also means “red” in Arabic. The same applies to the more than two dozen U.S. towns named “Madison”. In fact, the majority of references to places are ambiguous in this way.

To handle this ambiguity, MetaCarta's geoparser leverages the data assets in our Geographic Data Modules (GDMs). While the geoparsing process utilizes advanced algorithms capable of processing massive amounts of data at lightning speeds, the heart of a geoparser is the data that allows it to think like a person. This data comes from many sources. This page describes that part of the data known as linguistic statistics.

Step 0: Manual Tagging (The Foundation)

Statistical machine learning relies on ground truth training examples generated by human annotators. This technique is used in many areas of machine learning. In natural language processing, the training examples are usually texts that annotators mark up with metadata indicating various properties of substrings in the document. For example, an annotator might identify all of the verbs in a document or all of the organization names.

To work together, a group of annotators must agree on a set of tagging guidelines. Such guidelines allows the group to work together in producing a ground truth corpus with consistent metadata. For example, when the phrase "the city of" precedes a location reference, should it be tagged? Generally, additional guidelines must be created for new languages and genres. Even with good guidelines, human taggers still disagree on how to tag some documents.

To ensure the quality of the ground truth corpora that MetaCarta uses for training and evaluation of its machine learning models, MetaCarta's manual taggers passes each document through multiple manual taggers. Since many of MetaCarta's corpora involve specialized subjects, we retain experts capable of capturing domain specific nuances in our ground truth.

To construct a GDM, MetaCarta's scientists train and evaluate the geoparser's statistical models on corpora of ground truth documents.

Step 1: Assessing Candidate Georefs

When processing a document, the MetaCarta geoparser considers every substring of the document. For each substring, it computes the probability that the author intended for it to refer to a location. To compute this probability, the geoparser detects clues surrounding each substring. For example, in the English language, a reader might gather contextual information from the word "to" separated from other words by whitespace or punctuation.

While the presence of "to" does not fully determine the geographic meaning of subsequent substrings, it carries location-oriented meaning in the minds of both reader and author. The weight of this location-oriented meaning can be measured by counting how frequently people use the word "to" immediately before a location reference. We measure this frequency in ground truth corpora and store it in the GDM. The substring "to" and other more complicated predicates are detected by geoparser. The geoparser combines the weight of these statistical predicates for each substring.

When a substring happens to match a known place name, the geoparser uses this as another predicate --- useful; however, not conclusive. As indicated above, most place names can occur in non-geographic contexts. The geoparser can detect geographic references for which no coordinates are available. Such "out of gazetteer" names are useful to our internal processes.

Step 2: Resolution

For every substring with non-negligible probability of being a location reference, the geoparser looks up all possible location meanings in the GDM's gazetteer. For example, a reference to "Madison" can mean more than any 22 places with that name. In a manner similar to Step 1, the geoparser computes probabilities for each candidate meaning.

Spatial correlations in the textual proximity of geographic references are useful predicates in the resolution process. However, seemingly clear-cut spatial relationships are not always right. The diversity of ideas communicated in natural language prose presents many counter examples that can only be captured by a rich collection of predicates and training on nuanced corpora.

Some substrings refer to locations via a pattern, such as a street address or grid coordinate. The resolution process decodes these patterns using high-speed parsers and data rich lookup tables.

Like street addresses, relative references combine patterned and named references. For example, resolving the meaning of "three quarters of a mile north of the Prudential" requires resolving a location for the Prudential building and translating 0.75 miles north.

Step 3: Search Relevance

The science of geographic information retrieval (GIR) explores different models of the relevance of each location referenced in a document. MetaCarta GTS utilizes the confidence scores generated by the geoparser as one of the inputs to the GIR models that it uses to rank search results.

Read more about MetaCarta geographic entity resolution