MetaCarta Products >
MetaCarta Geographic Data Modules make up the core of MetaCarta products and hosted solutions. A GDM is a knowledge base used to identify and disambiguate geographic references, assign latitude/longitude coordinates, and confidence scores and relevance ranking. Each MetaCarta GDM contains linguistic statistics, gazetteer data, and natural language processing (NLP) logic.
MetaCarta currently provides a Base GDM containing more than 13 million place name variations. In addition, MetaCarta provides several GDMs for language or industry-specific use.
With GDMs, MetaCarta has solved the problem of geographic entity resolution. Entity resolution takes a big step beyond entity extraction by providing an interpretation of the author's intended meaning. For example, within a document, instead of simply saying "this string is a name of a place," entity resolution would say "the author means this specific place" as identified by, say, an additional geographic reference within the same document.
Core Platform Component:
Optional Add-On GDMs:
GDM Technical Background
1) Linguistic Statistics
The technical discipline of NLP is an area of computer science dealing with languages developed by humans for communication with other humans. Natural languages tend to be ambiguous, especially compared to machine languages. Geographic meaning resolution is a specific subfield of NLP.
Every MetaCarta GDM contains NLP logic which is used to:
- Recognize the jargons/data types that represent geographic entities; this includes the handling of name variants and contextual rules
- Disambiguate names and establish GeoConfidence
- Establish GeoRelevance
MetaCarta's NLP model uses a combination of linguistic statistics, which are measurements generated from manually tagged documents. MetaCarta's team of manual taggers establishes "ground truth" by annotating collections of documents in many genres and languages with geographic metadata. These carefully checked ground truth documents are used to generate linguistic statistics that form the core of MetaCarta's GDM-based products.
Linguistics statistics allow MetaCarta solutions to go beyond simple entity extraction and move into entity resolution - a.k.a. "disambiguation." People have named a huge number of places on Earth, and even on other planets.
Conservative estimates indicate that hundreds of millions of places have colloquial and formal names. While some of these places are widely known, the vast majority are referenced less often, because fewer people know them. This type of pattern is known as a "long-tailed distribution." In statistics, a long-tailed distribution describes any process where a large number of seemingly rare events occur. To see this property of geographic references in text, one can plot a curve with the number of mentions on the Y-axis and the number of locations with that many references on the X-axis — less frequently mentioned places account for huge number of references! Resolving geographic meanings from the long tail of geography requires large amounts of geo-linguistic data. With so many ways of referring to places, the analogue of the statistical translation table must capture a wide spectrum of contexts.
Instead of aligning a translated text with the original, the statistics relevant to geographic resolution come from counting the co-occurrence of manually identified location references with linguistic and syntactic clues. That is, one takes a document that contains references to places, has a human mark up the document with metadata indicating which substrings refer to particular locations, and then has a training system count how frequently various clues co-occur with these location references. By repeating this process with many manually tagged documents, the training system develops nuanced co-occurrence statistics that embody how real human authors refer to places.
MetaCarta generates linguistic statistics from documents that manual taggers (humans) have marked up with geotags. These geo-linguistic statistics are the foundation of MetaCarta's products.
Human beings have a remarkable ability to derive useful information from ambiguous and under-specified references using real-world knowledge and experience. They know, for example, that a reference to a place called “Madison”, in the absence of a state, is more likely to refer to “Madison, Wisconsin” than the smaller “Madison, Iowa”; and they know that James Madison and the Madison family do not refer to places at all.
MetaCarta imitates this human process using a combination of heuristics and data mining. We begin with a gazetteer described above, and the enclosure relationship between regions and points. A given name may refer to several points or regions, or refer to a non-geographic concept. To deal with ambiguity, for every potential reference of a name to a location point, we estimate the confidence that the written name really refers to a specific point. The relevance of the document to each mentioned location must also be determined, in order to present the results that best satisfy the need for both correctness and relevance to a query.
Back to top >>
2) Gazetteer
People have named hundreds of millions of locations. So far, civilization has only gathered a fraction of these names in digital collections called gazetteers. MetaCarta continuously gathers additional gazetteer data, because information about less commonly known locations tends to create additional insight for knowledge workers.
The MetaCarta gazetteer is a dictionary of geographic placenames and associated data about the placenames. Placenames can include any natural or manmade object that has a known location, such as continents, oceans, countries, states, provinces, regions, counties, cities, towns, landmarks, buildings and road names. The MetaCarta gazetteer is one of the largest collections in the world.
The gazetteer within any MetaCarta GDM contains:
- Name, e.g. "New York City"
- Name variants, e.g. "Big Apple"
- Latitude & Longitude, e.g. 40.71416855, -74.00639343
- Container information (e.g. county, state, country)
- Polygons (for map drawing)
MetaCarta takes full advantage of gazetteers like the NGA Geographic Names Server and the USGS Geographic Names Information System. In addition, MetaCarta leverages other sources that include country gazetteers, lists of schools, hospitals, notable buildings, local landmarks, oil wells, platforms, fields, basins, government facilities, religious sites, and others.
Back to Top>>