Location Extractor

Go to Product

The “Location Extractor” node allows to extract mentionings of geographic locations (aka. toponyms) from unstructured English text.

The location extraction algorithm performs various steps for recognizing potential locations within a given text, followed by a disambiguation. The disambiguation step checks hierarchical/contains relations and identifies the correct locations by their proximity to other given locations in the text (for example, it tries to distinguish between Paris, France and Paris, Texas based on the context in the given text input).

Each identified location in the text is returned, multiple occurrences are returned as often as they occur. Extracted locations are classified into the following categories:

Location Type Explanation, Example
CONTINENT e.g. “Asia”
COUNTRY e.g. “Japan”
CITY e.g. “Tokyo”
ZIP Zip code of a city
STREET Name of a street
STREETNR Number of a building within a street
UNIT A political or administrative unit like a federal state, a county, or a city district (e.g. “California”, “Bavaria”, or “Manhattan”)
REGION An area which is independent from or spanning multiple political or administrative units (e.g. “Midwest”)
POI A human-made point of interest or a building, like hotels, museums, universities, monuments, etc. (e.g. “Stanford University” or “Tahrir Square”)
LANDMARK Geographic features like rivers, canyons, lakes, islands, waterfalls, etc. (e.g. “Rocky Mountains”)
UNDETERMINED An undetermined or unknown type

For each location, geographical coordinates with longitude and latitude values are provided. They are in WGS84 decimal degrees.

In order to use the “Location Extractor”, a “Location Source” (also known as Gazetteer) must be connected to the node’s input port. The Location Source is a database with real world locations and meta information such as alternative names, population counts, coordinates, and hierarchical relations. Currently, there’s the following location sources available:

  • The “GeoNames Location Source” connects to the GeoNames REST API.
  • The “Local Location Source” is a locally hosted database.

This node uses Palladian’s location extraction mechanism – for more information see: “To Learn or to Rule: Two Approaches for Extracting Geographical Information from Unstructured Text”; Philipp Katz and Alexander Schill; Proc. of the 11th Australasian Data Mining & Analytics Conference (AusDM 2013).

Options

Input
The column in the input table which contains the text.
Disambiguation
The disambiguation method to use. Currently the following methods are supported:
  • ML (730-docs-10T): Machine-learning based disambiguation trained on 730 documents from different datasets using a sohpisticated set of features.
  • ML (TUD-Loc-2013-10T): Machine-learning based disambiguation trained on the TUD-Loc-2013 dataset using a sophisticated set of features.
  • Heuristic: Disambiguation based on several rules.
Minimum Trust
Trust probability threshold between 0 … 1. It allows to regulate the Precision/Recall tradeoff. The lower the value, the more locations will be extracted, but the higher the probably for invalid extractions. With increasing threshold, less locations will be extracted, but with a higher probability that all of them are correct.
Output
Specify how the extracted locations should be mapped to column values:
  • Rows: Create one new row for each location found in the text. In case there is more than one location, this will append multiple rows per input, or no row in case no match was found.
  • Rows or Missing: Same as “Rows”, but append a row with missing value cells when no location was found for an input row.
  • JSON: Append a JSON array with the rextracted locations and detailed location information.
Output Column Prefix (*)
Set a prefix for the appended column names.

Input Ports

Icon
Connector to a Location Source.
Icon
Table with a column holding text from which to extract locations.

Output Ports

Icon
Table with rows for each extracted location from the provided text inputs. It provides columns with the normalized name of the location (e.g. the short form “L.A.” occurring in the text is returned by its full form “Los Angeles”), the type of the location, the geographical coordinates in WGS84 decimal degrees, and the population of the location (if applicable).

Views

This node has no views

Workflows

  • No workflows found

Links

Developers

You want to see the source code for this node? Click the following button and we’ll use our super-powers to find it for you.