Analysis of geographical information in textual data

On April 28th, 2017, the CIST welcomed Taylor Arnold, professor in the Department of Statistics and Data Science at Yale University, invited professor at Université Paris Diderot (LARCA/CLILLAC-ARP joint initiative).

Introducing the lecture, Claude Grasland centred his presentation on the spatial analysis of lists of places defined qualitatively by their names. He underlined that the recognition of spatial entities is easier in the case of states than for cities. Then he developped 2 examples of analysis of lists of places. On the one hand, the analysis of students' mental representations of the countries of the world (FP7 EuroBroadMap project, 2009-2013). On the other, the contribution of the analysis of RSS flows from international medias to the analysis of geopolitical power relations between the countries of the world (ANR Géomédia project, 2013-2016).

Couverture de l'ouvrage Humanities Data in R, Springer

Taylor Arnold drew up a historical picture of the data exploratory analysis stream by identifying the major actors of the field. Then he underlined the turn represented by the development of S, S+ and finally R in the extension of data exploratory analysis. In the end, he demonstrated more precisely the contribution of this type of analysis to digital humanities through the example of textual analysis and CoreNLP package.

The presentation of Marianne Guérois et Malika Madelin was an exploratory analysis of Airbnb data (scrapped by the InsideAirbnb platform. They enriched with textual analysis the work done as part of the Grandes métropoles project. Their textual analysis is based on 3 fields which are combined with location information: title and description of homes (by hosts) and comments (by guests). The issues tackled by this study are twofold. Firstly, questions are raised about the languages used in relation with the targeted customers and the emergence of communities of languages. Secondly, the "naming" of home locations (by hosts) and their description (by hosts and guests) questions the possible mismatches between written and spatial locations. From a methodological point of view, one of the main difficulties lies in the recognition of geographic places. For instance, what is "Eiffel", the proximity to the site or a view towards the tower from an apartment? Among the results, the map of words according to Parisian districts shows the hierarchy of places within the city, for example touristic areas. It also enlightens the fact that touristic areas can be cited outside of their "actual" zone. They ended their presentation questioning the words about central location and how they stretches out in the city, based on xy locations.