Research outputs
Serere, H, N., Resch, B. (2024) Understanding the impact of geotagging on location inference models for accurate generalization to non-geotagged datasets
For a location inference model to be successful, the properties of the geotagged tweets where the location inference model is developed and those of the non-geotagged tweets where the location inference model is applied need to match. We investigated location mentions within the tweet text field of 3,953,166 geotagged and 2,783,609 non-geotagged tweets across five of the most prominent Twitter sources. Specifically, we compared the frequency and the location entity types used to infer the locations within the two datasets. Overall we found statistically significant differences in location mentions between the two datasets. However, although statistically significant, thirteen of the fifteen analysed location entities, showed low effect sizes. We conclude that location inference models trained on geotagged datasets can generalize a non-geotagged dataset if special adjustments are made on the development of the location inference models.
Hu, X., Elßner, T., Zheng, S., Serere, H. N., Kersten, J., Klan, F., & Qiu, Q. (2024) DLRGeoTweet: A comprehensive social media geocoding corpus featuring fine-grained places. Information Processing and Management
Every day, many short text messages on social media are generated in response to real-world events, providing a valuable resource for various domains such as emergency response and traffic management. Since exact coordinates of social media posts are rarely attached by users, accurately recognizing and resolving fine-grained place names, such as home addresses and Points of Interest, from these posts is crucial for understanding the precise locations of critical events, such as rescue requests. This task, known as geoparsing, involves toponym recognition and toponym resolution or geocoding. However, existing social media datasets for evaluating geoparsing approaches often lack sufficient fine-grained place names with associated geo-coordinates or linked to gazetteers, making evaluating, comparing, and training geocoding methods for such locations challenging. Moreover, the absence of supportive annotation tools compounds this challenge. To address these gaps, we implemented a lightweight Python tool leveraging Nominatim. Using this tool, we annotated a comprehensive X (formerly Twitter) geocoding corpus called DLRGeoTweet. The corpus underwent a rigorous cross-validation process to guarantee its quality. This corpus includes a total of 7,364 tweets and 12,510 places, of which 6,012 are fine-grained. It comprises two global datasets encompassing worldwide events and three local datasets related to local events such as the 2017 Hurricane Harvey. The annotation process spanned over ten months and required approximately 1000 person-hours to complete. We then evaluate 15 latest and representative geocoding approaches, including many deep learning-based, on DLRGeoTweet. The results highlight the inherent challenges in resolving fine-grained places accurately. Despite increasing access constraints to Twitter data, our corpus’s focus on short, informal text makes it a valuable resource for geocoding across multiple social media platforms.
Serere, H. N., Resch, B., & Havas, C. R. (2023) Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection
Twitter location inference methods are developed with the purpose of increasing the percentage of geotagged tweets by inferring locations on a non-geotagged dataset. For validation of proposed approaches, these location inference methods are developed on a fully geotagged dataset on which the attached Global Navigation Satellite System coordinates are used as ground truth data. Whilst a substantial number of location inference methods have been developed to date, questions arise pertaining the generalizability of the developed location inference models on a non-geotagged dataset. This paper proposes a high precision location inference method for inferring tweets’ point of origin based on location mentions within the tweet text. We investigate the influence of data selection by comparing the model performance on two datasets. For the first dataset, we use a proportionate sample of tweet sources of a geotagged dataset. For the second dataset, we use a modelled distribution of tweet sources following a non-geotagged dataset. Our results showed that the distribution of tweet sources influences the performance of location inference models. Using the first dataset we outweighed state-of-the-art location extraction models by inferring 61.9%, 86.1% and 92.1% of the extracted locations within 1 km, 10 km and 50 km radius values, respectively. However, using the second dataset our precision values dropped to 45.3%, 73.1% and 81.0% for the same radius values.
Serere, H. N., Kanilmaz, U. N., Ketineni, S., & Resch, B. (2023). A Comparative Study of Geocoder Performance on Unstructured Tweet Locations.
Geocoding is a process of converting human-readable addresses into latitude and longitude points. Whilst most geocoders tend to perform well on structured addresses, their performance drops significantly in the presence of unstructured addresses, such as locations written in informal language. In this paper, we make an extensive comparison of geocoder performance on unstructured location mentions within tweets. Using nine geocoders and a worldwide English-language Twitter dataset, we compare the geocoders’ recall, precision, consensus and bias values. As in previous similar studies, Google Maps showed the highest overall performance. However, with the exception of Google Maps, we found that geocoders which use open data have higher performance than those which do not. The open-data geocoders showed the least per-continent bias and the highest consensus with Google Maps. These results suggest the possibility of improving geocoder performance on unstructured locations by extending or enhancing the quality of openly available datasets.
Serere, H. N., & Resch, R. (2023). Syntactical Text Analysis to Disambiguate between Twitter Users ’ In-situ and Remote Location. First International Workshop on Geographic Information Extraction from Texts at ECIR 2023
The precision of text-based location inference models, which aim to identify a tweets’ point of origin through analysing the post’s text, is strongly influenced by differing location mentions. This particularly concerns the description of remote locations, i.e., locations that do not
coincide with the user’s location when posting a tweet. To filter out remote location mentions keyword filtering, temporal information matching and rule-based matching approaches have been used. However, these methods fail to take into account the tweets’ syntax and
hence produce low performance. We propose an advanced Named Entity Recognition model that not only extracts location entities but distinguishes between remote and in-situ location mentions based on the texts’ surrounding grammatical cues. We train our algorithm on a base spaCy model which exhibits moderate performance on a relatively
small training size. Preliminary results show that our approach outperforms similar studies and suggest the possibility of distinguishing between in-situ and remote location mentions with higher precision upon further refinement of the study design.
Serere, H. N., Resch, B., & Havas, C. (2022). Extracting and Geocoding Locations in Social Media Posts : A Comparative Analysis Extracting and Geocoding
Geo-social media have become an established data source for spatial analysis of geographic and social processes in various fields. However, only a small share of geo-social media data are explicitly georeferenced, which often compromises the reliability of the analysis results by excluding large volumes of data from the analysis. To increase the number of georeferenced tweets, inferred locations can be extracted from the texts of social media posts. We propose a customized workflow for location extraction from tweets and subsequent geocoding. We compare the results of two methods: DBpedia Spotlight (using linked Wikipedia entities), and spaCy combined with the geocoding methods of OpenStreetMap Nominatim. The results suggest that the workflow using spaCy and Nominatim identifies more locations than DBpedia Spotlight. For 50,616 tweets posted within California, USA, the granularity of the extracted locations is reasonable. However, several directions for future research were identified, including improved semantic analysis, the creation of a cascading workflow, and the need to integrate different data sources in order to increase reliability and spatial accuracy.
Serere, H. N., Ettema, J., Jetten, V. (2018) Developing a worst-case Tropical Cyclone rainfall scenario for flood on Dominca. (Master thesis)
Fresh water flooding as a result of tropical cyclone rainfall depicts a hydrometeorological hazard that needs to be prepared against. When not adequately prepared for, freshwater flooding results in immense damages that disrupts economies, displace settlements, and increase the poverty line. To prepare and mitigate against tropical cyclone (TC) induced freshwater flooding, countries make use of design storms. One disadvantage, however, is that design storms are very different from actual storm events with respect to both spatial and temporal rainfall structures. Design storms tend to lose vital storm information which influences the results of the simulated flood hazards. When dealing with extreme rainfall events, the simplification of storm traits by design storm may result in major implications on the decisions made for flood mitigations owing to the differences in simulated flood characteristics between the design storm and the extreme rainfall event. This research intended to evaluate the flood implications of simulating a worst-case tropical cyclone rainfall scenario against a design storm of comparable rainfall characteristics. Using the Southern catchments of Dominica as a study area and the 2017 Atlantic basin TC Maria as a proof of concept, the research was carried out in two main steps. First a method was developed to extract extreme rainfall pixels from the passage of a TC given temporal layers of precipitation images. Using the extracted extreme rainfall pixels and Dominica’s 100-year design storm, flood characteristics of the TC scenarios and the design storm were compared. Based on the flood characteristics of the worst-case rainfall scenario (extreme TC rainfall pixel with the highest simulated flood characteristics) and the 100-year design storm, economic flood implications of simulating a flood from the two approaches are evaluated. Contrary to common perception, the results of the analysis showed the extreme rainfall pixels of TC Maria to result from a category 2 and 3 cyclones. Of the extreme TC rainfall pixels used as TC scenarios, the worst-case rainfall scenario resulted from a high intensity pixel with a maximum intensity of 107mm/hr, three peak intensity values, and a shortest distance from the TC eye of 10km. Comparisons made between the flood characteristics of the TC scenarios and the 100-year design storm showed the 100-year design storm to have overall shorter flood start times, higher flood volume, larger flooded areas and higher flood heights in comparison to the TC scenarios. Based on the obtained flood characteristics, the 100-year design storm was concluded to simulate overestimated flood characteristics which would imply overestimated flood mitigation measures.