A Grading System for Evaluating Geospatial Entity Connectivity from Texts Using Co-occurrences, Semantic Similarity and Geodesic Distance
Eirini Katsadaki *
School of Rural, Surveying and Geoinformatics Engineering, National Technical University of Athens, Athens, Greece.
Georgios Bougas
School of Rural, Surveying and Geoinformatics Engineering, National Technical University of Athens, Athens, Greece.
Margarita Kokla
School of Rural, Surveying and Geoinformatics Engineering, National Technical University of Athens, Athens, Greece.
*Author to whom correspondence should be addressed.
Abstract
Extracting entity connectivity from texts is important for uncovering how places relate within real-world discourse. While structured data is informative, textual data captures rich contextual and semantic knowledge, enabling us to identify hidden networks of interdependence and thematic connections among geographic entities. Entity connectivity is not just complementary to information retrieval but rather essential in various activities, including event analysis, spatial decision support systems, urban studies, and knowledge graph development. This research proposes two versions of a grading system for evaluating connectivity between cities and other geopolitical entities, places, and events extracted from texts: one based on co-occurrences and semantic similarity (System A), and a second one (System B) that incorporates geodesic distance as an additional feature. The proposed grading systems may find practical implications in domains such as large-scale geographic information extraction, place-based information retrieval, and knowledge graph construction from unstructured data sources.
The two systems are evaluated and compared using six machine learning algorithms: Random Forest, Gradient Boosting, Multi-Layer Perceptron (MLP), K-Nearest Neighbors (KNN), Decision Tree, and Support Vector Machine (SVM). The performance of the algorithms is analyzed by measuring accuracy, precision, recall, F1-score, and R². Decision Tree was the winning algorithm for System A, with an accuracy score of 85% while KNN was the best performing algorithm for System B, with an accuracy score of 77%. The results show that the system without geodesic distance performs better on general texts, indicating that the addition of geographic features can introduce noise in text-driven contexts where spatial proximity is implicit or semantically inferred, and should therefore be applied selectively.
Keywords: grading system, connectivity, semantic similarity, machine learning