Applied Network Science
Volume 4, Issue 1, 2019
Network-theoretic information extraction quality assessment in the human trafficking domain (Article) (Open Access)
Kejriwal M.* ,
Kapoor R.
-
a
Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Ste. 1001, Marina del Rey, CA, United States
-
b
Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Ste. 1001, Marina del Rey, CA, United States
Abstract
Information extraction (IE) is an important problem in Natural Language Processing (NLP) and Web Mining communities. Recently, IE has been applied to online sex advertisements with the goal of powering search and analytics systems that can help law enforcement investigate human trafficking (HT). Extracting key attributes such as names, phone numbers and addresses from online sex ads is extremely challenging, since such webpages contain boilerplate, obfuscation, and extraneous text in unusual language models. Assessing the quality of an IE system is an important problem that is particularly problematic in this domain due to lack of gold standard datasets. Furthermore, building a robust ground truth from scratch is an expensive and time-consuming task for social scientists and law enforcement to undertake. In this article, we undertake the empirical challenge of analyzing the quality of IE outputs in the HT domain without the provision of laboriously annotated ground truths. Specifically, we use concepts from network science to construct and study an extraction graph from IE outputs collected over a corpus of online sex ads. Our studies show that network metrics, which require no labeled ground truths, share interesting and consistent correlations with IE accuracy metrics (e.g., precision and recall) that do require ground-truths. Our methods can potentially be applied for comparing the quality of different IE systems in the HT domain without access to ground-truths. © 2019, The Author(s).
Author Keywords
Index Keywords
[No Keywords available]
Link
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85068084380&doi=10.1007%2fs41109-019-0154-z&partnerID=40&md5=86712c2233ff70aef67fdbc1d591ab00
DOI: 10.1007/s41109-019-0154-z
ISSN: 23648228
Original Language: English