Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study

Vaghela, Uddhav; Rabinowicz, Simon; Bratsos, Paris; Martin, Guy; Fritzilas, Epameinondas; Markar, Sheraz; Purkayastha, Sanjay; Stringer, Karl; Singh, Harshdeep; Llewellyn, Charlie; Dutta, Debabrata; Clarke, Jonathan M.; Howard, Matthew; Serban, Ovidiu; Kinross, James; Sá-Marta, Eduarda; et al.

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study

Detalhes bibliográficos
Autor(a) principal:	Vaghela, Uddhav
Data de Publicação:	2021
Outros Autores:	Rabinowicz, Simon, Bratsos, Paris, Martin, Guy, Fritzilas, Epameinondas, Markar, Sheraz, Purkayastha, Sanjay, Stringer, Karl, Singh, Harshdeep, Llewellyn, Charlie, Dutta, Debabrata, Clarke, Jonathan M., Howard, Matthew, Serban, Ovidiu, Kinross, James, Sá-Marta, Eduarda, et al.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10316/105182 https://doi.org/10.2196/25714
Resumo:	The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.

Metadados do item

id	RCAP_7183774ec37009ca6e52d9415d3adc44
oai_identifier_str	oai:estudogeral.uc.pt:10316/105182
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Studystructured data synthesisdata sciencecritical analysisweb crawl datapipelinedatabaseliteratureresearchCOVID-19infodemicdecision makingdatadata synthesismisinformationinfrastructuremethodologyCOVID-19Data Interpretation, StatisticalDatasets as TopicHumansInternetLongitudinal StudiesSARS-CoV-2Search EngineNatural Language ProcessingThe scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.This work was supported by Defence and Security Accelerator (grant ACC2015551), the Digital Surgery Intelligent Operating Room Grant, the National Institute for Health Research Long-limb Gastric Bypass RCT Study, the Jon Moulton Charitable Trust Diabetes Bariatric Surgery Grant, the National Institute for Health Research (grant II-OL-1116-10027), the National Institutes of Health (grant R01-CA204403-01A1), Horizon H2020 (ITN GROWTH), and the Imperial Biomedical Research Centre.JMIR Publications Inc.2021-05-06info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10316/105182http://hdl.handle.net/10316/105182https://doi.org/10.2196/25714eng1438-8871Vaghela, UddhavRabinowicz, SimonBratsos, ParisMartin, GuyFritzilas, EpameinondasMarkar, SherazPurkayastha, SanjayStringer, KarlSingh, HarshdeepLlewellyn, CharlieDutta, DebabrataClarke, Jonathan M.Howard, MatthewSerban, OvidiuKinross, JamesSá-Marta, Eduardaet al.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-04-06T10:20:18Zoai:estudogeral.uc.pt:10316/105182Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:21:47.343251Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
spellingShingle	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study Vaghela, Uddhav structured data synthesis data science critical analysis web crawl data pipeline database literature research COVID-19 infodemic decision making data data synthesis misinformation infrastructure methodology COVID-19 Data Interpretation, Statistical Datasets as Topic Humans Internet Longitudinal Studies SARS-CoV-2 Search Engine Natural Language Processing
title_short	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_full	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_fullStr	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_full_unstemmed	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
title_sort	Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study
author	Vaghela, Uddhav
author_facet	Vaghela, Uddhav Rabinowicz, Simon Bratsos, Paris Martin, Guy Fritzilas, Epameinondas Markar, Sheraz Purkayastha, Sanjay Stringer, Karl Singh, Harshdeep Llewellyn, Charlie Dutta, Debabrata Clarke, Jonathan M. Howard, Matthew Serban, Ovidiu Kinross, James Sá-Marta, Eduarda et al.
author_role	author
author2	Rabinowicz, Simon Bratsos, Paris Martin, Guy Fritzilas, Epameinondas Markar, Sheraz Purkayastha, Sanjay Stringer, Karl Singh, Harshdeep Llewellyn, Charlie Dutta, Debabrata Clarke, Jonathan M. Howard, Matthew Serban, Ovidiu Kinross, James Sá-Marta, Eduarda et al.
author2_role	author author author author author author author author author author author author author author author author
dc.contributor.author.fl_str_mv	Vaghela, Uddhav Rabinowicz, Simon Bratsos, Paris Martin, Guy Fritzilas, Epameinondas Markar, Sheraz Purkayastha, Sanjay Stringer, Karl Singh, Harshdeep Llewellyn, Charlie Dutta, Debabrata Clarke, Jonathan M. Howard, Matthew Serban, Ovidiu Kinross, James Sá-Marta, Eduarda et al.
dc.subject.por.fl_str_mv	structured data synthesis data science critical analysis web crawl data pipeline database literature research COVID-19 infodemic decision making data data synthesis misinformation infrastructure methodology COVID-19 Data Interpretation, Statistical Datasets as Topic Humans Internet Longitudinal Studies SARS-CoV-2 Search Engine Natural Language Processing
topic	structured data synthesis data science critical analysis web crawl data pipeline database literature research COVID-19 infodemic decision making data data synthesis misinformation infrastructure methodology COVID-19 Data Interpretation, Statistical Datasets as Topic Humans Internet Longitudinal Studies SARS-CoV-2 Search Engine Natural Language Processing
description	The scale and quality of the global scientific response to the COVID-19 pandemic have unquestionably saved lives. However, the COVID-19 pandemic has also triggered an unprecedented "infodemic"; the velocity and volume of data production have overwhelmed many key stakeholders such as clinicians and policy makers, as they have been unable to process structured and unstructured data for evidence-based decision making. Solutions that aim to alleviate this data synthesis-related challenge are unable to capture heterogeneous web data in real time for the production of concomitant answers and are not based on the high-quality information in responses to a free-text query. Objective: The main objective of this project is to build a generic, real-time, continuously updating curation platform that can support the data synthesis and analysis of a scientific literature framework. Our secondary objective is to validate this platform and the curation methodology for COVID-19–related medical literature by expanding the COVID-19 Open Research Dataset via the addition of new, unstructured data. Methods: To create an infrastructure that addresses our objectives, the PanSurg Collaborative at Imperial College London has developed a unique data pipeline based on a web crawler extraction methodology. This data pipeline uses a novel curation methodology that adopts a human-in-the-loop approach for the characterization of quality, relevance, and key evidence across a range of scientific literature sources. Results: REDASA (Realtime Data Synthesis and Analysis) is now one of the world’s largest and most up-to-date sources of COVID-19–related evidence; it consists of 104,000 documents. By capturing curators’ critical appraisal methodologies through the discrete labeling and rating of information, REDASA rapidly developed a foundational, pooled, data science data set of over 1400 articles in under 2 weeks. These articles provide COVID-19–related information and represent around 10% of all papers about COVID-19. Conclusions: This data set can act as ground truth for the future implementation of a live, automated systematic review. The three benefits of REDASA’s design are as follows: (1) it adopts a user-friendly, human-in-the-loop methodology by embedding an efficient, user-friendly curation platform into a natural language processing search engine; (2) it provides a curated data set in the JavaScript Object Notation format for experienced academic reviewers’ critical appraisal choices and decision-making methodologies; and (3) due to the wide scope and depth of its web crawling method, REDASA has already captured one of the world’s largest COVID-19–related data corpora for searches and curation.
publishDate	2021
dc.date.none.fl_str_mv	2021-05-06
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10316/105182 http://hdl.handle.net/10316/105182 https://doi.org/10.2196/25714
url	http://hdl.handle.net/10316/105182 https://doi.org/10.2196/25714
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	1438-8871
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	JMIR Publications Inc.
publisher.none.fl_str_mv	JMIR Publications Inc.
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134108715057152

Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study

Registros relacionados