Finding the Critical Feature Dimension of Big Datasets

Detalhes bibliográficos
Autor(a) principal: Silva, José Miguel Parreira e
Data de Publicação: 2017
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10316/82847
Resumo: Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
id RCAP_9266191c0405e38119e23f2f3304f3ca
oai_identifier_str oai:estudogeral.uc.pt:10316/82847
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Finding the Critical Feature Dimension of Big DatasetsProcura do Tamanho Crítico de Amostragem de Grandes Conjuntos de DadosBig DataCritical SampleData MiningBig DataCritical SampleData MiningDissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e TecnologiaBig Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.Big Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.2017-07-14info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://hdl.handle.net/10316/82847http://hdl.handle.net/10316/82847TID:202124010engSilva, José Miguel Parreira einfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2020-02-11T10:24:25Zoai:estudogeral.uc.pt:10316/82847Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:04:42.655765Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Finding the Critical Feature Dimension of Big Datasets
Procura do Tamanho Crítico de Amostragem de Grandes Conjuntos de Dados
title Finding the Critical Feature Dimension of Big Datasets
spellingShingle Finding the Critical Feature Dimension of Big Datasets
Silva, José Miguel Parreira e
Big Data
Critical Sample
Data Mining
Big Data
Critical Sample
Data Mining
title_short Finding the Critical Feature Dimension of Big Datasets
title_full Finding the Critical Feature Dimension of Big Datasets
title_fullStr Finding the Critical Feature Dimension of Big Datasets
title_full_unstemmed Finding the Critical Feature Dimension of Big Datasets
title_sort Finding the Critical Feature Dimension of Big Datasets
author Silva, José Miguel Parreira e
author_facet Silva, José Miguel Parreira e
author_role author
dc.contributor.author.fl_str_mv Silva, José Miguel Parreira e
dc.subject.por.fl_str_mv Big Data
Critical Sample
Data Mining
Big Data
Critical Sample
Data Mining
topic Big Data
Critical Sample
Data Mining
Big Data
Critical Sample
Data Mining
description Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
publishDate 2017
dc.date.none.fl_str_mv 2017-07-14
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10316/82847
http://hdl.handle.net/10316/82847
TID:202124010
url http://hdl.handle.net/10316/82847
identifier_str_mv TID:202124010
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799133939119423488