Finding the Critical Feature Dimension of Big Datasets
Autor(a) principal: | |
---|---|
Data de Publicação: | 2017 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10316/82847 |
Resumo: | Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia |
id |
RCAP_9266191c0405e38119e23f2f3304f3ca |
---|---|
oai_identifier_str |
oai:estudogeral.uc.pt:10316/82847 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Finding the Critical Feature Dimension of Big DatasetsProcura do Tamanho Crítico de Amostragem de Grandes Conjuntos de DadosBig DataCritical SampleData MiningBig DataCritical SampleData MiningDissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e TecnologiaBig Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.Big Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.2017-07-14info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesishttp://hdl.handle.net/10316/82847http://hdl.handle.net/10316/82847TID:202124010engSilva, José Miguel Parreira einfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2020-02-11T10:24:25Zoai:estudogeral.uc.pt:10316/82847Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:04:42.655765Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Finding the Critical Feature Dimension of Big Datasets Procura do Tamanho Crítico de Amostragem de Grandes Conjuntos de Dados |
title |
Finding the Critical Feature Dimension of Big Datasets |
spellingShingle |
Finding the Critical Feature Dimension of Big Datasets Silva, José Miguel Parreira e Big Data Critical Sample Data Mining Big Data Critical Sample Data Mining |
title_short |
Finding the Critical Feature Dimension of Big Datasets |
title_full |
Finding the Critical Feature Dimension of Big Datasets |
title_fullStr |
Finding the Critical Feature Dimension of Big Datasets |
title_full_unstemmed |
Finding the Critical Feature Dimension of Big Datasets |
title_sort |
Finding the Critical Feature Dimension of Big Datasets |
author |
Silva, José Miguel Parreira e |
author_facet |
Silva, José Miguel Parreira e |
author_role |
author |
dc.contributor.author.fl_str_mv |
Silva, José Miguel Parreira e |
dc.subject.por.fl_str_mv |
Big Data Critical Sample Data Mining Big Data Critical Sample Data Mining |
topic |
Big Data Critical Sample Data Mining Big Data Critical Sample Data Mining |
description |
Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017-07-14 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10316/82847 http://hdl.handle.net/10316/82847 TID:202124010 |
url |
http://hdl.handle.net/10316/82847 |
identifier_str_mv |
TID:202124010 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133939119423488 |