Semantic Similarity Match for Data Quality
Autor(a) principal: | |
---|---|
Data de Publicação: | 2007 |
Outros Autores: | , |
Tipo de documento: | Relatório |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10451/14158 |
Resumo: | Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process |
id |
RCAP_33a12691348cb22b9dda534961712cbc |
---|---|
oai_identifier_str |
oai:repositorio.ul.pt:10451/14158 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Semantic Similarity Match for Data Qualitysemantic similaritydata cleaningdata qualitywordnetsimilarity matchData quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality processDepartment of Informatics, University of LisbonRepositório da Universidade de LisboaMartins, FernandoFalcão, AndréCouto, Francisco M.2009-02-10T13:12:03Z2007-102007-10-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/reportapplication/pdfhttp://hdl.handle.net/10451/14158porinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T15:59:48Zoai:repositorio.ul.pt:10451/14158Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:36:00.010636Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Semantic Similarity Match for Data Quality |
title |
Semantic Similarity Match for Data Quality |
spellingShingle |
Semantic Similarity Match for Data Quality Martins, Fernando semantic similarity data cleaning data quality wordnet similarity match |
title_short |
Semantic Similarity Match for Data Quality |
title_full |
Semantic Similarity Match for Data Quality |
title_fullStr |
Semantic Similarity Match for Data Quality |
title_full_unstemmed |
Semantic Similarity Match for Data Quality |
title_sort |
Semantic Similarity Match for Data Quality |
author |
Martins, Fernando |
author_facet |
Martins, Fernando Falcão, André Couto, Francisco M. |
author_role |
author |
author2 |
Falcão, André Couto, Francisco M. |
author2_role |
author author |
dc.contributor.none.fl_str_mv |
Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Martins, Fernando Falcão, André Couto, Francisco M. |
dc.subject.por.fl_str_mv |
semantic similarity data cleaning data quality wordnet similarity match |
topic |
semantic similarity data cleaning data quality wordnet similarity match |
description |
Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must be performed using the text sense. This work presents a semantic similarity approach, based on a text sense matching mechanism, that performs the detection of text units which are similar in sense. The goal of the proposed semantic similarity approach is therefore to perform the duplicate detection task in a data quality process |
publishDate |
2007 |
dc.date.none.fl_str_mv |
2007-10 2007-10-01T00:00:00Z 2009-02-10T13:12:03Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/report |
format |
report |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/14158 |
url |
http://hdl.handle.net/10451/14158 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
publisher.none.fl_str_mv |
Department of Informatics, University of Lisbon |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134258574393344 |