Predição de carbono orgânico do solo por espectroscopia Vis-Nir
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Manancial - Repositório Digital da UFSM |
dARK ID: | ark:/26339/001300000h3kp |
Texto Completo: | http://repositorio.ufsm.br/handle/1/24215 |
Resumo: | The development of large databases usually implies combining data collected for different purposes under different standards and methodologies, which often leads databases to suffer from disparate and inconsistent soil data. Despite the potential of visible and nearinfrared (Vis-NIR) spectroscopy to predict soil organic carbon (SOC) from those databases, the effectiveness and consistency among analytical methods used to produce the target data are seldom discussed. The main purpose of this research was to investigate the interplay among preprocessing techniques, model architectures, and especially the analytical methods used to produce the SOC target data. To accomplish it, we set up two specific objectives: i) evaluate the interplay among analytical methods, preprocessing techniques, and model architectures on SOC predictions, ii) assess whether this interplay can be translated into some form of hierarchy across validation metrics. In this PhD thesis, two chapters cover the topic where the above-mentioned objectives were met. Chapter I presents how changes in the analytical method (dry (SOCDC) and wet combustion with quantification by titrimetry (SOCWCt) and colorimetry (SOCWCc)) and the preprocessing techniques (smoothing (SMO), continuum removal (CRR), and Savitzky-Golay first derivative (SGD)) affect the empirical relationship captured by different machine learning algorithms (random forest, cubist, and partial least square regression (PLSR)). Cross-validation metrics were used to compare the parallel performance of 27 predictive models. The relationship between covariate matrix and target data is explored based on the variable importance. Chapter II shows how the interplay among those three factors can be translated into a hierarchy. A resampling technique was used to split the dataset into training and validation sets 100 times to achieve realistic performances and explore how the predictive performance changed as the training set changed. Conditional inference tree analysis was performed to evaluate how those three factors influenced global validation metrics. The predictive performance in both studies varied depending on the SOC analytical method, preprocessing technique, and model architecture employed. Among the three analytical methods tested, DC and WCt provided a higher correlation between SOC and spectra than WCc, and thus, resulted in higher models performance. The model architecture had a larger influence on the validation metrics over preprocessing techniques and analytical methods. PLSR models were more influenced by the analytical method, whereas the preprocessing technique influenced random forest and cubist more. Cubist models combined with CRR minimized the accuracy differences resulting from the employed SOC analytical methods. However, this combination resulted in overfitted model and high uncertainty on predictions. PLSR presented a more consistent performance than random forest and cubist. Overall, SOC data produced using different analytical methods in a training dataset significantly affected the prediction reliability, capability, and assessment. These results will be useful either to guide the analytical method selection for new projects or to manage already available databases. Besides that, they highlight the need for transparent and precise documentation over spectroscopy modeling to enable a fair comparison between publications. |
id |
UFSM_a2cd57504021bbb754521865fd7c855d |
---|---|
oai_identifier_str |
oai:repositorio.ufsm.br:1/24215 |
network_acronym_str |
UFSM |
network_name_str |
Manancial - Repositório Digital da UFSM |
repository_id_str |
|
spelling |
Predição de carbono orgânico do solo por espectroscopia Vis-NirSoil organic carbon prediction by diffuse reflectance spectroscopy: analytical methods, preprocessing techniques, and model architecturesPedometriaModelagem espectralAprendizado de máquinaBiblioteca espectralQuimiometriaSpectral libraryChemometricsPedometricsSpectral modelingMachine learningCNPQ::CIENCIAS AGRARIAS::AGRONOMIA::CIENCIA DO SOLOThe development of large databases usually implies combining data collected for different purposes under different standards and methodologies, which often leads databases to suffer from disparate and inconsistent soil data. Despite the potential of visible and nearinfrared (Vis-NIR) spectroscopy to predict soil organic carbon (SOC) from those databases, the effectiveness and consistency among analytical methods used to produce the target data are seldom discussed. The main purpose of this research was to investigate the interplay among preprocessing techniques, model architectures, and especially the analytical methods used to produce the SOC target data. To accomplish it, we set up two specific objectives: i) evaluate the interplay among analytical methods, preprocessing techniques, and model architectures on SOC predictions, ii) assess whether this interplay can be translated into some form of hierarchy across validation metrics. In this PhD thesis, two chapters cover the topic where the above-mentioned objectives were met. Chapter I presents how changes in the analytical method (dry (SOCDC) and wet combustion with quantification by titrimetry (SOCWCt) and colorimetry (SOCWCc)) and the preprocessing techniques (smoothing (SMO), continuum removal (CRR), and Savitzky-Golay first derivative (SGD)) affect the empirical relationship captured by different machine learning algorithms (random forest, cubist, and partial least square regression (PLSR)). Cross-validation metrics were used to compare the parallel performance of 27 predictive models. The relationship between covariate matrix and target data is explored based on the variable importance. Chapter II shows how the interplay among those three factors can be translated into a hierarchy. A resampling technique was used to split the dataset into training and validation sets 100 times to achieve realistic performances and explore how the predictive performance changed as the training set changed. Conditional inference tree analysis was performed to evaluate how those three factors influenced global validation metrics. The predictive performance in both studies varied depending on the SOC analytical method, preprocessing technique, and model architecture employed. Among the three analytical methods tested, DC and WCt provided a higher correlation between SOC and spectra than WCc, and thus, resulted in higher models performance. The model architecture had a larger influence on the validation metrics over preprocessing techniques and analytical methods. PLSR models were more influenced by the analytical method, whereas the preprocessing technique influenced random forest and cubist more. Cubist models combined with CRR minimized the accuracy differences resulting from the employed SOC analytical methods. However, this combination resulted in overfitted model and high uncertainty on predictions. PLSR presented a more consistent performance than random forest and cubist. Overall, SOC data produced using different analytical methods in a training dataset significantly affected the prediction reliability, capability, and assessment. These results will be useful either to guide the analytical method selection for new projects or to manage already available databases. Besides that, they highlight the need for transparent and precise documentation over spectroscopy modeling to enable a fair comparison between publications.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPESO desenvolvimento de grandes bancos de dados geralmente implica a combinação de dados coletados para diferentes propósitos sob diferentes padrões e metodologias, o que muitas vezes leva os bancos de dados a sofrer com dados de solo díspares e inconsistentes. Apesar do potencial da espectroscopia de infravermelho próximo e visível (Vis- NIR) para prever o carbono orgânico do solo (COS) a partir desses bancos de dados, a eficácia e a consistência entre os métodos analíticos usados para produzir os dados alvo raramente são discutidos. O objetivo principal desta pesquisa foi investigar a interação entre as técnicas de pré-processamento, arquiteturas de modelo e, especialmente, os métodos analíticos usados para produzir os dados alvo do COS. Para alcançá-lo, estabelecemos dois objetivos específicos: i) avaliar a interação entre métodos analíticos, técnicas de pré-processamento e arquiteturas de modelo nas predições de COS, ii) avaliar se essa interação pode ser traduzida em alguma forma de hierarquia entre as métricas de validação. Nesta tese de doutorado, dois capítulos abordam o tema onde os objetivos acima mencionados foram alcançados. O Capítulo I apresenta como mudanças no método analítico (seco (COSDC) e combustão úmida com quantificação por titulometria (COSWCt) e colorimetria (COSWCc)) e as técnicas de pré-processamento (suavização (SMO), continuum remoção (CRR) e primeira derivada de Savitzky-Golay (SGD) afetam a relação empírica capturada por diferentes arquiteturas de modelos (random forest, cubist e regressão de mínimos quadrados parciais (PLSR)). Métricas de validação cruzada foram usadas para comparar o desempenho paralelo de 27 modelos preditivos. A relação entre a matriz de covariável e os dados alvo é explorada com base na importância da variável. O Capítulo II mostra como a interação entre esses três fatores pode ser traduzida em uma hierarquia. Uma técnica de reamostragem foi usada para dividir o conjunto de dados em conjuntos de treinamento e validação 100 vezes para atingir desempenhos realistas e explorar como o desempenho preditivo mudou conforme o conjunto de treinamento mudou. A análise da árvore de inferência condicional foi realizada para avaliar como esses três fatores influenciaram as métricas de validação global. O desempenho preditivo em ambos os estudos variou dependendo do método analítico COS, da técnica de pré-processamento e da arquitetura do modelo empregada. Dentre os três métodos analíticos testados, DC e WCt proporcionaram maior correlação entre COS e espectros do que WCc e, portanto, resultaram em melhor desempenho dos modelos. A arquitetura do modelo teve maior influência nas métricas de validação sobre as técnicas de pré-processamento e métodos analíticos. Os modelos PLSR foram mais influenciados pelo método analítico, enquanto a técnica de pré-processamento influenciou mais os modelos random forest e cubist. Modelos cubists combinados com CRR minimizaram as diferenças de precisão resultantes dos métodos analíticos de COS empregados. No entanto, essa combinação resultou em um modelo sobreajustado e alta incerteza nas previsões. PLSR apresentou desempenho mais consistente do que random forest e cubist. No geral, os dados do COS produzidos com diferentes métodos analíticos em um conjunto de dados de treinamento afetaram significativamente a confiabilidade, capacidade e avaliação das predições. Esses resultados serão úteis para orientar a seleção de métodos analíticos para novos projetos ou para gerenciar bancos de dados já disponíveis. Além disso, eles destacam a necessidade de documentação transparente e precisa sobre a modelagem espectroscópica para permitir uma comparação justa entre as publicações.Universidade Federal de Santa MariaBrasilAgronomiaUFSMPrograma de Pós-Graduação em Ciência do SoloCentro de Ciências RuraisDalmolin, Ricardo Simão Dinizhttp://lattes.cnpq.br/3735884911693854Rosa, Alessandro SamuelGrunwald, Sabineten Caten, AlexandreSouza, Deorgia Tayane Mendes dePedron, Fabrício de AraújoSchenato, Ricardo BergamoHeinen, Taciara Zborowski Horst2022-04-28T15:17:09Z2022-04-28T15:17:09Z2021-12-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://repositorio.ufsm.br/handle/1/24215ark:/26339/001300000h3kpengAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessreponame:Manancial - Repositório Digital da UFSMinstname:Universidade Federal de Santa Maria (UFSM)instacron:UFSM2022-04-28T15:17:09Zoai:repositorio.ufsm.br:1/24215Biblioteca Digital de Teses e Dissertaçõeshttps://repositorio.ufsm.br/ONGhttps://repositorio.ufsm.br/oai/requestatendimento.sib@ufsm.br||tedebc@gmail.comopendoar:2024-07-29T10:39:47.434007Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM)false |
dc.title.none.fl_str_mv |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir Soil organic carbon prediction by diffuse reflectance spectroscopy: analytical methods, preprocessing techniques, and model architectures |
title |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
spellingShingle |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir Heinen, Taciara Zborowski Horst Pedometria Modelagem espectral Aprendizado de máquina Biblioteca espectral Quimiometria Spectral library Chemometrics Pedometrics Spectral modeling Machine learning CNPQ::CIENCIAS AGRARIAS::AGRONOMIA::CIENCIA DO SOLO |
title_short |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
title_full |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
title_fullStr |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
title_full_unstemmed |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
title_sort |
Predição de carbono orgânico do solo por espectroscopia Vis-Nir |
author |
Heinen, Taciara Zborowski Horst |
author_facet |
Heinen, Taciara Zborowski Horst |
author_role |
author |
dc.contributor.none.fl_str_mv |
Dalmolin, Ricardo Simão Diniz http://lattes.cnpq.br/3735884911693854 Rosa, Alessandro Samuel Grunwald, Sabine ten Caten, Alexandre Souza, Deorgia Tayane Mendes de Pedron, Fabrício de Araújo Schenato, Ricardo Bergamo |
dc.contributor.author.fl_str_mv |
Heinen, Taciara Zborowski Horst |
dc.subject.por.fl_str_mv |
Pedometria Modelagem espectral Aprendizado de máquina Biblioteca espectral Quimiometria Spectral library Chemometrics Pedometrics Spectral modeling Machine learning CNPQ::CIENCIAS AGRARIAS::AGRONOMIA::CIENCIA DO SOLO |
topic |
Pedometria Modelagem espectral Aprendizado de máquina Biblioteca espectral Quimiometria Spectral library Chemometrics Pedometrics Spectral modeling Machine learning CNPQ::CIENCIAS AGRARIAS::AGRONOMIA::CIENCIA DO SOLO |
description |
The development of large databases usually implies combining data collected for different purposes under different standards and methodologies, which often leads databases to suffer from disparate and inconsistent soil data. Despite the potential of visible and nearinfrared (Vis-NIR) spectroscopy to predict soil organic carbon (SOC) from those databases, the effectiveness and consistency among analytical methods used to produce the target data are seldom discussed. The main purpose of this research was to investigate the interplay among preprocessing techniques, model architectures, and especially the analytical methods used to produce the SOC target data. To accomplish it, we set up two specific objectives: i) evaluate the interplay among analytical methods, preprocessing techniques, and model architectures on SOC predictions, ii) assess whether this interplay can be translated into some form of hierarchy across validation metrics. In this PhD thesis, two chapters cover the topic where the above-mentioned objectives were met. Chapter I presents how changes in the analytical method (dry (SOCDC) and wet combustion with quantification by titrimetry (SOCWCt) and colorimetry (SOCWCc)) and the preprocessing techniques (smoothing (SMO), continuum removal (CRR), and Savitzky-Golay first derivative (SGD)) affect the empirical relationship captured by different machine learning algorithms (random forest, cubist, and partial least square regression (PLSR)). Cross-validation metrics were used to compare the parallel performance of 27 predictive models. The relationship between covariate matrix and target data is explored based on the variable importance. Chapter II shows how the interplay among those three factors can be translated into a hierarchy. A resampling technique was used to split the dataset into training and validation sets 100 times to achieve realistic performances and explore how the predictive performance changed as the training set changed. Conditional inference tree analysis was performed to evaluate how those three factors influenced global validation metrics. The predictive performance in both studies varied depending on the SOC analytical method, preprocessing technique, and model architecture employed. Among the three analytical methods tested, DC and WCt provided a higher correlation between SOC and spectra than WCc, and thus, resulted in higher models performance. The model architecture had a larger influence on the validation metrics over preprocessing techniques and analytical methods. PLSR models were more influenced by the analytical method, whereas the preprocessing technique influenced random forest and cubist more. Cubist models combined with CRR minimized the accuracy differences resulting from the employed SOC analytical methods. However, this combination resulted in overfitted model and high uncertainty on predictions. PLSR presented a more consistent performance than random forest and cubist. Overall, SOC data produced using different analytical methods in a training dataset significantly affected the prediction reliability, capability, and assessment. These results will be useful either to guide the analytical method selection for new projects or to manage already available databases. Besides that, they highlight the need for transparent and precise documentation over spectroscopy modeling to enable a fair comparison between publications. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-12-21 2022-04-28T15:17:09Z 2022-04-28T15:17:09Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://repositorio.ufsm.br/handle/1/24215 |
dc.identifier.dark.fl_str_mv |
ark:/26339/001300000h3kp |
url |
http://repositorio.ufsm.br/handle/1/24215 |
identifier_str_mv |
ark:/26339/001300000h3kp |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Brasil Agronomia UFSM Programa de Pós-Graduação em Ciência do Solo Centro de Ciências Rurais |
publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Brasil Agronomia UFSM Programa de Pós-Graduação em Ciência do Solo Centro de Ciências Rurais |
dc.source.none.fl_str_mv |
reponame:Manancial - Repositório Digital da UFSM instname:Universidade Federal de Santa Maria (UFSM) instacron:UFSM |
instname_str |
Universidade Federal de Santa Maria (UFSM) |
instacron_str |
UFSM |
institution |
UFSM |
reponame_str |
Manancial - Repositório Digital da UFSM |
collection |
Manancial - Repositório Digital da UFSM |
repository.name.fl_str_mv |
Manancial - Repositório Digital da UFSM - Universidade Federal de Santa Maria (UFSM) |
repository.mail.fl_str_mv |
atendimento.sib@ufsm.br||tedebc@gmail.com |
_version_ |
1814439791642542080 |