Comparação de métodos para definição do número ótimo de grupos em análise de agrupamento
Arquivos
Data
2012-02-02
Autores
Título da Revista
ISSN da Revista
Título de Volume
Editor
Universidade Federal de Viçosa
Resumo
Estudos envolvendo análise de agrupamento hierárquico encontram um problema na hora de determinar o número ótimo de grupos, devido à falta de critérios objetivos. Pesquisas que envolvem o ajuste de modelos não-lineares a dados de crescimento ou de sobrevivência, cujo interesse principal é saber quantas curvas são necessárias para descrever o comportamento dos indivíduos analisados, utilizam dessa técnica. Como forma de auxiliar essa decisão, alguns pesquisadores recorrem aos índices BSS (Between-group Sum of Squares), SPRSQ (Semi-partial R-Squared), RMSSTD (Root Mean Square Standard Deviation), RS (R-Squared) e ao método de Mojena. Entretanto, não se sabe qual deles é a melhor escolha para determinação desse valor. A comparação dessas estatísticas foi o objetivo desse trabalho. Toda a metodologia utilizou o método de Ward para fazer o agrupamento das observações, o modelo de von Bertalanffy para o ajuste das curvas, e uma função própria, baseada na lei dos cossenos e na ideia do Método da Máxima Curvatura Modificado, para calcular o número de grupos indicado pelos índices. No capítulo 1 foi feito o estudo de caso real. O conjunto de dados possuía sete curvas de crescimento animal, que formavam três grupos. Após o agrupamento das estimativas dos parâmetros e o cálculo das estatísticas, foi constatado que apenas o índice SPRSQ apontou o número de grupos correto. Usando uma função que re-escalona o eixo dos índices de acordo com o eixo do número de grupos, para melhorar os resultados obtidos, apenas o RMSSTD não indicou o valor esperado. O capítulo 2 descreve o uso da simulação para descobrir qual das estatísticas citadas possuía maior porcentagem de acerto quanto à determinação do número ótimo de grupos em dois cenários. No primeiro, as observações provinham de uma única curva geradora e no outro, os indivíduos pertenciam a três populações diferentes. Para o caso de uma única curva, o índice RS foi o que apontou o número ótimo de grupos na maioria dos casos. Para o cenário onde se possuía três populações diferentes, o método de Mojena foi o que acertou o número de grupos mais vezes. Nesses cenários, o uso da função que re-escalona os eixos não mostrou eficiência para melhorar a porcentagem de acertos dos índices. De modo geral, os índices RS e SPRSQ mostraram-se os mais indicados para auxiliar na determinação do número ótimo de grupos.
Studies that use hierarchical cluster analysis have a problem in determining the optimal number of groups due to lack of objective criteria. Researches involving the adjustment of nonlinear models to data on growth or survival, in which the main interest is to determine how many curves are needed to describe the behavior of the individuals analyzed, use this technique. Some researchers use indices BSS (Between-group Sum of Squares), SPRSQ (Semi-partial R-Squared), RMSSTD (Root Mean Square Standard Deviation), RS (R-Squared) and Mojena method, as a means of assistance in this decision. However, it is not known which one is the best choice to determine that value. The comparison of these statistics was the aim of this study. The entire methodology used the Ward s method to cluster the observations, the von Bertalanffy model to fit the curves, and a specific function, based on the law of cosines and the idea of the Modified Maximum Curvature Method, to calculate the number of groups indicated by the indices. In chapter 1, a real case study was developed. The data set had seven animal growth curves, forming three groups. After grouping the parameter estimates and the calculation of statistics, it was found that only the index SPRSQ pointed to the correct number of groups. Using a function to re-scale the axis of the indices according to the axis of the number of groups, to improve the results obtained, only RMSSTD did not indicate the expected value. Chapter 2 describes the use of simulation to find out which of the statistics mentioned had the highest percentage of accuracy in determining the optimal number of groups in two cases. In the first one, the observations came from a single generator curve and, in the other, the individuals belonged to three different populations. In the case of a single curve, the RS index pointed to the optimal number of groups in most cases. For the case in which there were three different populations, the Mojena method was the one that indicated the right number of groups more often. In these cases, the use of the function that re-scales the axes did not show efficiency to improve the percentage of correct indices. In general, the indices RS and SPRSQ were the most appropriate to assist in determining the optimal number of groups.
Studies that use hierarchical cluster analysis have a problem in determining the optimal number of groups due to lack of objective criteria. Researches involving the adjustment of nonlinear models to data on growth or survival, in which the main interest is to determine how many curves are needed to describe the behavior of the individuals analyzed, use this technique. Some researchers use indices BSS (Between-group Sum of Squares), SPRSQ (Semi-partial R-Squared), RMSSTD (Root Mean Square Standard Deviation), RS (R-Squared) and Mojena method, as a means of assistance in this decision. However, it is not known which one is the best choice to determine that value. The comparison of these statistics was the aim of this study. The entire methodology used the Ward s method to cluster the observations, the von Bertalanffy model to fit the curves, and a specific function, based on the law of cosines and the idea of the Modified Maximum Curvature Method, to calculate the number of groups indicated by the indices. In chapter 1, a real case study was developed. The data set had seven animal growth curves, forming three groups. After grouping the parameter estimates and the calculation of statistics, it was found that only the index SPRSQ pointed to the correct number of groups. Using a function to re-scale the axis of the indices according to the axis of the number of groups, to improve the results obtained, only RMSSTD did not indicate the expected value. Chapter 2 describes the use of simulation to find out which of the statistics mentioned had the highest percentage of accuracy in determining the optimal number of groups in two cases. In the first one, the observations came from a single generator curve and, in the other, the individuals belonged to three different populations. In the case of a single curve, the RS index pointed to the optimal number of groups in most cases. For the case in which there were three different populations, the Mojena method was the one that indicated the right number of groups more often. In these cases, the use of the function that re-scales the axes did not show efficiency to improve the percentage of correct indices. In general, the indices RS and SPRSQ were the most appropriate to assist in determining the optimal number of groups.
Descrição
Palavras-chave
Simulação, Curvas de crescimento, Modelo de von Bertalanffy, Método de Mojena, Simulation, Growth curves, Von Bertalanffy model, Mojena method
Citação
ALVES, Suelem Cristina. Comparison of methods for defining the optimal number of groups in cluster analysis. 2012. 74 f. Dissertação (Mestrado em Estatística Aplicada e Biometria) - Universidade Federal de Viçosa, Viçosa, 2012.