Tree modeling is a representative method of data mining which performs inference and prediction by using decision rules into a tree structure. In terms of data mining, analysis using a tree model has both characteristics of exploring data and constructing a statistical model. This is because a tree ...
Tree modeling is a representative method of data mining which performs inference and prediction by using decision rules into a tree structure. In terms of data mining, analysis using a tree model has both characteristics of exploring data and constructing a statistical model. This is because a tree model provides a good prediction accuracy as well as data visualization and model interpretation. CART, one of the widely used algorithms for the tree model, has two problems since it is based on exhaustive search. The first problem is that selection bias occurs for split variables with more split candidates. The second problem is that it requires a considerable computation cost, particularly, when there are a large size of samples or explanatory variables having a number of categories. GUIDE was proposed as one of the algorithms to overcome these problems. This algorithm divides the split rule selection into two phase: split variable selection and split point (or set) selection. In this way, this algorithm has negligible selection bias and reasonable computation cost. As the data complexity increases, there may be a growing need for more accurate prediction and rich interpretation for multivariate
data analysis with multiple response variables for univariate data. Tree-structured methods have usually been developed for univariate data, however, they can be utilized for multivariate data. In this thesis, we deal with the following two multivariate problems. First, a univariate response is considered, however, multiple quantiles are estimated. Thus, it is viewed as a multivariate problem. Second, we deal with multiple regression equations which is called seemingly unrelated regression. We developed tree-structured methods for the multivariate problems to maximize the visualization and interpretation. In addition, they have better prediction power than the existing methods.
First, we propose a unified non-crossing quantile regression tree in the first half of this thesis. Quantile regression provides a variety of comprehensive and useful statistical information by exploring how explanatory variables relate to conditional quantile functions of the response variable. However, the traditional linear quantile regression model can lead to the distorted and incorrect results when analysing real data having a nonlinear relationship between the explanatory variables and the response variable. Tree-structured quantile regression relaxes the linearity assumption imposed on the conditional quantile function. This method starts with the assumption which tree structures corresponding to the multiple quantile functions can be expected to be quite similar each other since those associated with one response are highly correlated.
The proposed method effectively solves the crossing problem in estimating multiple quantiles. It also avoids incorrect split variables at some extreme quantile levels. In addition, the algorithm for selecting the split point (or set) effectively reduces the computation burden. In this thesis, we investigate the performance and usefulness of the proposed method with simulated and real data.
In the second part, we propose a seemingly unrelated regression tree. Seemingly unrelated regression is a method used to analyze multivariate data in various fields such as biology and social sciences as well as econometrics. The feature of seemingly unrelated regression is that a set of regression equations are linked since their error terms are, in fact, correlated, although they seem to be unrelated. The seemingly unrelated regression assumes a linear relationship between explanatory variables and response variable; however, sometimes this assumption can be too strong for real data analysis. The proposed method can alleviate the linearity assumption. Furthermore, it is superior to the existing nonparametric methods employing spline or polynomial forms when there are sharp changes of the relationship between the explanatory and response variables. In addition, since the proposed method can be fitted in a piecewise linear form, it can offer revelatory insights through model interpretation and data visualization. In this thesis, we investigate the performance and usefulness of the proposed method with simulated and real data.
Tree modeling is a representative method of data mining which performs inference and prediction by using decision rules into a tree structure. In terms of data mining, analysis using a tree model has both characteristics of exploring data and constructing a statistical model. This is because a tree model provides a good prediction accuracy as well as data visualization and model interpretation. CART, one of the widely used algorithms for the tree model, has two problems since it is based on exhaustive search. The first problem is that selection bias occurs for split variables with more split candidates. The second problem is that it requires a considerable computation cost, particularly, when there are a large size of samples or explanatory variables having a number of categories. GUIDE was proposed as one of the algorithms to overcome these problems. This algorithm divides the split rule selection into two phase: split variable selection and split point (or set) selection. In this way, this algorithm has negligible selection bias and reasonable computation cost. As the data complexity increases, there may be a growing need for more accurate prediction and rich interpretation for multivariate
data analysis with multiple response variables for univariate data. Tree-structured methods have usually been developed for univariate data, however, they can be utilized for multivariate data. In this thesis, we deal with the following two multivariate problems. First, a univariate response is considered, however, multiple quantiles are estimated. Thus, it is viewed as a multivariate problem. Second, we deal with multiple regression equations which is called seemingly unrelated regression. We developed tree-structured methods for the multivariate problems to maximize the visualization and interpretation. In addition, they have better prediction power than the existing methods.
First, we propose a unified non-crossing quantile regression tree in the first half of this thesis. Quantile regression provides a variety of comprehensive and useful statistical information by exploring how explanatory variables relate to conditional quantile functions of the response variable. However, the traditional linear quantile regression model can lead to the distorted and incorrect results when analysing real data having a nonlinear relationship between the explanatory variables and the response variable. Tree-structured quantile regression relaxes the linearity assumption imposed on the conditional quantile function. This method starts with the assumption which tree structures corresponding to the multiple quantile functions can be expected to be quite similar each other since those associated with one response are highly correlated.
The proposed method effectively solves the crossing problem in estimating multiple quantiles. It also avoids incorrect split variables at some extreme quantile levels. In addition, the algorithm for selecting the split point (or set) effectively reduces the computation burden. In this thesis, we investigate the performance and usefulness of the proposed method with simulated and real data.
In the second part, we propose a seemingly unrelated regression tree. Seemingly unrelated regression is a method used to analyze multivariate data in various fields such as biology and social sciences as well as econometrics. The feature of seemingly unrelated regression is that a set of regression equations are linked since their error terms are, in fact, correlated, although they seem to be unrelated. The seemingly unrelated regression assumes a linear relationship between explanatory variables and response variable; however, sometimes this assumption can be too strong for real data analysis. The proposed method can alleviate the linearity assumption. Furthermore, it is superior to the existing nonparametric methods employing spline or polynomial forms when there are sharp changes of the relationship between the explanatory and response variables. In addition, since the proposed method can be fitted in a piecewise linear form, it can offer revelatory insights through model interpretation and data visualization. In this thesis, we investigate the performance and usefulness of the proposed method with simulated and real data.
※ AI-Helper는 부적절한 답변을 할 수 있습니다.