
Statistical Machine Learning in High Dimensions
Saturday, Aug. 26, 2023




University of Science and Technology of China
Title: Individual-Centered Partial Information in Social Networks
Abstract: In statistical network analysis, we often assume either the full network is available or multiple subgraphs can be sampled to estimate various global properties of the network. However, in a real social network, people frequently make decisions based on their local view of the network alone. Here, we consider a partial information framework that characterizes the local network centered at a given individual by path length L and gives rise to a partial adjacency matrix. Under L = 2, we focus on the problem of (global) community detection using the popular stochastic block model (SBM) and its degree-corrected variant (DCSBM). We derive general properties of the eigenvalues and eigenvectors from the signal term of the partial adjacency matrix and propose new spectral-based community detection algorithms that achieve consistency under appropriate conditions. Our analysis also allows us to propose a new centrality measure that assesses the importance of an individual’s partial information in determining global community structure. Using simulated and real networks, we demonstrate the performance of our algorithms and compare our centrality measure with other popular alternatives to show it captures unique nodal information. Our results illustrate that the partial information framework enables us to compare the viewpoints of different individuals regarding the global structure.
University of Science and Technology of China
Title: Transfer Learning by Optimal Model Averaging for Censored Data
Abstract: Transfer learning has gained significant attention in various domains, addressing the challenge of limited individual study data for prediction. In this paper, we develop a transfer learning approach with model averaging to predict censored responses in the main model. Specifically, several helper models are formulated with shared parameters from other datasets, and the optimal weights for the averaging procedure are derived by minimizing a delete-one cross-validation criterion. The proposed transfer learning allows the model forms to vary among helper models. We show that the proposed approach achieves the lowest prediction risk asymptotically when the main model is misspecified and attains model weight consistency when the main model is correctly specified. We further demonstrate that the risk of the proposed approach is no larger than the risks of the equal weighting approach and any single candidate model asymptotically, regardless of the correctness of the main model. The performances of the proposed procedure are illustrated and compared with other existing methods on extensive simulation studies and the Surveillance, Epidemiology, and End Results (SEER)-Medicare liver cancer data.
University of Science and Technology of China
Title: Estimating the Number of Communities Based on Partial Networks
Abstract: We introduce a community detection method that utilizes partial network information and provides an estimate of the number of communities K from data. To achieve this, we establish a hypothesis test where the test statistic is constituted of singular values and eigenvalues of partitioned matrices obtained from a centered and rescaled partial adjacency matrix. We demonstrate the asymptotic null distribution and consistency of estimating K using results from random matrix theory. Extensive simulations, including both directed and undirected graphs, and real data examples are conducted to confirm the effectiveness and usefulness of our proposed method.
University of Science and Technology of China
Title: Robust Estimation of Dimension Reduction Subspace under High-Dimensional and Heavy-Tailed Design
Abstract: Sufficient dimension reduction (SDR) is an elegant tool to tackle high dimensionality while maintaining the primary information of the prediction problem. A variety of high-dimensional SDR methods have been developed in literature and have demonstrated great promise in many applications. The existing methods all rely on the light-tailed assumption of the high-dimensional predictor. However, the light-tailedness assumption is frequently violated in real life. In this paper, we propose a novel high-dimensional SDR method to consistently estimate the central subspace with the presence of heavy-tailedness. In this sense, our proposal is robust and also scales well to the high dimensionality. We achieve robustness by establishing a fundamental invariance result under the generalized regression model assumption and the elliptically-contoured design assumption. The invariance result connects the targeted central subspace to a surrogate subspace. It converts the original challenging SDR problem to the estimation of the surrogate subspace, which can be readily implemented using existing state-of-the-art high-dimensional SDR methods. Theoretically, our proposal enjoys satisfactory consistency, and the convergence rate is shown to achieve optimality. Empirically, the efficiency and effectiveness of our proposal are demonstrated by extensive simulation studies and real data examples.