研究成果
过滤器
输入关键词搜索

Yang,L., Huang,Y., & Chang,J., (2026). EMMS: Multi-Label Multi-Dimensional Selection. The Main Track of IJCAI 2026, in press.

Multi-label data are widely encountered in fields such as image recognition, text classification, and bioinformatics. Unlike traditional single-label data, each instance in a multi-label setting is typically associated with multiple labels. For example, a travel photo may be annotated with labels such as “beach”, “sunset” and “people”. However, real-world multi-label data often involve high dimensionality, outliers, and label noise. These issues can easily result in the curse of dimensionality and interfere with a model’s ability to learn from informative samples and reliable label information, thereby impairing the performance of downstream tasks such as classification and prediction.

To address the above problems, conventional methods usually select representative features, samples, or labels independently, while overlooking the interdependencies among them during the selection process. Moreover, these methods often assume that label annotations are noise-free, an assumption that is rarely valid in practical applications. To overcome these limitations, this paper proposes an evidence-theory-based multi-dimensional selection method for multi-label data, which jointly selects features, samples, and labels. Specifically, the proposed method employs a dual-projection framework with sparse constraints to map high-dimensional data first into a latent space and then into the label space. By explicitly modeling projection residuals, it identifies representative samples and thereby enables the joint selection of features, samples, and labels. Furthermore, evidence theory is introduced to integrate information from both the sample level and the label level, thereby improving the reliability of label learning and mitigating the adverse effects of noisy labels.

Extensive experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods across multiple evaluation metrics. Case studies on multi-label image datasets further confirm its effectiveness.

Read More »

常晋源, Fang, Q., Kolaczyk, E., MacDonald, P. & Yao, Q. (2026+). Autoregressive networks with dependent edges. Journal of the Royal Statistical Society Series B, in press.

We propose an autoregressive framework for modelling dynamic networks with dependent edges. It encompasses models that accommodate, for example, transitivity, degree heterogenenity, and other stylized features often observed in real network data. By assuming the edges of networks at each time are independent conditionally on their lagged values, the models, which exhibit a close connection with temporal exponential random graph models, facilitate both simulation and the maximum likelihood estimation (MLE) in a straightforward manner. Due to the possibly large number of parameters in the models, the natural MLEs may suffer from slow convergence rates. An improved estimator for each component parameter is proposed based on an iteration employing projection, which mitigates the impact of the other parameters. Leveraging a martingale difference structure, the asymptotic distribution of the improved estimator is derived without the assumption of stationarity. The limiting distribution is not normal in general, although it reduces to normal when the underlying process satisfies some mixing conditions. Illustration with a transitivity model was carried out in both simulation and a real network data set.

Read More »

常晋源, 杨麟, Zha, M., & Zhou, W.-X. (2026+). Adapting to noise tails in private linear regression. Journal of the American Statistical Association, in press.

While the traditional goal of statistics is to infer population parameters, modern practice increasingly demands protection of individual privacy. One way to address this need is to adapt classical statistical procedures into privacy-preserving algorithms. In this paper, we develop differentially private tail-robust methods for linear regression. The trade-off among bias, privacy, and robustness is controlled by a tunable robustification parameter in the Huber loss. We implement noisy clipped gradient descent for low-dimensional settings and noisy iterative hard thresholding for high-dimensional sparse models. Under sub-Gaussian errors, our method achieves near-optimal convergence rates while relaxing several assumptions required in earlier work. For heavy-tailed errors, we explicitly characterize how the non-asymptotic convergence rate depends on the moment index, privacy parameters, sample size, and intrinsic dimension. Our analysis shows how the moment index influences the choice of robustification parameters and, in turn, the resulting statistical error and privacy cost. By quantifying the interplay among bias, privacy, and robustness, we extend classical perspectives on privacy-preserving robust regression. The proposed methods are evaluated through simulations and two real datasets.

Read More »

常晋源, 杜悦, 何婧, & Yao, Q. (2026+). Testing independence and conditional independence in high dimensions via coordinatewise Gaussianization. Journal of the American Statistical Association, in press.

We propose new statistical tests, in high-dimensional settings, for testing the independence of two random vectors and their conditional independence given a third random vector. The key idea is simple, i.e., we first transform each component variable to the standard normal via its marginal empirical distribution, and we then test for independence and conditional independence of the transformed random vectors using appropriate L-type test statistics. While we are testing some necessary conditions of the independence or the conditional independence, the new tests outperform the 13 frequently used testing methods in a large scale simulation comparison. The advantage of the new tests can be summarized as follows: (i) they do not require any moment conditions, (ii) they allow arbitrary dependence structures of the components among the random vectors, and (iii) they allow the dimensions of random vectors to diverge at the exponential rates of the sample size. The critical values of the proposed tests are determined by a computationally efficient multiplier bootstrap procedure. Theoretical analysis shows that the sizes of the proposed tests can be well controlled by the nominal significance level, and the proposed tests are also consistent under certain local alternatives. The finite sample performance of the new tests is illustrated via extensive simulation studies and a real data application.

Read More »

常晋源, Tang, C. Y., & 朱元正 (2025). Bayesian penalized empirical likelihood and Markov Chain Monte Carlo sampling. Journal of the Royal Statistical Society: Series B, 87, 1127-1149.

In this study, we introduce a novel methodological framework called Bayesian Penalized Empirical Likelihood (BPEL), designed to address the computational challenges inherent in empirical likelihood (EL) approaches. Our approach has two primary objectives: (i) to enhance the inherent flexibility of EL in accommodating diverse model conditions, and (ii) to facilitate the use of well-established Markov Chain Monte Carlo (MCMC) sampling schemes as a convenient alternative to the complex optimization typically required for statistical inference using EL. To achieve the first objective, we propose a penalized approach that regularizes the Lagrange multipliers, significantly reducing the dimensionality of the problem while accommodating a comprehensive set of model conditions. For the second objective, our study designs and thoroughly investigates two popular sampling schemes within the BPEL context. We demonstrate that the BPEL framework is highly flexible and efficient, enhancing the adaptability and practicality of EL methods. Our study highlights the practical advantages of using sampling techniques over traditional optimization methods for EL problems, showing rapid convergence to the global optima of posterior distributions and ensuring the effective resolution of complex statistical inference challenges.

Read More »

常晋源, 杜悦, 黄光麟 & Yao, Q. (2025+). Identification and estimation for matrix time series CP-factor models, The Annals of Statistics, in press.

We propose a new method for identifying and estimating the CP-factor models for matrix time series. Unlike the generalized eigenanalysis-based method of Chang et al. (2023) for which the convergence rates of the associated estimators may suffer from small eigengaps as the asymptotic theory is based on some matrix perturbation analysis, the proposed new method enjoys faster convergence rates which are free from any eigengaps. It achieves this by turning the problem into a joint diagonalization of several matrices whose elements are determined by a basis of a linear system, and by choosing the basis carefully to avoid near co-linearity (see Proposition 5 and Section 4.3). Furthermore, unlike Chang et al. (2023) which requires the two factor loading matrices to be full-ranked, the proposed new method can handle rank-deficient factor loading matrices. Illustration with both simulated and real matrix time series data shows the advantages of the proposed new method.

Read More »

常晋源, Jiang, Q., McElroy, T., & Shao, X. (2025). Statistical inference for high-dimensional spectral density matrix. Journal of the American Statistical Association, 120, 1960-1974.

The spectral density matrix is a fundamental object of interest in time series analysis, and it encodes both contemporary and dynamic linear relationships between component processes of the multivariate system. In this article we develop novel inference procedures for the spectral density matrix in the high-dimensional setting. Specifically, we introduce a new global testing procedure to test the nullity of the cross-spectral density for a given set of frequencies and across pairs of component indices. For the first time, both Gaussian approximation and parametric bootstrap methodologies are employed to conduct inference for a highdimensional parameter formulated in the frequency domain, and new technical tools are developed to provide asymptotic guarantees of the size accuracy and power for global testing. We further propose a multiple testing procedure for simultaneously testing the nullity of the cross-spectral density at a given set of frequencies. The method is shown to control the false discovery rate. Both numerical simulations and a real data illustration demonstrate the usefulness of the proposed testing methods. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Read More »

Zhang, X., 周玮, Liu, J., & Kang, J. (2026+). Statistical inference for mediation models with high dimensional exposures and mediators. Journal of the American Statistical Association, in press.

High-dimensional mediation analysis has gained increasing interest in various fields, particularly in genetic and medical research. Compared with existing works that focus mainly on high-dimensional mediators, this paper advocates a new framework of Partial Regularization-based Inference for Mediation Effects (PRIME) when both exposures and mediators are high-dimensional. Estimated direct and indirect effects are established using a group-wise partially penalized least squares method, incorporating a double-layer latent factor structure. F-type and Wald tests for the high-dimensional direct and indirect effects, respectively, are advocated based on the proposed estimators. Both theoretical and numerical performance of PRIME have been carefully studied. PRIME is also applied to investigating direct effects of genetic variants on Alzheimer’s disease (AD) and indirect effects of them mediated by changes in brain activity intensity.

Read More »

周玮, Kang, X., Zhong, W., & Wang, J. (2025+). Efficient learning of DAG structures in heavy-tailed data. Statistica Sinica, in press.

Directed acyclic graph (DAG) models are widely used to discover causal relationships among random variables. However, most existing DAG learning algorithms are not directly applicable to heavy-tailed data which are commonly observed in finance and other fields. In this article, we propose a two-step efficient algorithm based on topological layers, referred as TopHeat, to learn linear DAGs with heavy-tailed error distributions which include Pareto, Fréchet, log-normal, Cauchy distributions, and so on. First, we reconstruct the topological layers hierarchically in a top-down fashion based on the new reconstruction criteria for heavy-tailed DAGs without assuming the popularly-employed faithfulness condition. Second, we recover the directed edges via the modified conditional independence testing for heavy-tailed distributions. We theoretically demonstrate the consistency of the exact DAG structures. Monte Carlo simulations validate the outstanding finite-sample performance of the proposed algorithm compared with competing methods. In the real data analysis, we analyze the exchange rates among 17 countries and uncover the source of financial contagion and the pathways, which indicates that the financial risk contagion effect became increasingly stable among European countries as the euro was introduced.

Read More »

Zeng, D., Xu, Z., 刘史毓, Pan, Y., Wang, Q., & Tang, X. (2025). On the power of adaptive weighted aggregation in heterogeneous federated learning and beyond. International Conference on Artificial Intelligence and Statistics (AISTATS)

Federated averaging (FedAvg) is the most fundamental algorithm in Federated learning (FL). Previous theoretical results assert that FedAvg convergence and generalization degenerate under heterogeneous clients. However, recent empirical results show that FedAvg can perform well in many real-world heterogeneous tasks. These results reveal an inconsistency between FL theory and practice that is not fully explained. In this paper, we show that common heterogeneity measures contribute to this inconsistency based on rigorous convergence analysis. Furthermore, we introduce a new measure client consensus dynamics and prove that FedAvg can effectively handle client heterogeneity when an appropriate aggregation strategy is used. Building on this theoretical insight, we present a simple and effective FedAvg variant termed FedAWARE. Extensive experiments on three datasets and two modern neural network architectures demonstrate that FedAWARE ensures faster convergence and better generalization in heterogeneous client settings. Moreover, our results show that FedAWARE can significantly enhance the generalization performance of advanced FL algorithms when used as a plugin module. The source code is available at https://github.com/dunzeng/FedAWARE.

Read More »