研究成果

周玮, He, X., Zhong, W., & Wang, J. (2022). Efficient learning of quadratic variance function directed acyclic graphs via topological layers. Journal of Computational and Graphical Statistics, 31, 1269-1279.

Directed acyclic graph (DAG) models are widely used to represent casual relationships among random variables in many application domains. This article studies a special class of non-Gaussian DAG models, where the conditional variance of each node given its parents is a quadratic function of its conditional mean. Such a class of non-Gaussian DAG models are fairly flexible and admit many popular distributions as special cases, including Poisson, Binomial, Geometric, Exponential, and Gamma. To facilitate learning, we introduce a novel concept of topological layers, and develop an efficient DAG learning algorithm. It first reconstructs the topological layers in a hierarchical fashion and then recovers the directed edges between nodes in different layers, which requires much less computational cost than most existing algorithms in literature. Its advantage is also demonstrated in a number of simulated examples, as well as its applications to two real-life datasets, including an NBA player statistics data and a cosmetic sales data collected by Alibaba. Supplementary materials for this article are available online.

陈骋, Guo, S., & Qiao, X. (2022). Functional linear regression: dependence and error contamination. Journal of Business & Economic Statistics, 40, 444-457.

Functional linear regression is an important topic in functional data analysis. It is commonly assumed that samples of the functional predictor are independent realizations of an underlying stochastic process, and are observed over a grid of points contaminated by iid measurement errors. In practice, however, the dynamical dependence across different curves may exist and the parametric assumption on the error covariance structure could be unrealistic. In this article, we consider functional linear regression with serially dependent observations of the functional predictor, when the contamination of the predictor by the white noise is genuinely functional with fully nonparametric covariance structure. Inspired by the fact that the autocovariance function of observed functional predictors automatically filters out the impact from the unobservable noise term, we propose a novel autocovariance-based generalized method-of-moments estimate of the slope function. We also develop a nonparametric smoothing approach to handle the scenario of partially observed functional predictors. The asymptotic properties of the resulting estimators under different scenarios are established. Finally, we demonstrate that our proposed method significantly outperforms possible competing methods through an extensive set of simulations and an analysis of a public financial dataset.

Zhong, W., 周玮, Fan, Q., & Gao, Y. (2022). Dummy endogenous treatment effect estimation using high‐dimensional instrumental variables. Canadian Journal of Statistics, 50, 795-819.

We develop a two-stage approach to estimate the treatment effects of dummy endogenous variables using high-dimensional instrumental variables (IVs). In the first stage, instead of using a conventional linear reduced-form regression to approximate the optimal instrument, we propose a penalized logistic reduced-form model to accommodate both the binary nature of the endogenous treatment variable and the high dimensionality of the IVs. In the second stage, we replace the original treatment variable with its estimated propensity score and run a least-squares regression to obtain a penalized logistic regression instrumental variables estimator (LIVE). We show theoretically that the proposed LIVE is root-n consistent with the true treatment effect and asymptotically normal. Monte Carlo simulations demonstrate that LIVE is more efficient than existing IV estimators for endogenous treatment effects. In applications, we use LIVE to investigate whether the Olympic Games facilitate the host nation’s economic growth and whether home visits from teachers enhance students’ academic performance. In addition, the R functions for the proposed algorithms have been developed in an R package naivereg. The Canadian Journal of Statistics 50: 795–819;

Ai, Q., He, L., 刘史毓, & Xu, Z. (2022). ByPE-VAE: Bayesian pseudocoresets exemplar VAE. Neural Information Processing Systems (NeurIPS).

Recent studies show that advanced priors play a major role in deep generativemodels. Exemplar VAE, as a variant of VAE with an exemplar-based prior, hasachieved impressive results. However, due to the nature of model design, anexemplar-based model usually requires vast amounts of data to participate in training, which leads to huge computational complexity. To address this issue, wepropose Bayesian Pseudocoresets Exemplar VAE (ByPE-VAE), a new variant of VAE with a prior based on Bayesian pseudocoreset. The proposed prior is condi.tioned on a small-scale pseudocoreset rather than the whole dataset for reducingthe computational cost and avoiding overfitting. Simultaneously, we obtain theoptimal pseudocoreset via a stochastic optimization algorithm during VAE trainingaiming to minimize the Kullback-Leibler divergence between the prior based onthe pseudocoreset and that based on the whole dataset. Experimental results showthat ByPE-VAE can achieve competitive improvements over the state-of-the-artVAEs in the tasks of density estimation, representation learning, and generativedata augmentation. Particularly, on a basic VAE architecture, ByPE-VAE is up to 3times faster than Exemplar VAE while almost holding the performance. Code isavailable at https://github.com/Aiqz/ByPE-VAE.

常晋源, Chen, S. X., Tang, C. Y., & Wu, T. T. (2021). High-dimensional empirical likelihood inference. Biometrika, 108, 127-147.

High-dimensional statistical inference with general estimating equations is challenging and remains little explored. We study two problems in the area: confidence set estimation for multiple components of the model parameters, and model specifications tests. First, we propose to construct a new set of estimating equations such that the impact from estimating the high-dimensional nuisance parameters becomes asymptotically negligible. The new construction enables us to estimate a valid confidence region by empirical likelihood ratio. Second, we propose a test statistic as the maximum of the marginal empirical likelihood ratios to quantify data evidence against the model specification. Our theory establishes the validity of the proposed empirical likelihood approaches, accommodating over-identification and exponentially growing data dimensionality. Numerical studies demonstrate promising performance and potential practical benefits of the new methods.

Zheng, X., Guo, B., 何婧, & Chen, S. X. (2021). Effects of corona virus disease‐19 control measures on air quality in North China. Environmetrics, 32, e2673.

Corona virus disease-19 (COVID-19) has substantially reduced human activities and the associated anthropogenic emissions. This study quantifies the effects of COVID-19 control measures on six major air pollutants over 68 cities in North China by a Difference in Relative-Difference method that allows estimation of the COVID-19 effects while taking account of the general annual air quality trends, temporal and meteorological variations, and the spring festival effects. Significant COVID-19 effects on all six major air pollutants are found, with NO2 having the largest decline (−39.6%), followed by PM2.5 (−30.9%), O3 (−16.3%), PM10 (−14.3%), CO (−13.9%), and the least in SO2 (−10.0%), which shows the achievability of air quality improvement by a large reduction in anthropogenic emissions. The heterogeneity of effects among the six pollutants and different regions can be partly explained by coal consumption and industrial output data.

Zhong, W., Gao, Y., 周玮, & Fan, Q. (2021). Endogenous treatment effect estimation using high-dimensional instruments and double selection. Statistics & Probability Letters, 169, 108967.

We propose a double selection instrumental variable estimator for the endogenous treat- ment effects using both high-dimensional control variables and instrumental variables. It deals with the endogeneity of the treatment variable and reduces omitted variable bias due to imperfect model selection.

Chen, X., 张佳, & Zhou, W. (2022). High-dimensional elliptical sliced inverse regression in non-Gaussian distributions. Journal of Business & Economic Statistics, 40, 1204-1215.

High-dimensional elliptical sliced inverse regression in non-Gaussian distributions, Journal of Business & Economic Statistics, in press.

常晋源, Kolaczyk, E. D. & Yao, Q. (2020). Discussion of ‘Network cross-validation by edge sampling’. Biometrika, 107, 277-280.

We thank the authorsfor their new contribution to networkmodelling.Datareuse, encompassingmethods such as bootstrapping and cross-validation, is an area that to date has largely resisted obvious and rapid development in the network context. One of the major reasons is that mimicking the original sampling mechanisms is challenging if not impossible. To avoid deleting edges and destroying some of the network structure, the resampling strategy proposed in Li et al. (2020) based on splitting node pairs rather than nodes is therefore insightful and effective. Matrix completion is the key technique involved, with its use here providing a new perspective for network analysis.

Li, Q., 余关元, & Liu, Y. (2020). A deep multimodal generative and fusion framework for class-imbalanced multimodal data. Multimedia Tools and Applications, 79, 25023-25050.

The purpose of multimodal classification is to integrate features from diverse information sources to make decisions. The interactions between different modalities are crucial to this task. However, common strategies in previous studies have been to either concatenate features from various sources into a single compound vector or input them separately into several different classifiers that are then assembled into a single robust classifier to generate the final prediction. Both of these approaches weaken or even ignore the interactions among different feature modalities. In addition, in the case of class-imbalanced data, multimodal classification becomes troublesome. In this study, we propose a deep multimodal generative and fusion framework for multimodal classification with class-imbalanced data. This framework consists of two modules: a deep multimodal generative adversarial network (DMGAN) and a deep multimodal hybrid fusion network (DMHFN). The DMGAN is used to handle the class imbalance problem. The DMHFN identifies fine-grained interactions and integrates different information sources for multimodal classification. Experiments on a faculty homepage dataset show the superiority of our framework compared to several start-of-the-art methods.

菜单导航