Statistics seminar
Title: Two kinds of over-dispersion affect regional DNA methylation patterns
Speaker: Celia Greenwood, McGill
Date and time:
14 Feb 2023,
11:00am -
12:00pm
Location: CLE C115
Read full description
Abstract: DNA methylation is an epigenetic mark intrinsically involved in regulating the activity of DNA, and methylation levels are known to change with age, exposures, and disease status. DNA methylation can be measured with a sequencing technique that gives methylated and unmethylated counts at each targeted position in the genome, but the data are very noisy. I will describe an over-dispersed quasi-binomial model with functional smoothing to model DNA methylation patterns in small genomic regions, and how these patterns depend on covariates. Results will be illustrated with an analysis of DNA methylation and a biomarker strongly associated with rheumatoid arthritis.
Title: Valid inference after clustering with application to single-cell RNA-sequencing data
Speaker: Lucy Gao, UBC
Date and time:
07 Feb 2023,
11:00am -
12:00pm
Location: CLE C115
Read full description
Abstract: In single-cell RNA-sequencing studies, researchers often model the variation between cells with a latent variable, such as cell type or pseudotime, and investigate associations between the genes and the latent variable. As the latent variable is unobserved, a two-step procedure seems natural: first estimate the latent variable, then test the genes for association with the estimated latent variable. However, if the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to control the type I error rate.
Title: Multivariate One-sided Tests for Nonlinear Mixed Effects Models with Censored Responses
Speaker: Lang Wu, UBC
Date and time:
31 Jan 2023,
11:00am -
12:00pm
Location: via Zoom
Read full description
Zoom link
Abstract: Nonlinear mixed effects (NLME) models are commonly used in modelling many longitudinal data such as pharmacokinetics and HIV viral dynamics. These models are often derived based on the underlying data generation mechanisms, so the parameters in these models often have meaningful physical interpretations and natural restrictions such as some parameters being positive. Hypothesis testing for these parameters should incorporate these restrictions, leading to one-sided or constrained tests. Motivated from HIV viral dynamic models, in this article we propose multi-parameter one-sided or constrained tests for NLME models with censored responses, e.g., viral dynamic models with viral loads below detection limits. We propose approximate likelihood-based tests which are computationally efficient. We evaluate the tests via simulations and show that the proposed tests are more powerful than the corresponding two-sided or unrestricted tests. We apply the proposed tests to an AIDS dataset with new findings.
Title: A novel machine learning approach for gene module identification and prediction via a co-expression network of single-cell sequencing data
Speaker: Li Xing, University of Saskatchewan
Date and time:
24 Jan 2023,
11:00am -
12:00pm
Location: CLE C115
Read full description
Abstract:Gene co-expression network analysis is widely used in microarray and RNA sequencing data analysis. It groups genes based on their co-expression network. And genes within a group infer similarity in function or coregulation in the pathway.
In literature, the approaches to group genes are mainly unsupervised, which may introduce instability and variation across different datasets. Inspired by ensemble learning, we propose a novel approach that ensemble supervised and unsupervised learning techniques and simultaneously works on two tasks, gene module identification and phenotype prediction, during the data analysis process. The identified gene modules from this approach could suggest more candidate genes to the original pathway, and those genes are potential biomarkers for pathway-related diseases. In addition, the novel approach also improves the prediction accuracy for phenotypes.
The algorithm can be used as a general prediction algorithm. And, as it is specially designed to handle large samples, it is suitable for handling single-cell data with many cells. We showcased the use of the algorithm in single-cell cell-type auto-annotation.
Title: Evaluation of Logrank, MaxCombo and Difference in Restricted Mean Survival Time in Immuno-Oncology (IO) trials - A retrospective analysis in patients treated with anti-PD1/PDL1 agents across solid tumors
Speaker: JiaBu Ye, MSD
Date and time:
29 Nov 2022,
1:30pm -
2:30pm
Location: via Zoom
Read full description
Zoom link.
Abstract: The log-rank test is considered the criterion standard for comparing 2 survival curves in pivotal registrational oncology trials. However, with novel immunotherapies that often violate the proportional hazards assumptions over time, log-rank can lose power and may fail to detect treatment benefit. We performed systematic review and meta-analysis of 63 studies between the log-rank, maxcombo and dRMST. The findings of this review show that MaxCombo may provide a pragmatic alternative to log-rank when departure from proportional hazards is anticipated. Both tests resulted in the same statistical decision in most comparisons. Discordant studies had modest to meaningful improvements in treatment effect. The dRMST test provided no added sensitivity for detecting treatment differences over log-rank.
Bio:
Jiabu Ye is principal scientist of Biostatistics at MSD. He is endometrial indication lead statistician over see several endometrial late phase clinical trials. Before joining MSD, Jiabu worked at late development trial statistician in AstraZeneca and lead multiple late phase trial development. He is also a member of NPH cross-pharma working group. He received PhD in biostatistics from University of Texas Health Science Center at Houston.
Title: Insights from a Statistician working in the BC Public Service
Speaker: Beiyan Ou, BC Ministry of Agriculture and Food
Date and time:
22 Nov 2022,
1:30pm -
2:30pm
Location: MAC D010
Read full description
The speaker, Beiyan Ou, is a MSc alumnus of our department, who is now a senior manager in the BC Ministry of Agriculture and Food. In this seminar talk, students will learn some of the analyses that she’s done to inform policy development in her career in the BC Public Service. Beiyan will also offer a glimpse into the typical work day for a statistician who is part of a multidisciplinary team of government agencies. This special talk is not focused on the development of novel methods. However, it will offer helpful insights for students considering working in government agencies after graduation.
Title: Extensions and Applications of Wasserstein Distance and Optimal Transport
Speaker: Lynn Lin, Duke University
Date and time:
15 Nov 2022,
1:30pm -
2:30pm
Location: via Zoom
Read full description
Zoom link.
Abstract: Optimal transport (OT) is a principled approach for matching, having achieved success in diverse applications such as tracking and cluster alignment. It is also the core computation problem for solving the Wasserstein metric between probabilistic distributions, which has been increasingly used in machine learning. In this talk, I will present a new distance called Minimized Aggregated Wasserstein (MAW) for Gaussian mixture models. The definition of MAW exploits OT as a fundamental matching principle. We then extend OT by a new optimization formulation called Optimal Transport with Relaxed Marginal Constraints (OT-RMC). Specifically, we relax the marginal constraints by introducing a penalty on the deviation from the constraints. We demonstrate how MAW and OT-RMC can easily adapt to various tasks by single-cell data analysis.
Title: Nonparametric high-dimensional multi-sample tests based on graph theory
Speaker: Xiaoping Shi, UBC Okanagan
Date and time:
08 Nov 2022,
1:30pm -
2:30pm
Location: via Zoom
Read full description
Zoom link: https://uvic.zoom.us/j/83114822200?pwd=bDY1RnFmb05wZXJRZk52THBGbDFYZz09
High-dimensional data pose unique challenges for data processing in an era of ever-increasing amounts of data availability. Graph theory can provide a structure of high-dimensional data. We introduce two key properties desirable for graphs in testing homogeneity. Roughly speaking, these properties may be described as: unboundedness of edge counts under the same distribution and boundedness of edge counts under different distributions. It turns out that the minimum spanning tree violates these properties but the shortest Hamiltonian path posses them. Based on the shortest Hamiltonian path, we propose two combinations of edge counts in multiple samples to test the homogeneity. We give the permutation null distributions of proposed statistics when sample sizes go to infinity. The power is analyzed by assuming both sample sizes and dimensionality tend to infinity. Simulations show that our new tests behave very well overall in comparison with various competitors. Real data
analysis of tumors and images further convince the value of our proposed tests. Software implementing the test is available in the R package Relevance.
Title: Efficient Bayesian inference for complex statistical models via annealed sequential Monte Carlo method
Speaker: Liangliang Wang, Simon Fraser University
Date and time:
01 Nov 2022,
1:30pm -
2:30pm
Location: MAC D010
Read full description
In this talk, I will describe an "embarrassingly parallel'' method for Bayesian inference, annealed Sequential Monte Carlo (ASMC) with an adaptive determination of annealing parameters. The ASMC method can efficiently provide an approximate posterior distribution and an unbiased estimator for the marginal likelihood. We can use this unbiasedness property to test the correctness of posterior simulation and the estimated marginal likelihood to conduct Bayesian model comparison. We have applied the annealed SMC method to two non-standard applications with complex statistical models: 1) Bayesian inference of phylogenetic trees and evolutionary parameters from biological sequence data; 2) Estimation of parameters in nonlinear ordinary differential equations and model selection. We illustrate our method by comparing it with other methods such as standard Markov chain Monte Carlo algorithms using simulation studies and real data analysis.
Short bio:
Liangliang Wang is an Associate Professor and Graduate Program Chair in the Department of Statistics and Actuarial Science at Simon Fraser University, where she has been a faculty member since 2013. Dr. Wang completed her Ph.D. in statistics at the University of British Columbia and her master's degree in statistics at McGill University. Her research interests focus on computational statistics and statistical machine learning. Her favourite applications come from important scientific questions raised in genetics, biology, public health, and environmetrics. Dr. Wang is interested in tackling the computational issues in complex statistical models applied to large-scale data. She has published 50 papers in statistical journals and machine learning conferences.
Title: A Constrained Minimum Criterion for Regression Model Selection
Speaker: Min Tsao, University of Victoria
Date and time:
25 Oct 2022,
4:00pm -
5:00pm
Location: via Zoom
Read full description
Zoom link: https://uvic.zoom.us/j/83114822200?pwd=bDY1RnFmb05wZXJRZk52THBGbDFYZz09
ABSTRACT: Although log-likelihood is widely used in model selection, the log-likelihood ratio has had few applications in this area. In this talk, I present a log-likelihood ratio based method for selecting regression models which focuses on the set of models deemed plausible by the likelihood ratio test. I show that when the sample size is large and the significance level of the test is small, there is a high probability that the smallest model in the set is the true model; thus, the method selects this smallest model. The significance level of the test serves as a tuning parameter that controls the balance between the false active rate and false inactive rate of the selected model. I consider three levels of this parameter in a simulation study and compare this method with the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) to demonstrate its excellent accuracy and adaptability to different sample sizes.
Model selection is an active area of research with a long history, a wide range of perspectives, and a rich collection of methods. For students unfamiliar with this area, this talk includes a review of key methods including the AIC, BIC and modern Lp penalty methods. The new method presented in this talk offers a frequentist perspective on the model selection problem. It is an alternative and a strong competitor to the AIC and BIC for selecting regression models.
Title: An AI + HI Hybrid Content Moderation Solution for Microsoft News and Feeds
Speaker: Lizhen Peng, Microsoft WebXT Content Services
Date and time:
18 Oct 2022,
1:30pm -
2:30pm
Location: MAC D010
Read full description
Abstract: Content Moderation is the key and fundamental component for any content services and platforms to operate and to offer friendly, meaningful and non-toxic content for consumers to enjoy and engage with, and for users to build an online community to interact and communicate as well. Moderation service is the safety gatekeeper for other features to build on top of, such as content recommendations and personalization, targeted advertisement and so on. However, there are many big practical challenges we are facing on a daily basis. In this talk, we will present the trending solutions for content moderation in the Tech Industry, by leveraging both Artificial Intelligence (AI) and Human Intelligence (HI) to overcome multi-dimensional obstacles and to achieve the goals from multiple perspectives in real practices.
Title: Leveraging spatial transcriptomics data to recover cell locations in single-cell RNA-seq with CeLEry
Speaker: Qihuang Zhang, Department of Epidemiology, Biostatistics and Occupational Health, McGill University
Date and time:
11 Oct 2022,
1:30pm -
2:30pm
Location: MAC D010
Read full description
Abstract: Single-cell RNA sequencing (scRNA-seq) has transformed our understanding of cellular heterogeneity in health and disease, but the lack of physical relationships among dissociated cells has limited its applications. In this talk, we present CeLEry, a supervised deep learning algorithm to recover the spatial origins of cells in scRNA-seq by leveraging gene expression and spatial location relationships learned from spatial transcriptomics. CeLEry has a data augmentation procedure via a variational autoencoder to enlarge the training sample size, which improves the robustness of the method and overcomes noise in scRNA-seq. CeLEry can infer the spatial origins of cells in scRNA-seq at multiple levels, including 2D location as well as the spatial domain or tissue layer of a cell. CeLEry also provides uncertainty estimates for the recovered locations. This framework can be applied to study the changing of cell distribution in cerebral cortex layers during the progression of Alzheimer's disease.
Title: Meta-clustering of Genomic Data
Speaker: Yingying Wei, Department of Statistics, The Chinese University of Hong Kong
Date and time:
04 Oct 2022,
4:00pm -
5:00pm
Location: via Zoom
Read full description
Zoom link: https://uvic.zoom.us/j/83114822200?pwd=bDY1RnFmb05wZXJRZk52THBGbDFYZz09
Abstract:
Like traditional meta-analysis that pools effect sizes across studies to improve statistical power, it is of increasing interest to conduct clustering jointly across datasets to identify disease subtypes for bulk genomic data and discover cell types for single-cell RNA-sequencing (scRNA-seq) data. Unfortunately, due to the prevalence of technical batch effects among high-throughput experiments, directly clustering samples from multiple datasets can lead to wrong results. The recent emerging meta-clustering approaches require all datasets to contain all subtypes, which is not feasible for many experimental designs.
In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS) framework. BUS is capable of correcting batch effects explicitly, grouping samples that share similar characteristics into subtypes, identifying features that distinguish subtypes, and enjoying a linear-order computational complexity. We prove the identifiability of BUS for not only bulk data but also scRNA-seq data whose dropout events suffer from missing not at random. We mathematically show that under two very flexible and realistic experimental designs—the “reference panel” and the “chain-type” designs—true biological variability can also be separated from batch effects. Moreover, despite the active research on analysis methods for scRNA-seq data, rigorous statistical methods to estimate treatment effects for scRNA-seq data—how an intervention or exposure alters the cellular composition and gene expression levels—are still lacking. Building upon our BUS framework, we further develop statistical methods to quantify treatment effects for scRNA-seq data.
Title: Ensembling Classification Models Based on Phalanxes of Variables with Applications in Drug Discovery
Speaker: Dr. Jabed Tomal, Department of Mathematics and Statistics, Thompson Rivers University
Date and time:
27 Sep 2022,
1:30pm -
2:30pm
Location: MAC D010
Read full description
Abstract: Statistical detection of a rare class of objects in a two-class classification problem can pose several challenges. Because the class of interest is rare in the training data, there is relatively little information in the known class response labels for model building. At the same time the available explanatory variables are often moderately high dimensional. In the four assays of our drug-discovery application, compounds are active or not against a specific biological target, such as lung cancer tumor cells, and active compounds are rare. Several sets of chemical descriptor variables from computational chemistry are available to classify the active versus inactive class; each can have up to thousands of variables characterizing molecular structure of the compounds. The statistical challenge is to make use of the richness of the explanatory variables in the presence of scant response information. Our algorithm divides the explanatory variables into subsets adaptively and passes each subset to a base classifier. The various base classifiers are then ensembled to produce one model to rank new objects by their estimated probabilities of belonging to the rare class of interest. The essence of the algorithm is to choose the subsets such that variables in the same group work well together; we call such groups phalanxes.
Title: Deciphering tissue microenvironment from Next Generation Sequencing data
Speaker: Dr. Jian Hu, Department of Human Genetics, Emory School of Medicine
Date and time:
20 Sep 2022,
1:30pm -
2:30pm
Location: via Zoom
Read full description
Zoom link: https://uvic.zoom.us/j/83114822200?pwd=bDY1RnFmb05wZXJRZk52THBGbDFYZz09
ABSTRACT: The advent of high-throughput next-generation sequencing (NGS) technologies has transformed our understanding of cell biology and human disease. As NGS has been adopted earliest by the scientific community, its use has now become widespread, and the technology has improved rapidly. At present, it is now common for laboratories to assay genome-wide transcriptomes of thousands of cells in a single scRNA-seq experiment. In addition, technologies that enable the measurement of new information, for example, chromatin accessibility, protein quantification, and spatial location, have been developed. In order to take full advantage of the multi-modality information when analyzing NGS data, new methods are demanded. This seminar will introduce several machine learning algorithms for NGS data analysis with different aims, including cell type classification, spatial domain detection, and tumor microenvironment annotation.
KEYWORDS: single cell RNA sequencing (scRNA-seq), Spatial transcriptomics (ST), tumor microenvironment, machine learning