Spring 2022 Seminars:
D2K Lab Seminar Speaker: Sangwon Hyun
Wednesday, February 2, 12pm - 1pm (Central Time)
Abstract: Although microscopic, phytoplankton in the ocean are extremely important to all of life and are together responsible for as much photosynthesis
as all plants on land combined. Today, oceanographers are able to collect flow cytometry data in real time while onboard a moving ship, providing them with
fine-scale information about the distribution of phytoplankton across thousands of kilometers. We describe the application of a novel sparse multivariate mixture of experts model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations. After this research portion of the talk, the second part will be a short teaching talk about regression analysis.
Bio: Sangwon Hyun is a post-doctoral researcher at the Data Sciences and Operations department of the University of Southern California, advised by Jacob Bien. Sangwon received his PhD in statistics from Carnegie Mellon University in 2018, under the supervision of Ryan Tibshirani and Max G’Sell. Sangwon’s research spans three different topics – changepoint inference, infectious disease forecasting, and ocean data science. The common theme in his research is developing new statistical methodology for problems in scientific domains, and working closely with scientists on analyzing large and complex scientific data. In addition to statistical research, Sangwon is interested in improving data science education and has conducted pedogogy research about devising and improving assessment test questions in statistics.
D2K Lab Seminar Speaker: Andersen Chang
Monday, January 31, 12pm - 1pm
Abstract: The first portion of the talk will discuss research work on graphical models for functional neuronal connectivity. With modern calcium imaging technology, the activities of thousands of neurons can be recorded simultaneously in vivo. These experiments can potentially provide new insights into functional connectivity, defined as the statistical relationships between the spiking activity of neurons in the brain. As a commonly used tool for estimating conditional dependencies in high-dimensional settings, graphical models are a natural choice for analyzing calcium imaging data. However, raw neuronal activity recording data presents a unique challenge: the important information lies in the rare extreme value observations that indicate neuronal firing, as opposed to the non-extreme observations associated with inactivity. To address this issue, we develop a novel class of graphical models, called the extreme graphical model, which focuses on finding relationships between features with respect to the extreme values. We first establish the full form of the extreme graphical model and discuss theoretical properties of the joint distribution and estimation procedure. We then demonstrate the empirical performance of the extreme graphical model on several neuroscience data examples, in which we apply our method to real-world calcium imaging data sets to obtain functional connectivity estimates.
The second portion of the talk will focus on teaching background. It will highlight experience in teaching courses in data science. A short teaching demo covering concepts of data visualization will also be presented.
Bio: Andersen Chang is currently a Ph.D. candidate in the Department of Statistics at Rice University. He received his B.S. in Statistics and Master of Statistical Practice from Carnegie Mellon University. His current research interests include machine learning and statistical methodology for neuroscience applications. Additionally, he has served as an instructor for several experiential learning courses in the D2K Lab at Rice.
D2K Lab Seminar Speaker: Xinjie Lan
Monday, January 24, 12pm - 1pm
Abstract: In the teaching session, he will discuss the model evaluation and selection topic in a graduate-level machine learning course. It will cover some basic definitions, such as underfitting/overfitting, and an entry-level model selection algorithm, namely regularization. In the research session, he will present an information theoretic approach to improve the generalization performance of Deep Neural Networks (DNNs). Specifically, the research presentation is summarized below.
DNNs have already achieved great success in various applications. However, over-parameterized DNNs have severe generalization problems, especially in the Out of Distribution (OOD) domain. To resolve the OOD generalization problem, he mainly makes two contributions. First, he proposes a mutual information trade-off to explain the generalization behavior of DNNs. Second, leveraging the proposed information theoretic explanation, he designs a novel information theoretic approach to regularize the min-max optimization for improving DNN generalization. Experimental results demonstrate that the proposed information theoretic techniques outperform the state-of-the-art methods.
Bio: Xinjie Lan is an instructor at the University of Delaware. He has several years of teaching experience in machine learning and statistical signal processing. In addition, his research focuses on machine learning theory, eXplainable Artificial Intelligence (XAI), and computer vision. He is particularly interested in exploiting statistical theory and information theory to explain the internal mechanism of Deep Neural Networks (DNNs) and propose new approaches to optimize DNN performance in various computer vision tasks. Recently, his work is focused on improving the generalization and robustness performance of DNNs based on information theoretic techniques.
Spring 2021 Seminars:
D2K Lab Seminar Speaker: Gabriel Zenarosa
Thursday, April 1, 12:00pm - 1:00pm (Central Time)
Gabriel Lopez Zenarosa (he/him/his) is a Research Associate under the National Research Council (NRC) Research Associateship Programs (RAP) of the National Academies of Sciences, Engineering, and Medicine (NASEM) collaborating with researchers in the Air Force Research Laboratory Munitions Directorate (AFRL/RW). He previously was an Assistant Professor of Systems Engineering and Engineering Management at UNC Charlotte, where he taught courses on computational methods (programming in C++/Java and data analytics in R), systems design and deployment, fundamentals of engineering management, fundamentals of stochastic system analysis, and special topics. Zenarosa received his Doctor of Philosophy in Industrial Engineering from the University of Pittsburgh in 2016, Master of Software Engineering from Carnegie Mellon University in 2005, Master of Science in Computer Science from Columbia University in 2002, and Bachelor of Science in Computer Science from the University of the Philippines in 1997. Zenarosa also held positions in industry for over eight years as software engineer, software quality assurance test engineer, client support engineer, and software process consultant.
Title: Mixed-integer Programs for Transfer Learning
Abstract: Transfer learning is an approach for leveraging previously learned knowledge on some problem domain to aid in learning knowledge, particularly a predictive function, in a new domain. For instance, convolutional neural networks trained to classify millions of images along a thousand or more labels, such as Google Inception and Deep Residual Networks, can be used to extract features of images to subsequently aid in classifying a relatively small set of new images to a handful of labels. In this case, the learning task reduces to finding the optimal mapping of image features to their output labels. Mixed-integer programming (MIP) is an approach for finding this optimal mapping, and recent advancements in the state-of-the-art MIP solvers afford modeling this optimization problem. Because the MIP models are large in scale and computationally difficult to solve, we demonstrate how some techniques from stochastic integer programming and importance sampling can provide efficient solution methods and reduce overfitting. This talk starts with an overview lecture of mathematical programming models and their relationships to some machine-learning models.
D2K Lab Seminar Speaker: Prince Afriyie
Wednesday, March 31, 12:00pm - 1:00pm (Central Time)
Prince Afriyie is an assistant professor at the University of Virginia’s department of statistics. He is also affiliated to University of Texas’ Dana Center where he helps develop special training courses for educators on teaching statistics. Dr. Afriyie received his PhD in Statistics at Temple University (2016), master’s degree in Mathematics at Ball State University (2011) and bachelor’s degree in Mathematics at Northern Kentucky University (2008). Prior to joining the University of Virginia, he was an assistant professor of statistics at California Polytechnic State University, San Luis Obispo.
Dr. Afriyie’s current research is focused on developing new and powerful methodologies for testing multiple hypotheses simultaneously as well as statistics and data science education. He has served on the Statistics Advisory Group for University of Texas’ Dana Center where he helped create learning outcomes for a college-level course in statistics that actively engages students from Black, Latinx, Asian, and Indigenous communities. Dr. Afriyie was recently appointed as a committee member of the Advanced Placement (AP) Statistics Development Committee where he will help write and review AP statistics exams questions, develop course curriculum, determine the general content and ability level of each exam, and determine requirements for course syllabi.
Multiple Hypotheses Testing - Procedures Controlling the Tail Probability of the False Discovery Proportion
Multiple testing has been an area of active statistical research in the past decade mainly because of its wide scope of applicability in modern scientific investigations. Currently research in multiple testing is mainly focused on developing powerful methods even when the number of tests is very large. This talk briefly reviews modern multiple testing methodologies before focusing on its primary goal of making further contributions to the field of controlling false discovery proportion (FDP). More specifically, we propose four newer step-up procedures controlling the -FDP, the probability of FDP exceeding , given some [0,1). The first of these procedures is developed by modifying the Benjamini and Hochberg (1995, J. Roy. Statist. Soc., Ser. B) critical constants, which controls the -FDP under both independent and positively dependent test statistics. The second one is a two-stage adaptive procedure developed from these modified Benjamini and Hochberg critical constants and controls the -FDP under independence. The third and fourth procedures are also two-stage adaptive procedures controlling the -FDP under independence, but developed using critical constants in Lehmann and Romano (2005, Ann. of Statist.) and Delattre and Roquain (2015, Ann. of Statist.), respectively. Results of simulation studies examining performances of the proposed procedures relative to their relevant competitors will be presented. We also show the performance of our proposed procedures on high throughput genomic data.
Building a Logistic Regression Model to Predict the Outcome of an NBA Game.
The discipline of Data Science addresses the fundamental challenge of drawing robust conclusions about the world around us using incomplete data. There are three core aspects of effective data analysis: exploratory data analysis, modeling and prediction, and inference. This talk focuses on one aspect of modeling and prediction - Logistic Regression. We will use data from the 2017-18 season of the National Basketball Association (NBA) to build a logistic regression model to predict the outcome – probability of a win – of NBA home games.
D2K Lab Seminar Speaker: Tanmay Basu
Wednesday, March 24, 12:00pm - 1:00pm (Central Time)
Title: Machine Learning and NLP for Knowledge Discovery in Unstructured Text
Abstract: Natural Language Processing (NLP) is the process of using computer algorithms to identify, analyze and derive key elements in unstructured text in a smart and effective way. With the widespread use of online social media and electronic health records (EHRs), unstructured text is a veritable gold mine, and NLP is the best way to extract value from these resources. Some recent research works and scopes of further works will be discussed to demonstrate the effectiveness of machine learning and NLP for information extraction from electronic health records and social media to develop useful tools for health-care. The merit of machine learning and NLP for knowledge discovery in scientific literature will be explained. Moreover, the basic idea of data classification and the method of decision tree classification will be presented in order to explore its implications in relevant domains.
Bio: Tanmay Basu is a research fellow in data science and biomedical informatics with interests in developing methods and tools using novel computational NLP, text mining and machine learning techniques for potential knowledge discovery in electronic health records, social media, scientific literature, and other types of text data. Tanmay obtained MS and PhD degrees in Computer Science respectively from Jadavpur University and Indian Statistical Institute in Kolkata, India. He worked on developing novel text classification and text clustering techniques during the PhD tenure. Currently, he is working as a research fellow in the Health Data Research UK grant at Institute of Cancer and Genomic Sciences in University of Birmingham, UK since August 2019. Prior to joining University of Birmingham, he worked as an assistant professor in the Department of Computer Science at Ramakrishna Mission Vivekananda University in West Bengal, India. Earlier, he worked as a postdoctoral fellow respectively at the Department of Learning Health Sciences in University of Michigan Ann Arbor and Division of Biomedical Informatics of Northwestern University Feinberg School of Medicine in Chicago. He had delivered invited talks on different research topics of biomedical NLP in Duke University, University of Cincinnati, LIMSI NLP Group in France, IIT Kharagpur and ISI Kolkata in India. He loves teaching, travelling and various sports.
D2K Lab Seminar Speaker: Dr. Nidhi Rastogi
Monday, March 15, 12:00pm - 1:00pm (Central Time)
Dr. Nidhi Rastogi is a Research Scientist at Rensselaer Polytechnic Institute. Her research is at the intersection of cybersecurity, artificial intelligence, large-scale networks, graph analytics, and data privacy. For her contributions to cybersecurity and encouraging women in STEM, Dr. Rastogi was recognized in 2020 as an International Women in Cybersecurity by the Cyber Risk Research Institute. She was a speaker at the SANS cybersecurity summit and the Grace Hopper Conference. Before her Ph.D. from RPI, Dr. Rastogi also worked in the industry on heterogeneous wireless networks (cellular, 802.1x, 802.11) and network security through engineering and research positions at Verizon and GE Global Research Center, and GE Power.
Towards Contextual Security and Privacy preservation on AI-enabled platforms
The explosive growth of Internet-connected and AI-enabled devices and data produced by them has introduced significant threats. For example, malware intrusions (SolarWinds) have become perilous and extremely hard to discover, while data breaches continue to compromise user privacy (Zoom credentials exposed) and endanger personally identifiable information. My research takes a holistic approach towards systems and platforms to address these very concerns using contextual and explainable security models and federated learning. In this talk, I will present ongoing work and plans for two main research themes (1) analysis and improvements in the cybersecurity posture of Internet-connected systems and devices using automated, trustworthy, and contextual AI-systems; (2) preservation of user data privacy and protection of information leakage from AI models. Ongoing research in malware threat intelligence gathers diverse information from varied datasets - system and network logs, source code, and text. In , an open-source ontology (MALOnt) contextualizes threat intelligence by aggregating malware-related information into classes and relations. The knowledge graph, TINKER  – the first open-source malware knowledge graph, instantiates MALOnt classes and enables information extraction, reasoning, analysis, detection, classification, and cyber threat attribution. At present, I am addressing the trustworthiness of information sources and extractors. For data privacy, I am exploring local data collection from sensors in autonomous vehicles. I end the talk by sharing planned future directions for research.
Data Science in Cybersecurity
I will cover the goals of cybersecurity and the usage of data science for a malware detection problem.