Spring 2021 Seminars:
D2K Lab Seminar Speaker: Gabriel Zenarosa
Thursday, April 1, 12:00pm - 1:00pm (Central Time)
Gabriel Lopez Zenarosa (he/him/his) is a Research Associate under the National Research Council (NRC) Research Associateship Programs (RAP) of the National Academies of Sciences, Engineering, and Medicine (NASEM) collaborating with researchers in the Air Force Research Laboratory Munitions Directorate (AFRL/RW). He previously was an Assistant Professor of Systems Engineering and Engineering Management at UNC Charlotte, where he taught courses on computational methods (programming in C++/Java and data analytics in R), systems design and deployment, fundamentals of engineering management, fundamentals of stochastic system analysis, and special topics. Zenarosa received his Doctor of Philosophy in Industrial Engineering from the University of Pittsburgh in 2016, Master of Software Engineering from Carnegie Mellon University in 2005, Master of Science in Computer Science from Columbia University in 2002, and Bachelor of Science in Computer Science from the University of the Philippines in 1997. Zenarosa also held positions in industry for over eight years as software engineer, software quality assurance test engineer, client support engineer, and software process consultant.
Title: Mixed-integer Programs for Transfer Learning
Abstract: Transfer learning is an approach for leveraging previously learned knowledge on some problem domain to aid in learning knowledge, particularly a predictive function, in a new domain. For instance, convolutional neural networks trained to classify millions of images along a thousand or more labels, such as Google Inception and Deep Residual Networks, can be used to extract features of images to subsequently aid in classifying a relatively small set of new images to a handful of labels. In this case, the learning task reduces to finding the optimal mapping of image features to their output labels. Mixed-integer programming (MIP) is an approach for finding this optimal mapping, and recent advancements in the state-of-the-art MIP solvers afford modeling this optimization problem. Because the MIP models are large in scale and computationally difficult to solve, we demonstrate how some techniques from stochastic integer programming and importance sampling can provide efficient solution methods and reduce overfitting. This talk starts with an overview lecture of mathematical programming models and their relationships to some machine-learning models.
D2K Lab Seminar Speaker: Prince Afriyie
Wednesday, March 31, 12:00pm - 1:00pm (Central Time)
Prince Afriyie is an assistant professor at the University of Virginia’s department of statistics. He is also affiliated to University of Texas’ Dana Center where he helps develop special training courses for educators on teaching statistics. Dr. Afriyie received his PhD in Statistics at Temple University (2016), master’s degree in Mathematics at Ball State University (2011) and bachelor’s degree in Mathematics at Northern Kentucky University (2008). Prior to joining the University of Virginia, he was an assistant professor of statistics at California Polytechnic State University, San Luis Obispo.
Dr. Afriyie’s current research is focused on developing new and powerful methodologies for testing multiple hypotheses simultaneously as well as statistics and data science education. He has served on the Statistics Advisory Group for University of Texas’ Dana Center where he helped create learning outcomes for a college-level course in statistics that actively engages students from Black, Latinx, Asian, and Indigenous communities. Dr. Afriyie was recently appointed as a committee member of the Advanced Placement (AP) Statistics Development Committee where he will help write and review AP statistics exams questions, develop course curriculum, determine the general content and ability level of each exam, and determine requirements for course syllabi.
Multiple Hypotheses Testing - Procedures Controlling the Tail Probability of the False Discovery Proportion
Multiple testing has been an area of active statistical research in the past decade mainly because of its wide scope of applicability in modern scientific investigations. Currently research in multiple testing is mainly focused on developing powerful methods even when the number of tests is very large. This talk briefly reviews modern multiple testing methodologies before focusing on its primary goal of making further contributions to the field of controlling false discovery proportion (FDP). More specifically, we propose four newer step-up procedures controlling the -FDP, the probability of FDP exceeding , given some [0,1). The first of these procedures is developed by modifying the Benjamini and Hochberg (1995, J. Roy. Statist. Soc., Ser. B) critical constants, which controls the -FDP under both independent and positively dependent test statistics. The second one is a two-stage adaptive procedure developed from these modified Benjamini and Hochberg critical constants and controls the -FDP under independence. The third and fourth procedures are also two-stage adaptive procedures controlling the -FDP under independence, but developed using critical constants in Lehmann and Romano (2005, Ann. of Statist.) and Delattre and Roquain (2015, Ann. of Statist.), respectively. Results of simulation studies examining performances of the proposed procedures relative to their relevant competitors will be presented. We also show the performance of our proposed procedures on high throughput genomic data.
Building a Logistic Regression Model to Predict the Outcome of an NBA Game.
The discipline of Data Science addresses the fundamental challenge of drawing robust conclusions about the world around us using incomplete data. There are three core aspects of effective data analysis: exploratory data analysis, modeling and prediction, and inference. This talk focuses on one aspect of modeling and prediction - Logistic Regression. We will use data from the 2017-18 season of the National Basketball Association (NBA) to build a logistic regression model to predict the outcome – probability of a win – of NBA home games.
D2K Lab Seminar Speaker: Tanmay Basu
Wednesday, March 24, 12:00pm - 1:00pm (Central Time)
Title: Machine Learning and NLP for Knowledge Discovery in Unstructured Text
Abstract: Natural Language Processing (NLP) is the process of using computer algorithms to identify, analyze and derive key elements in unstructured text in a smart and effective way. With the widespread use of online social media and electronic health records (EHRs), unstructured text is a veritable gold mine, and NLP is the best way to extract value from these resources. Some recent research works and scopes of further works will be discussed to demonstrate the effectiveness of machine learning and NLP for information extraction from electronic health records and social media to develop useful tools for health-care. The merit of machine learning and NLP for knowledge discovery in scientific literature will be explained. Moreover, the basic idea of data classification and the method of decision tree classification will be presented in order to explore its implications in relevant domains.
Bio: Tanmay Basu is a research fellow in data science and biomedical informatics with interests in developing methods and tools using novel computational NLP, text mining and machine learning techniques for potential knowledge discovery in electronic health records, social media, scientific literature, and other types of text data. Tanmay obtained MS and PhD degrees in Computer Science respectively from Jadavpur University and Indian Statistical Institute in Kolkata, India. He worked on developing novel text classification and text clustering techniques during the PhD tenure. Currently, he is working as a research fellow in the Health Data Research UK grant at Institute of Cancer and Genomic Sciences in University of Birmingham, UK since August 2019. Prior to joining University of Birmingham, he worked as an assistant professor in the Department of Computer Science at Ramakrishna Mission Vivekananda University in West Bengal, India. Earlier, he worked as a postdoctoral fellow respectively at the Department of Learning Health Sciences in University of Michigan Ann Arbor and Division of Biomedical Informatics of Northwestern University Feinberg School of Medicine in Chicago. He had delivered invited talks on different research topics of biomedical NLP in Duke University, University of Cincinnati, LIMSI NLP Group in France, IIT Kharagpur and ISI Kolkata in India. He loves teaching, travelling and various sports.
D2K Lab Seminar Speaker: Dr. Nidhi Rastogi
Monday, March 15, 12:00pm - 1:00pm (Central Time)
Dr. Nidhi Rastogi is a Research Scientist at Rensselaer Polytechnic Institute. Her research is at the intersection of cybersecurity, artificial intelligence, large-scale networks, graph analytics, and data privacy. For her contributions to cybersecurity and encouraging women in STEM, Dr. Rastogi was recognized in 2020 as an International Women in Cybersecurity by the Cyber Risk Research Institute. She was a speaker at the SANS cybersecurity summit and the Grace Hopper Conference. Before her Ph.D. from RPI, Dr. Rastogi also worked in the industry on heterogeneous wireless networks (cellular, 802.1x, 802.11) and network security through engineering and research positions at Verizon and GE Global Research Center, and GE Power.
Towards Contextual Security and Privacy preservation on AI-enabled platforms
The explosive growth of Internet-connected and AI-enabled devices and data produced by them has introduced significant threats. For example, malware intrusions (SolarWinds) have become perilous and extremely hard to discover, while data breaches continue to compromise user privacy (Zoom credentials exposed) and endanger personally identifiable information. My research takes a holistic approach towards systems and platforms to address these very concerns using contextual and explainable security models and federated learning. In this talk, I will present ongoing work and plans for two main research themes (1) analysis and improvements in the cybersecurity posture of Internet-connected systems and devices using automated, trustworthy, and contextual AI-systems; (2) preservation of user data privacy and protection of information leakage from AI models. Ongoing research in malware threat intelligence gathers diverse information from varied datasets - system and network logs, source code, and text. In , an open-source ontology (MALOnt) contextualizes threat intelligence by aggregating malware-related information into classes and relations. The knowledge graph, TINKER  – the first open-source malware knowledge graph, instantiates MALOnt classes and enables information extraction, reasoning, analysis, detection, classification, and cyber threat attribution. At present, I am addressing the trustworthiness of information sources and extractors. For data privacy, I am exploring local data collection from sensors in autonomous vehicles. I end the talk by sharing planned future directions for research.
Data Science in Cybersecurity
I will cover the goals of cybersecurity and the usage of data science for a malware detection problem.