(New!) Seminar Series on Trustworthy Data Science and AI

Past Data Science Seminars
Title: Can deep learning only be neural networks?
Speaker: Zhi-Hua Zhou, Nanjing University, China
Date, Time, Location: 10:30-11:30am, Dec 11, 2019
TASC I 9204
Abstract: The word "deep learning" is generally regarded as a synonym of "deep neural networks (DNNs)". In this talk, we will discuss on essentials in deep learning and claim that deep learning is not necessarily to be realized by neural networks. We will then present an exploration to non-NN style deep learning, where the building blocks are non-differentiable modules and the training process does not rely on backpropagation or gradient-based adjustment.
Speaker Bio: Zhi-Hua Zhou is a Professor, Head of the Department of Computer Science and Technology, Dean of the School of Artificial Intelligence, and Founding Director of the LAMDA Group, Nanjing University. His main research interests are in artificial intelligence, machine learning and data mining. He authored the books "Ensemble Methods: Foundations an d Algorithms (2012)", "Machine Learning (in Chinese, 2016)" and "Evolutionary Learning: Advances in Theories and Algorithms (2019)", and published more than 200 papers in top-tier international journals/conferences. According to Google Scholar, his publications have received more than 40,000 citations, with an H-index of 96. He also holds 24 patents and has rich experiences in industrial applications. He has received various awards, including the National Natural Science Award of China, IEEE Computer Society Edward J. McCluskey Technical Achievement Award, ACML Distinguished Contribution Award, PAKDD Distinguished Contribution Award, IEEE ICDM Outstanding Service Award, etc. He serves as the Editor-in-Chief of Frontiers of Computer Science, Associate Editor-in-Chief of Science China Information Sciences, and Action/Associate Editor of Machine Learning, IEEE PAMI, ACM TKDD, etc. He founded ACML (Asian Conference on Machine Learning) and served as Chair for many prestigious conferences, including Program Chair of AAAI 2019, General Chair of IEEE ICDM 2016, Program Chair of IJCAI 2015 Machine Learning track, and frequently served as Area Chair for NeurIPS, ICML, KDD, etc. He is a Fellow of the ACM, AAAI, AAAS, IEEE, and IAPR.
Host:Jian Pei


Title: Multi-Facet Contextualized Graph Mining with Cube Networks
Speaker: Carl Ji Yang, University of Illinois at Urbana-Champaign
Date, Time, Location: 2-3pm, Dec 9, 2019
TASC I 9204 West
Abstract: Graph data are ubiquitous and indispensable in a variety of high-impact data mining problems and applications, due to its natural and unique modeling of interconnected objects. However, real-world graph data are often massive, complex, and noisy, challenging the design of both effective and efficient knowledge discovery frameworks. In this talk, I will present our recent progress on multi-facet contextualized graph mining, centered around the objective of multi-modal data integration across different domains. In particular, I will focus on (1) a new data model of cube networks, which organizes massive complex networks into controllable small subnetworks with clear structures and semantics under multi-facet contexts; (2) a few algorithmic examples on what can be done on top of cube networks. Beyond that, I will also briefly give examples on how to construct cube networks from existing data models like attributed heterogeneous networks, and what real-world impact cube networks can make on industry-level applications. Finally, I will conclude with some visions and future plans regarding learning with cube networks.
Speaker Bio: Carl Yang is a final-year Ph.D. student with Jiawei Han in Computer Science at University of Illinois, Urbana Champaign. Before that, he received his B.Eng. in Computer Science at Zhejiang University under Xiaofei He in 2014. In his research, he develops data-driven, weakly supervised, and scalable techniques for knowledge discovery from massive, complex and noisy network (graph) data. His interests span graph data mining, network data science, and applied machine learning, with a focus on designing novel graph analysis and deep learning frameworks for the construction, modeling, and application of real-world network data, towards tasks like conditional structure generation, contextualized network embedding, graph-aided recommendations, and so on. Carl’s leading-author research results have been published and well-cited in top conferences like KDD, WWW, NeurIPS, ICDE, WSDM, ICDM, CIKM, ECML-PKDD, SDM and ICML.
Host:Jian Pei


Title: The Importance of Domain Expertise in Data Science and Feature Engineering
Speaker: Pablo Duboue
Date, Time, Location: 3-4pm, Nov 28, 2019
TASC 1 9204 West
Abstract: Do you need to be an expert in a domain to solve a data problem on it? If we lack expertise, what do we miss? The importance of domain expertise tends to be highlighted in machine learning tasks, but many times careful data analysis can steer us away from bad assumptions and yield high-performing models without needing such expertise. If we possess domain knowledge or can consult with people who do possess it, tapping into domain expertise will allow us to get much farther than if we need to invent solutions to known problems in the target domain. This is particularly crucial for feature engineering, the process of representing a problem domain to make it amenable for learning techniques. In this talk I will briefly summarise feature engineering techniques and discuss two case studies in my upcoming book, one pursued with domain expertise and the other without.
Speaker Bio: Dr. Duboue works on applied language technology and natural language generation. He is the director and owner of Textualization Software Ltd., a local Vancouver AI consulting company. With a Ph.D. degree in Computer Science from Columbia University (NYC), he is passionate about improving society through language technology and splits his time between teaching, doing research and contributing to free software projects. His first book "The Art of Feature Engineering" is currently in copy editing and coming Spring 2020 via Cambridge University Press. Career highlights include: former member of the IBM Watson team that beat the Jeopardy! Champions in 2011, best paper award at the Canadian AI conference industrial track in 2014 and consulted for a start-up acquired by Intel Corp. He has done joint research with more than fifty co-authors and taught in three different countries.
Host:Jiannan Wang


Title: Probabilistic Databases: A Dichotomy Theorem and Limitations of DPLL Algorithms
Speaker: Dan Suciu, University of Washington, Seattle
Date, Time, Location: 2:30-3:30pm, Oct 24, 2019
ASB 10911 (Big Data Presentation Studio)
Abstract: Probabilistic Databases (PDBs) extend traditional relational databases by annotating each record with a weight, or a probability. The query evaluation problem, "given a query (a First Order Logic Sentence), compute it's probability", is an instance of the weighted model counting problem of Boolean formulas, and has applications to Markov Logic Networks and to other Statistical Relational Models. I will present in this talk two results. The first is a dichotomy theorem stating that, for each universally (or existentially) quantified sentence without negation, computing its probability is either #P-hard, or is in PTIME in the size of the probabilistic database. The second result is a limitation of Davis-Putnam-Logemann-Loveland (DPLL) algorithms: there exists FOL sentences that can be computed in PTIME over probabilistic databases (using lifted inference) yet every DPLL algorithm, even extended with caching and with components, takes exponential time.
Speaker Bio: Dan Suciu is a Professor in Computer Science at the University of Washington. He received his Ph.D. from the University of Pennsylvania in 1995, was a principal member of the technical staff at AT&T Labs and joined the University of Washington in 2000. Suciu is conducting research in data management, with an emphasis on topics related to Big Data and data sharing, such as probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He is a Fellow of the ACM, holds twelve US patents, received the best paper award in SIGMOD 2000, SIGMOD 2019 and ICDT 2013, the ACM PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10 Year Most Influential Paper Award in ICDE 2013, the VLDB Ten Year Best Paper Award in 2014, and is a recipient of the NSF Career Award and of an Alfred P. Sloan Fellowship. Suciu serves on the VLDB Board of Trustees, and is an associate editor for the Journal of the ACM, VLDB Journal, ACM TWEB, and Information Systems and is a past associate editor for ACM TODS and ACM TOIS. Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.
Host:Jiannan Wang


Title: Get Your Data Together
Speaker: Erkang (Eric) Zhu, University of Toronto and Microsoft Research (Redmond)
Date, Time, Location: 10:30-11:30am, Oct 2, 2019
TASC I 9204 West
Abstract: Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables to discover potential linkages, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.
Speaker Bio: Erkang (Eric) Zhu is a recent PhD graduate from the University of Toronto (U of T) and will join Microsoft Research as a researcher this fall. His supervisor at U of T is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches). He has published at top data management conferences including VLDB and SIGMOD. He created two popular open source projects "datasketch" and "SetSimilaritySearch" based on his research.
Host:Tianzheng Wang