Taming Big Data: Stream Summarization and its Many Applications
Amr El Abbadi
University of California at Santa Barbara
Abstract: During the past two decades we have seen an unprecedented increase in the amount of data that is generated from numerous internet-scale applications. As hundreds of millions to billions of users interact with these applications, there is a continuous flow of interactions that are collected by internet companies hosting these applications. Before this data can be subject to modeling and analysis, it is often necessary to obtain summary statistics such as the cardinality of unique visitors, frequency counts of users from different states or countries, and in general, finding the quantile information from the dataset. Efficient algorithms exist for computing the exact information over the data. Unfortunately, these algorithms require a considerable amount of time, scanning the data multiple times, or require additional storage that is linear in the size of the dataset itself. Approximation methods, with guaranteed error bounds, developed in the context of streaming data are extremely effective to extract useful and relatively accurate knowledge from big data. In this talk, we will review the recent, and not so recent, advances in big data stream summarization. The main objective of this talk is to demonstrate the strong relationship between the mathematics of big data and the management of big data. We will discuss streaming data summarization focusing on the heavy hitter’s problem in diverse setting, including recent advance for environments with both insertions and deletions; privacy challenges and applications for caching in large scale elastic cloud environments and data analysis and monitoring in modern software defined networks.
Biography: Amr El Abbadi is a Professor of Computer Science. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. His research interests are in the fields of fault-tolerant distributed systems and databases, focusing recently on Cloud data management, blockchain based systems and privacy concerns. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He served as Associate Graduate Dean at the University of California, Santa Barbara from 2021–2023. He served as a journal editor for several database journals, including The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences, including most recently SIGMOD 2022. He served on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. Recently, papers he co-authored received an Outstanding paper award in NSDI (Networked System Design and Implementation) 2024 and the Test of Time Award from MDM (Mobile Data Management)2024. He has published over 350 articles in databases and distributed systems and has supervised over 40 PhD students.
Keynote 2
Data+AI: A LLM-Powered Data Analytics System
Guoliang Li
Tsinghua University
Abstract: Data analytics systems for structured data are widely used and deployed. However, analyzing unstructured and heterogeneous data (such as data lakes) remains challenging due to the lack of semantic operators, intelligent data analytics pipeline generation, and effective reasoning capabilities. Fortunately, large language models (LLMs) offer powerful understanding, reasoning, semantic matching, and generation abilities, providing an opportunity to revolutionize data analytics systems. First, when it comes to structured data analytics, we can integrate LLMs as semantic operators within data analytics processes. Second, for unstructured data, LLMs can be employed to automatically generate execution pipelines for analysis. Third, for heterogeneous data, we demonstrate how to link disparate data types and fuse their execution plans. In this talk, I will discuss the challenges involved and propose solutions to tackle these issues. Additionally, I will highlight open challenges in heterogeneous data analytics.
Biography: Guoliang Li is a full professor in the Department of Computer Science at Tsinghua University, Beijing, China. His research interests include Data+AI Systems, AI4Data, Data4AI, and cloud-native database systems. He has received several awards, including the VLDB 2017 Early Research Contribution Award, TCDE 2014 Early Career Award, SIGMOD 2024 Research Highlight Award, as well as best paper awards, such as VLDB 2023 Best Industry Paper, CIKM 2017 Best Paper award, DASFAA 2023 Best Paper award, best papers of SIGMOD 2023, VLDB 2020, KDD 2018, ICDE 2018. Guoliang has served as the general co-chair of SIGMOD 2021, demo co-chair for VLDB 2022, industrial co-chair for ICDE 2022, tutorial co-chair for SIGMOD 2022, and program contest co-chair for SIGMOD 2024. He regularly serves as a (senior) PC member for conferences like SIGMOD, VLDB, and ICDE.
Keynote 3
Responsible AI and the Role of Data Engineering
Evaggelia Pitoura
University of Ioannina & Archimedes Research Unit of Athena RC
Abstract: As AI algorithms are deployed in domains that impact human lives, ensuring responsibility in their design and implementation has become critical. This talk will explore two key dimensions of responsible AI: fairness and explainability. I will present our recent research on addressing these challenges, with a particular focus on counterfactual explanations. Counterfactual explanations provide insights by identifying the minimal changes to input data that would alter the output of an algorithm offering a powerful tool for enhancing both fairness and transparency. In addition, I will discuss the importance of responsibility in Retrieval-Augmented Generation (RAG) pipelines. The talk will emphasize how data engineering principles and techniques can be leveraged to enhance both quality and performance.
Biography: Evaggelia Pitoura is a Professor at the Department of Computer Science and Engineering at the University of Ioannina and a Lead Researcher at Archimedes Research Unit of Athena RC, Greece. She holds a BEng degree from the University of Patras, Greece, and an MS and PhD from Purdue University, USA. Her current research interests focus on two primary areas: responsible data management, with a focus on fairness, explainability, and their interplay; and on graph exploration and analysis. For her work, he has received best paper awards, a Marie Currie Fellowship and two Recognition of Service Awards from ACM. She is an ACM senior member, chair of the Greek ACM-W event steering committee, chair of the Hellenic ACM SIGMOD chapter, and member of the sectorial scientific council of Greece National Council for Research, Technology and Innovation.
41st IEEE International Conference on Data Engineering, Hong Kong SAR, China – May 19-23, 2025