I am an Assistant Professor in the Department of Statistics at the Federal University of São Carlos (UFSCar) in Brazil. I completed my Ph.D. in Mathematics at the Institute for Pure and Applied Mathematics (IMPA) in Rio de Janeiro, focusing on boosting methods and concentration of measure in machine learning. I also earned my M.Sc. and Bachelor’s degrees in Mathematics from the Universidade de São Paulo (ICMC-USP).
Split conformal prediction (CP) is arguably the most popular CP method for uncertainty quantification, enjoying both academic interest and widespread deployment. However, the original theoretical analysis of split CP makes the crucial assumption of data exchangeability, which hinders many real-world applications. In this paper, we present a novel theoretical framework based on concentration inequalities and decoupling properties of the data, proving that split CP remains valid for many non-exchangeable processes by adding a small coverage penalty. Through experiments with both real and synthetic data, we show that our theoretical results translate to good empirical performance under non-exchangeability, e.g., for time series and spatiotemporal data. Compared to recent conformal algorithms designed to counter specific exchangeability violations, we show that split CP is competitive in terms of coverage and interval size, with the benefit of being extremely simple and orders of magnitude faster than alternatives.
BlockBoost: Scalable and Efficient Blocking through Boosting
Thiago Ramos, Rodrigo Loro Schuller, Alex Akira Okuno, Lucas Nissenbaum, Roberto I Oliveira, and Paulo Orenstein
In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, 02–04 may 2024
As datasets grow larger, matching and merging entries from different databases has become a costly task in modern data pipelines. To avoid expensive comparisons between entries, blocking similar items is a popular preprocessing step. In this paper, we introduce BlockBoost, a novel boosting-based method that generates compact binary hash codes for database entries, through which blocking can be performed efficiently. The algorithm is fast and scalable, resulting in computational costs that are orders of magnitude lower than current benchmarks. Unlike existing alternatives, BlockBoost comes with associated feature importance measures for interpretability, and possesses strong theoretical guarantees, including lower bounds on critical performance metrics like recall and reduction ratio. Finally, we show that BlockBoost delivers great empirical results, outperforming state-of-the-art blocking benchmarks in terms of both performance metrics and computational cost.