Intrinsic dimensions, densities and the Information Imbalance: Remarkably simple yet effective tools for the data scientist
ABSTRACT
Data increasingly comes in the form of very high dimensional descriptors possessing hundreds of even thousands of coordinates, but they typically lie on manifolds of much lower dimensionality and a rich set of hidden properties. In my talk, I will overview some simple, and yet very effective, numerical techniques to analyse fundamental characteristics of data manifolds. Specifically, I will describe estimators of intrinsic dimension [Macocco et al., PRL (2023)] and manifold density [Carli et al., ArXiv (2024)], as well as methods to find informative coordinates [Glielmo et al., PNAS Nexus (2022); Camboulin et al., UniReps@NeurIPS (2024)]. I will support the theoretical explanation of the methods with practical demonstrations on toy datasets using the DADApy package [Glielmo et al., Patterns (2022)], available at https://github.com/sissa-data-science/DADApy.