Lectures

Lecture 1: “Algorithm Discovery using Reinforcement Learning 1/2”

Abstract: TBA

Lecture 2: “Algorithm Discovery using Reinforcement Learning 1/2”

Abstract: TBA

Matej Balog

Lecture 1: Transformers 1/2

Abstract: TBA

Lecture 2: Transformers 2/2

Abstract: TBA

Lecture 3: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 1/2

Abstract: TBA

Lecture 4: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 2/2

Abstract: TBA

Lucas Beyer

Lecture 1: PaLM-E: An Embodied Language Model

Abstract: TBA

Lecture 2: Efficiently Scaling Large Model Inference

Abstract: TBA

Aakanksha Chowdhery

Lecture 1: Graph Neural Networks 1/2

Abstract. Graph Neural Networks (GNNs) are an essential model class in the modern deep learning toolbox. They excel not only in classical machine learning tasks on graphs such as node classification, graph classification, and link prediction, but are becoming increasingly important for algorithmic reasoning tasks and for modeling various complex, interacting and dynamical systems – from predicting dynamics in social networks to learning accurate physical simulators.
This lecture will introduce GNNs from a message passing perspective, discuss the main representative GNN variants in use today, and give an overview of how GNNs are used in various graph representation learning tasks.

Lecture 2: Graph Neural Networks 2/2

Abstract. The second part of the lecture will focus on how GNNs can be used for modeling complex, dynamic interacting systems. We will cover how to learn to simulate the dynamics of complex interacting systems with GNNs and how to use GNNs to discover relations or interactions.

Lecture 3: Structured Scene Understanding 1/2

Abstract. The world around us is highly structured: our everyday environments contain a myriad of objects and other components that can be independently interacted with or reasoned about. A core challenge in perception is to learn to infer such a structured understanding of everyday scenes. This is reflected in computer vision tasks such as object detection, instance segmentation, or pose estimation.

This lecture will introduce methods for structured and object-centric scene understanding: we will discuss how object representations can be integrated into end-to-end deep learning architectures, giving rise to object-centric architectures such as the Detection Transformer (DETR). We will further cover how object-centric models such as Slot Attention can be trained without supervised object labels to discover objects in raw image data.

Lecture 4: Structured Scene Understanding 2/2

Abstract. The second part of this lecture will cover extensions of object-centric models to learn about 3D scenes, enabling use cases such as scene editing and novel view synthesis. Finally, we will discuss how this class of models can be used to learn about dynamics in scenes, to consistently track objects in scenes and to learn to simulate their dynamics forward in time.

Thomas Kipf

Lecture 1: “Zeroth-order Optimisation: Applications, Algorithms and Analysis, 1/2”

Abstract: TBA

Lecture 2: “Zeroth-order Optimisation: Applications, Algorithms and Analysis, 2/2”

Abstract: TBA

Tor Lattimore

Lecture 1: An Overview of the Principles of Parsimony and Self-Consistency: The Past, Present, and Future of Intelligence

Abstract TBA

Lecture 2: An Introduction to Low-Dimensional Models and Deep Networks

Abstract: TBA

Lecture 3: Parsimony: White-box Deep Networks from Optimizing Rate Reduction

Abstract: TBA

Lecture 4: Self-Consistency: Closed-Loop Transcription of Low-Dimensional Structures via Maximin Rate Reduction

Abstract: TBA

Yi Ma

Lecture 4: Approaches to Increase Trustworthiness of Foundation Models

Abstract. The training data of language models may contain misogynistic, racist, or anti-religious texts, which are then reproduced by the model. Especially for dialog applications the output should be meaningful, specific and interesting, avoiding harmful suggestions and unfair bias, as well as false claims. The first step is a targeted preprocessing of the training data including deduplication and filtering of harmful content, which requires a lot of effort. After pre-training, the model has to be fine-tuned to controlled dialog data, possibly taking into account the documents retrieved by parallel retrieval operations. Explicit filters can be used in postprocessing to avoid unwanted contents. In addition, the history of a dialog has to be saved and retrieved later to be taken into account during answer generation. Reinforcement learning with human feedback is used to generate text that is targeted to users’ prompts and produces the desired content. To prevent text-to-image models from delivering sexist or offensive depictions, the approaches must be extended to multimedia and multilingual domains. A final aspect is the explainability of the generated content, which increases the acceptance of the returned information. We discuss the level of trustworthiness achieved by the current approaches including our own OpenGPT-X model, and compare this with the proposed EU AI act and other planned regulations.

Lecture 3: Combining Foundation Models with External Text Resources

Abstract. Language models such as GPT-4 have the ability to capture a lot of information and world knowledge contained in their training data. However, if the model’s prompt concerns very recent or very specific topics, there is often no information in the training data. To avoid costly retraining of the model with actual data, you can provide external information that the model should cover in the generated text. The retriever-reader scheme follows this path. The retriever employs dense retrieval to find texts matching the query. The reader is a pre-trained language model which is fine-tuned to combine the internal knowledge of the model with the retrieved texts and to generate a suitable answer. It has been shown that this approach improves the fraction of correct answers. In addition, retrieved documents can be added to the text as references. Similarly, other types of information, such as the contents of tables or databases, can be incorporated into a language model. We discuss the current accuracy improvements achieved by these models and new approaches to for enhancement.

Lecture 2: Foundation Models for Retrieval Applications

Abstract. Traditional search engines rely on the matching of terms between the query and the documents. However, term-based retrieval systems have several limitations such as lack of robustness with respect to polysemy, synonymy, and paraphrasing between the query and the documents. Recently, Foundation Model techniques have been used to improve the representation of textual data and to enhance the ability of information retrieval systems to understand natural language queries. One approach is dense retrieval, where query and documents are expressed as embeddings and matched by nearest neighbor search. In addition, attention mechanisms have been employed to improve the ability of search engines to attend to important parts of the query and documents for matching. In the talk, we also discuss how to incorporate external knowledge during retrieval, such as knowledge graphs and information from different media like images. As the results for many benchmarks show, dense retrieval has significantly improved the performance of search engines. However, even with approximate nearest neighbor search, the cost of dense retrieval is higher than term-based retrieval and is an obstacle to widespread use. Nevertheless, all major commercial search engines claim to use language technology today.

Lecture 1: Introduction to Foundation Models

Abstract. Starting with the Transformer, the concept of self-attention was invented, which represents the meaning of tokens in a text by context-sensitive embedding vectors. Based on the correlation of embeddings of input tokens, each layer of the network generates more expressive embeddings, taking into account the relation to all tokens of the input text. These models are pre-trained on large collections of text documents with the task of predicting omitted tokens or the next token in the sentence. The models achieve an unprecedented accuracy for generating new text and can be adapted to new tasks by fine-tuning. If the models have a sufficient number of parameters, they can simply be prompted to perform a task without any fine-tuning. It turned out, that the models can also be applied to other media like images, sound, video, etc. by partitioning these media into tokens and applying self-attention to capture their contents. They are called Foundation models because they can be used as a basic architecture for a wide range of AI tasks, superseding prior models such as RNN and CNN. In this lecture we describe the basic architecture of BERT, GPT, and the Transformer and discuss the concept of transfer learning. We then explain token representations of various media and models simultaneously processing tokens from different media. Finally, we summarize the properties and potential impact of foundation models.

Gerhard Paass

Lecture: “Diffusion capacity of single and interconnected networks”

This lecture addresses the significant challenge of comprehending diffusive processes in networks in the context of complexity. Networks possess a diffusive potential that depends on their topological configuration, but diffusion also relies on the process and initial conditions. The lecture introduces the concept of Diffusion Capacity, a measure of a node’s potential to diffuse information that incorporates a distance distribution considering both geodesic and weighted shortest paths and the dynamic features of the diffusion process. This concept provides a comprehensive depiction of individual nodes’ roles during the diffusion process and can identify structural modifications that may improve diffusion mechanisms. The lecture also defines Diffusion Capacity for interconnected networks and introduces Relative Gain, a tool that compares a node’s performance in a single structure versus an interconnected one. To demonstrate the concept’s utility, we apply the methodology to a global climate network formed from surface air temperature data, revealing a significant shift in diffusion capacity around the year 2000. This suggests a decline in the planet’s diffusion capacity, which may contribute to the emergence of more frequent climatic events. Our goal is to gain a deeper understanding of the complexities of diffusive processes in networks and the potential applications of the Diffusion Capacity concept.

Reference: Schieber, T.A., Carpi, L.C., Pardalos, P.M. et al. Diffusion capacity of single and interconnected networks. Nat Commun 14, 2217 (2023). https://doi.org/10.1038/s41467-023-37323-0

Panos M. Pardalos www.ise.ufl.edu/pardalos

Panos Pardalos

Lecture 1: Low-Dimensional and Nonconvex Models for Shallow Representation Learning

Abstract: Machine learning is transforming every field of science and engineering. However, as data is increasing in volume and dimension, the performance of modern machine learning methods is critically dependent on the choice of data representation. In the past decade, although we witnessed the revolutionary empirical success of many representation learning methods, from (convolutional) dictionary learning to deep learning . the underlying principles behind their success still largely remain a mystery, which hinders their further development and adoption to broader applications. One of the major challenges originates from the nonlinearity of the data representation models, so that it often results in complicated, highly nonconvex optimization problems — in the worst-case, solving nonconvex problems could be NP-hard. Nonetheless, various empirical evidence suggests that the symmetric properties of the problem and intrinsic low-dimensional structures of the data often alleviate the hardness of these problems, that simple heuristic nonconvex methods often work surprisingly well for learning succinct representations.

Lecture 2: Low-Dimensional Structures in Deep Representation Learning I

Abstract: This lecture focuses on the study of the low-dimensional structures appearing in the last-layer of deep networks. Recently, an intriguing phenomenon has been discovered in the final stages of network training for many classification problems. This phenomenon, known as Neural Collapse, has generated significant interest. It involves the collapse of the last-layer features and classifiers into elegant and simple mathematical structures, where all training inputs are mapped to class-specific points in feature space, and the last-layer classifier converges to the dual of the features’ class means while achieving the maximum possible margin. This phenomenon persists across various network architectures, datasets, and even data domains. The lecture explores the symmetry and geometry of Neural Collapse and develops a rigorous mathematical theory that explains when and why this low-dimensional structure of the last-layer representation occurs under the unconstrained feature model, and justifies its ubiquity across different network architectures, training losses, and problem formulations.

Lecture 3: Low-Dimensional Structures in Deep Representation Learning II

Abstract: In the second lecture, we delve deeper into the low-dimensional structures of representation in intermediate layers, building on the concepts covered in the previous lecture. Our findings indicate that as we move from shallow to deep layers of a learned deep network, there is a gradual collapse in feature variability often with a linear decay ratio. We established a theoretical explanation for this phenomenon using a multi-layer deep linear network. Our analysis shows that if a deep linear network is trained via gradient descent using small and orthogonal weights, the within-class variability measure undergoes linear decay as we go from shallow to deep layers. Moreover, we demonstrate that the rate of linear decay is determined by the weight initialization scale. Finally, we demonstrate how our study can be leveraged to provide guidelines for improving the generalizability and transferability of deep representations, leading to more efficient fine-tuning strategies for classification problems in vision.

Lecture 4: Robust Learning of Overparameterized Networks via Low-Dimensional Models

Abstract: In recent years, over-parameterized models with a higher number of parameters than the amount of available data have become dominant in the field of machine learning, leading to improved performances. However, when the training data is corrupted, over-parameterized models tend to overfit and fail to generalize. The third part of the lecture aims to tackle this issue through low-dimensional modeling. The approach involves leveraging the implicit regularization of gradient descent on overparameterized models and exploiting the incoherence between sparse corruption and low-rank structures to prevent overfitting during training. This is achieved by accurately separating noise from data using a method called Double Over-Parameterization (DOP). Contrary to classical wisdom, which suggests that more parameters exacerbate overfitting, DOP uses a specific choice of learning rates on different sets of model parameters to prevent overfitting. Empirical results show that DOP outperforms traditional methods when applied to tasks such as image recovery from corrupted measurements and image classification under label noise.

Qing Qu

Lecture 1: Shape-Constrained Kernel Machines and Their Applications

Abstract: TBA

Lecture 2: Beyond Mean Embedding: The Power of Cumulants in RKHSs

Abstract: TBA

Zoltan Szabo

Lecture 1: Reinforcement Learning

Abstract: TBA

Lecture 2: Deep Reinforcement Learning

Abstract: TBA

Lecture 3: Learning by Bootstrapping: Representation Learning

Abstract: TBA

Lecture 4: Learning by Bootstrapping: World Models

Abstract: TBA

Michal Valko

Tutorials

Lecture 1: Wonders of high-dimensions: the maths and physics of Machine Learning 1/3

Wonders of high-dimensions: the maths and physics of Machine Learning

The past decade has witnessed a surge in the development and adoption of machine learning algorithms to solve day-a-day computational tasks. Yet, a solid theoretical understanding of even the most basic tools used in practice is still lacking, as traditional statistical learning methods are unfit to deal with the modern regime in which the number of model parameters are of the same order as the quantity of data – a problem known as the curse of dimensionality. Curiously, this is precisely the regime studied by Physicists since the mid 19th century in the context of interacting many-particle systems. This connection, which was first established in the seminal work of Elisabeth Gardner and Bernard Derrida in the 80s, is the basis of a long and fruitful marriage between these two fields.

The goal of this tutorial is to provide an in-depth overview of these connections. We will study the high-dimensional behaviour of different models from machine learning, such as linear and logistic regression, kernel methods and two-layer neural networks, and uncover some interesting phenomenology which is at the edge of our current theoretical understanding of modern ML. At the end of this tutorial, we expect the student to have a good vision of the different tools available in the statistical physics toolbox, as well as their scope and limitations.

Note: no prior knowledge of statistical physics is expected.

Lecture 2: Wonders of high-dimensions: the maths and physics of Machine Learning 2/3

Wonders of high-dimensions: the maths and physics of Machine Learning

Note: no prior knowledge of statistical physics is expected.

Lecture 3: Wonders of high-dimensions: the maths and physics of Machine Learning 3/3

Wonders of high-dimensions: the maths and physics of Machine Learning

Note: no prior knowledge of statistical physics is expected.

Bruno Loureiro

Tutorial 1: Backpropagation Neural Tree

Simpler models are better in their generalization. This research presents a class of neural-
inspired algorithms that are highly sparse in their architectural construction but perform
highly accurately. In addition, they make a simultaneous function approximation and
feature selection when solving machine learning tasks: classification, regression, and pattern
recognition. This class of algorithms are Neural Tee Algorithms: Heterogeneous Neural Tree,
Multi-Output Neural Tree, and Backpropagation Neural Tree. This research found that any
such arbitrarily constructed neural tree, which is like an arbitrarily “thinned” neural
network, has the potential to solve machine learning tasks with an equivalent or better
degree of accuracy than a fully connected symmetric and systematic neural network
architecture. The algorithm takes random repeated inputs through its leaves and imposes
dendritic nonlinearities through its internal connections like a biological dendritic tree
would do. The algorithm produces an ad hoc neural tree which is trained using a stochastic
gradient descent optimizer. The algorithms produce high-performing and parsimonious
models balancing the complexity with descriptive ability on a wide variety of machine
learning problems.

Resources:
Ojha, V., & Nicosia, G. (2022). Backpropagation neural tree. Neural Networks, 149, 66-
83: https://arxiv.org/pdf/2202.02248.pdf
Ojha, V., & Nicosia, G. (2020). Multiobjective optimization of multi-output neural trees. In 2020 IEEE Congress on Evolutionary Computation (CEC) (pp. 1-8). IEEE
Press: https://arxiv.org/pdf/2010.04524.pdf

Tutorial 2: Sensitivity Analysis of Deep Learning and Optimization Algorithms

Sensitivity analysis offers the opportunity to explore the sensitivity (influence) of
parameters on a model. This work applies global sensitivity analysis to deep learning and
optimization algorithms for the analysis of the influence of their hyperparameters. For deep
learning, we analyzed hyperparameters such as type of optimizers, learning rate, batch size,
etc. We analyzed these hyperparameters for deep neural networks such as ResNet18,
AlexNet, and GoogleNet. For the optimization algorithms, we analyzed hyperparameters of
two single-objective and two multi-objective state-of-the-art global optimization
evolutionary algorithms as an algorithm configuration problem. We investigate the quality
of influence hyperparameters have on the performance of algorithms in terms of their
direct effect and interaction effect with other hyperparameters. Using three sensitivity
analysis methods, Morris LHS, Morris, and Sobol, to systematically analyze tuneable
hyperparameters, the framework reveals the behaviours of hyperparameters to sampling
methods and performance metrics. That is, it answers questions like what hyperparameters
influence patterns, how they interact, how much they interact, and how much their direct
influence is. Consequently, the ranking of hyperparameters suggests their order of tuning,
and the pattern of influence reveals the stability of the algorithms.

Resources:
Assessing Ranking and Effectiveness of Evolutionary Algorithm Hyperparameters Using
Global Sensitivity Analysis Methodologies, Swarm and Evolutionary
Computation: https://arxiv.org/pdf/2207.04820.pdf
Sensitivity Analysis for Deep Learning: Ranking Hyper-parameter
Influence: https://ieeexplore.ieee.org/document/9643336

Varun Ojha