### Lectures

##### Algorithm Discovery using Reinforcement Learning

Abstract: TBA

##### Lecture 3: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 1/2

Abstract: TBA

##### Lecture 4: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 2/2

Abstract: TBA

##### Lecture 1: Transformers 1/2

Abstract: TBA

##### Lecture 2: Transformers 2/2

Abstract: TBA

##### Lecture 1: PaLM-E: An Embodied Language Model

Abstract: TBA

##### Lecture 2: Efficiently Scaling Large Model Inference

Abstract: TBA

##### Lecture 1: Graph Neural Networks 1/2

**Abstract.** Graph Neural Networks (GNNs) are an essential model class in the modern deep learning toolbox. They excel not only in classical machine learning tasks on graphs such as node classification, graph classification, and link prediction, but are becoming increasingly important for algorithmic reasoning tasks and for modeling various complex, interacting and dynamical systems – from predicting dynamics in social networks to learning accurate physical simulators.

This lecture will introduce GNNs from a message passing perspective, discuss the main representative GNN variants in use today, and give an overview of how GNNs are used in various graph representation learning tasks.

##### Lecture 2: Graph Neural Networks 2/2

**Abstract.** The second part of the lecture will focus on how GNNs can be used for modeling complex, dynamic interacting systems. We will cover how to learn to simulate the dynamics of complex interacting systems with GNNs and how to use GNNs to discover relations or interactions.

##### Lecture 3: Structured Scene Understanding 1/2

**Abstract.** The world around us is highly structured: our everyday environments contain a myriad of objects and other components that can be independently interacted with or reasoned about. A core challenge in perception is to learn to infer such a structured understanding of everyday scenes. This is reflected in computer vision tasks such as object detection, instance segmentation, or pose estimation.

This lecture will introduce methods for structured and object-centric scene understanding: we will discuss how object representations can be integrated into end-to-end deep learning architectures, giving rise to object-centric architectures such as the Detection Transformer (DETR). We will further cover how object-centric models such as Slot Attention can be trained without supervised object labels to discover objects in raw image data.

##### Lecture 4: Structured Scene Understanding 2/2

**Abstract.** The second part of this lecture will cover extensions of object-centric models to learn about 3D scenes, enabling use cases such as scene editing and novel view synthesis. Finally, we will discuss how this class of models can be used to learn about dynamics in scenes, to consistently track objects in scenes and to learn to simulate their dynamics forward in time.

##### Lecture 1: An Overview of the Principles of Parsimony and Self-Consistency: The Past, Present, and Future of Intelligence

Abstract TBA

##### Lecture 2: An Introduction to Low-Dimensional Models and Deep Networks

Abstract: TBA

##### Lecture 3: Parsimony: White-box Deep Networks from Optimizing Rate Reduction

Abstract: TBA

##### Lecture 4: Self-Consistency: Closed-Loop Transcription of Low-Dimensional Structures via Maximin Rate Reduction

Abstract: TBA

##### Lecture 4: Approaches to Increase Trustworthiness of Foundation Models

**Abstract**. The training data of language models may contain misogynistic, racist, or anti-religious texts, which are then reproduced by the model. Especially for dialog applications the output should be meaningful, specific and interesting, avoiding harmful suggestions and unfair bias, as well as false claims. The first step is a targeted preprocessing of the training data including deduplication and filtering of harmful content, which requires a lot of effort. After pre-training, the model has to be fine-tuned to controlled dialog data, possibly taking into account the documents retrieved by parallel retrieval operations. Explicit filters can be used in postprocessing to avoid unwanted contents. In addition, the history of a dialog has to be saved and retrieved later to be taken into account during answer generation. Reinforcement learning with human feedback is used to generate text that is targeted to users’ prompts and produces the desired content. To prevent text-to-image models from delivering sexist or offensive depictions, the approaches must be extended to multimedia and multilingual domains. A final aspect is the explainability of the generated content, which increases the acceptance of the returned information. We discuss the level of trustworthiness achieved by the current approaches including our own OpenGPT-X model, and compare this with the proposed EU AI act and other planned regulations.

##### Lecture 3: Combining Foundation Models with External Text Resources

**Abstract**. Language models such as GPT-4 have the ability to capture a lot of information and world knowledge contained in their training data. However, if the model’s prompt concerns very recent or very specific topics, there is often no information in the training data. To avoid costly retraining of the model with actual data, you can provide external information that the model should cover in the generated text. The retriever-reader scheme follows this path. The retriever employs dense retrieval to find texts matching the query. The reader is a pre-trained language model which is fine-tuned to combine the internal knowledge of the model with the retrieved texts and to generate a suitable answer. It has been shown that this approach improves the fraction of correct answers. In addition, retrieved documents can be added to the text as references. Similarly, other types of information, such as the contents of tables or databases, can be incorporated into a language model. We discuss the current accuracy improvements achieved by these models and new approaches to for enhancement.

##### Lecture 2: Foundation Models for Retrieval Applications

**Abstract**. Traditional search engines rely on the matching of terms between the query and the documents. However, term-based retrieval systems have several limitations such as lack of robustness with respect to polysemy, synonymy, and paraphrasing between the query and the documents. Recently, Foundation Model techniques have been used to improve the representation of textual data and to enhance the ability of information retrieval systems to understand natural language queries. One approach is dense retrieval, where query and documents are expressed as embeddings and matched by nearest neighbor search. In addition, attention mechanisms have been employed to improve the ability of search engines to attend to important parts of the query and documents for matching. In the talk, we also discuss how to incorporate external knowledge during retrieval, such as knowledge graphs and information from different media like images. As the results for many benchmarks show, dense retrieval has significantly improved the performance of search engines. However, even with approximate nearest neighbor search, the cost of dense retrieval is higher than term-based retrieval and is an obstacle to widespread use. Nevertheless, all major commercial search engines claim to use language technology today.

##### Lecture 1: Introduction to Foundation Models

**Abstract**. Starting with the Transformer, the concept of self-attention was invented, which represents the meaning of tokens in a text by context-sensitive embedding vectors. Based on the correlation of embeddings of input tokens, each layer of the network generates more expressive embeddings, taking into account the relation to all tokens of the input text. These models are pre-trained on large collections of text documents with the task of predicting omitted tokens or the next token in the sentence. The models achieve an unprecedented accuracy for generating new text and can be adapted to new tasks by fine-tuning. If the models have a sufficient number of parameters, they can simply be prompted to perform a task without any fine-tuning. It turned out, that the models can also be applied to other media like images, sound, video, etc. by partitioning these media into tokens and applying self-attention to capture their contents. They are called Foundation models because they can be used as a basic architecture for a wide range of AI tasks, superseding prior models such as RNN and CNN. In this lecture we describe the basic architecture of BERT, GPT, and the Transformer and discuss the concept of transfer learning. We then explain token representations of various media and models simultaneously processing tokens from different media. Finally, we summarize the properties and potential impact of foundation models.

##### Lecture: “Diffusion capacity of single and interconnected networks”

*et al.*Diffusion capacity of single and interconnected networks.

*Nat Commun*14, 2217 (2023). https://doi.org/10.1038/s41467-023-37323-0

##### Lecture 1: Low-Dimensional and Nonconvex Models for Shallow Representation Learning

**Abstract: **Machine learning is transforming every field of science and engineering. However, as data is increasing in volume and dimension, the performance of modern machine learning methods is critically dependent on the choice of data representation. In the past decade, although we witnessed the revolutionary empirical success of many representation learning methods, from (convolutional) dictionary learning to deep learning . the underlying principles behind their success still largely remain a mystery, which hinders their further development and adoption to broader applications. One of the major challenges originates from the nonlinearity of the data representation models, so that it often results in complicated, highly nonconvex optimization problems — in the worst-case, solving nonconvex problems could be NP-hard. Nonetheless, various empirical evidence suggests that the symmetric properties of the problem and intrinsic low-dimensional structures of the data often alleviate the hardness of these problems, that simple heuristic nonconvex methods often work surprisingly well for learning succinct representations.

##### Lecture 2: Low-Dimensional Structures in Deep Representation Learning I

**Abstract: **This lecture focuses on the study of the low-dimensional structures appearing in the last-layer of deep networks. Recently, an intriguing phenomenon has been discovered in the final stages of network training for many classification problems. This phenomenon, known as Neural Collapse, has generated significant interest. It involves the collapse of the last-layer features and classifiers into elegant and simple mathematical structures, where all training inputs are mapped to class-specific points in feature space, and the last-layer classifier converges to the dual of the features’ class means while achieving the maximum possible margin. This phenomenon persists across various network architectures, datasets, and even data domains. The lecture explores the symmetry and geometry of Neural Collapse and develops a rigorous mathematical theory that explains when and why this low-dimensional structure of the last-layer representation occurs under the unconstrained feature model, and justifies its ubiquity across different network architectures, training losses, and problem formulations.

##### Lecture 3: Low-Dimensional Structures in Deep Representation Learning II

**Abstract:** In the second lecture, we delve deeper into the low-dimensional structures of representation in intermediate layers, building on the concepts covered in the previous lecture. Our findings indicate that as we move from shallow to deep layers of a learned deep network, there is a gradual collapse in feature variability often with a linear decay ratio. We established a theoretical explanation for this phenomenon using a multi-layer deep linear network. Our analysis shows that if a deep linear network is trained via gradient descent using small and orthogonal weights, the within-class variability measure undergoes linear decay as we go from shallow to deep layers. Moreover, we demonstrate that the rate of linear decay is determined by the weight initialization scale. Finally, we demonstrate how our study can be leveraged to provide guidelines for improving the generalizability and transferability of deep representations, leading to more efficient fine-tuning strategies for classification problems in vision.

##### Lecture 4: Robust Learning of Overparameterized Networks via Low-Dimensional Models

**Abstract: **In recent years, over-parameterized models with a higher number of parameters than the amount of available data have become dominant in the field of machine learning, leading to improved performances. However, when the training data is corrupted, over-parameterized models tend to overfit and fail to generalize. The third part of the lecture aims to tackle this issue through low-dimensional modeling. The approach involves leveraging the implicit regularization of gradient descent on overparameterized models and exploiting the incoherence between sparse corruption and low-rank structures to prevent overfitting during training. This is achieved by accurately separating noise from data using a method called Double Over-Parameterization (DOP). Contrary to classical wisdom, which suggests that more parameters exacerbate overfitting, DOP uses a specific choice of learning rates on different sets of model parameters to prevent overfitting. Empirical results show that DOP outperforms traditional methods when applied to tasks such as image recovery from corrupted measurements and image classification under label noise.

##### Lecture 1: Shape-Constrained Kernel Machines and Their Applications

Abstract: TBA

##### Lecture 2: Beyond Mean Embedding: The Power of Cumulants in RKHSs

Abstract: TBA

##### Lecture 1: Reinforcement Learning

Abstract: TBA

##### Lecture 2: Deep Reinforcement Learning

Abstract: TBA

##### Lecture 3: Learning by Bootstrapping: Representation Learning

Abstract: TBA

##### Lecture 4: Learning by Bootstrapping: World Models

Abstract: TBA

### Tutorials

##### Wonders of high-dimensions: the maths and physics of Machine Learning 1/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*The past decade has witnessed a surge in the development and adoption of machine learning algorithms to solve day-a-day computational tasks. Yet, a solid theoretical understanding of even the most basic tools used in practice is still lacking, as traditional statistical learning methods are unfit to deal with the modern regime in which the number of model parameters are of the same order as the quantity of data – a problem known as the curse of dimensionality. Curiously, this is precisely the regime studied by Physicists since the mid 19th century in the context of interacting many-particle systems. This connection, which was first established in the seminal work of Elisabeth Gardner and Bernard Derrida in the 80s, is the basis of a long and fruitful marriage between these two fields.*

*The goal of this tutorial is to provide an in-depth overview of these connections. We will study the high-dimensional behaviour of different models from machine learning, such as linear and logistic regression, kernel methods and two-layer neural networks, and uncover some interesting phenomenology which is at the edge of our current theoretical understanding of modern ML. At the end of this tutorial, we expect the student to have a good vision of the different tools available in the statistical physics toolbox, as well as their scope and limitations.*

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 2/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*The past decade has witnessed a surge in the development and adoption of machine learning algorithms to solve day-a-day computational tasks. Yet, a solid theoretical understanding of even the most basic tools used in practice is still lacking, as traditional statistical learning methods are unfit to deal with the modern regime in which the number of model parameters are of the same order as the quantity of data – a problem known as the curse of dimensionality. Curiously, this is precisely the regime studied by Physicists since the mid 19th century in the context of interacting many-particle systems. This connection, which was first established in the seminal work of Elisabeth Gardner and Bernard Derrida in the 80s, is the basis of a long and fruitful marriage between these two fields.*

*The goal of this tutorial is to provide an in-depth overview of these connections. We will study the high-dimensional behaviour of different models from machine learning, such as linear and logistic regression, kernel methods and two-layer neural networks, and uncover some interesting phenomenology which is at the edge of our current theoretical understanding of modern ML. At the end of this tutorial, we expect the student to have a good vision of the different tools available in the statistical physics toolbox, as well as their scope and limitations.*

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 3/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*The past decade has witnessed a surge in the development and adoption of machine learning algorithms to solve day-a-day computational tasks. Yet, a solid theoretical understanding of even the most basic tools used in practice is still lacking, as traditional statistical learning methods are unfit to deal with the modern regime in which the number of model parameters are of the same order as the quantity of data – a problem known as the curse of dimensionality. Curiously, this is precisely the regime studied by Physicists since the mid 19th century in the context of interacting many-particle systems. This connection, which was first established in the seminal work of Elisabeth Gardner and Bernard Derrida in the 80s, is the basis of a long and fruitful marriage between these two fields.*

*The goal of this tutorial is to provide an in-depth overview of these connections. We will study the high-dimensional behaviour of different models from machine learning, such as linear and logistic regression, kernel methods and two-layer neural networks, and uncover some interesting phenomenology which is at the edge of our current theoretical understanding of modern ML. At the end of this tutorial, we expect the student to have a good vision of the different tools available in the statistical physics toolbox, as well as their scope and limitations.*

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 4/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 5/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 6/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 7/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 8/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 9/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Wonders of high-dimensions: the maths and physics of Machine Learning 10/10

**Wonders of high-dimensions: the maths and physics of Machine Learning **

*Note: no prior knowledge of statistical physics is expected.*

##### Backpropagation Neural Tree

Simpler models are better in their generalization. This research presents a class of neural-

inspired algorithms that are highly sparse in their architectural construction but perform

highly accurately. In addition, they make a simultaneous function approximation and

feature selection when solving machine learning tasks: classification, regression, and pattern

recognition. This class of algorithms are Neural Tee Algorithms: Heterogeneous Neural Tree,

Multi-Output Neural Tree, and Backpropagation Neural Tree. This research found that any

such arbitrarily constructed neural tree, which is like an arbitrarily “thinned” neural

network, has the potential to solve machine learning tasks with an equivalent or better

degree of accuracy than a fully connected symmetric and systematic neural network

architecture. The algorithm takes random repeated inputs through its leaves and imposes

dendritic nonlinearities through its internal connections like a biological dendritic tree

would do. The algorithm produces an ad hoc neural tree which is trained using a stochastic

gradient descent optimizer. The algorithms produce high-performing and parsimonious

models balancing the complexity with descriptive ability on a wide variety of machine

learning problems.

Resources:

Ojha, V., & Nicosia, G. (2022). Backpropagation neural tree. Neural Networks, 149, 66-

83: https://arxiv.org/pdf/2202.02248.pdf

Ojha, V., & Nicosia, G. (2020). Multiobjective optimization of multi-output neural trees. In 2020 IEEE Congress on Evolutionary Computation (CEC) (pp. 1-8). IEEE

Press: https://arxiv.org/pdf/2010.04524.pdf

##### Sensitivity Analysis of Deep Learning and Optimization Algorithms

Sensitivity analysis offers the opportunity to explore the sensitivity (influence) of

parameters on a model. This work applies global sensitivity analysis to deep learning and

optimization algorithms for the analysis of the influence of their hyperparameters. For deep

learning, we analyzed hyperparameters such as type of optimizers, learning rate, batch size,

etc. We analyzed these hyperparameters for deep neural networks such as ResNet18,

AlexNet, and GoogleNet. For the optimization algorithms, we analyzed hyperparameters of

two single-objective and two multi-objective state-of-the-art global optimization

evolutionary algorithms as an algorithm configuration problem. We investigate the quality

of influence hyperparameters have on the performance of algorithms in terms of their

direct effect and interaction effect with other hyperparameters. Using three sensitivity

analysis methods, Morris LHS, Morris, and Sobol, to systematically analyze tuneable

hyperparameters, the framework reveals the behaviours of hyperparameters to sampling

methods and performance metrics. That is, it answers questions like what hyperparameters

influence patterns, how they interact, how much they interact, and how much their direct

influence is. Consequently, the ranking of hyperparameters suggests their order of tuning,

and the pattern of influence reveals the stability of the algorithms.

Resources:

Assessing Ranking and Effectiveness of Evolutionary Algorithm Hyperparameters Using

Global Sensitivity Analysis Methodologies, Swarm and Evolutionary

Computation: https://arxiv.org/pdf/2207.04820.pdf

Sensitivity Analysis for Deep Learning: Ranking Hyper-parameter

Influence: https://ieeexplore.ieee.org/document/9643336