Lectures
Lecture 1: “Algorithm Discovery using Reinforcement Learning 1/2”
Abstract: TBA
Lecture 2: “Algorithm Discovery using Reinforcement Learning 1/2”
Abstract: TBA
Lecture 1: Transformers 1/2
Abstract: TBA
Lecture 2: Transformers 2/2
Abstract: TBA
Lecture 3: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 1/2
Abstract: TBA
Lecture 4: Large-Scale Pre-Training & Transfer in Computer Vision and Vision-Text Models 2/2
Abstract: TBA
Lecture 1: PaLM-E: An Embodied Language Model
Abstract: TBA
Lecture 2: Efficiently Scaling Large Model Inference
Abstract: TBA
Lecture 1: Graph Neural Networks 1/2
Abstract. Graph Neural Networks (GNNs) are an essential model class in the modern deep learning toolbox. They excel not only in classical machine learning tasks on graphs such as node classification, graph classification, and link prediction, but are becoming increasingly important for algorithmic reasoning tasks and for modeling various complex, interacting and dynamical systems – from predicting dynamics in social networks to learning accurate physical simulators.
This lecture will introduce GNNs from a message passing perspective, discuss the main representative GNN variants in use today, and give an overview of how GNNs are used in various graph representation learning tasks.
Lecture 2: Graph Neural Networks 2/2
Abstract. The second part of the lecture will focus on how GNNs can be used for modeling complex, dynamic interacting systems. We will cover how to learn to simulate the dynamics of complex interacting systems with GNNs and how to use GNNs to discover relations or interactions.
Lecture 3: Structured Scene Understanding 1/2
Abstract. The world around us is highly structured: our everyday environments contain a myriad of objects and other components that can be independently interacted with or reasoned about. A core challenge in perception is to learn to infer such a structured understanding of everyday scenes. This is reflected in computer vision tasks such as object detection, instance segmentation, or pose estimation.
This lecture will introduce methods for structured and object-centric scene understanding: we will discuss how object representations can be integrated into end-to-end deep learning architectures, giving rise to object-centric architectures such as the Detection Transformer (DETR). We will further cover how object-centric models such as Slot Attention can be trained without supervised object labels to discover objects in raw image data.
Lecture 4: Structured Scene Understanding 2/2
Abstract. The second part of this lecture will cover extensions of object-centric models to learn about 3D scenes, enabling use cases such as scene editing and novel view synthesis. Finally, we will discuss how this class of models can be used to learn about dynamics in scenes, to consistently track objects in scenes and to learn to simulate their dynamics forward in time.
Lecture 1: “Zeroth-order Optimisation: Applications, Algorithms and Analysis, 1/2”
Abstract: TBA
Lecture 2: “Zeroth-order Optimisation: Applications, Algorithms and Analysis, 2/2”
Abstract: TBA
Lecture 1: An Overview of the Principles of Parsimony and Self-Consistency: The Past, Present, and Future of Intelligence
Abstract TBA
Lecture 2: An Introduction to Low-Dimensional Models and Deep Networks
Abstract: TBA
Lecture 3: Parsimony: White-box Deep Networks from Optimizing Rate Reduction
Abstract: TBA
Lecture 4: Self-Consistency: Closed-Loop Transcription of Low-Dimensional Structures via Maximin Rate Reduction
Abstract: TBA
Lecture 4: Approaches to Increase Trustworthiness of Foundation Models
Abstract. The training data of language models may contain misogynistic, racist, or anti-religious texts, which are then reproduced by the model. Especially for dialog applications the output should be meaningful, specific and interesting, avoiding harmful suggestions and unfair bias, as well as false claims. The first step is a targeted preprocessing of the training data including deduplication and filtering of harmful content, which requires a lot of effort. After pre-training, the model has to be fine-tuned to controlled dialog data, possibly taking into account the documents retrieved by parallel retrieval operations. Explicit filters can be used in postprocessing to avoid unwanted contents. In addition, the history of a dialog has to be saved and retrieved later to be taken into account during answer generation. Reinforcement learning with human feedback is used to generate text that is targeted to users’ prompts and produces the desired content. To prevent text-to-image models from delivering sexist or offensive depictions, the approaches must be extended to multimedia and multilingual domains. A final aspect is the explainability of the generated content, which increases the acceptance of the returned information. We discuss the level of trustworthiness achieved by the current approaches including our own OpenGPT-X model, and compare this with the proposed EU AI act and other planned regulations.
Lecture 3: Combining Foundation Models with External Text Resources
Abstract. Language models such as GPT-4 have the ability to capture a lot of information and world knowledge contained in their training data. However, if the model’s prompt concerns very recent or very specific topics, there is often no information in the training data. To avoid costly retraining of the model with actual data, you can provide external information that the model should cover in the generated text. The retriever-reader scheme follows this path. The retriever employs dense retrieval to find texts matching the query. The reader is a pre-trained language model which is fine-tuned to combine the internal knowledge of the model with the retrieved texts and to generate a suitable answer. It has been shown that this approach improves the fraction of correct answers. In addition, retrieved documents can be added to the text as references. Similarly, other types of information, such as the contents of tables or databases, can be incorporated into a language model. We discuss the current accuracy improvements achieved by these models and new approaches to for enhancement.
Lecture 2: Foundation Models for Retrieval Applications
Abstract. Traditional search engines rely on the matching of terms between the query and the documents. However, term-based retrieval systems have several limitations such as lack of robustness with respect to polysemy, synonymy, and paraphrasing between the query and the documents. Recently, Foundation Model techniques have been used to improve the representation of textual data and to enhance the ability of information retrieval systems to understand natural language queries. One approach is dense retrieval, where query and documents are expressed as embeddings and matched by nearest neighbor search. In addition, attention mechanisms have been employed to improve the ability of search engines to attend to important parts of the query and documents for matching. In the talk, we also discuss how to incorporate external knowledge during retrieval, such as knowledge graphs and information from different media like images. As the results for many benchmarks show, dense retrieval has significantly improved the performance of search engines. However, even with approximate nearest neighbor search, the cost of dense retrieval is higher than term-based retrieval and is an obstacle to widespread use. Nevertheless, all major commercial search engines claim to use language technology today.
Lecture 1: Introduction to Foundation Models
Abstract. Starting with the Transformer, the concept of self-attention was invented, which represents the meaning of tokens in a text by context-sensitive embedding vectors. Based on the correlation of embeddings of input tokens, each layer of the network generates more expressive embeddings, taking into account the relation to all tokens of the input text. These models are pre-trained on large collections of text documents with the task of predicting omitted tokens or the next token in the sentence. The models achieve an unprecedented accuracy for generating new text and can be adapted to new tasks by fine-tuning. If the models have a sufficient number of parameters, they can simply be prompted to perform a task without any fine-tuning. It turned out, that the models can also be applied to other media like images, sound, video, etc. by partitioning these media into tokens and applying self-attention to capture their contents. They are called Foundation models because they can be used as a basic architecture for a wide range of AI tasks, superseding prior models such as RNN and CNN. In this lecture we describe the basic architecture of BERT, GPT, and the Transformer and discuss the concept of transfer learning. We then explain token representations of various media and models simultaneously processing tokens from different media. Finally, we summarize the properties and potential impact of foundation models.
Lecture: “Diffusion capacity of single and interconnected networks”
Lecture 1: Low-Dimensional and Nonconvex Models for Shallow Representation Learning
Abstract: Machine learning is transforming every field of science and engineering. However, as data is increasing in volume and dimension, the performance of modern machine learning methods is critically dependent on the choice of data representation. In the past decade, although we witnessed the revolutionary empirical success of many representation learning methods, from (convolutional) dictionary learning to deep learning . the underlying principles behind their success still largely remain a mystery, which hinders their further development and adoption to broader applications. One of the major challenges originates from the nonlinearity of the data representation models, so that it often results in complicated, highly nonconvex optimization problems — in the worst-case, solving nonconvex problems could be NP-hard. Nonetheless, various empirical evidence suggests that the symmetric properties of the problem and intrinsic low-dimensional structures of the data often alleviate the hardness of these problems, that simple heuristic nonconvex methods often work surprisingly well for learning succinct representations.
Lecture 2: Low-Dimensional Structures in Deep Representation Learning I
Abstract: This lecture focuses on the study of the low-dimensional structures appearing in the last-layer of deep networks. Recently, an intriguing phenomenon has been discovered in the final stages of network training for many classification problems. This phenomenon, known as Neural Collapse, has generated significant interest. It involves the collapse of the last-layer features and classifiers into elegant and simple mathematical structures, where all training inputs are mapped to class-specific points in feature space, and the last-layer classifier converges to the dual of the features’ class means while achieving the maximum possible margin. This phenomenon persists across various network architectures, datasets, and even data domains. The lecture explores the symmetry and geometry of Neural Collapse and develops a rigorous mathematical theory that explains when and why this low-dimensional structure of the last-layer representation occurs under the unconstrained feature model, and justifies its ubiquity across different network architectures, training losses, and problem formulations.
Lecture 3: Low-Dimensional Structures in Deep Representation Learning II
Abstract: In the second lecture, we delve deeper into the low-dimensional structures of representation in intermediate layers, building on the concepts covered in the previous lecture. Our findings indicate that as we move from shallow to deep layers of a learned deep network, there is a gradual collapse in feature variability often with a linear decay ratio. We established a theoretical explanation for this phenomenon using a multi-layer deep linear network. Our analysis shows that if a deep linear network is trained via gradient descent using small and orthogonal weights, the within-class variability measure undergoes linear decay as we go from shallow to deep layers. Moreover, we demonstrate that the rate of linear decay is determined by the weight initialization scale. Finally, we demonstrate how our study can be leveraged to provide guidelines for improving the generalizability and transferability of deep representations, leading to more efficient fine-tuning strategies for classification problems in vision.
Lecture 4: Robust Learning of Overparameterized Networks via Low-Dimensional Models
Abstract: In recent years, over-parameterized models with a higher number of parameters than the amount of available data have become dominant in the field of machine learning, leading to improved performances. However, when the training data is corrupted, over-parameterized models tend to overfit and fail to generalize. The third part of the lecture aims to tackle this issue through low-dimensional modeling. The approach involves leveraging the implicit regularization of gradient descent on overparameterized models and exploiting the incoherence between sparse corruption and low-rank structures to prevent overfitting during training. This is achieved by accurately separating noise from data using a method called Double Over-Parameterization (DOP). Contrary to classical wisdom, which suggests that more parameters exacerbate overfitting, DOP uses a specific choice of learning rates on different sets of model parameters to prevent overfitting. Empirical results show that DOP outperforms traditional methods when applied to tasks such as image recovery from corrupted measurements and image classification under label noise.
Lecture 1: Shape-Constrained Kernel Machines and Their Applications
Abstract: TBA
Lecture 2: Beyond Mean Embedding: The Power of Cumulants in RKHSs
Abstract: TBA
Lecture 1: Reinforcement Learning
Abstract: TBA
Lecture 2: Deep Reinforcement Learning
Abstract: TBA
Lecture 3: Learning by Bootstrapping: Representation Learning
Abstract: TBA
Lecture 4: Learning by Bootstrapping: World Models
Abstract: TBA
Tutorials
Lecture 1: Wonders of high-dimensions: the maths and physics of Machine Learning 1/3
Wonders of high-dimensions: the maths and physics of Machine Learning
Lecture 2: Wonders of high-dimensions: the maths and physics of Machine Learning 2/3
Wonders of high-dimensions: the maths and physics of Machine Learning
Lecture 3: Wonders of high-dimensions: the maths and physics of Machine Learning 3/3
Wonders of high-dimensions: the maths and physics of Machine Learning
Tutorial 1: Backpropagation Neural Tree
Simpler models are better in their generalization. This research presents a class of neural-
inspired algorithms that are highly sparse in their architectural construction but perform
highly accurately. In addition, they make a simultaneous function approximation and
feature selection when solving machine learning tasks: classification, regression, and pattern
recognition. This class of algorithms are Neural Tee Algorithms: Heterogeneous Neural Tree,
Multi-Output Neural Tree, and Backpropagation Neural Tree. This research found that any
such arbitrarily constructed neural tree, which is like an arbitrarily “thinned” neural
network, has the potential to solve machine learning tasks with an equivalent or better
degree of accuracy than a fully connected symmetric and systematic neural network
architecture. The algorithm takes random repeated inputs through its leaves and imposes
dendritic nonlinearities through its internal connections like a biological dendritic tree
would do. The algorithm produces an ad hoc neural tree which is trained using a stochastic
gradient descent optimizer. The algorithms produce high-performing and parsimonious
models balancing the complexity with descriptive ability on a wide variety of machine
learning problems.
Resources:
Ojha, V., & Nicosia, G. (2022). Backpropagation neural tree. Neural Networks, 149, 66-
83: https://arxiv.org/pdf/2202.02248.pdf
Ojha, V., & Nicosia, G. (2020). Multiobjective optimization of multi-output neural trees. In 2020 IEEE Congress on Evolutionary Computation (CEC) (pp. 1-8). IEEE
Press: https://arxiv.org/pdf/2010.04524.pdf
Tutorial 2: Sensitivity Analysis of Deep Learning and Optimization Algorithms
Sensitivity analysis offers the opportunity to explore the sensitivity (influence) of
parameters on a model. This work applies global sensitivity analysis to deep learning and
optimization algorithms for the analysis of the influence of their hyperparameters. For deep
learning, we analyzed hyperparameters such as type of optimizers, learning rate, batch size,
etc. We analyzed these hyperparameters for deep neural networks such as ResNet18,
AlexNet, and GoogleNet. For the optimization algorithms, we analyzed hyperparameters of
two single-objective and two multi-objective state-of-the-art global optimization
evolutionary algorithms as an algorithm configuration problem. We investigate the quality
of influence hyperparameters have on the performance of algorithms in terms of their
direct effect and interaction effect with other hyperparameters. Using three sensitivity
analysis methods, Morris LHS, Morris, and Sobol, to systematically analyze tuneable
hyperparameters, the framework reveals the behaviours of hyperparameters to sampling
methods and performance metrics. That is, it answers questions like what hyperparameters
influence patterns, how they interact, how much they interact, and how much their direct
influence is. Consequently, the ranking of hyperparameters suggests their order of tuning,
and the pattern of influence reveals the stability of the algorithms.
Resources:
Assessing Ranking and Effectiveness of Evolutionary Algorithm Hyperparameters Using
Global Sensitivity Analysis Methodologies, Swarm and Evolutionary
Computation: https://arxiv.org/pdf/2207.04820.pdf
Sensitivity Analysis for Deep Learning: Ranking Hyper-parameter
Influence: https://ieeexplore.ieee.org/document/9643336