Chair: Ron S. Kenett (KPA Group and Samuel Neaman Institute, Technion, Israel)
Date: 10th Januar 2024, at 14:00-17:00 CET
In the era of machine learning, data analysis pipelines have grown in complexity to accommodate larger quantities and a greater variety of data. This trend has made software engineering increasingly important in ensuring that we can extract the information we need from them reliably and in a timely fashion. Firstly, we will provide an overview of how the software lifecycle and the typical machine learning workflow combine in a practical setting to produce a pipeline that follows the best practices from software engineering and data science. In particular, we will introduce the key concepts involved in the following modules:
• Project scoping and producing a baseline implementation.
• Data ingestion and data preparation.
• Model training and experiment tracking.
• Monitoring, logging and reporting.
These modules arise from the interplay between the cyclic nature of software development, which leads to the iterative refining of the code, and that of machine learning and data science practice, which extracts actionable information from data as they become available.
We will then discuss an example making use of directed acyclic graphs to implement a machine learning pipeline in a programmatic fashion. For this purpose, we will sketch small code examples using the production-grade software and use them to illustrate some of the modules. Finally, we will list a number of important trade-offs and best practices that should be considered when designing a machine learning pipeline, spanning from hardware choices to coding practices to documentation practices.
Marco Scutari is a Senior Researcher at Istituto Dalle Molle di Studi sull'IntelligenzaArtificiale (IDSIA), Switzerland. He has held positions in statistics, statistical genetics and machine learning in the UK and Switzerland since completing his PhD in Statistics in 2011. His research focuses on the theory of Bayesian networks and their applications to biological and clinical data, as well as statistical computing and software engineering.
Mauro Malvestio is a senior technologist based in Milan, Italy, with more than 15 years of experience in software engineering and IT operations in consulting and product companies as a CTO. His research focuses on software engineering, machine learning systems, embedded systems and cloud computing.