- Indico style
- Indico style - inline minutes
- Indico style - numbered
- Indico style - numbered + minutes
- Indico Weeks View
Your profile timezone:
The 24th annual conference of the European Network for Business and Industrial Statistics (ENBIS) will be hosted by the Division for Mechatronics, Biostatistics and Sensors (MeBioS) of the Katholieke Universiteit Leuven (KU Leuven) and will take place at the Irish College in the historic city center of Leuven (Belgium), from September 15 to 19. The conference sessions are scheduled from 16th to 18th September, with the administrative meetings, the pre- and post-conference courses and workshops taking place on the 15th and 19th September.
The annual conference features invited and contributed sessions, workshops and panel discussions, pre- and post-conference courses, as well as distinguished keynote speakers.
This year's keynote speakers include David Banks (Duke University) and Mathilde Mougeot (ENS Paris Saclay).
We are also pleased to announce that Peter Rousseeuw (KU Leuven, Belgium) will receive the ENBIS Box Medal in recognition for his remarkable contributions to the development and the application of statistical methods in European business and industry.
We cordially invite you not only to engage in highly rewarding scientific and professional exchange during the conference, but also to find some leisure time and explore the city of Leuven and its region.
Warmly welcome,
The ENBIS-24 Organizing Committee
Image: KEV@CAM
Statistics came of age when manufacturing was king. But today’s industries are focused on information technology. Remarkably, a lot of our expertise transfers directly. This talk will discuss statistics and AI in the context of computational advertising, autonomous vehicles, large language models, and process optimization
Invited session
We propose a generalized linear model for distributed multimodal data, where each sample contains multiple data modalities, each collected by an instrument. Unlike the centralized methods that require access to all samples, our approach assumes that samples are distributed across several sites, and pooling the data is not allowable due to data sharing constraints. Our approach constructs a set of local predictive models based on available multimodal data at each site. Next, the local models are sent to an aggregator that constructs an aggregated model. The models are obtained by minimizing local and aggregated objective functions that include penalty terms to create consensus among the data modalities and the local sites. Through extensive simulations, we compare the performance of the proposed method to local and centralized benchmarks. Furthermore, we assess the proposed framework for predicting the severity of Parkinson's based on the patient's activity data collected by the mPower application
In the era of Industry 4.0, ensuring the quality of Printed Circuit Boards (PCBs) is essential for maintaining high product quality, reliability, and reducing manufacturing costs. Anomaly detection in PCB production lines plays a critical role in this process. However, imbalanced datasets and the complexities of diverse data types pose significant challenges. This study explores the impact of data volume on anomaly detection accuracy by utilizing machine learning techniques, including Generative Adversarial Networks (GANs) and Synthetic Minority Oversampling Technique (SMOTE) to generate additional data. By addressing dataset imbalance through synthetic data augmentation, we aim to enhance model performance. Our experiments reveal that increasing data volume, particularly through GAN-generated and SMOTE-augmented data, significantly improves the accuracy of anomaly detection models. Interestingly, we also find that adding a large amount of data does not necessarily enhance model accuracy and that, beyond a certain point, accuracy actually drops.
In advanced manufacturing processes, high-dimensional (HD) streaming data (e.g., sequential images or videos) are commonly used to provide online measurements of product quality. Although there exist numerous research studies for monitoring and anomaly detection using HD streaming data, little research is conducted on feedback control based on HD streaming data to improve product quality, especially in the presence of incomplete responses. To address this challenge, this article proposes a novel tensor-based automatic control method for partially observed HD streaming data, which consists of two stages: offline modeling and online control. In the offline modeling stage, we propose a one-step approach integrating parameter estimation of the system model with missing value imputation for the response data. This approach (i) improves the accuracy of parameter estimation, and (ii) maintains a stable and superior imputation performance in a wider range of the rank or missing ratio for the data to be completed, compared to the existing data completion methods. In the online control stage, for each incoming sample, missing observations are imputed by balancing its low-rank information and the one-step-ahead prediction result based on the control action from the last time step. Then, the optimal control action is computed by minimizing a quadratic loss function on the sum of squared deviations from the target. Furthermore, we conduct two sets of simulations and one case study on semiconductor manufacturing to validate the superiority of the proposed framework.
ISBIS invited session
SVM Regression Oblique Trees: A Novel Approach to Regression Tasks. This technique combines feature selection based on predictor correlation and a weighted support vector machine classifier with a linear kernel. Evaluation on simulated and real datasets reveals the superior performance of the proposed method compared to other oblique decision tree models, with the added advantage of enhanced interpretability.
The focus is on the homogeneity test that evaluates whether two multivariate samples come from the same distribution. The problem arises naturally in various applications, and many methods are available in the literature. Based on data depth, several tests have been proposed for this problem, but they may not be very powerful. In light of the recent development of data depth as an important measure of quality assurance, two new test statistics are proposed for the multivariate two-sample homogeneity test. The proposed test statistics have the same chi-squared asymptotic null distribution. The generalization of the proposed tests into the multivariate multisample situation is also discussed. Simulation studies demonstrate the superior performance of the proposed tests. The test procedure is illustrated through two real data examples.
The use of a statistical classifier can be limited by its conditional misclassification rates (i.e., false positive rate and false negative rate) even when the overall misclassification rate is satisfactory. When one or both conditional misclassification rates are high, a neutral zone can be introduced to lower and possibly balance these rates. In this talk the need for neutral zones will be motivated and a method for constructing neutral zones will be explained. Real-life applications of neutral zone classifiers to prostate cancer diagnosis and to student evaluations of teaching will be discussed.
Invited session
In their simplest form, orthogonal arrays (OAs) are experimental designs where all level-combinations of any two factors occur equally often. As a result, the main effects of the factors are orthogonal to each other. There are also more involved OAs for which the level-combinations of any three factors occur equally often. In such OAs, the main effects are orthogonal to each other as well as to the two-factor interactions. With three practical examples, I show why OAs are so useful. I give pointers to published OAs and I comment on the discrepancy between the rather specialized literature and existing statistical software.
Orthogonal minimally aliased response surface (OMARS) designs permit the screening of quantitative factors at three levels using an economical number of runs. In these designs, the main effects are orthogonal to each other and to the quadratic effects and two-factor interactions of the factors, and these second-order effects are never fully aliased. Complete catalogs of OMARS designs with up to seven factors have been obtained using an enumeration algorithm. However, the algorithm is computationally demanding for constructing good OMARS designs with many factors and runs. To overcome this issue, we propose a construction method for large OMARS designs that concatenates two definitive screening designs. The method ensures the core properties of an OMARS design and improves the good statistical features of its parent designs. The concatenation employs an algorithm that minimizes the aliasing among the second-order effects using foldover techniques and column permutations for one of the parent designs. We study the statistical properties of the new OMARS designs and compare them to alternative designs in the literature. Our method increases the collection of OMARS designs for practical applications.
Much has been written about augmenting preliminary designs for first-order regression with additional runs to support quadratic models. This is a reasonable approach to practical sequential experimentation, allowing an early stop if the preliminary first-order result does not look promising. Central composite designs are especially well-suited to this (Box and Wilson, 1951), as all or part of the factorial portion can be executed first, and the remainder of the design including axial points can be added if the analysis of the early data warrants. A similar strategy is reasonable in the context of experiments in which factors have qualitative levels, where augmentation focuses on factors that are apparently active in the preliminary analysis. Here it makes more sense to focus on preliminary and final designs that have good properties under a qualitative factorial model, rather than a regression model. The ideal solution would be a two-level orthogonal array nested within a three-level orthogonal array.
Here, we examine three-level orthogonal arrays to find subsets of runs that constitute good two-level designs. The assumption being made here is that the experimenter's tentative plan is to complete a three-level factorial experiment, but would like the option of early termination if a small preliminary experiment using two levels of each factor, perhaps those thought a priori to be most different, does not provide promising results. For this purpose, we use the complete collections of three-level orthogonal arrays generated by Schoen, Eendebak and Nguyen (2009). Each three-level array is systematically examined to find the subsets of runs that constitute a non-singular two-level design. From this collection, we identify designs that are Pareto-admissible with respect to the generalized word length of the three-level array, and the Ds criterion and a bias criterion for the two-level subset.
SIS Invited session
Supervised learning under measurement constraints presents a challenge in the field of machine learning. In this scenario, while predictor observations are available, obtaining response observations is arduous or cost-prohibitive. Consequently, the optimal approach involves selecting a subset of predictor observations, acquiring the corresponding responses, and subsequently training a supervised learning model on this subset.
Among various subsampling techniques, the design-inspired subsampling methods have attracted great interest in recent years (see Yu et al. (2023) for a review). Most of these approaches have shown remarkable performance in coefficient estimation and model prediction, but their performance heavily relies on a specified model. When the model is misspecified, misleading results may be obtained.
In this work we provide a comparative analysis of methods that account for model misspecification, as for instance the LowCon approach introduced by Meng et al. (2021) for selecting a subsample using an orthogonal Latin hypercube design, or other subsampling criteria based on space-filling designs. Furthermore, the robustness of these methods to outliers is also assessed (see Deldossi et al. (2023)). Empirical comparisons are conducted, providing insights into the relative performance of these approaches.
References
Deldossi, L., Pesce, E., Tommasi, C. (2023) Accounting for outliers in optimal subsampling methods. Statistical Papers, 64(4), 1119-1135.
Meng, C., Xie, R., Mandal, A., Zhang, X., Zhong, W., Ma, P. (2021) Lowcon: a design-based subsampling approach in a misspecified linear model. Journal of Computational and Graphical Statistics, 30:694–708
Yu, J., Ai, M., Ye, Z. (2024) A review on design inspired subsampling for big data. Statistical Papers, 65, 467–510
After a rich history in medicine, randomized control trials (RCTs), both simple and complex, are in increasing use in other areas, such as web-based A/B testing and planning and design of decisions. A main objective of RCTs is to be able to measure parameters, and contrasts in particular, while guarding against biases from hidden confounders. After careful definitions of classical entities such as contrasts, an algebraic method based on circuits is introduced which gives a wide choice of randomization schemes. In this talk, we will introduce and discuss some real-world examples.
Stratification on important variables is a common practice in clinical trials,
since ensuring cosmetic balance on known baseline covariates is often deemed to be a crucial requirement for the credibility of the experimental results. However, the actual benefits of stratification are still debated in the literature. Other authors have shown that it does not improve efficiency in large samples and improves it only negligibly in smaller samples. This paper investigates different subgroup analysis strategies, with a particular focus on the potential benefits in terms of inferential precision of pre-stratification versus both post-stratification and post-hoc regression adjustment. For each of these approaches, the pros and cons of population-based versus randomization-based inference are discussed. The effects of the presence of a treatment-by-covariate interaction and the variability in the patient responses are also taken into account. Our results show that, in general, pre-stratifying does not provide substantial benefit. On the contrary it may be deleterious, in particular for randomization-based procedures in the presence of a chronological bias. Even when there is treatment-by-covariate interaction, pre-stratification may backfire by considerably reducing the inferential precision.
A tool for analysis of variation of qualitative (nominal) or semi-quantitative (ordinal) data obtained according to a cross-balanced design is developed based on one-way and two-way CATANOVA and ORDANOVA. The tool calculates the frequencies and relative frequencies of the variables, and creates the empirical distributions for the data. Then the tool evaluates the total data variation and its decomposition into contributing components as effects of the main factors influencing the data, i.e., sources of variation (such as examination conditions and expertise of technicians) and their interaction. The significance of a factor’s influence is tested as a hypothesis on homogeneity of the corresponding variation component with respect to the total variation. Powers of the tests and associated risks of incorrect decisions are calculated considering the number of categories and levels of the factors, i.e., the size of the statistical sample of nominal or ordinal data collected in the study, and the effect of the size for the test. The code for calculations using a macro-enabled Excel spreadsheet and including Monte Carlo draws from a multinomial distribution is described. Examples from the published studies of interlaboratory comparisons of weld imperfections (a case of nominal data) and intensity of drinking water odor (a case of ordinal data) are demonstrated.
Classification precision is particularly crucial in scenarios where the cost of false output is high, e.g. medical diagnosis, search engine results, product quality control etc. A statistical model for analyzing classification's precision from collaborative studies will be presented. Classification (categorical measurement) means that the object’s property under study is presented by each collaborator on the scale consisting of K exclusive classes/categories forming a comprehensive spectrum of this property. We assume that due to measurement/classification errors, a property belonging to category i can be classified by collaborator to category j with probabilities Pj/i (confusion matrix), distributed between collaborators according to Dirichlet distribution for every given i , whereas category counts of repeated by every collaborator classifications are distributed according to corresponded multinomial distribution. Such a model is called Dirichlet-multinomial distribution model which is a generalization of the beta-binomial model of the binary test. We propose repeatability and reproducibility measures based on categorical variation and Hellinger distance analysis, their unbiased estimators and discuss possible options of statistical homogeneity/heterogeneity test. Finally, the Bayesian approach to assessing the classification abilities of collaborators will also be discussed.
In most discrete choice experiments (DCEs), respondents are asked to choose their preferred alternative. But it is also possible to ask them to indicate the worst, or the best and worst alternative among the provided alternatives or to rank all or part of the alternatives in decreasing preference. In all these situations, it is commonly assumed that respondents only have strict preferences among all the alternatives as respondents can only give a single best choice, a single worst choice, or a ranking without ties. In this paper, we propose a general rank-ordered model, which is able to deal with all types of ranking data, including complete and incomplete rankings, with and without ties. We conduct a simulation study to check the performance of the general rank-ordered logit model in case the responses are either full rankings with ties, multiple best choices, or multiple best and worst choices, respectively. In each scenario, we compare the performance of the proposed model with that of the classical model on the corresponding converted data without ties which are obtained by randomly ordering the tied rankings. The results of the simulation study show that the proposed model can recover the preference parameters correctly. Furthermore, the results illustrate that modeling possible ties instead of forcing respondents to choose between tied alternatives, results in more accurate estimates of the preference parameters and of the marginal rates of substitution.
In recent decades, machine learning and industrial statistics have moved closer to each other. CQM, a consultancy company, performs projects in supply chains, logistics, and industrial R&D that often involve building prediction models using techniques from machine learning. For these models, challenges persist, e.g. if the dataset is small, has a group structure, or is a time series. At the same time, confidence intervals for the performance of the prediction model are often ignored, whereas it can be very valuable to assess model in a business context.
For assessing the performance of the prediction model, we encounter several challenges. 1) Choice between two common strategies: either training set - test set split or k-fold CV (cross validation), both with pros and cons. 2) A proper statistical description of k-fold CV has several subtleties in terms of bias, estimation, dependency. 3) Some useful metrics such as Area under Curve are just outside the scope of many texts on k-fold CV. 4) There are several strategies for computing a confidence interval. Strategies can be basic, and more advanced methods are found in recent literature with formal coverage guarantees, addressing the computational burden, or acknowledging group structure in data. We performed a study into these issues, giving clarity and resulting in practical guidelines to navigate the these challenges.
This study was carried out as part of the European project ASIMOV (https://www.asimov-project.eu/)
Conditional Average Treatment Effect (CATE) is widely studied in medical contexts. It is one tool used to analyze causality. In the banking sector, the interest for causality methods increases. As an example, one may be interested in estimating the average effect of a financial crisis on credit risk, conditionally to macroeconomic as well as internal indicators. On one other hand, transfer learning is used to adapt a model trained on one task for a second related task. Typically, large data is available for the first task, much less for the second one, and a model trained on the large data may be adapted for the second one, avoiding re-training the model from scratch.
We propose a new random forest design, oriented on CATE estimation called HTERF - Heterogenous Treatment Effect based Random Forest. Then we explore causal transfer learning methods and more precisely, we provide a new transfer methodology to adapt HTERF and causal neural networks on a new data.
In spreading processes such as opinion spread in a social network, interactions within groups often play a key role. For example, we can assume that three members of the same family have higher chance to persuade a fourth member to change their opinion than three friends of the same person who do not know each other, and hence who do not belong to the same community. The other way around, in a dynamical version, the change of opinion of the individuals can lead to the split of communities, or the birth of new ones, as people prefer to join groups where they find peers with the same opinion. We can represent these phenomena on spreading processes on hypergraphs, where hyperedges (subsets of the vertex set of arbitrary size) represent the communities, and the vertices in the intersecion of two or more hyperedges act as transmitters. These kind of dynamics on random hypergraphs give rise to various statistical problems, for example, the estimate of the probability of the split of a community where opinions differ, or the rate of opinion change given the neighborhood of an individual.
In our work we use various machine learning methods, in particular, xgboost and neural networks to estimate such rates by using only cumulative statistics of the process. That is, we assume that we only know the total number of individuals representing the different opinions, and some basic statistics about the hypergraph structure, for example, the average size of communities and average size of overlaps. In our simulation study, we identify the quantities that are necessary to obtain good estimates, and the quantities that might contain some additional useful information and help improving the quality of the estimates. We also study the effect of the structure of the underlying random hypergraph by running the simulations on networks with different distribution of the community structure in the graph.
Drought is a major natural hazard which can cause severe consequences on agricultural production, the environment, the economy and social stability. Consequently, increasing attention has been paid to drought monitoring methods that can assist governments in implementing preparedness plans and mitigation measures to reduce the economic, environmental, and social impacts of drought. The relevant drought characteristics are severity, duration and frequency; therefore, a suitable monitoring methodology should consider the time interval T between two occurrences and the magnitude X of each event. Time-Between-Events-and-Amplitude (TBEA) control charts have been proposed to monitor this type of phenomenon: a decrease in T and/or an increase in X may result in a negative condition that needs to be monitored and possibly detected with control charts. Most of the TBEA control charts proposed in the literature assume known distribution functions for the variables T and X. However, in the majority of real situations, the distributions of these random variables are unknown or very difficult to identify. In this study, time between events and amplitude control charts are used to detect changes in the characteristics of drought events. We used non-parametric methodologies that do not require any assumption on the distribution of the phenomenon or on the observed statistics. The results indicate that the proposed methods can be valuable tools for the institutions responsible for planning drought management and mitigation measures.
Despite the success of machine learning in the past several years, there has been an ongoing debate regarding the superiority of machine learning algorithms over simpler methods, particularly when working with structured, tabular data. To highlight this issue, we outline a concrete example by revisiting a case study on predictive monitoring in an educational context. In their work, the authors contrasted the performance of a simple regression model, a Bayesian multilevel regression approach, and an LSTM neural network to predict the probability of exceptional academic performance in a group of students. In this comparison, they found that the Bayesian multilevel model mostly outperformed the more complex neural network. In this work in progress, we focus on the case attributes that might have influenced the observed differences in outcomes among the models tested, and supplement the previous case study with supplementary analyses. We elaborate on characteristics that lead to comparable results derived from simpler, interpretable models in accordance with the data-generating mechanism and theoretical knowledge. We use these findings to discuss the general implications for other case studies.
Dental practices are a small business. Like any other business, they need cash flow management and financial planning to be viable, if not highly profitable. What a lot of practices may not realize is that they are sitting on a treasure trove of data to be used in more ways than plain accounting and financial forecasting. Here we focus on longitudinal data, such as the timing of each patient’s visits and the value of treatments since joining the practice. We aim to show how such data can be used by practices to understand their patient base and make plans for the future development of the business. There are plenty of business metrics. Here, we will focus on one metric: patient lifetime value (PLV), derived from customer lifetime value (CLV). CLV is well established in retail and other sectors. Patient appointments data can be used in methods adapted from CLV via the bespoke concept of PLV. We describe different approaches to calculating PLV, the advantages and disadvantages of each approach and the ways in which they can benefit a dental practice.
Keywords: Data science, statistical models, business improvement, decision making, loyalty, commitment, trust, customer lifetime value, patient lifetime value
Gliomas are the most common form of primary brain tumors. Diffuse Low-Grade Gliomas (DLGG) are slow growing tumors, and often asymptomatic during a long period. They turn into a higher grade, leading to the patients’ death. Treatments are surgery, chemotherapy and radiotherapy, with the aim of controlling tumor evolution. Neuro-oncologists estimate the tumor size evolution by delineating tumor edges on successive MRIs. Localization of the tumor seams of great interest for both awake surgery and our attempt to characterize patterns of tumors having common features.
We considered a small local database of 161 patients, and extracted the coordinates of the tumor barycenter from the MRI at the time of diagnosis. Given the particular structure of the data (spatial data coupled with other more usual features), we intend to use the theory of Spatial Point Processes (SPP), to answer the question of randomness of barycenters, and the existence of an underlying spacial organization. We are using R packages for spatio-temporal point processes in dimension two and designing a method for using the 3-dimensionality of our data.
Structural Equation Models (SEMs) are primarily employed as a confirmatory approach to validate research theories. SEMs operate on the premise that a theoretical model, defined by structural relationships among unobserved constructs, can be tested against empirical data by comparing the observed covariance matrix with the implied covariance matrix derived from the model parameters. Traditionally, SEMs assume that each unobserved construct is modeled as a common factor within a measurement theory framework.
Recently, Henseler (2021) proposed the synthesis theory, which allows for the inclusion of composites as proxies for unobserved constructs in SEMs. While automatic search algorithms have been proposed for factor-based SEMs to systematically identify the model that best fit the data based on statistical criteria, such algorithms have not been developed for composite-based SEMs.
This presentation introduces an extension of these approaches to composite-based SEMs using a genetic algorithm to identify the theoretical model that best fits the data. Akaike Information Criterion (AIC) is employed to compare model fits and determine the optimal model. We present a Monte Carlo simulation study that investigates the ability of our approach to accurately identify the true model under various conditions, including different sample sizes and levels of model complexity.
Our methodology can be considered a grounded theory approach, offering novel insights for conceptualizing structural relations among unobserved constructs and potentially advancing new theories.
Lightning is a chaotic atmospheric phenomenon that is incredibly challenging to forecast accurately and poses a significant threat to life and property. Complex numerical weather prediction models are often used to predict lightning occurrences but fail to provide adequate short-term forecasts, or nowcasts, due to their design and computational cost. In the past decade, researchers have demonstrated that spatiotemporal deep learning models can produce accurate lightning nowcasts using remotely sensed meteorological data, such as radar, satellite, and previous lightning occurrence imagery. However, these models are generally designed to predict lightning occurrence an hour or more in advance, leaving a forecasting gap in the sub-hour timeframe. This research develops novel sequence-to-sequence attention-based and non-attention-based spatiotemporal deep learning neural networks that ingest multi-modal, remotely sensed weather data to produce a time series of lightning nowcasts in the sub-hour interval. Furthermore, model error is uniquely incorporated into the model developmental process, resulting in more reliable predictions. Comparing the performance of these models to models seen in previous literature shows that the novel models perform comparably to, if not better than, prior lightning nowcast models. Additionally, the results show that adding attention mechanisms benefits specific model architectures more than others.
The problem of measuring the size distribution of ultrafine (nano and submicron-sized) particles is important to determine the physical and chemical properties of aerosols, their toxicity. We give a quick review of some statistical methods used in the literature to solve this problem, for instance an EM algorithm for the reconstruction of particle size distributions from diffusion battery data. We also present some simulation studies carried out during an ongoing work exploring new directions of research on this topic.
This study examines the relationship between foreign affiliates and labour productivity in the construction and manufacturing sectors. Labour productivity is calculated using EUKLEMS & INTANProd database of the Luiss Lab of European Economics, while foreign affiliates abroad data are taken from Eurostat. With the help of data coming from 19 EU countries between 2010 and 2019, we demonstrate how turnover per employee in foreign subsidiaries controlled by the reporting country has a positive and significant impact on labour productivity in the construction sector only. Foreign direct investment from these European countries also has a positive and significant impact on labour productivity in both sectors. This study can be used by public decision makers to highlight fiscal elusive strategies and outline the real share of domestic and foreign productivity for industrial economic sectors by considering the impact of permanent establishments. Foreign affiliate in the framework of this work is not to be considered as a resident entity, over which a resident institutional unit has direct and indirect control. These results prove the exigence for structural business statistics to select the perimeter of the companies potentially interested in the phenomenon of foreign production with their local units not constituting separate legal entities. Both controlled enterprises and branches data concerning turnover and investments will be considered in this work to outline multinational groups and their operations with correlated entities.
In process robustness studies, experimenters are interested in comparing the responses at different locations within the normal operating ranges of the process parameters to the response at the target operating condition. Small differences in the responses imply that the manufacturing process is not affected by the expected fluctuations in the process parameters, indicating its robustness. In this presentation, I will introduce a new optimal design criterion, named the generalized integrated variance for differences (GI$\mathrm{_D}$) criterion, to set up experiments for robustness studies. GI$\mathrm{_D}$-optimal designs have broad applications, particularly in pharmaceutical product development and manufacturing. I will show that GI$\mathrm{_D}$-optimal designs have better predictive performances than other commonly used designs for robustness studies, especially when the target operating condition is not located at the center of the experimental region. In some situations, the alternative designs typically used are roughly only 50% as efficient as GI$\mathrm{_D}$-optimal designs. I will also demonstrate the advantages of tailor-made GI$\mathrm{_D}$-optimal designs through an application to a manufacturing process robustness study of the Rotarix liquid vaccine.
Screening experiments often require both continuous and categorical factors. In this talk I will introduce a new class of saturated main effects designs containing m three-level continuous factors and m − 1 two-level discrete or continuous factors in n = 2m runs, where m ≥ 4. A key advantage is that these designs are available for any even n ≥ 8. With effect sparsity a few quadratic effects are identifiable. I will demonstrate three methods of construction depending on whether n is a multiple of 8, a multiple of 4 but not 8, or a multiple of 2 but not 4 or 8.
Orthogonal minimally aliased response surface or OMARS designs are an extensive family of experimental designs, bridging the gap between definitive screening designs and traditional response surface designs. Their technical properties render OMARS designs suitable to combine a screening experiment and a response surface experiment in one. The original OMARS designs are intended for experimentation involving three-level quantitative factors only. Recently, however, mixed-level OMARS designs were introduced to study three-level quantitative factors and two-level quantitative or categorical factors simultaneously. An even more recent addition to the set of orthogonal designs available is the family of orthogonal mixed-level (OML) designs. In this presentation, we discuss the connection between OMARS designs and OML designs. We conclude that all OMARS designs are OML designs, but not all OML designs are OMARS designs. We also discuss the combinatorial construction of certain types of three-level OMARS designs.
The aim of pattern matching is to identify specific patterns in historical time series data to predict future values. Many pattern matching methods are non-parametric and based on finding nearest neighbors. This type of method is founded on the assumption that past patterns can be repeated and provide informations about future trends. Most of the methods proposed in the literature are univariate. In some applications, other time series are available and their use can improve the forecasting results. Certain methods exist to deal with this type of context, such as the MV-kWNN approach, which predicts the future of several time series with temporal dependencies using pattern matching. To deal with the case of uncorrelated time series, we propose a new method called Weighted Nearest Neighbors for multivariate time series ($WNN_{multi}$). This method allows to extend the search area from nearest neighbors to other curves, with the aim of finding similar behavior in other individuals at different times. Once the k nearest neighbors have been found, the prediction is obtained either by averaging the futures or using Gaussian processes. To evaluate the performance of this model, we use data from the Spanish electricity market, where the objective is to predict electricity consumption using also the price of electricity (and vice versa). In this type of application where the time series are heterogeneous, normalization is necessary to be able to compare and use different time series. We compare the results with other methods for multivariate time series forecasting.
Multivariate Singular Spectrum Analysis (MSSA) is a nonparametric tool for time series analysis widely used across finance, healthcare, ecology, and engineering. Traditional MSSA depends on singular value decomposition that is highly susceptible to outliers. We introduce a robust version of MSSA, named Robust Diagonalwise Estimation of SSA (RODESSA), that is able to resist both cellwise and casewise outliers. The decomposition step of MSSA is replaced by a robust low-rank approximation of the trajectory matrix that takes its special structure into account. We devise a fast algorithm that decreases the objective function at each iteration. Additionally, an enhanced time series plot is introduced for better outlier visualization. Through extensive Monte Carlo simulations and a practical case study on temperature monitoring in railway vehicles, RODESSA demonstrates superior performance in handling outliers than competing approaches in the literature.
Acknowledgments: The research activity of F. Centofanti was carried out within the MICS (Made in Italy – Circular and Sustainable) Extended Partnership and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1551.11-10-2022, PE00000004). The research activity of B. Palumbo was carried out within the MOST - Sustainable Mobility National Research Center and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D. 1033.17-06-2022, CN00000023). This work reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.
Our previous contribution to ENBIS included an introduction of BAPC (‘Before and After correction Parameter Comparison’), a framework for explainable AI time series forecasting, which has formerly been applied to logistic regression. An initially non-interpretable predictive model (such as neural network) to improve the forecast of a classical time series ’base model’ is used. Explainability of the correction is provided by fitting the base model again to the data from which the error prediction is removed. This follow-up work is devoted to the practical application of the framework by (1) showcasing the method to explain changes in the dynamics of a physical system, (2) providing guidance on the choice of the interpretable and correction model pair based on explainability-accuracy tradeoff analysis and (3) comparing our method with the state of the art on explainable time-series forecasting. In this context, the BAPC is able to identify the set of model parameters and the time window that brings maximum explanation to the AI-correction local behavior, hence delivering explanation both in the form of feature-importance and time-importance.
Existing control charts for Poisson counts are tailor-made for detecting changes in the process mean while the Poisson assumption is not violated. But if the mean changes together with the distribution family, the performance of these charts may deviate considerably from the expected out-of-control behavior. In this research, omnibus control charts for Poisson counts are developed, which are sensitive to a broad variety of process changes. This is achieved by adapting common omnibus goodness-of-fit (GoF) tests to process monitoring. More precisely, different GoF-tests based on the probability generating function (pgf) are combined with an exponentially weighted moving-average (EWMA) approach in various ways. A comprehensive simulation study leads to clear design recommendations on how to achieve the desired omnibus property. The practical benefits of the proposed omnibus EWMA charts are demonstrated with several real-world data examples.
The shifted (or two-parameter) exponential distribution is a well-known model for lifetime data with a warranty period. Apart from that, it is useful in modelling survival data with some flexibility due to its two-parameter representation. Control charts for monitoring a process that it is modeled as a shifted exponential distribution have been studied quite extensively in the recent literature. However, all the available charts require the use of rational subgroups of size $n\geq 2$. In this work we focus on the case of individual observations (i.e., $n=1$) and propose the use of two CUSUM charts for monitoring this type of processes. The preliminary results show that the proposed charts have increased sensitive and thus, they are effective, in the detection of various out-of-control situations. Also, a follow-up procedure is discussed in order to identify which of the process parameters has changed due to the presence of assignable causes. Finally, the implementation of the proposed charts in practice is discussed via an illustrative example.
Acknowledgement: This work has been partly supported by the University of Piraeus Research Center.
This article constructs a control chart for monitoring a ratio of two variances within a bivariate-distributed population. For an in-control process, we assume the in-control two variances and the covariance of the bivariate-distributed population are known. The ratio of two variances is equivalent to a difference of the two variances. An unbiased estimator of the difference between the two sample variances is provided, and its mean and variance are derived. The new ratio control chart is thus developed accordingly. We investigate the properties and detection performance of the proposed ratio control chart by some numerical analyses. Whether the sample size is small or large, we demonstrate that the control chart provided correct process monitoring information and quickly out-of-control detection ability.
A real example of monitoring the ratio of variances within a bivariate-distributed population adopting a semiconductor data set demonstrates the application of the proposed ratio control chart.
Causality is a fundamental concept in the scientific learning paradigm. For this purpose, deterministic models are always desirable, but they are often unfeasible due to the lack of knowledge. In such cases, empirical models fitted on process data can be used instead. Moreover, the advent of Industry 4.0 and the growing popularity of the Big Data movement have caused a recent shift in process data. In this context, data scientists typically use machine learning models for correlation and prediction, but these models often fail to identify the underlying causal relationships.
By contrast, latent variable-based models, such as Partial Least Squares (PLS), allow for the analysis of large datasets with highly correlated data. These models do not only analyze the relationship between input and output spaces but also provide models for both spaces, offering uniqueness and causality in the reduced latent space. However, causal interpretation in the latent space is restricted to changes that respect the correlation structure of these models. This work focuses on causal latent variable-based models to:
- define multivariate raw material specifications providing assurance of quality with a certain confidence level for the critical to quality attributes (CQAs). Besides, an effective process control system attenuating most raw material variations is implemented by manipulating process variables.
- develop a latent space-based multivariate capability index to rank raw material suppliers. The novelty of this new index is that it is defined in the latent space connecting the raw material properties with the CQAs.
- process optimization using historical data.
In the pharmaceutical industry, the use of statistics has been largely driven by clinical development, an area where frequentist statistics have been and remain dominant. This approach has led to numerous successes when considering the various effective treatments available to patients today.
However, over time, Null Hypothesis Significance Testing (NHST) and related Type-I error thinking became almost the exclusive method for handling statistical questions across all aspects of the pharmaceutical industry - in discovery, preclinical/translational research, and even in manufacturing - well beyond confirmatory Phase III trials. This has often resulted in adapting and twisting questions and answers to fit the NHST framework, rather than applying appropriate statistical methodologies to the specific questions at hand and fully understanding the impact of the solutions provided.
In my talk, I will first provide examples from preclinical research and manufacturing where, in some cases, a frequentist approach is appropriate, while in others, Bayesian statistics is more oriented towards the question of interest.
Secondly, I will explain how, over the last 20 years, I have been working to organize the sharing of case studies using Bayesian statistics, spread applied knowledge, and promote the careful but necessary acceptance of Bayesian methodology by both the statistical community and regulatory authorities.
Welcome reception in the city hall of Leuven. Drinks only.
In today’s fast-paced industrial landscape, the need for faster and more cost-effective research and development cycles is paramount. As experiments grow increasingly complex, with more factors to optimize, tighter budgetary and time constraints, and limited resources, the challenges faced by industry professionals are more pressing than ever before.
Although the optimal design of experiments framework allows to tailor a design as close as possible to the problem at hand, it still requires deep knowledge about statistical concepts and terminology. For instance, when considering hard-to-change factors, one must be familiar with split-plot designs and decide on the number of whole plots and total number of runs. This technicality, in combination with the observation that, most often, researchers or engineers rather than statisticians construct the experimental plans, is one of the causes for the low adoption rate or bad configuration of the most performant types of experimental designs.
Can we make design of experiments more accessible to engineers? What if we tailor the design generation to the actual description of the experimental constraints in engineering terms? Imagine an algorithm where engineers define the cost of each test and the transition time between tests, all in their preferred units. This algorithm would automatically determine the tests to be performed, the optimal ordering and number of runs, regardless of the budgetary constraints. Engineers could spend less time setting up experiments and focus more on extracting valuable insights.
Two-level designs are widely used for screening experiments where the goal is to identify a few active factors which have major effects. We apply the model-robust Q_B criterion for the selection of optimal two-level designs without the usual requirements of level balance and pairwise orthogonality. We provide a coordinate exchange algorithm for the construction of Q_B-optimal designs for the first-order maximal model and second-order maximal model and demonstrate that different designs will be recommended under different prior beliefs. Additionally, we study the relationship between this new criterion and the aberration-type criteria. Some new classes of model-robust designs which respect experimenters' prior beliefs are found.
Industrial experiments often have a budget which directly translates into an upper limit on the number of tests that can be performed. However, in situations where the cost of the experimental tests is unequal, there is no one-to-one relation between the budget and the number of tests. In this presentation, we propose a design construction method to generate optimal experimental designs for situations in which not every test is equally expensive. Unlike most existing optimal design construction algorithms, for a given budget, our algorithm optimizes both the number of tests as well as the factor level combinations to be used at each test. Our algorithm belongs to the family of variable neighborhood search (VNS) algorithms, which are known to work well for complex optimization problems. We demonstrate the added value of our algorithm using a case study from the pharmaceutical industry.
Linear model trees are regression trees that incorporate linear models in the leaf nodes. This preserves the intuitive interpretation of decision trees and at the same time enables them to better capture linear relationships, which is hard for standard decision trees. But most existing methods for fitting linear model trees are time consuming and therefore not scalable to large data sets. In addition, they are more prone to overfitting and extrapolation issues than standard regression trees. In this paper we introduce PILOT, a new algorithm for linear model trees that is fast, regularized, stable and interpretable. PILOT trains in a greedy fashion like classic regression trees, but incorporates an $L^2$ boosting approach and a model selection rule for fitting linear models in the nodes. The abbreviation PILOT stands for PIecewise Linear Organic Tree, where ‘organic’ refers to the fact that no pruning is carried out. PILOT has the same low time and space complexity as CART without its pruning. An empirical study indicates that PILOT tends to outperform standard decision trees and other linear model trees on a variety of data sets. Moreover, we prove its consistency in an additive model setting under weak assumptions. When the data is generated by a linear model, the convergence rate is polynomial.
This work formulates model selection as an infinite-armed bandit problem, namely, a problem in which a decision maker iteratively selects one of an infinite number of fixed choices (i.e., arms) when the properties of each choice are only partially known at the time of allocation and may become better understood over time, via the attainment of rewards.
Here, the arms are machine learning models to train and selecting an arm corresponds to a partial training of the model (resource allocation).
The reward is the accuracy of the selected model after its partial training.
We aim to identify the best model at the end of a finite number of resource allocations and thus consider the best arm identification setup.
We propose the algorithm Mutant-UCB that incorporates operators from evolutionary algorithms into the UCB-E (Upper Confidence Bound Exploration) bandit algorithm introduced by Audibert et al. (2010).
Tests carried out on three open source image classification data sets attest to the relevance of this novel combining approach, which outperforms the state-of-the-art for a fixed budget.
In modern manufacturing processes, one may encounter processes composed of two or more critical input blocks having an impact on Y-space. If these blocks follow a sequential order, any cause of variation in a particular block may be propagated to subsequent blocks. This is frequently observed when a first block of raw material properties entering a production process influence the performance of a second block of process variables, and the final product quality. The goal is not to maintain the raw material variability under control, since it may be not feasible, but rather to manipulate the process to mitigate its effect on the product quality. This scenario would hinder the interpretability of the process monitoring by a conventional statistical process control (SPC) scheme due to the redundancy of information among blocks. In addition, this may trigger a time-varying process forcing the user to use either local or adaptive-based procedures. Nevertheless, would it be possible to establish a unique SPC scheme for process variations regardless of raw material variations?
The purpose of this work is to establish a SPC scheme based on the sequential multi-block partial least squares (SMB-PLS) when process blocks present correlated information. This scheme increases the interpretability of the process monitoring, and it prevents any special cause from propagating to subsequent blocks. Thus, these blocks can be monitored by a unique scheme even though there are special causes of variations in prior blocks. A real case study from a food manufacturing process is used to illustrate the proposal.
Online outlier detection in multivariate settings is a topic of high interest in several scientific fields, with the Hotelling's T2 control chart being probably the most widely used method in practice to treat it. The problem becomes challenging though when we lack the ability to perform a proper phase I calibration, like in short runs or in cases where online inference is requested from the start of the process, as in biomedical applications. In this work, we propose a Bayesian self-starting version of the Hotelling's T2 control chart, for multivariate normal data when all parameters are unknown. A conjugate power prior will allow to incorporate different sources of information (when available), providing closed form expressions that are straightforward to be used in practice and most importantly, will allow online inference, breaking free from the phase I calibration stage. From a theoretic perspective, we determine the power of the proposed scheme in detecting a fixed size outlier in the mean vector and we discuss its properties. Apart from monitoring, we deal also with the post-alarm inference aiming to provide the likely source(s) of an alarm, enriching the practitioners root-cause analysis toolbox. A simulation study evaluates the performance of the proposed control chart against its competitors, while topics regarding its robustness are also covered. An application to real data will illustrate its practical use.
Hyperspectral imaging is an instrumental method that allows obtaining images where each pixel contains information in a specific range of the electromagnetic spectrum. Initially used for military and satellite applications, hyperspectral imaging has expanded to agriculture, pharmaceuticals, and the food industry. In recent decades, there has been an increasing focus on such analytical techniques as a rapid and non-destructive approach to gather significant insights into textiles.
Automatic identification and segmentation of textile fibers are extremely important since textile material sorting is interesting for reuse and recycling as it can guarantee added value to the recycled material. However, the extensive variety of fibers utilized in textile production naturally complicates the process of analysis and identification. Textile samples are challenging due to their complex chemical composition and diverse physical characteristics. The optical and physical characteristics like thickness, surface texture, color, and transparency affect data acquisition, carrying information that would increase the overall variance of the data.
This study employs multivariate statistics to address technological and practical tasks for the textile recycling industry. Samples of different textile compositions were analyzed using hyperspectral NIR imaging. Various preprocessing techniques and statistical methods were employed for data exploration, classification, and regression analysis. The research also focuses on the potential assessment of elastane content in cotton fibers, considering its prevalence in the textile industry and challenges in recycling processes. Statistical methods, including Principal Component Analysis and Multivariate Curve Resolution are calculated to analyze the data and extract the maximum amount of information.
With the routine collection of energy management data at the organisational level comes a growing interest in using data to identify opportunities to improve energy use. However, changing organisational priorities can result in data streams which are typically very messy; with missing periods, poor resolution and containing structures that are challenging to contextualise. Using operational data collected over three years on a university campus this presentation shows the influential role changepoint analysis and statistical anomaly detection can play in making sense of such data. Combining building level data for multiple utilities we demonstrate the ability of the methods to detect:
We illustrate how these could be placed in an organisational context to guide energy management decisions.
Process Analytical Technologies have been the key technology of quality maintenance and improvement in process industry. Quality is however only one indicator of process excellence: Safety, Cost, Delivery, Maintenance and specifically Environment are strongly complementary determinants of process value. The rising societal demands on sustainability of contemporary process industry has made specifically Environmental impact increasingly relevant, demonstrable by the implementation of CSRD into national legislation in the coming years. This however creates an interesting “collision of timelines” as the future predictions from large volumes of PAT data collide with the retrospective quantification of environmental with Life Cycle Assessment (LCA), of data that is available at time of production.
Aside from quality information, PAT data (e.g. NIR spectra) contain a wealth of information on aspects like provenance, which are the key inputs for LCA. The available sustainability data on ingredients may therefore also be used to predict the footprint of the end-product. In this way, both quality and environmental impact (and production cost) may be simultaneously predicted. This allows the producer to take control of the product footprint, like they already are used to take control of quality through PAT. We show on a case study of animal feeds, how NIR spectroscopy (1) adequately predicts all product outcomes, (2) likewise predicts ingredient provenance, thereby providing a paperless evidence basis for their origin and (3) makes transparent the economic balance underlying sustainable production.
Industry 4.0 contexts generate large amounts of data holding potential value for advancing product quality and process performance. Current research already uses data-driven models to refine theoretical models, but integrating mechanistic understanding into data-driven models is still overlooked. This represents an opportunity to harness extensive data alongside fundamental principles.
We propose a framework for hybrid modeling solutions in industry, by combining Information Quality (InfoQ) principles with hybrid modeling insights. Such Hybrid Information Quality approach (H-InfoQ) aims to enhance industrial problem-solving, to improve process modeling and understanding of non-stationary systems.
The H-InfoQ framework evaluates a given hybrid model, $f_H$, the available process information, $X_H$, the specific analysis goal, $g$, and the adequate utility measure, $U$. Despite its thoroughness, the framework’s reproducibility and practical application remain challenging for practitioners to navigate autonomously. The main goal is to optimize the utility derived from applying $f_H$ to $X_H$, in the scope of the goal $g: Max \: H\text{-}InfoQ = U\{f_H (X_H)|g\}$. To improve its practicality, an eight-dimensional strategy is proposed, focusing on data granularity, structure, integration, temporal relevance, data and goal chronology, generalizability, operationalization, and communication (see also Kenett & Shmueli, 2014).
To illustrate the practical application and effectiveness of the H-InfoQ framework, two industrial case studies are analyzed and explored through the lens of this methodological construct. These instances were selected to showcase the tangible benefits and real-world applicability of the framework in industrial contexts.
References
Sansana J, Joswiak MN, Castillo I, Wang Z, Rendall R, Chiang LH, Reis MS. Recent trends on hybrid modeling for Industry 4.0. Computers and Chemical Engineering. 2021;151.
Kenett RS, Shmueli G. On Information Quality. Journal of the Royal Statistical Society Series A: Statistics in Society. 2014;177.
Reis MS, Kenett RS. Assessing the value of information of data-centric activities in the chemical processing industry 4.0. AIChE Journal. 2018;64.
In various global regions, In Vitro Diagnostic Medical Devices (IVDs) must adhere to specific regulations in order to be marketed. To obtain approval from entities such as the U.S. Food and Drug Administration (FDA), the In Vitro Diagnostic Medical Devices Regulation (IVDR) in Europe, Health Canada, or Japanese regulatory bodies, manufacturers are required to submit Technical Documentation to guarantee the products' safety and performance.
The Technical Documentation encompasses the Analytical Performance Report (APR), detailing product accuracy, specificity, stability, interference, and detection and quantitation limits, among other analytical capabilities of the marketed product. Each study is outlined in a distinct report, providing insights into study design, target populations, statistical methodologies employed, acceptance criteria and reasoning behind sample size calculations. The comprehensive APR is the amalgamation of these individual reports and adheres to a standardized structure.
Traditionally compiled manually, the creation of the APR involves labor-intensive tasks such as repetitive data entry and verification of study consistency against methodological guidelines, such as those provided by the Clinical and Laboratory Standards Institute (CLSI). Our presentation will introduce a prototype tool leveraging Large Language Models to automate and streamline the preparation process of the final APR, reducing manual intervention and enhancing efficiency.
Bioprinting is an innovative set of technologies derived from additive manufacturing, with significant applications in tissue engineering and regenerative medicine. The quality of printed constructs is commonly measured in terms of shape fidelity trough a procedure known as printability assessment. However, the cost of experimental sampling and the complexity of various combinations of materials, processes, and conditions makes it difficult to train and generalize printability models across different scenarios. Typically, these models are application-specific and developed from scratch, exploring only a few parameters after arbitrarily predetermining several conditions.
The objective of this study is to demonstrate the first application of Transfer Learning (TL) in bioprinting. TL has already proven effective in additive manufacturing by leveraging existing knowledge and applying it to new conditions when materials, machines, or settings change.
In our study, we transfer the knowledge from a biomaterial (the source) to another (the target), aiming at modeling the target printability response by reusing the previous knowledge and thus minimizing experimental effort. The accuracy of the transferred model is assessed by comparing its prediction error with a conventional approach developed from scratch. Different established TL approaches are employed, compared, and improved to enhance prediction performance for this application. Additionally, we investigate the method's performance and limitations by varying the number of experimental target points.
This method demonstrates the feasibility of knowledge transfer in bioprinting, acting as a catalyst for more advanced scenarios across diverse printing conditions, materials, and technologies. Furthermore, the approach enhances reliability and efficiency of bioprinting process modeling.
AdDownloader is a Python package for downloading advertisements and their media content from the Meta Online Ad Library. With a valid Meta developer access token, AdDownloader automates the process of downloading relevant ads data and storing it in a user-friendly format. Additionally, AdDownloader uses individual ad links from the downloaded data to access each ad's media content (i.e. images and videos) and stores it locally. The package also offers various analytical functionalities, such as topic modelling of ad text and image captioning using AI, embedded in a Dashboard. AdDownloader can be run as a Command-Line Interface or imported as a Python package, providing a flexible and intuitive user experience. The source code is currently stored on Github, and can be reused for further research under the GPL-3.0 license. Applications range from understanding the effectiveness and transparency of online political campaigns to monitoring the exposure of different population groups to the marketing of harmful substances. This paper applies AdDownloader's functionalities to the US General 2020 Elections as a case study.
Different data difficulty factors (e.g., class imbalance, class overlapping, presence of outliers and noisy observations and difficult border decisions) make classification tasks challenging in many practical applications and are hot topics in the domain of pattern recognition, machine learning and deep learning. Data complexity factors have been widely discussed in specialized literature from a model-based or a data-based perspective, conversely less research efforts have been devoted to investigating their effect on the behavior of classifier predictive performance measures. Our study tries to address this issue by investigating the impact of data complexity on the behavior of several measures of classifier predictive performance. The investigation has been conducted via an extensive study based on numerical experiments using artificial data sets. The data generation process has been controlled through a set of parameters (e.g., number of features; class frequency distributions; frequency distributions of safe and unsafe instances) defining the characteristics of generated data. The artificial data sets have been classified using several algorithms whose predictive performances have been evaluated through the measures under study. Study results highlight that, although the investigated performance measures quite agree for easy classification tasks (i.e., with balanced datasets containing only safe instances), their behavior significantly differs when dealing with difficult classification tasks (i.e., increasing data complexity) which is a rule in many real-word classification problems.
Acknowledgements
This study was carried out within the MICS (Made in Italy– Circular and Sustainable) Extended Partnership and received funding from the European Union Next-Generation EU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 –D.D. 1551.11-10-2022, PE00000004). This manuscript reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.
Squats are punctual material failures at railway tracks which can lead to critical effects when not detected or removed in time. Investigations in the last years (c.f., e.g., [1], [2], [3], [4]) pointed out the severity of this problem, although relevant questions about root causes remain open. A main reason for this situation may be the challenging detectability of squat genesis as well as the broad palette of potential root causes, such as track characteristics, traffic patterns and loads as well as maintenance. A comprehensive study of these effects requires high effort of data preparation since many different sources of different owners, quality and quantity have to be merged adequately. As an extension of the risk modelling work done so far ([5], [6]), this paper presents time-to-failure models which do not only show under which circumstances the risk for observing a squat is high but indicate additionally a time of failure occurrence. This allows to determine a health indicator along the track and supports the planning of preventive maintenance actions. Data from the Swiss railway were applied for model development and provide corresponding examples.
[1] Kerr, M., Wilson, A., Marich, S.: The epidemiology of squats and related rail defects, Conference on railway engineering, 2008.
[2] Luther M., Heyder R., Mädler K.: Prevention of multiple squats and rail maintenance measures, 11th international conference on contact mechanics and wear of rail/wheel systems (CM2018) Delft, the Netherlands, September 24-27, 2018.
[3] Muhamedsalih Y., Hawksbee S., Tucker G., Stow J., Burstow M.: Squats on the Great Britain rail network: Possible root causes and research recommendations, International Journal of Fatigue, Volume 149, 2021. doi.org/10.1016/j.ijfatigue.2021.106267.
[4] Schamberger S.: Der Squat aus Sicht der OEBB, OEVG: Squats, University of Technology Graz, 2021/09/13.
[5] Nerlich, I.: Netzweite statistische Analyse von Squat-Rollkontaktermüdungsfehlern unter Berücksichtigung von Kontaktgeometrie und Zusammensetzung der Traktionsmittel in einem Bahnsystem mit Mischverkehr. Submitted Dissertation, TU Berlin, 2024.
[6] Haselgruber, N., Nerlich, I.: Statistical models for health monitoring of rare events in railway tracks. Submitted paper to the SIS 2024, the 52nd Scientific Meeting of the Italian Statistical Society, Bari.
Anomaly detection identifies cases that deviate from a common behavior or pattern in data streams. It is of great interest in a variety of fields, e.g., from biology recognizing uncommon observations in genetic data, to financial sectors identifying frauds through unusual economic activities. Detection of anomalies can be formulated as a binary classification problem, distinguishing between the anomalies and non-anomalies. In many instances (particularly in financial frauds), available information comes from two sources: covariates characterizing the profile of a case and the network connected to others. In this work, we develop a binary Gaussian process classification model that utilizes information from both sources. We follow the Bayesian paradigm to estimate the parameters and latent states of the model and to naturally quantify the uncertainty for their true values. To derive the covariance matrix of the Gaussian prior distribution of the latent states, we employ kernel functions that model the relationships implied by any available covariates as well as by the network structure of the problem at hand. We develop a bespoke Markov chain Monte Carlo algorithm to obtain posterior samples, enhancing efficiency while reducing complexity in terms of tuning parameters requirements. The performance of the proposed methodology is examined via a simulation study, while an application to real data illustrates its use in practice.
We discuss the problem of active learning in regression scenarios. In active learning, the goal is to provide criteria that the learning algorithm can employ to improve its performance by actively selecting data that are most informative.
Active learning is usually thought of as being a sequential process where the training set is augmented one data point at a time. Additionally, it is assumed that an experiment to gain a label $y$ for an instance $x$ is costly but computation is cheap.
However, in some application areas, e.g. in biotechnology, selecting queries in serial may be inefficient. Hence, we focus on batch-mode active learning that allows the learner to query instances in groups.
We restrict ourselves to a pool-based sampling scenario and investigate several query strategies, namely uncertainty sampling, committee-based approaches, and variance reduction, for actively selecting instantiations of the input variables $x$ that should be labelled and incorporated into the training set, when the model class is possibly misspecified.
We compare all active selection strategies to the passive one that selects the next input points at random from the unlabelled examples using toy and real data sets and present the results of our numerical studies.
Process stability is usually defined using iid assumption about data. However violating stability requires some concrete model like changepoint, linear trend, outliers, distributional models, positive or negative autocorrelation, etc. These violations are often tested separately and not all of the possible modes of instability can always be taken into account. We suggested a likelihood-based procedure using local regression to assess many possible modes of instability in one step and replace or complement multiple stability tests. The study includes evaluation of equivalent degrees of freedom and AIC/BIC with Monte Carlo simulations and application examples.
The concepts of null space (NS) and orthogonal space (OS) have been developed in independent contexts and with different purposes.
The former arises in the inversion of Partial Least Squares (PLS) regression models, as first proposed by Jaeckle & MacGregor [1], and represents a subspace in the latent space within which variations in the inputs do not affect the prediction of the outputs. The NS is particularly useful in tackling engineering problems such as process design, process scale-up, product formulation, and product transfer.
The second arises in orthogonal PLS (O-PLS) modeling, which was originally proposed by Trygg & Wold [2] as a preprocessing method for multivariate data. O-PLS provides a way to remove systematic variation from the inputs that is not correlated to the outputs. The OS is, therefore, defined as the space that contains the combinations of inputs that do not produce systematic variations in the outputs. Its most important role is in multivariate calibration.
In this study, we bridge PLS model inversion and O-PLS modeling by proving that the NS and the OS are, in fact, the same space (for the univariate response case). We also provide a graphical interpretation of the equivalence between the two spaces.
[1] C. M. Jaeckle and J. F. MacGregor, ‘Industrial applications of product design through the inversion of latent variable models’, Chemom. Intell. Lab. Syst., 50(2):199–210, 2000.
[2] J. Trygg and S. Wold, ‘Orthogonal projections to latent structures (O-PLS)’, J. Chemom, 16(3):119–128, 2002.
An important axiom in innovation is “Fail early, fail often, but learn from the failures.” This talk discusses an academic-industrial statistical engineering project that initially had good prospects for success but ultimately provided virtually no benefit to the industrial partner although it did produce a nice dissertation for the PhD student assigned to the project. It is crucial to note that “hindsight is always 20-20.”
The talk begins with a high-level overview of statistical engineering. It then discusses the nature of academic-industrial relationships through industry sponsored centers of engineering excellence. Next, the paper discusses the three-year journey selling the project to the company, the politics both within the engineering center and with the company, the failure of the academic leadership, the difficulty of beginning meaningful dialogues between the company engineers and the university team, and the fundamental funding issues that led to the dissolution of the joint engineering/statistics team project.
The talk ends with a summary of the good, the bad, and the ugly of the experience. More importantly, it provides constructive insights and suggestions to address the fundamental issues that doomed this project.
In this presentation, we present a case study that results from a multi-stage project supported by NASA’s Engineering Safety Center (NESC) where the objective was to assess the safety of composite overwrapped pressure vessels (COPVs). The analytical team was tasked with devising a test plan to model stress rupture failure risk in carbon fiber strands that encase the COPVs with the goal of understanding the reliability of the strands at use conditions for the expected mission life. While analyzing the data, we found that the proper analysis of the data contradicts accepted theories about the stress rupture phenomena. During this presentation, we'll offer statistical insights and elaborate on our successful integration of statistical reasoning into the engineering process, prioritizing evidence-based decision-making over intuition.
The International Statistical Engineering Association (ISEA) defines statistical engineering as "the discipline dedicated to the art and science of solving complex problems that require data and data analysis." Statistical Engineering emphasizes the importance of understanding the problem and its context before developing a problem-solving strategy. While this step may appear obvious, it is often overlooked or rushed through in the haste to deliver. Yet the decisions that we make (or fail to make) at the beginning set the trajectory for either success or failure. This presentation will provide an overview of ISEA and key principles of statistical engineering as well as provide case studies to highlight the importance of these principles.
When analyzing sensor data, it is important to distinguish between environmental effects and actual defects of the structure. Ideally, sensor data behavior can be explained and predicted by environmental effects, for example via regression. However, this is not always the case, and explicite formulas are often missed. Then, comparing the behavior of environmental and sensor data can help to identify similarities. To do so, a classical approach is observing the correlation. Nevertheless, this only captures linear dependencies. Here, the concept of distance correlation as introduced by Székely et al. (2007) comes into play. It is not only not restricted to linear dependence, but it is also able to detect independence and does not require normality. To respond to another particularity of sensor data, that is local stationarity, we use the extension of Jentsch et al. (2020) of this concept, the so-called local distance correlation. We show different examples of application in the field of bridge monitoring from finding similarities and anomalies in sensor outputs over the determination of time spans for temperature transfer up to possible alarm concepts for long term surveillance.
In data-driven Structural Health Monitoring (SHM), a key challenge is the lack of availability of training data for developing algorithms which can detect, localise and classify the health state of an engineering asset. In many cases, it is additionally not possible to enumerate the number of operational or damage classes prior to operation, so the number of classes/states is unknown. This poses a challenge in many classification or clustering methodologies. The proposed solution is to adopt a Bayesian nonparametric approach to the clustering problem, a Dirichlet process density estimation procedure. This method can be interpreted as the extension of well known (Gaussian) mixture models to the case where the number of components in the mixture is infinite. A further extension of the algorithm to enable active learning will also be shown which allows guided inspections to be carried out. Contrary to many active learning algorithms, the approach presented here removes the need for a query budget to be specified a priori such that it may be applied in the streaming setting which is encountered in SHM. The efficacy of these approaches will be shown on both a simulated and in operation case study.
Structural Health Monitoring (SHM) is increasingly applied in civil engineering. One of its primary purposes is detecting and assessing changes in structure conditions to reduce potential maintenance downtime. Recent advancements, especially in sensor technology, facilitate data measurements, collection, and process automation, leading to large data streams. We propose a function-on-function regression framework for modeling the sensor data and adjusting for confounder-induced variation. Our approach is particularly suited for long-term monitoring when several months or years of training data are available. It combines highly flexible yet interpretable semi-parametric modeling with functional principal component analysis and uses the corresponding out-of-sample phase-II scores for monitoring. The method proposed can also be described as a combination of an "input-output" and an "output-only" method.
Forecasting is of the utmost importance to the integration of renewable energy into power systems and electricity markets. Wind power fluctuations at horizons of a few minutes ahead particularly affect the system balance and are the most significant offshore. Therefore, we focus on offshore wind energy short-term forecasting.
Since forecasts characterize but do not eliminate uncertainty, they ought to be probabilistic. For short-term forecasting, statistical methods have proved to be more skilled and accurate. However, they often rely on stationary, Gaussian distributions, which are not appropriate for wind power generation. Indeed, it is a non-linear, non-stationary stochastic process that is double bounded by nature.
We extend previous works on generalized logit-normal distributions for wind energy by developing a rigorous statistical framework to estimate the full parameter vector of the distribution. To deal with non-stationarity, we derive the corresponding recursive maximum likelihood estimation and propose an algorithm that can track the parameters over time.
From the observation that bounds are always assumed to be fixed when dealing with bounded distributions, which may not be appropriate for wind power generation, we develop a new statistical framework where the upper bound can vary without being observed. In the context of stochastic processes, we address the bound as an additional parameter and propose an online algorithm that can deal with quasiconvexity.
These new methods and algorithms originate from considering wind power forecasting. However, they are of interest for a much broader range of statistical and forecasting applications, as soon as bounded variables are involved.
Manufacturing processes are systems composed of multiple stages that transform input materials into final products. Drawing inferences about the behavior of these systems for decision-making requires building statistical models that can define the flow from input to output. In the simplest scenario, we can model the entire process as a single-stage relationship from input to output. In the most complex scenario, we need to build a model that accounts for all existing connections across the stages of the process, where each stage may contain a number of controllable parameters and evidence variables. In this work, we will explore the different data science elements behind modeling manufacturing processes using Bayesian networks. Through this, we will demonstrate how to build a model for a manufacturing process, starting from the simplest form and progressing to a more realistic and complex system definition, using a recycling process as our driving case study. We will deploy this model to show how it ultimately provides the inferences we aim to draw about the behavior of our manufacturing processes
Multi-way data extend two-way matrices to a higher-dimensional tensor. In many fields, it is relevant to pursue the analysis of such data by keeping it in its initial form without unfolding it into a matrix. Often, multi-way data are explored by means of dimensional reduction techniques. Here, we study the Multilinear Principal Component Analysis (MPCA) model, which expresses the multi-way data in a more compact format by determining a multilinear projection that captures most of the original multi-way data variation. The most common algorithm to fit this model is an Alternating Singular Value Decomposition algorithm, which, despite its popularity, suffers from outliers. To address this issue, robust alternative methods were introduced to withstand casewise and cellwise outliers, respectively, where two different loss functions are tailored based on the type of outliers. However, such methods break when confronted with datasets contaminated by both types of outliers. To address this discrepancy, we propose a method by constructing a new loss function using M-estimators for multi-way data, offering robustness against both kinds of anomalies simultaneously. Extensive simulations show the efficacy of this Robust MPCA method against outliers, demonstrating its potential in robust multi-way data analysis.
frEnbis Invited session
Our research addresses the industrial challenge of minimising production costs in an undiscounted, continuing, partially observable setting. We argue that existing state-of-the-art reinforcement learning algorithms are unsuitable for this context. We introduce Clipped Horizon Average Reward (CHAR), a method tailored for undiscounted optimisation. CHAR is an extension applicable to any off-policy reinforcement learning algorithm which exploits known characteristic times of environments to simplify the problem. We apply CHAR to an industrial gas supplier case study and demonstrate its superior performance in the specific studied environment. Finally, we benchmark our results against the standard industry algorithm, presenting the merits and drawbacks of our approach.
The aim of AI based on machine learning is to generalize information about individuals to an entire population. And yet...
- Can an AI leak information about its training data?
- Since the answer to the first question is yes, what kind of information can it leak?
- How can it be attacked to retrieve this information?
To emphasize AI vulnerability issues, Direction Générale de l’Armement (DGA, member of MoD in France) proposed a challenge on confidentiality attacks based on two tasks:
- Membership Attack: An image classification model has been trained on part of the FGVC-Aircraft open-access dataset. The aim of this challenge is to find, from a set of 1,600 images, those used for training the model and those used for testing.
- Forgetting attack: The model supplied, also known as the "export" model, was refined from a so-called "sovereign" model. The sovereign model has certain sensitive aircraft classes (families) which have been removed and replaced by new classes. The aim is to find which of a given set of classes have been used to train the sovereign model, using only the weights of the export model.
Friendly Hackers team of ThereSIS win the two tasks. During this presentation, we will present how we did it and what lessons we learned during this fascinating challenge.
In this presentation, we provide an overview of deep learning applications in electricity markets, focusing on several key areas of forecasting. First, we discuss state-of-the-art methods for forecasting electricity demand, including Generalised Additive Models (GAMs), which inspired the work that follows. Second, we look at multi-resolution forecasting, which uses data at high- and low-resolution levels through the application of Convolutional Neural Networks (CNNs). Third, we explore the use of Graph Neural Networks (GNNs) to exploit information across different spatial hierarchies, thereby improving the granularity and accuracy of predictions. In particular, we show the promising role of GNNs in forecasting the French national load based on nodes at the regional level. Fourth, we study meta-learning techniques to select the optimal neural network architecture. Finally, we examine the role of foundation models in standardising and streamlining electricity demand forecasting processes. This review highlights the promising advances and practical implementations of deep learning to improve forecasting accuracy, operational efficiency and decision-making processes in electricity markets.
It is well-known that real data often contain outliers. The term outlier typically refers to a case, corresponding to a row of the $n \times d$ data matrix. In recent times also cellwise outliers are being considered. These are suspicious cells (entries) that can occur anywhere in the data matrix. Even a relatively small proportion of outlying cells can contaminate over half the rows, which is a problem for existing rowwise robust methods. Challenges posed by cellwise outliers are discussed, and some methods developed so far to deal with them. On the one hand there has been work on the detection of outlying cells, after which one might replace them by missing values and apply techniques for incomplete data. On the other hand cellwise robust methods have been constructed, yielding estimates that are less affected by outlying cells. In lower dimensions the focus has been on robust estimation of location and covariance as well as linear regression, whereas in higher dimensions one needs cellwise robust principal components. Some real data examples are provided for illustration.
Reinforcement learning proposes a flexible framework that allows to tackle problems where data is gathered in a dynamic context: actions have an influence on future states. The classical reinforcement learning paradigm depends on a Markovian hypothesis, the observed states depend upon past states only through the last state and action. This condition may be too restrictive for real-world applications, as the dynamics may depend on past states. We get around this issue by augmenting the state space into a functional space, and we propose to use policies that depend on these states using functional regression models. We present an industrial application involving the automatic tint control of e-chromic frames, minimizing the number of user interactions. This particular policy takes action on an ordered set, using functional data estimated from successive Ambient Light Sensor (ALS) values, which are non-stationary. This is achieved using an extension of an existing ordinal model with functional covariates as a policy. The non-stationary ALS signal describing the state is handled by means of a wavelet functional basis. Finally, the policy is improved using policy gradient methods.
Reinforcement Learning (RL) has emerged as a pivotal tool in the chemical industry, providing innovative solutions to complex challenges. RL is primarily utilized to enhance chemical processes, improve production outcomes, and minimize waste. By enabling the automation and real-time optimization of control systems, RL aims to achieve optimal efficiency in chemical plant operations, thereby significantly reducing operational costs and enhancing process reliability.
In the polymeric fed-batch process, which consists of the feeding phase and the digestion phase, optimization is particularly challenging due to differing constraints and control variables across the phases. This complex process necessitates a careful design of the Markov Decision Process (MDP) and the formulation of reward/cost functions that accurately reflect the operational goals and safety requirements. The irreversible and exothermic nature of these reactions further imposes stringent safety constraints, necessitating robust control mechanisms to prevent hazardous conditions.
In this study, we employ an in-house developed hybrid model to meticulously design an MDP and utilize the Soft Actor-Critic (SAC) algorithm to train an RL agent. Our primary objective is to maximize the yield of the final product while strictly adhering to safety constraints. Detailed performance comparisons between SAC-trained RL agents and human expert golden-batch execution demonstrate that the SAC-trained agent can complete the process approximately 20% faster than the golden-batch, while maintaining safe temperature boundaries and achieving comparable product yield. These promising results underscore the potential of RL to significantly enhance both the efficiency and safety of complex chemical processes, making it a valuable asset in industrial applications.
The integration of multimodal artificial intelligence (AI) in warehouses monitoring offers substantial improvements in efficiency, accuracy, and safety. This approach leverages diverse data sources, including visual, and speech sensors, to provide comprehensive monitoring capabilities. Key challenges include the fusion of heterogeneous data streams, which requires sophisticated algorithms to interpret and integrate diverse inputs effectively. The development and training of AI models that can accurately analyse multimodal data are resource-intensive, demanding significant computational power and extensive datasets. Real-time processing is essential for prompt decision-making and incident response, yet achieving this remains a technical challenge due to the high volume of data and complexity of the models. Additionally, ensuring system robustness and reliability in varying warehouse environments is critical. We present here use cases where Multimodal AI has been successfully applied to demonstrate ability to deal with variabilities to monitor warehouses with unpredictable changes.
In this talk, the problem of selecting a set of design points for universal kriging,
which is a widely used technique for spatial data analysis, is further
investigated. We are interested in optimal designs for prediction and present
a new design criterion that aims at simultaneously minimizing the variation
of the prediction errors at various points. This optimality criterion is based
on the generalized variance (GV) and selects the design points in order to
make simultaneous predictions of the random variable of interest at a finite
number of unsampled locations with maximum precision. Specifically, a
correlated random field given by a linear model with an unknown parameter
vector and a spatial error correlation structure is considered as response.
Though the proposed design is effective and there are efficient techniques
for incrementally building designs for that criterion the method is limited to
simultaneous predictions at a finite number of locations. We are convinced
that this restriction can be lifted and the method may be generalized to minimizing
the generalized prediction variance over the design space. Currently
we have not yet solved the problem which addresses infinite determinants
but we may present interesting and promising preliminary results.
Finding an optimal experimental design is computationally challenging, especially in high-dimensional spaces. To tackle this, we introduce the NeuroBayes Design Optimizer (NBDO), which uses neural networks to find optimal designs for high-dimensional models, by reducing the dimensionality of the search space. This approach significantly decreases the computational time needed to find a highly efficient optimal design, as demonstrated in various numerical examples. The method offers a balance between computational speed and efficiency, laying the groundwork for more reliable design processes.
Project and problem-based learning is becoming increasingly important in teaching. In statistics courses in particular, it is important not only to impart statistical knowledge, but also to keep an eye on the entire process of data analysis. This can best be achieved with case studies or data analysis projects. In the IPPOLIS project, we are developing a software learning tool that allows students to experience a data analysis process from the definition of the question to be answered, through the description and analysis of the data, to the appropriate presentation of the results. The tool supports students from a wide range of disciplines in learning statistical data analysis. The project is part of the federal and state funding initiative “Artificial Intelligence in Higher Education”, which aims to improve higher education by supporting teaching activities and learning processes based on artificial intelligence.
We present details of the development of the new teaching software tool as an R Shiny application. The tool contains a collection of case studies at different levels, each time starting with a request from a user in medicine or business. Each time a student starts a case study, the tool automatically generates a new data set. The focus is then on the student selecting appropriate measures and graphs to describe the data. At higher learning levels, it is also necessary to perform e.g. a linear or logistic regression to answer the questions posed. Students carry out all steps of the data analysis process and produce a report at the end. A future goal is to support the assessment of student performance by using background data from the tool as well as results from the text mining of the report.
The "DOE Marble Tower" is a modular 3D-printed experiment system for teaching Design of Experiments. I designed it to solve one primary weakness of most DOE exercises, namely to prevent the ability of the experimenter to simply look at the system to figure out what each factor does. By hiding the mechanics, the DOE Marble Tower feels much more like real processes where the only way to know the causes of changes in the response for sure is by experimentation.
The tower is designed to work well in any classroom exercise, being easy to setup and change the settings of, having a low noise response (time for a marble to run through) and it produces nice sounds to boot!
In this session I'll introduce the tower and give a couple of examples of how to integrate it in teaching, after which participants are free to play with it along with their DOE software of choice!
Over the years I've seen diverse examples of fun elements in teaching statistics at the ENBIS. Of course paper helicopters and using catapults, or candle or water beads projects, for hands-on experience with DoE. But I also vividly remember a Lego assembly competition used for explaining control charts and process control.
Fun parts boost motivation and serve as anchors to remember contents. I believe that among the ENBIS participants there is a wealth of experience, of examples we have used, seen or dreamt of.
In this session I want to offer an opportunity for sharing these ideas and experience, and to become inspired by what we learn from others - from entire course concepts to quick ideas enhancing a short input for colleagues.
I will give a very short introduction, talking about motivation and engagement, and also show my own stats escape room. Then we'll get together in groups and share.
Hopefully, you'll leave with some fresh ideas and new acquaintances.
Storage of spare parts is one of the basic tasks set by the industry. Mathematical models, such as Crow-AMSAA (known in the Statistical Literature as the Power-Law Nonhomogeneous Poisson Process), allow us to estimate the demand based on coming data. Unfortunately, the amount of data is limited in the case of parts with high reliability, which is why the estimation is inaccurate. Bayesian methods are one technique that allows you to improve the estimation quality. In this talk, we will show the application of the profile likelihood estimation method to Crow-AMSAA and the combination of this method with the Bayesian approach. We will indicate how in practice using even a small amount of data you can successfully use the model and improve the quality of estimation and prediction. We present a model analysis based on synthetic data and real data.
This research is a part of the Primavera Project partly funded by Nederlandse Organisatie voor Wetenschappelijk Onderzoek (NWO) grant number NWA.1160.18.238 and co-funding from the participating consortium members.
Acceptance sampling plays an imperative role in quality control. It is a common technique employed across various industries to assess the quality of products. The decision to accept or reject a lot depends upon the inspection of a random sample from that lot. However, traditional approaches often overlook valuable prior knowledge of product quality. Moreover, existing Bayesian literature primarily focuses on economic considerations, employing complex cost functions and overlooking the probabilistic risks of rejecting satisfactory lots or accepting unsatisfactory lots that are inherently associated with the sampling process. Until the present, a Bayesian formulation of these risks has not yet been considered in the design of acceptance sampling plans.
This work addresses this gap by proposing a novel approach to designing single-sampling plans for attributes utilizing Bayesian risks. More specifically, it extends the two-point method for designing a sampling plan for attributes to a Bayesian framework enabling the incorporation of prior knowledge of the product quality into the design process. For this purpose, we used Bayesian risks (so-called modified and Bayes’ risks defined earlier by Brush (1986)) to develop search strategies aimed at designing plans that effectively reduce these risks.
Experiments reveal that the design procedure allows the sample size to adapt to the prior knowledge of product quality. When the prior suggests good quality, less stringent sampling plans with lower sample sizes are obtained. When prior indicates bad quality, the behaviour of the resulting sampling plan is highly determined by the type of risks used in the design.
Counting processes occur very often in several scientific and technological problems. The concept of numerousness and, consequently the counting of a number of items are at the base of many high-level measurements, in fields such as, for example, time and frequency, optics, ionizing radiations, microbiology and chemistry. Also, in conformity assessment and industrial quality control, as well as in everyday life, counting plays a fundamental role.
The occurrence of error in counting is real and needs to be addressed. It might occur, for example, that one fails in counting an object because of some reasons, such as human or instrumental errors. In such a case, the measurand, i.e., the number of items intended to be counted, is underestimated. On the other hand, one may count a non-existing object, hence obtaining an overestimate of the measurand.
In a previous paper [Metrologia 2012, 49 (1), 15-19], a general model for measurements by counting was proposed which allows an evaluation of the uncertainty compliant with the general framework of the “Guide to the expression of uncertainty in measurement” [JCGM 100:2008]. The present work considers the same scenario but facing the problem from a Bayesian point of view. In particular, we discuss in detail the (discrete) likelihood function of the counted objects and give the posterior probability mass function associated with the measurand for selected prior distributions.
Multispectral imaging, enhanced by artificial intelligence (AI), is increasingly applied in industrial settings for quality control, defect detection, and process optimization. However, several challenges hinder its widespread adoption. The complexity and volume of multispectral data necessitate advanced algorithms for effective analysis, yet developing these algorithms is resource-intensive. Variability in imaging conditions, such as lighting and sensor noise, requires robust preprocessing techniques to ensure consistent results. Additionally, the integration of AI with existing industrial systems poses interoperability issues. Ensuring real-time processing capability is crucial for many applications but remains a technical hurdle due to the computational demands of AI models. We present here a Multispectral Imaging Flow that tackles these limitations and provide a usable solution for industrial applications. Validation in relevant use cases are presented.
In the pharmaceutical industry, there are strict requirements on the presence of contaminants inside single-use syringes (so-called unijects). Quality management systems include various methods such as measuring weight, manual inspection or vision techniques. Automated and accurate techniques for quality inspection are preferred, reducing the costs and increasing the speed of production.
In this paper we analyze defects on unijects. During inspection, the product is spun around to force contaminants to the outside of the bulb and photos are taken. These photos can be manually inspected, however using computer vision techniques this process can be automated.
As such inclusions are exceedingly rare to occur in practice, it is very difficult to collect a first dataset to train a deep-learning network on, which contains actual defects. The approach we will demonstrate in our contribution introduces synthetic defects on top of regular images for kickstarting the defect detection network. Using this initial defect segmentation network, we can then introduce classic uncertainty and diversity sampling algorithms to select relevant images for annotation. Normally, in these 'active learning' strategies the initial dataset is taken at random. However, because of the low probability of selecting each type of defect at random, the model has a very cold start. We will demonstrate how our hot-start approach using synthetic defects solves this initialization problem.
Pest insects threaten agriculture, reducing global crop yields by 40% annually and causing economic losses exceeding $70 billion, according to the FAO. Increasing pesticide use not only affects pest species but also beneficial ones. Consequently, precise insect population monitoring is essential to optimize pesticide application and ensure targeted interventions.
In today's AI-driven era, manual monitoring methods like sticky traps are being automated using image-based Deep Learning models. However, training these models requires large, typically unavailable species-specific datasets. Thus, leveraging prior knowledge from similar tasks is key for efficiently adapting to new invasive insects.
Our study focuses on training a Convolutional Neural Network (CNN) model, the EfficientNetv2. We compare training efficiency starting from two pretraining strategies: one using a pretraining based on the generic ImageNet dataset, and another employing an insect-focused version of the iNaturalist dataset. Moreover, we experiment with various training set sizes (1,000-30,000 images) and explore which EfficientNetV2 architecture sizes and layer-freezing strategies are most effective.
Results show that while EfficientNetV2-Large slightly outperforms Medium and Small variants, its longer training times are unjustified. Therefore, EfficientNetV2-Small is preferable. Furthermore, findings indicate that the optimal layer-freezing strategy depends on the pretraining type. Fine-tuning all layers performed best for ImageNet pretraining, while tuning only mid-to-final layers yielded higher performance than any ImageNet configuration when using iNaturalist pretraining.
We demonstrate that domain-specific knowledge combined with fine-tuning intermediate CNN layers offers an efficient learning strategy in case of limited data, enabling effective monitoring of insect populations and enhancing agricultural defenses against pest-induced yield losses.
Multivariate EWMA control charts were introduced in Lowry et al. in 1992 and became a popular and effective tool for monitoring multivariate data. However, multi-stream data are somehow related to the aforementioned framework. In both cases, correlation between the components respective streams is considered. However, whereas the multivariate EWMA charts deploys a distance (Mahalanobis) in the multivariate space, the multi-stream EWMA chart comprises a set of univariate control charts. In this talk, we discuss feasible calculation of the detection performance of multi-stream EWMA charts (not many results are available so far) and compare their detection behavior to the better investigated multivariate EWMA charts. Essentially, numerical methods are applied. Extensive Monte Carlo studies confirm their validity.
In modern industrial settings, the complexity of quality characteristics necessitates advanced statistical methods using functional data. This work extends the traditional Exponentially Weighted Moving Average (EWMA) control chart to address the statistical process monitoring (SPM) of multivariate functional data, introducing the Adaptive Multivariate Functional EWMA (AMFEWMA). The AMFEWMA modifies EWMA weighting parameters adaptively to improve the detection sensitivity under various process mean shifts, crucial for industries with dynamic scenarios. The AMFEWMA's advantages over competing methods are assessed through an extensive Monte Carlo simulation and a practical application with the automotive industry in the SPM of resistance spot welding quality through the analysis of dynamic resistance curves across multiple welds, which represent a comprehensive technological signature of the welding process quality. The practical application emphasizes AMFEWMA's potential to enhance SPM in advanced manufacturing.
Acknowledgments: The research activity of A. Lepore and F. Centofanti were carried out within the MICS (Made in Italy – Circular and Sustainable) Extended Partnership and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.3 – D.D. 1551.11-10-2022, PE00000004). The research activity of B. Palumbo was carried out within the MOST - Sustainable Mobility National Research Center and received funding from the European Union Next-GenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D. 1033.17-06-2022, CN00000023). This work reflects only the authors’ views and opinions, neither the European Union nor the European Commission can be considered responsible for them.
Statistical process monitoring is of vital importance in various fields such as biosurveillance, data streams, etc. This work presents a non-parametric monitoring process aimed at detecting changes in multidimensional data streams. The non-parametric monitoring process is based on the use of convex hulls for constructing appropriate control charts. Results from applying the proposed method are presented and the competitive advantages against other competitors are highlighted.
Accelerated degradation tests (ADTs) are widely used to assess lifetime information under normal use conditions for highly reliable products. For the accelerated tests, two basic assumptions are that changing stress levels does not affect the underlying distribution family and that there is stochastic ordering for the life distributions at different stress levels. The acceleration invariance (AI) principle for ADTs is proposed to study these fundamental assumptions. Using the AI principle, a theoretical connection between the model parameters and the accelerating variables is developed for Hougaard processes. This concept can be extended to heterogeneous gamma and inverse Gaussian processes. Simulation studies are presented to support the applicability and flexibility of the Hougaard process using the AI principle for ADTs. A real data analysis using the derived relationship is used to validate the AI principle for accelerated degradation analysis.
A repairable system can be reused after repairs, but data from such systems often exhibit cyclic patterns. However, as seen in the charge-discharge cycles of a battery where capacity decreases with each cycle, the system's performance may not fully recover after each repair. To address this issue, the trend renewal process (TRP) transforms periodic data using a trend function to ensure the transformed data displays independent and stationary increments. This study investigates random-effects models with a conjugate structure, achieved by reparameterizing the TRP models. These random-effects TRP models, adaptable to any TRP model with a renewal distribution possessing a conjugate structure, provide enhanced convenience and flexibility in describing sample heterogeneity. Moreover, in addition to analyzing aircraft cooling system data, the proposed random-effects models are extended to accelerated TRP for assessing the reliability of lithium-ion battery data.
In a period of time when Artificial Intelligence and Machine Learning algorithms are taking over the analysis of our needs in product development. There is still important to be reminded where humans still have to question and control how new product designs handle the aspects of variation and uncertainty.
One part is the mapping of the variation of all aspects of design and production parameters. The second part is to question the safety margins as well as design margins. A third part is the uncertainty of computational models.
In this presentation the general theory for investigating design margins are discussed, and how this can be used for Robust Design optimization.
The use of composite materials has been increasing in all production industries including the aviation industry, due to their strength, lightness, and design flexibility. The manufacturing of composite materials finalizes with their curing in the autoclaves that are heat and pressure ovens. The autoclave curing cycle, in which a batch of materials is cured in the autoclave, is made of three stages; all materials are heated up until the curing temperature (heating stage), cured at the curing temperature (curing stage), and cooled down to room temperature (cooling stage). This curing cycle should be shortened to better utilize the autoclaves. We consider the heating stage and aim to minimize its duration by efficiently placing the parts in the autoclave, such that the duration for all parts to reach the curing temperature is minimum. To achieve this, we need to use the relationship between how parts are placed in the autoclave and their heating durations, which is currently not known for the parts and the autoclave considered.
In this study, we estimate the heating duration of parts depending on how a batch of parts is placed in the autoclave. We use two methods, multiple linear regression and artificial neural networks, and develop different models for each method by either partitioning the autoclave area in smaller subareas (area-based models),and considering the autoclave region as a single area (single-area models). We propose the best model by evaluating their performances on real test data.
Waste lubricant oil (WLO) is a hazardous residual that is preferably recovered through a regeneration process, for promoting a sustainable circular economy. WLO regeneration is only viable if the WLO does not coagulate in the equipment. Thus, to prevent process shutdowns, the WLO’s coagulation potential is assessed offline in a laboratory through an alkaline treatment. This procedure is time-consuming, presents several risks, and the final outcome is subjective (visual assessment).
To expedite decision-making, process analytics technology (PAT) was employed to develop a model to classify WLOs according to their coagulation protentional. To this end, three approaches were followed, spanning linear and non-linear models. The first approach (benchmark) uses partial least squares for discriminant analysis (PLS-DA) combined with standard spectral pre-processing techniques (27 model variants). The second approach uses wavelet transforms to decompose the spectra into multiple frequency components by convolution with linear filters, and PLS-DA for feature selection (10 model variants). Finally, the third approach uses convolutional neural networks (CNN) to estimate the optimal filter for feature extraction (1 model variant).
The results show that the three modelling approaches can attain high accuracy (91% on average). Thus, they can lead to a significant reduction in laboratorial burden. However, the benchmark approach requires an exhaustive search over multiple pre-processing filters since the optimal filter cannot be defined a priori. The CNN approach can streamline the estimation of an optimal filter, but has a more complex model building stage. The spectral filtering using wavelet transforms proved to be a viable option, maintaining the interpretability of linear approaches, and reducing the amount of model variants to explore.
In the fishing industry, maintaining the quality of fish such as the Peruvian anchovy (Engraulis ringens), used primarily for fishmeal and oil, is critical. The condition and freshness of the fish directly influence production outcomes and the final product's quality. Traditional methods for assessing fish freshness, though precise, are often too costly and time-consuming for frequent application. This study introduces a novel method using convolutional neural networks (CNN) to estimate the freshness of Peruvian anchovy from informal photos taken by fishermen using mobile phones or cameras. Initially, the CNN model was trained to identify not only the fish and krill but also damaged organs and blood. This identification process is crucial as the physical integrity of the fish affects the cooking process: the less intact the fish upon entering the factory, the different the cooking approach should be. Therefore, real-time measurements of fish wholeness, freshness, and the amount of krill entering the production are essential. The CNN model is designed to automatically classify the freshness of the fish by analyzing images for key indicators such as eye clarity, skin texture, and color, and to address the challenge of krill contamination — small crustaceans frequently mixed with anchovy catches. The model distinguishes between krill, anchovy, and levels of destruction such as blood, which is vital for accurate freshness evaluation. Preliminary results show a strong correlation between the CNN predictions and traditional laboratory measurements, indicating that this approach can significantly streamline freshness assessments, reduce costs, and improve response times in production decisions. This technology has the potential to transform quality control in the anchovy fishing industry, enhancing both efficiency and economic viability.
Online experimentation is a way of life for companies involved in information technology and e-commerce. These experiments allocate visitors to a website to different experimental conditions to identify conditions that optimize important performance metrics. Most online experiments are simple two-group comparisons with complete randomization. However, there is great potential for improvement from implementing multi-factor experiments and for accelerating exploitation of results by employing bandit allocation, in which an increasing fraction of traffic is directed toward successful variants. The natural implementation of this idea in a multi-factor experiment is to divert traffic to better levels of the factors that are the first to stand out as active. Logistical considerations may limit the number of variants that can be made available, so that full factorials are not possible. It is then appealing to use screening experiments (e.g. Plackett-Burman designs) to identify important factors. When bandit allocation is used with a screening experiment, declaring an active factor will reduce the number of factor combinations that get continuing use and may result in a singular design. We present here some simple, yet efficient, methods for online augmentation of the factor combinations to enhance the ability to identify additional active factors after one or more are shifted to bandit allocation.
Mixture choice experiments investigate people's preferences for products composed of different ingredients. To ensure the quality of the experimental design, many researchers use Bayesian optimal design methods. Efficient search algorithms are essential for obtaining such designs, yet research in the field of mixture choice experiments is still not extensive. Our paper pioneers the use of a Simulated Annealing (SA) algorithm to construct Bayesian optimal designs for mixture choice experiments. Our SA algorithm not only accepts better solutions but also has a certain probability of accepting inferior solutions. This approach effectively prevents rapid convergence, enabling exploration of a broader experimental region, thus yielding better designs within a given time frame compared to the popular mixture coordinate exchange method. We demonstrate the superior performance of our SA algorithm through extensive computational experiments and a real-life example.
Validating statistical software involves a variety of challenges. Of these, the most difficult is the selection of an effective set of test cases, sometimes referred to as the “test case selection problem”. To further complicate matters, for many statistical applications, development and validation are done by individuals who often have limited time to validate their application and may not have formal training in software validation techniques. As a result, it is imperative that the adopted validation method is efficient, as well as effective, and it should also be one that can be easily understood by individuals not trained in software validation techniques. As it turns out, the test case selection problem can be thought of as a design of experiments (DOE) problem.
As a collaborative statistician you have been charged with completing a complicated toxicology analysis regarding levels of harmful chemicals in groundwater. At the conclusion of your presentation, an audience member asks, ‘So, should my cows drink the water?” At least half the audience nods and comments that they, too, would like to know the answer to that question. Clearly, something went wrong with your communication strategy. You go back through your charts and graphs, repeating your previous words. Half of the audience still appears confused.
To prevent this situation, most of us have been advised to tell a story, explain results in common language, use a visualization, etc. Yet, there is a big gap in knowing what to do and knowing how to do it. In this talk, we will present practical tips and tools for communicating statistical results at the right level to your audience. We will cover the 4Cs Framework which helps you transition from communicating ‘what you did and how you did it’ to ‘what it means for decision makers’. The first three Cs stand for Comprehensive, Collaborator, and Circulation. The last C varies depending on your role and audience and could stand for Citizen, Chair, or C-Suite.
We will also discuss ADEPT: Analogy, Diagram, Example, Plain English (or your native language), and Technical Definition for communicating complex statistics and data science concepts. Finally, we will touch on some techniques for visualization and slide design that can make you instantly more effective in your next project meeting. All techniques will be accompanied by statistics and data science examples. This talk is appropriate for all levels of collaborative statisticians and data scientists including students, novices, professors teaching collaborative skills, and experts mentoring or managing those early in their careers.
Frøydis presents a case study on fat% variation in salami production, and shows how variance components are investigated, and visualised to different audiences, reflecting the issues from Jennifer’s introduction. In the talk, some ways of illustrating the fat% variation within and between batches of salami green mass are presented and discussed.
This talk ties in with the previous two talks in the session: the story and data are from one of the series of cases discussed by Froydis Bjerke, from Animalia, Norway, and the communication focus follows guidelines provided by Jennifer Van Mullekom.
The issues that arise in the case study itself include industrial statistics classics: “is the expensive external laboratory test really better than our modern in-house test?”; “does the relatively high variation observed for one property reflect a true lack of homogeneity for this property, or a test problem?”; “is an investigation of the components of product and test variation at an intermediate stage of production relevant to the quality of the final product?”.
Choices made when constructing the active learning class exercise are described. For example, is our starting point a verbal description of the problem, an available data set, or an existing graphical display? Which skills of statistical practice do we want to teach in this exercise, and what is the right amount of guidance? Two examples of skills are how to identify which additional data to ask for or to plan to collect, and how to exploit existing supplementary data that is of limited use. Regarding the communication aspect, we strive to be brutally honest about how well our beloved graphs are really understood by engineers and managers.
Quality by Design (QbD) has emerged as a pivotal framework in the pharmaceutical industry, emphasizing proactive approaches to ensure product quality. Central to QbD is the identification of a robust design space, encompassing the range of input variables and process parameters that guarantee pharmaceutical product quality. In this study, we present a comparative analysis of random walk sampling used in Markov Chain MCMC and nested sampling methodologies employed in design space identification within the QbD paradigm.
Random walk sampling, a key component of MCMC methods, has been widely used in Bayesian inference for its simplicity and effectiveness in exploring parameter space. However, the efficacy of random walk sampling can be affected by the presence of high-dimensional and multimodal distributions, which are common in complex pharmaceutical processes.
Nested sampling offers an alternative approach to sampling from complex probability distributions by focusing on the marginal likelihood. Unlike MCMC, nested sampling systematically improves the estimation of the evidence, enabling more efficient exploration of the parameter space, particularly in scenarios with a high number of design factors.
Drawing on insights from the work of Kusumo et al. (2019) on a Bayesian approach to probabilistic design space characterization using nested sampling strategy, our comparative analysis evaluates the strengths, limitations, and applicability of classical random walk sampling in MCMC and nested sampling in the context of design space identification. We consider factors such as computational efficiency, accuracy of parameter estimation, scalability to high-dimensional spaces, and robustness to multimodal distributions. Furthermore, we discuss practical considerations for selecting appropriate methodologies based on the specific characteristics of the pharmaceutical process under investigation.
Through case studies and theoretical discussions, we illustrate the strengths and limitations of random walk sampling in MCMC and nested sampling methodologies for design space identification within the QbD framework. As OMARS designs are quite capable of handling the high dimensional problems, we aim to test the nested sampling method with OMARS designs. Our findings aim to provide insights into the comparative performance of these methodologies and inform researchers, practitioners, and regulatory authorities involved in pharmaceutical development about their suitability for achieving the goals of QbD.
This study contributes to advancing the understanding of methodologies for design space identification in QbD and facilitates informed decision-making in pharmaceutical process optimization and quality assurance.
Ref: Kusumo, K. P., Gomoescu, L., Paulen, R., García Muñoz, S., Pantelides, C. C., Shah, N., & Chachuat, B. (2019). Bayesian approach to probabilistic design space characterization: A nested sampling strategy. Industrial & Engineering Chemistry Research, 59(6), 2396-2408.
Chemical and physical stability of drug substances and drug products are critical in the development and manufacturing of pharmaceutical products. Classical stability studies, conducted under defined storage conditions of temperature and humidity and in the intended packaging, are resource intensive and are a major contributor to the development timeline of a drug product. To provide support for shelf life claims and expedite the path to clinical implementation, accelerated stability studies in combination with stability modeling have become common practice in the pharmaceutical industry.
In this context, a unified Bayesian kinetic modeling framework is presented, accommodating different types of nonlinear kinetics with temperature and humidity dependent rates of degradation. In comparison to kinetic modeling based on nonlinear least-squares regression, the Bayesian framework allows for interpretable posterior inference, straightforward inclusion of the effects of the packaging in shelf life prediction, flexible error modeling and the opportunity to include prior information based on historical data or expert knowledge. Both frameworks perform comparably for sufficient data from well-designed studies. However, the Bayesian approach provides additional robustness when the data are sparse or of limited quality. This is illustrated with several examples of modeling and shelf life predictions.
Powders are ubiquitous in the chemical industry, from pharmaceutical powders for tablet production to food powders like sugar. In these applications, powders are often stored in silos where the powder builds up stress under its own weight. The Janssen model describes this build up, but this model has unknown parameters that must be estimated from experimental data. This parameter estimation involves several challenges, such as structural unidentifiability and correlated measurements. To overcome these challenges, a Bayesian non-linear mixed effects model, that incorporates data from two different measurement set-ups, is implemented in Turing.jl.
Version 18 of JMP and JMP Pro are being released in Spring 2024, bringing a host of new features useful to scientists and engineers in industry and academia. This presentation will focus on some key extensions and improvements: Besides an improved user experience based on a new Columns Manager for easier data management or Platform Presets for creating and reusing customized report templates, we will present a completely overhauled Python integration, a Deep Learning extension via the Torch library and other improvements.
In stratified designs, restricted randomization is often due to budget or time constraints. For example, if a factor is difficult to change and changing its level is expensive, the tests in a design are grouped into blocks so that within each block the level of the difficult factor is kept constant. Another example appears in agriculture, where some factors may need to be applied to larger experimental units (think of aerial spraying of pesticides), while others may be applied to smaller units (crop variety). A final example is an experiment in which several mixtures are made and different process conditions are tested on each of them. In this last case, the mixture factors are the ones that are difficult to change and the process factors are the ones that are easy to change.
To design and analyse these problems, strata information must be taken into account.
On the one hand, the restricted randomization sources should be identified during the design of the experiment. The EFFEX software platform provides an easy-to-use interface to find an experimental design for the given situation.
On the other hand, ignoring the strata present in the data can lead to meaningless models and incorrect conclusions. The EFFEX software platform's mixed modelling analysis tools graphically display the best models and guide the user to identify the most important effects.
The popular ENBIS LIVE session is again on the program!
ENBIS LIVE 2024 will be hosted by Christian Ritter and Jennifer Van Mullekom.
This is a session in which three volunteers present open problems and the audience discusses them. t's a special occasion where we can all work together to make either by providing useful suggestions or by gaining a deeper understanding. In this session, everybody participates actively Last year, we looked at NIR/Raman techniques for quality monitoring in food ingredients, predicting final properties in pharmaceuticals from initial conditions and control parameters, and issues related to detecting boar taint using the human nose score. In the years before we also looked at 3D printing of metals, a growing food delivery service, data mining work flows, the problem of data access to citizens during the COVID pandemic, etc.
Let's see what we will have this year! But for that we NEED again three VOLUNTEERS.
Do YOU currently have an open problem which you could present in that session? We will give you 7 minutes for doing that. You can either present a short deck of slides or you can just speak freely. You should introduce the context and where you are currently stuck or looking for a better answer. We will then go around the audience for one round of questions and another one for suggestions. Together with you we will then come up with a brief summary. Quite often, this will provide you with new insights and ideas.
If you think your problem might fit, please contact ritter.christian@ridaco.be as soon as possible.
Do YOU want to listen to open projects and contribute actively in the discussion, note the time and date of the session and enjoy the lively interaction.
We count on all of YOU to make ENBIS LIVE another success.
There is a common perception that bringing in statistical innovation in the highly regulated industry, such as pharmaceutical companies, is a hard mission. Often, due to legal constraints, the statistical innovation in the nonclinical space is not obvious to the outer world. In our discussion panel we would like to discuss challenges we face as industrial statisticians working in pharmaceutical companies and the main focus would be automation of the statistical analyses and workflows in our practice. Automation can on one hand free up statistical practitioners to focus more on bringing in innovative approaches rather than rerunning routine/repeatable analyses and standard reports, on the other hand, help to provide access to novel tools for data exploration and analysis.
We shall talk what types of analyses does it usually make sense to automate, share the successful automatization story and lessons learned from the automation tools which were more failures. Bayesian framework will also be discussed with its benefits and challenges. The prior information requires a thoughtful cross-disciplinary discussion to empower the analyses. We would also like to bring to a broader audience strategies which make automated tools widely adopted, i.e. via trainings, good agreements and collaborations with stakeholders. In addition, technical aspects of software implementation would be brought up, such as data format requirements, ensuring reproducibility of the analysis etc. Finally, a set-up of highly regulated analyses (i..e Good Manufacturing practices) vs. exploratory ones (in early development) requires specific approach and rigor to ensure the final software or script would be applied for the developed purpose and would be a good return on invested time and resources.
The panel discussion would also contain some example case studies from several companies the panelists represent.
As businesses increasingly rely on machine learning models to make informed decisions, developing accurate and reliable models is critical. Obtaining curated and annotated data is essential for the development of these predictive models. However, in many industrial contexts, data annotation represents a significant bottleneck to the training and deployment of predictive models. Acquiring labelled observations can be laborious, expensive, and occasionally unattainable, making the limited availability of such data a significant barrier to training machine learning models suitable for real-world applications. Additionally, dealing with data streams is even more challenging because decisions need to be made in real-time, compounded by issues like covariate shifts and concept drifts. In this presentation, we will discuss the use of active learning and adaptive sampling techniques to effectively manage label scarcity in supervised learning, particularly in regression data streams. This talk will provide a comprehensive overview of these techniques, followed by detailed discussions on two specific approaches. First, we will dive into stream-based active learning, which aims to minimise labelled data requirements by strategically selecting observations. The focus will be on linear models, and we will explore the impact of outliers and irrelevant features in the sampling process. Next, we will address concept drift monitoring and adaptive sampling, presenting a method to optimise data collection schemes in scenarios where the relationship between input features and the target variable changes over time. The aim of the presentation is to provide an overview of these sampling techniques while highlighting potential applications in real-time data stream scenarios, laying the groundwork for future work in this growing research area.
The rapid progress in artificial intelligence models necessitates the development of innovative real-time monitoring techniques with minimal computational overhead. Particularly in machine learning, where artificial neural networks (ANNs) are commonly trained in a supervised manner, it becomes crucial to ensure that the learned relationship between input and output remains valid during the model's deployment. If this stationarity assumption holds, we can conclude that the ANN provides accurate predictions. Otherwise, the retraining or rebuilding of the model is required. This talk focuses on examining the latent feature representation of data, referred to as "embedding", generated by ANNs to identify the time point when the data stream starts being nonstationary. The proposed monitoring approach employs embeddings and utilizes multivariate control charts based on data depth calculations and normalized ranks. The method's performance is thoroughly compared to benchmark approaches, accounting for various existing ANN architectures and underlying data formats. The goal is to assess its effectiveness in detecting nonstationarity in real time, offering insights into the validity of the model’s output.
The online quality monitoring of a process with low volume data is a very challenging task and the attention is most often placed in detecting when some of the underline (unknown) process parameter(s) experience a persistent shift. Self-starting methods, both in the frequentist and the Bayesian domain aim to offer a solution. Adopting the latter perspective, we propose a general closed-form Bayesian scheme, whose application in regular practice is straightforward. The testing procedure is build on a memory-based control chart that relies on the cumulative ratios of sequentially updated predictive distributions. The derivation of control chart's decision-making threshold, based on false alarm tolerance, along with closed form conjugate analysis, accompany the testing. The theoretic framework can accommodate any likelihood from the regular exponential family, while the appropriate prior setting allows the use of different sources of information, when available. An extensive simulation study evaluates the performance against competitors and examines the robustness to different prior settings and model type misspecifications, while continuous and discrete real datasets, illustrate its implementation.
In medical and pharma research, statistical significance is often based on confidence intervals (CIs) for means or mean difference and p-values, the reporting of which is included in publications in most top-level medical journals. However, recent years have seen ongoing debates on the usefulness of these inferential tools. Misinterpretations of CIs for means and p-values can lead to misleading conclusions and nonreproducible claims. On the other hand, the two one sided tests (TOST) approach is usually applied in pharma industry for equivalence testing, robustness study or stability analysis. Yet, the TOST is also commonly based on CIs for mean difference or p-value.
Here, we propose a unified framework based on success probability (SP), which has a wider definition based on the tolerance interval’s methodology. The SP allows a straightforward and identical interpretation between both frequentist and Bayesian paradigms. The SP extends also the concept of ’probability of agreement’ and (Bayesian) ‘comparative probability metrics’ (CPM). While the CPM is calculated from the posterior distributions, we show that the confidence bound of such probabilities is crucial but rarely applied in practice. The confidence bound for the SP is indeed a one-to-one function of the p-value with enhanced interpretability properties and has a default cut-off value of 50% whatever the type I error.
Performance of our methodology will be evaluated by simulations and applications to case studies within CMC statistics and vaccines development. We argue that success probabilities should be preferred by researchers in pharma industry.
The management of the COVID 19 pandemic, especially during years 2020 and 2021, highlighted a serious shortage at all levels and in the majority of countries around the world.
Some countries reacted slightly better, having faced similar epidemics in their recent past, but obviously this was not enough, since the flows of people worldwide are now so huge that it makes little sense to make differences at individual country level.
In those difficult phases of the emergency, statisticians from all over the world should have had a great role, but in reality this was not the case.
This is why some individual statisticians moved independently and did their best to help understand what was happening, by appropriately analyzing even large amounts of data, often confused.
This presentation summarizes three years' statistical work carried out by the author. The work is divided into three parts.
The first part presents a statistical dashboard model that effectively allows to evaluate the progress of the epidemic diffusion.
The second part delves into the peculiar aspects of statistical monitoring, control and optimally handling a pandemic emergency.
In the third part it is shown how it is possible to “robustly” evaluate the impact of the infectious disease on the human mortality.
All three parts present methods based on the analysis of systematically collected data, freely available in official repositories.
During the past few decades, it has become necessary to develop new tools for exploiting and analysing the ever increasing volume of data. This is one of reason why Functional Data Analysis (FDA) have become a very popular in a constantly growing number of industrial, societal and medical applications. FDA is a branch of statistics that deals with data that can be represented as functions. Unlike traditional data analysis, which focuses on discrete observations, FDA involves analyzing data that is inherently continuous, such as curves, surfaces, and shapes. Regression models with a functional response involving functional covariate, also called "function-on-function", are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a Mixture-of-Experts designed for fully functional data. As most of the inference approach for models on functional data, we use B-splines basis expansion both for covariates and parameters to have an approximation in finite dimensional space. A regularized inference approach is also proposed, which accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data with different scenarios illustrate the good performance of our method for capturing complex relationship between functional covariates and functional response. The method is finally applied to a real-world data set for comparison to competitors. We illustrate in particular performance of our proposed method on predicting the Quality of user experience of streaming video service based on network quality of service parameters.
Traffic flow estimation plays a key role in the strategic and operational planning of transport networks. Although the amplitude and peak times of the daily traffic flow profiles change from location to location, some consistent patterns emerge within urban networks. In fact, the traffic volumes of different road segments are correlated with each other from spatial and temporal perspectives. The spatial and temporal correlation estimate on road networks represents an important issue for many applications such as traffic inference, missing data imputation, and traffic management and control. In particular, exploring the pairwise correlation between sensors paves the path for inferring data on broken sensors based on data observed on other mostly correlated still-working sensors.
In this setting, we propose a clustering-based functional graphical model (CBFGM) method to explore the spatial (i.e., link-to-link) conditional dependence structure of daily traffic flow profiles. After a smoothing phase, observations are clustered by applying a functional clustering method. Then, for each cluster, a functional graphical model is fitted through a specified estimation method. Based on functional data analysis techniques, the method can efficiently treat the high dimensionality of the problem, avoiding the well-known issue of compressing the information into pattern-specific and arbitrarily chosen features. The CBFGM is applied to a dataset collected using the traffic flow monitoring system installed in the city of Turin with the main aim of building a graphical network of daily traffic flow profiles measured at different sensor locations.
Acknowledgements The present work was developed within the project funded by "Programma per il finanziamento della ricerca di Ateneo – Linea B" of the University of Naples Federico II (ref. ALTRI_CdA_75_2021_FRA_LINEA_B).
We formulate a semiparametric regression approach to short-term prediction (48 to 72 hours ahead horizons) of electricity prices in the Czech Republic. It is based on complexity penalized spline implementation of GAM hence it allows for flexible modeling of dynamics of the process, important details of the hourly + weekly periodic components (which are salient for both point prediction and its uncertainty), as well as external influences or long-term moods. Importantly, the models are highly structured allowing for extracting and checking plausibility of the components instead of black-box style predictions. We will demonstrate and compare the performance of several competing models of this class on long-term real data featuring highly nonstationary behavior. We will also show advantages of functional data approaches to the prediction problem – both from modeling and computational perspectives.
Flow cytometry is a technique used to analyze individual cells or particles contained in a biological sample. The sample passes through a cytometer, where the cells are irradiated by a laser, causing them to scatter and emit fluorescent light. A number of detectors then collect and analyze the scattered and emitted light, producing a wealth of quantitative information about each cell (cell size, granularity, expression of particular proteins or other markers…). This technique produces high dimensional multiparametric observations.
We considered here flow cytometry data, obtained from blood samples, in the context of a specific severe disorder. For each of the n patients, p variables were measured on around 10 000 cells. The information for each patient can be considered like a p-dimensional distribution. Usually, with such data, dimension reduction is based on the distance matrix between these distributions. We propose in this work to reduce the size of the data by calculating deciles and correlation between variables. These method allows to keep more variables (around several hundred) to use classification methods.
In order to evaluate the performance of companies, the focus is shifting from purely quantitative (financial) information to qualitative (textual) information. Corporate annual reports are comprehensive documents designed to inform investors and other stakeholders about a company's performance in the past year and its goals for the coming years. We have focused on the corporate sustainability reporting of FTSE 350 companies in the period 2012–2021. The lack of standardization and structuring of non-financial reporting makes such an analysis difficult.
We extracted all text from the non-financial sections of the annual reports using the pdf2txt tool and filtered it to retain only structurally correct sentences. We then identified sentences related to sustainability using a pre-trained sentence classifier (manual annotation). The content of these sentences was analyzed using the RoBERTa model, which was adapted to the financial domain. Using a hierarchical clustering algorithm, we identified 30 interpretable sustainability-related topics and 6–9 higher-level clusters of sustainability concepts.
For each report and each year, we calculated the proportion of topics within the report. The development of sustainability topics over time shows that external events and new reporting standards influence the overall content of the annual reports. In addition, we clustered the reports hierarchically based on the proportion of topics and identified 6 types of reports. The analysis showed that external events had the greatest influence on the structure of the individual reports.
Kernel Principal Component Analysis (KPCA) extends linear PCA from a Euclidean space to data provided in the form of a kernel matrix. Several authors have studied its sensitivity to outlying cases and have proposed robust alternatives, as well as outlier detection diagnostics. We investigate the behavior of kernel ROBPCA, which relies on the Stahel-Donoho outlyingness in feature space (Debruyne and Verdonck, 2010). It turns out that one needs an appropriate selection of directions in which to compute the outlyingness. The outlier map of linear PCA also needs to be adapted to the kernel setting. Our study involves simulated and real data sets, such as the MNIST fashion data.
Many measurement system capability studies investigate two components of the measurement error, namely repeatability and reproducibility. Repeatability is used to denote the variability of measurements due to gauge, whereas reproducibility is the variability of measurements due to different conditions such as operators, environment, or time. A gauge repeatability and reproducibility (R&R) study is often conducted to estimate these two components of the measurement error variability. However, when a reference measurement point cannot be determined for parts, the selection of the measurement method and within-part variation may contribute to the estimates of measurement error. In this study, we investigate a measurement system in the existence of within-part variation for a cylindrical part without a reference measurement point. Alternative measurement methods and analysis of variance models for decomposing the within-part variation from the other components are studied. For a real-life application, cylindrical parts with elliptical cross-sections and barrel-shape along the length have been simulated using the R programming language. Estimates of operator, gauge, part-to-part, and within-part variability are studied. Recommendations are provided for practitioners working in manufacturing processes.
The Advanced Manufacturing Research Centre has invested heavily in AI for manufacturing and has seen success in many applications, including process monitoring, knowledge capture and defect detection. Despite the success in individual projects, the AMRC still has few experts in data science and AI and currently has no framework in place to enable wider adoption of AI nor to ensure the quality of AI projects beyond the final technical review. To enable faster adoption of AI, whilst maintaining quality, the Standardised Data-Centric Manufacturing (DCM) Workflow has been developed. A Github platform has been built from which colleagues and industry partners can find documentation templates, codes and guidance on best practice for DCM projects. The platform provides transparency and trustworthiness to decision-making through the DCM process as well as recommended resources and software to ensure the right tool is used for the right problem. The platform enables transparency in the data engineering performed through DCM projects. As important, the platform provides a comprehensive guide to scoping a DCM project and thus avoids the common pitfall of underestimating the resources and time required to collect quality data and carry out data-intensive analysis required when using AI for reliable decision making. Guidance and recommended resources provide a mechanism for engineers to upskill in the relevant areas of data science and AI and will provide the foundation for future upskilling efforts. The platform has been developed by data scientists and will be tested through specific case studies across the AMRC’s seven groups, including machining, design, composites and castings. User feedback will be used to improve the platform ensuring that it provides a working standardised workflow for the manufacturing industry with flexibility where required. The platform will provide a mechanism for sharing expertise and data, collaboration and knowledge capture of all DCM projects.
In recent years, significant progress has been made in setting up decision support systems based on machine learning exploiting very large databases. In many research or production environments, the available databases are not very large, and the question arises as to whether it makes sense to rely on machine learning models in this context.
Especially in the industrial sector, designing accurate machine learning models with an economy of data is nowadays a major challenge.
This talk presents Transfer Learning and Physical Informed Machine Learning models that leverage various knowledge to implement efficient models with an economy of data.
Several achievements will be presented that successfully use these learning approaches to develop powerful decision support tools for industrial applications, even in cases where the initial volume of data is limited.
References:
- From Theoretical to Practical Transfer Learning: The ADAPT Library. A de Mathelin, F Deheeger, M Mougeot, N Vayatis, Federated and Transfer Learning, Springer 2022.
- Fixed-budget online adaptive learning for physics-informed neural networks. Towards parameterized problem inference. TNK Nguyen, T Dairay, R Meunier, C Millet, M Mougeot. International Conference on Computational Science, 453-468
https://conferences.enbis.org/event/59/
https://conferences.enbis.org/event/60/