Data Analytics Activities

CAMS Data Analytics

The insights and discussions from the Data Analytics State of the Nation event have significantly contributed to shaping the CAMS Data Analytics thematic area. CAMS has identified several gaps between industry and academia. To address these challenges, the event working group has proposed the establishment of two steering committees to drive both short-term and long-term projects:

Data Analytics Teaching and Training (DATT) Steering Committee
Data Analytics Network and Infrastructure (DANI) Steering Committee

DANI - Data Analytics Webinar Series

The next webinar will be hosted in Septermber, keeping watching for more information soon!

DATT Steering Committee

Data Analytics Teaching and Training (DATT)

What have we achieved in 2024?

1) Invited partners and members to form the committee, ensuring a balanced mix of academics and industry professionals.

2) Held discussions to identify focus areas and tasks, and determined the best approach to address them.

3) Distributed a survey to gather insights into the current state of data analytics in the industry, identify opportunities and skills needed, and understand where we can help tackle these challenges.

The Committee

Dr Lucy M. Morgan

Senior Scientist in Analytical Chemistry (and Data Science)

Pfizer Ltd.

Lucy has 6 years experience in teaching chemistry and data analytics in academia and 3 years experience training in analytical chemistry and data science in the pharmaceutical industry. Her background is in batteries and pharmaceuticals, with skills in molecular dynamic and DFT modelling, predictive modelling (LC, Raman, CCS, UV), and coding in python.

Allyson McIntyre

Principal Scientist

AstraZeneca

Allyson works in and around the area of data analytics and is keen to ensure we have strong links between industry & academia to enable improved training and guidance in this area. She will help to bring an industrial perspective in what we do and what skills we need to improve and work alongside other members of DATT to provide improvements in this area.

Diane C. Turner PhD FRSC

Chair of Trustees of the Analytical Chemistry Trust Fund, Director & Senior Consultant Anthias Consulting Ltd

Consultant Anthias Consulting Ltd

Through my role as a consultant and trainer in analytical sciences across most industries globally, I teach how to use software from many manufacturers for data analysis, alongside how to plan, collect and use the data that is needed for a project through to data analytics. I have been using Chemometrics most of my career including for my PhD in disease diagnosis. I am using this knowledge and experience within DATT to look at how to improve teaching and training in data analytics.

Claire White

Chemist

Selden Research Ltd

With over 17 years experience working in industry laboratories, I have helped train and mentor numerous students and new employees in the field of chemistry including the analysis of data. This background equips me with practical insights on how to enhance teaching and training in data analytics, helping bridge the gap between academia and industry.

DANI Steering Committee

Data Analytics Network and Infrastructure (DANI)

What have we achieved in 2024?

1) Invited partners and members to form the committee, ensuring a balanced mix of academics and industry professionals.

2) Held discussions to review focus areas and confirm committee members.

3) Developed a regular webinar series on Data Analytics.

4) Created a clear landing page on the CAMS website to provide resources on good data and infrastructure practices.

The Committee

Dr Drupad Trivedi

Lecturer in Analytical and Measurement Sciences

The University of Manchester | Analytical Chemisty, Chemometrics, Metabolomics

Dr Drupad Trivedi is a CAMS lecturer and data analytics MSI co-chair. His research expertise spans mass spectrometry techniques, data analytics, and point-of-use sensor development for health and disease monitoring as well as prediction. His current research focuses on translating laboratory assays into wearable sensor technologies, utilizing data-driven approaches to decode complex signals. With nearly a decade of research and leadership experience, he has built a multidisciplinary research program focused on signal processing, data analysis, and big data modelling in analytical research. His work has been strengthened through active international collaborations, industry consultancy and academic collaborations, contributing to the advancement of analytical measurement sciences.

Rebecca Ingle

Lecturer

University College London

Rebecca's research involves the development and application of advanced spectroscopic techniques to problems in molecular photochemistry and new applications in the analytical sciences. Many of her experiments involve dealing with large, multidimensional datasets and often require extensive post-processing and statistical analysis. She is particularly interested in how better standardisation of experimental techniques and analysis methods can improve the value of data for the scientific community.

Martin Strachon

Digital Scientist, Analytical R&D

Pfizer

In my role within analytical R&D, I focus on streamlining laboratory workflows to enhance data management and analysis, while also enabling further data analytics. This expertise contributes to the steering committee's understanding of scientists' needs and software standards in an industry setting

Chiara Giorio

Professor of Atmospheric Chemistry

Yusuf Hamied Department of Chemistry, University of Cambridge

My expertise is in multivariate statistical analysis, chemometrics, mass spectrometry, source apportionment

Alex Henderson

Senior Technical Specialist (Data Systems Architect)

The University of Manchester

Data Analytics Webinar Series - Part 1

Kate Kemsley

Kate’s early academic career at the UK's Institute of Food Research focused on infrared sensor design. Her PhD research, on chemometric analysis of the large datasets produced by infrared spectroscopy, led to a wider interest in the emerging disciplines of computational statistics and machine learning. Since then, she has published widely on the analysis of large ‘chemical profile’ datasets (FTIR, NMR, Raman) as well as image and time domain signals. Key areas of application have been natural product integrity issues, and plant and human metabolomics studies. She leads on the Centre of Expertise in Food Authenticity at the University of East Anglia (UEA), and since 2023 has been a Scientific Director at Mestrelab Research SL, a leading producer of scientific software for analytical instrumentation.

AI, Chemometrics & co. for handling large analytical datasets

Advances in AI are transforming the way large analytical datasets are processed and interpreted. This talk will focus on recent developments in predictive modelling at the molecular sub-structure level, in particular its role in improving the assignment and verification of NMR spectra. Drawing on real-world examples from the food and drugs sectors, I will also touch on the use of machine learning as well as traditional chemometric approaches for treating classification problems, contrasting their strengths and limitations and how these might impact on decision-making in analytical chemistry.

Missed our first webinar in the Data Analytics series? No worries! The recording is now available, allowing you to revisit Kate's presentation and our panel discussion.

Access the recordings here Don’t miss out!

Q&A summary

How did Kate come about her current role and what funding mechanisms were used?

Kate explained that she has worked with a foot in industry for about 10 years, initially at the Quadrim Institute and the Institute of Food Research. She collaborated closely with Oxford Instruments on an Innovate UK project, which led to her dual role. She continued working as a consultant via Qib Extra for Oxford Instruments until COVID-19 caused an abrupt end. Later, she took on a consultancy with Mester Lab, which eventually supported her fully as a scientific director and in a role at the University of East Standard.

How good does analytical data need to be for AI to be successful?

Kate emphasized the importance of robust statistics to handle errors in large datasets. She mentioned using trimmed means and medians instead of actual means or standard deviations to avoid the influence of outlying values. Additionally, she highlighted the importance of understanding and eyeballing the data to spot errors.

How can we overcome the reluctance to adopt machine learning in industry?

Kate suggested that explainable AI is crucial for industry adoption. She mentioned the importance of independent test data and proving the model's performance on completely novel substances. She also noted that the complexity of models, which makes them effective, can be a challenge in unravelling their workings.

Sustainability of AI methods in industry:

Kate clarified that her work does not involve generative AI, which is computationally expensive. For her models, predictions are instantaneous and not an issue for sustainability. The training phase is computationally intensive, but they have made progress in streamlining it by reorganizing training data to reduce redundancy.

Deciding on variable input for FTIR spectrum in machine learning:

Kate advised against including noisy data in neural networks. She recommended cutting out regions of the spectral baseline that are not informative. She mentioned that random forests can work with raw data, but it depends on the specific application and the amount of work one wants to do.

Skills crucial for analytics professionals:

Kate highlighted the importance of experience and practice, noting that there are plenty of data repositories and free platforms like Python for learning. She emphasized the need for a numerate background and the ability to think in multidimensional spaces.

Addressing data openness and privacy concerns:

Kate mentioned that UEA is committed to data sharing, especially for publicly funded projects, and aims to place collected data in the public domain after publication. She acknowledged the challenges of combining data from different sources and the proprietary nature of some business data. David added that instrument vendors might be more willing to share data, while software vendors might find it difficult due to the proprietary value of their datasets.

Data Analytics Webinar Series - Part 2

CAMS Data Analytics Webinar: Tips, Traps and Trepidations for Multi-Modal Data Integration and Visualisation

Joram Matthias Posma

Senior Lecturer in Biomedical InformaticsImperial College London

Dr Joram Posma is Senior Lecturer in Biomedical Informatics at Imperial, a former Health Data Research (HDR) UK Fellow, and co-leads the Data Science stream of the MRes in Biomedical Research alongside teaching AI/machine learning to undergraduates. 7 students have completed their PhDs under his supervision, and he supervised 50 master and BSc students on their research projects. His team currently consists of 3 postdocs and 4 PhD students. His background is in chemistry and engineering (undergraduate and master) and obtained his PhD in bioinformatics and statistics from Imperial in 2014. His groups' research focuses on advancing methodologies for integrating diverse data types, including omic, medical imaging, and biomedical text data. Areas include development of multivariate regression and classification algorithms, software for the interactive and immersive visualisation of metabolic reaction networks, statistical spectroscopy methods for metabolite identification, bioinformatic and statistical workflows for the analysis of metabolic phenotyping data, integration of multi-omics data, development of domain-specific large language models, and computer vision applied to radiology. His main application area of interest is around cardiometabolic diseases and cancer, and their interface with human nutrition.

Multi-modal data integration presents an exciting and powerful opportunity to gain more comprehensive insights by combining diverse datasets (molecular, omics, imaging, text). However, the computational analysis of such data can present potential pitfalls that can lead to results that look "too good to be true".

This talk delves into common errors in statistical experimental design, presentation of results and visualisation practices that skew outcomes across different domains. It will cover the importance of selecting appropriate error metrics, correct data splitting and scaling, avoiding data leakage and overtraining, and how to present interpretable results from multi-omics data.

This talk will contain anecdotes "beyond the published paper(s)" of elements that often happen while analysing large data sets.

If you missed it, you can watch the recording here

Q+A Summary

Data Quality for Machine Learning:

Question: What data quality in terms of accuracy and reliability should be aimed for in machine learning to tackle analytical data challenges?
Response: Joram explained that the required data quality depends on the specific context. For example, in idiopathic pulmonary fibrosis, a 70% accuracy would be a significant improvement over the current 55% accuracy of pulmonologists. However, for cancer diagnosis, higher accuracy is crucial.

Training, Validation, and Test Set Split:

Question: What is a good percentage split between training, validation, and test sets?
Response: Joram mentioned that 80/20 is a common split, but it depends on the dataset size. For small datasets, a higher training set percentage is recommended. He suggested splits like 70/10/20 or 85/5/10, depending on the data size.

Leave-One-Out Cross-Validation:

Question: Is leave-one-out cross-validation suitable for creating training and test data?
Response: Joram stated that it depends on the dataset size. For small datasets, leave-one-out cross-validation is beneficial. For larger datasets, K-fold or Monte Carlo cross-validation might be better to ensure variability and predictive ability.

Metrics for Data Analytics Success:

Question: What metrics are used to measure the success of data analytics initiatives?
Response: Joram mentioned metrics like accuracy, confusion matrix, F1 scores, precision, recall, and sensitivity for classification problems. For regression problems, different metrics are needed. He emphasized understanding the limitations of each metric.

Combining Multimodal Data:

Question: How to combine multimodal datasets (text, images, time series) for a single model and ensure no data type is given preference?
Response: Joram suggested using neural networks to give different weights to different datasets and creating a latent variable space for shared feature explanation. He also mentioned transforming data into a common format for fair comparison.

Balance Between Human Expertise and AI:

Question: How do you see the balance between human expertise and AI-driven analytics in academic research in the next 5-10 years?
Response: Joram highlighted the importance of human expertise in using AI tools correctly and understanding their limitations. He noted that while AI has advanced rapidly, human knowledge and responsibility remain crucial.