Are Swimming’s Fitness and Competitive Industries Data-Fit for AI? – Part 1

Published on February 11, 2025
Edited on May 29, 2025
Introduction
Data-driven insights have revolutionized many sports, enabling precise training plans, improved injury prevention, and real-time performance feedback. Yet, in the realm of swimming—a sport where milliseconds matter—the quality and structure of data remain significant challenges. How can AI and ML help us unlock better outcomes, and what risks arise when data quality is ignored?
This first instalment of our two-part series offers a literature-based review on preparing data for AI in sports, with references drawn from AI/ML research fields and applied to swimming-specific scenarios. Our goal is to bridge the gap between what AI systems need and how swimming can provide it. We’ll explore the foundations of data quality, the dangers of poor data management, and the key pillars necessary for building robust, AI-ready datasets. By the end of this review, you’ll understand why well-structured, high-quality data is essential to building a foundation for advanced analytics, enabling better decision-making and performance gains in the pool.
Sections Covered in Part 1:
- Section 1: Why Data Quality Is Essential for ML/AI
We outline the core reasons why high-quality, well-managed data is indispensable for AI and ML applications, especially in performance-critical sports like swimming. - Section 2: The Barriers, Pitfalls, and Challenges of Poor-Quality Data
This section highlights the practical consequences of poor data practices, including biased models, flawed training strategies, and wasted resources. - Section 3: Core Foundations for Ensuring High-Quality Data in AI/ML
We present the key pillars of reliable data management, from intrinsic and contextual data quality to ethical compliance, all of which are crucial for creating trustworthy AI outcomes.
Section 1: Why Data Quality Is Essential for ML/AI — “The Engine of AI”
Imagine you’re fuelling an engine: if the fuel is low-grade or contaminated, you’ll never get peak performance. Data works the same way for Machine Learning (ML) and Artificial Intelligence (AI). In the world of sports, especially swimming, accurate data is the lifeblood powering modern analytics, performance tracking, and decision-making. Poor-quality or incomplete data can mislead even the most advanced AI systems, potentially derailing training plans and competitive outcomes.
Below are key reasons why data quality is vital for any AI-driven application:
- Model Accuracy and Reliability
High-quality data ensures AI models deliver precise, reliable predictions. In swimming, consistent and accurate data on metrics like stroke counts, lap splits, and heart-rate variability enable coaches and athletes to trust AI-generated insights. On the other hand, poor data can lead to unreliable models and flawed training regimens (Priestley et al., 2023; Qayyum et al., 2020). - Avoidance of Data Cascades
Data errors can propagate throughout the ML pipeline, creating a cascade effect where small initial mistakes amplify into larger problems. For instance, consistently misrecording lap times may distort pace analysis, fatigue predictions, and race strategies, leading to costly inefficiencies (Sambasivan et al., 2021; Polyzotis et al., 2018). - Bias and Fairness
Biased or incomplete data, especially in competitive sports, can result in skewed insights and inequitable outcomes. For example, training data limited to certain swimmer demographics or conditions may exclude key factors, creating models that favour some athletes over others. Ensuring diverse, representative data helps reduce bias and improve generalization (Zhou et al., 2018; Qayyum et al., 2020). - Data Cleaning and Preparation
Effective data cleaning removes noise, corrects inconsistencies, and addresses missing values. Think of it as maintaining a pool’s water quality—without proper cleaning, swimmers’ performance and AI insights suffer. Clean data ensures models can adapt to new and evolving conditions (Polyzotis et al., 2018; Priestley et al., 2023). - Domain-Specific Requirements
Each sport comes with unique metrics and requirements. In swimming, monitoring metrics like stroke frequency, rest intervals, and underwater phases is essential. Tailoring data quality checks to these specifics ensures AI outputs address real-world performance needs (Priestley et al., 2023; Ranjan, 2023). - Continuous Monitoring and Management
Data collection doesn’t stop after a model is trained. Swimmers’ performance evolves, new athletes join programs, and sensors may change over time. Ongoing monitoring of incoming data ensures AI tools remain accurate and relevant (Bangari et al., 2024; Zhou et al., 2018). - Comprehensive Data Quality Management
Managing large volumes and varieties of training data—such as lap counts, biometric readings, and video analytics—requires robust, scalable processes. A clear data quality strategy addresses volume, variety, and velocity to maintain consistency across the ML lifecycle (Ranjan, 2023; Priestley et al., 2023). - Ethical and Legal Considerations
Collecting performance and health metrics raises ethical concerns, especially around privacy and compliance. High data quality standards, secure management, and adherence to ethical guidelines help organizations meet legal obligations (Qayyum et al., 2020; Zhou et al., 2018).
Data quality is the foundation of successful ML/AI systems. Accurate, comprehensive, and well-managed data drives more reliable models, fostering trust among coaches, athletes, and stakeholders. Treating data as the “fuel” of AI applications ensures more equitable outcomes, whether in training facilities, research labs, or global competitions.
Section 2: The Barriers, Pitfalls, and Challenges of Poor-Quality Data
In sports analytics, poor data quality is more than just a minor setback—it can derail training programs, waste valuable resources, and erode trust in AI-driven insights. From coaches tracking turn times to sports scientists analysing large sensor datasets, understanding these key pitfalls is crucial to ensuring reliable outcomes.
- Model Performance Degradation
AI models rely on accurate, complete data to learn and make predictions. When fed missing or incorrect data—such as inaccurate lap splits or mislogged stroke counts—models produce unreliable predictions. This can result in suboptimal pacing strategies or even increased injury risk if athletes are pushed beyond safe limits (Priestley et al., 2023; Qayyum et al., 2020). - Data Cascades
Small data errors at the start of the pipeline can snowball into larger issues downstream. For example, a heart-rate monitor that incorrectly records frequent spikes could trigger “false alarms” about an athlete’s health, leading to unnecessary changes in training plans. These cascades reduce confidence in AI systems and can compromise athlete well-being (Sambasivan et al., 2021; Polyzotis et al., 2018). - Bias and Fairness Issues
Poor data quality often stems from incomplete datasets that fail to represent diverse athlete populations. When models are trained on limited data—such as metrics from only elite swimmers—they may produce advice that is irrelevant or even harmful for youth or masters-level athletes. Inclusive and representative data collection is key to mitigating bias (Zhou et al., 2018; Qayyum et al., 2020). - Lack of Standardized Metrics
Without standardized methods for recording key metrics (e.g., stroke rate or lap segment times), comparing data across teams or studies becomes difficult. Inconsistent definitions can create confusion when adopting AI solutions, slowing progress and amplifying errors across applications (Priestley et al., 2023). - Data Poisoning and Security Risks
When data is poorly managed, it becomes vulnerable to tampering or malicious attacks. In sports, altered performance data could mislead scouts, skew rankings, or even affect betting markets. Implementing robust validation and security measures helps prevent such data poisoning risks (Qayyum et al., 2020). - Resource Constraints and Documentation Issues
Under-resourced teams and unclear data collection protocols often lead to avoidable errors. For instance, poorly documented sensor calibration procedures can result in mislabeling data, which later requires extensive effort to correct. Over time, these resource gaps compound inefficiencies (Sambasivan et al., 2021). - Ethical and Legal Challenges
Handling sensitive athlete data—including biometric or health-related metrics—requires strict compliance with privacy regulations. Sloppy data management could lead to non-compliance, legal issues, and damage to the trust between athletes and staff (Qayyum et al., 2020; Zhou et al., 2018). - Operational Inefficiencies
Poor data quality can significantly slow down progress by requiring constant cleanup and validation. Time spent “firefighting” bad data could be better used to develop advanced training strategies or run additional experiments (Priestley et al., 2023). - Training and Education Gaps
Many sports organizations lack proper training in data collection, management, and ethics. Without this foundational knowledge, teams may inadvertently introduce errors into datasets, creating further challenges in scaling AI solutions (Zhou et al., 2018). - Generalization and Representativeness
Models trained on narrow datasets often struggle to generalize across different contexts. For example, a model trained exclusively on elite swimmers may offer little value for youth or masters athletes, necessitating expensive data collection and retraining (Priestley et al., 2023; Ranjan, 2023).
Poor data quality presents significant challenges for AI adoption in sports. From degraded model performance and ethical risks to operational delays, these pitfalls underscore the need for robust, well-documented, and secure data pipelines. By addressing these challenges, organizations can ensure that coaches, scientists, and support staff can trust AI insights—ultimately leading to better training strategies and more equitable outcomes.
Section 3: Core Foundations for Ensuring High-Quality Data in AI/ML
Achieving high-quality data is no accident—it requires intentional strategies and meticulous processes. In sports, especially swimming, data comes from a variety of sources such as lap times, stroke counts, and physiological metrics. To ensure AI models deliver reliable insights, each data point must be accurate, relevant, and contextually meaningful. Below are the key pillars supporting effective data collection, management, and use.
-
Intrinsic Data Quality
Intrinsic quality focuses on ensuring the data itself is accurate, consistent, and complete. In swimming, even a small inaccuracy—such as a misrecorded lap time—can distort training recommendations and affect athletes’ outcomes. To achieve high intrinsic quality, sensors like timing pads and wearable devices should undergo regular calibrations. Periodic spot checks, such as comparing automated data with video reviews, help validate the accuracy of key metrics. Automated systems that flag outliers, like stroke rates exceeding physical limits, are also critical (Priestley et al., 2023; Ranjan, 2023). These combined measures ensure the data remains trustworthy for AI analysis. -
Contextual Quality
Contextual quality ensures that data is relevant, timely, and suitable for its intended AI task. For example, training data gathered from short-course pools may not be applicable to open-water swimming, making segmentation essential. To maintain contextual relevance, teams should clearly define data collection objectives, such as improving starts, turns, or overall endurance. Data should be classified based on conditions like pool size or altitude to provide contextually meaningful insights. Moreover, as training needs evolve, so should data collection processes to keep them aligned with current goals (Priestley et al., 2023; Zhou et al., 2018). -
Representational Quality
Representational quality focuses on consistent and interpretable data formats across teams and systems. Without standardization, performance data can be misinterpreted—such as when different teams label a 50-meter lap as “50 Free” or “FC_50.” Adopting standardized naming conventions and maintaining a shared data schema across teams help mitigate these issues. Teams should also use metadata to document details about when and how data was collected (Priestley et al., 2023). These measures prevent confusion and improve collaboration between internal and external stakeholders. -
Accessibility
Accessibility ensures that data is available to authorized users while safeguarding privacy. Coaches, sports scientists, and athletes often need real-time access to performance data to adjust training. Secure cloud-based systems with role-based access control can provide access without compromising security. Additionally, user-friendly dashboards designed for non-technical users allow for broader accessibility. For sensitive athlete data, encryption should be enforced to meet privacy regulations (Zhou et al., 2018). These measures help balance data availability and privacy while supporting effective decision-making. -
Data Lifecycle Management
Data lifecycle management oversees data from collection to processing, storage, analysis, and eventual archiving or deletion. Traceability is key—without it, errors can be introduced into the AI pipeline unnoticed. Maintaining thorough documentation, including details such as collection dates and sensor calibration logs, helps preserve data integrity. Periodic reviews are essential to remove outdated or irrelevant data while maintaining focus on quality datasets (Ranjan, 2023; Priestley et al., 2023). Backup and disaster recovery strategies further ensure long-term data reliability. -
Ethical and Legal Compliance
Ethical and legal compliance is crucial when handling sensitive data, particularly in sports where biometric and health data are involved. Athletes trust that their personal information will be protected and used responsibly. To uphold this trust, teams should anonymize athlete data when possible and ensure that data usage complies with relevant laws, such as the GDPR. Obtaining informed consent from athletes before collecting and using their data is also essential (Qayyum et al., 2020; Zhou et al., 2018). Failure to adhere to these guidelines risks legal repercussions and reputational harm. -
Continuous Monitoring and Improvement
Continuous monitoring ensures that data quality is maintained over time as performance data evolves. Swimming programs often introduce new metrics and technologies, making ongoing validation important. Automated validation scripts can detect anomalies, such as unusually short or long lap times, before they affect analyses. Periodic audits help maintain completeness and integrity, while feedback loops involving coaches and athletes allow for the prompt resolution of discrepancies (Bangari et al., 2024; Zhou et al., 2018). This proactive approach helps maintain a dynamic and reliable data pipeline. -
Integration of Domain Knowledge
Domain knowledge integration leverages the expertise of coaches, sports scientists, and athletes to interpret and validate data effectively. Anomalies, such as a sudden spike in heart rate, may have simple explanations like sensor malfunctions or environmental conditions. Domain experts can distinguish between real issues and equipment errors, preventing unnecessary model adjustments. Collaborating with coaches on data collection protocols and validating AI-driven recommendations against real-world experiences enhances the reliability of the insights generated (Ranjan, 2023). This iterative process ensures that data-driven decisions align with practical experience.
By focusing on these core foundations—intrinsic and contextual quality, representational consistency, accessibility, lifecycle management, compliance, continuous monitoring, and domain expertise—organizations can establish trustworthy data pipelines. For swimming professionals, this translates into better training regimens, accurate athlete feedback, more engagement, fewer injuries, and superior competitive performance.
Summary
In this first part, we’ve explored the core principles of data quality and shown how poor data can derail even the most advanced AI projects. Sloppy or incomplete records don’t just stall innovation—they can actively mislead coaches, athletes, and analysts. But how do these concepts apply to swimming’s current data landscape?
In the next instalment, we’ll dive into the practical realities of managing swimming training session data, highlighting areas where the industry excels and where improvements are needed. We’ll also discuss the opportunity for a unified framework designed to enhance data management across all levels of the sport. Finally, we’ll answer the key question: Is the swimming fitness and competitive industry data fit for AI? Stay tuned for a closer look at how we can harness AI to drive better outcomes for swimmers at every level.
References:
Priestley, Maria & O’Donnell, Fionntán & Simperl, Elena. (2023). A Survey of Data Quality Requirements That Matter in ML Development Pipelines. Journal of Data and Information Quality. 15. 10.1145/3592616.
Bangad, Nikhil & Jayaram, Vivekananda & Sughaturu Krishnappa, Manjunatha & Banarse, Amey & Bidkar, Darshan & Nagpal, Akshay & Parlapalli, Vidyasagar. (2024). A Theoretical Framework For Ai-Driven Data Quality Monitoring In High-Volume Data Environments. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY. 15. 618-636. 10.5281/zenodo.13878755.
Zhou, Yuhan & Tu, Fengjiao & Sha, Kewei & Ding, Junhua & Chen, Haihua. (2024). A Survey on Data Quality Dimensions and Tools for Machine Learning Invited Paper. 120-131. 10.1109/AITest62860.2024.00023.
Polyzotis, Neoklis & Roy, Sudip & Whang, Steven & Zinkevich, Martin. (2018). Data Lifecycle Challenges in Production Machine Learning: A Survey. ACM SIGMOD Record. 47. 17-28. 10.1145/3299887.3299891.
Qayyum, Adnan & Qadir, Junaid & Bilal, Muhammad & Al-Fuqaha, Ala. (2020). Secure and Robust Machine Learning for Healthcare: A Survey. IEEE Reviews in Biomedical Engineering. PP. 1-1. 10.1109/RBME.2020.3013489.
Neutatz, Felix & Chen, Binger & Abedjan, Ziawasch & Wu, Eugene. (2021). From Cleaning before ML to Cleaning for ML.
Sambasivan, Nithya & Kapania, Shivani & Highfill, Hannah & Akrong, Diana & Paritosh, Praveen & Aroyo, Lora. (2021). “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. 1-15. 10.1145/3411764.3445518.
Roh, Yuji & Heo, Geon & Whang, Steven. (2019). A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective. IEEE Transactions on Knowledge and Data Engineering. PP. 1-1. 10.1109/TKDE.2019.2946162.
Whang, Steven & Roh, Yuji & Song, Hwanjun & Lee, Jae-Gil. (2023). Data collection and quality challenges in deep learning: a data-centric AI perspective. The VLDB Journal. 32. 10.1007/s00778-022-00775-9.
Rangineni, Sandeep. (2023). An Analysis of Data Quality Requirements for Machine Learning Development Pipelines Frameworks. International Journal of Computer Trends and Technology. 71. 16-27. 10.14445/22312803/IJCTT-V71I8P103.