Domain D6Lawful, quality, traceable data for AI

What is a Data Governance and Management?

AI data governance is the discipline of ensuring that data used to train, fine-tune, evaluate, or operate AI systems is lawful in source, fit for purpose in quality, traceable in lineage, retained appropriately, and protected from unauthorised access throughout its lifecycle.

AI failures more often trace back to data than to model architecture. Bias in outputs reflects bias in training data; poor performance reflects unrepresentative data; legal exposure reflects unclear consent or licensing. AI data governance applies general data governance disciplines (lineage, quality, retention, access control) with extra emphasis on issues that matter most for AI: provenance, consent for training use, representativeness across groups, and protection of sensitive inputs to operational AI.

Specific obligations have hardened. The EU AI Act Article 10 sets quality criteria for training, validation, and test data. GDPR governs personal data throughout. Sector rules add further duties (financial services, healthcare). For LLMs, copyright and licensing of training corpora has become a live legal question. Operational data fed to AI (e.g. user prompts) raises confidentiality and retention issues that traditional data governance often did not anticipate.

In the Veridio framework, D6 contains six principles covering training data provenance and consent, data quality and representativeness, data minimisation for AI, data lineage and reproducibility, retention and deletion, and access control. It sits primarily at tier 1 and tier 2 because data quality is foundational at every maturity level.

Frequently asked

Common questions about data governance & management

What does AI data provenance mean?

For every dataset used to train, fine-tune, or evaluate an AI system: the source it came from; the date of collection; the lawful basis for processing (for personal data); the licensing terms; any consent obtained; and the chain of custody to the version actually used. Provenance is the precondition for defensible AI systems under the EU AI Act, GDPR, and copyright law.

How do you ensure training data is representative?

Define the population the AI system will operate on; analyse training data against that population for under- or over-representation across relevant groups (demographic, geographic, behavioural); document the gaps; and either re-balance the data, restrict the system's use to populations where it performs well, or add controls (lower confidence thresholds, human review) where representativeness is poor.

Can personal data be used to train AI under GDPR?

Yes, but only with a lawful basis (typically legitimate interests with appropriate safeguards, or explicit consent), with data subject information, with appropriate technical and organisational measures, and respecting the data minimisation principle. Special category data (health, biometric, etc.) requires an Article 9 condition. Document the analysis before training.

How long should AI training data be retained?

Long enough to satisfy reproducibility and audit requirements (typically the lifetime of the model plus a defined post-retirement period), but no longer than necessary for the original purpose. Personal data must follow the GDPR storage limitation principle. Many organisations retain training datasets for the model lifetime plus three years.

What templates support AI data governance?

The D6 bundle includes the AI Data Provenance Record, Training Data Quality Standard, AI Data Retention Schedule, AI Access Control Matrix, and the AI Data Processing Impact Assessment. Available individually or bundled at templates.veridio.co.uk.

Take action

Apply this domain in your organisation

The other domains