Sensitive information such as personal data presents data analysts with legal and technical challenges. Preparing data in compliance with data protection regulations – from anonymisation to test data management – is becoming a key skill. This article highlights proven strategies, tools and technologies for preparing sensitive data for use in AI projects whilst ensuring compliance with the GDPR.
Good data quality forms the foundation of every successful AI application. Poor-quality data inevitably leads to flawed models and unreliable results. Experience shows that experts still spend around 80% of their time on data preparation and only 20% on analysis and decision-making. The data pre-processing process can be extensive and complex, consisting of several steps.
By using AI-supported tools, this effort can be drastically reduced, leaving more time for important analysis tasks. Modern AI pipelines typically go through the following steps: data collection, profiling, cleansing, harmonisation and transformation. Typical tasks include collecting raw data, identifying and removing duplicates or erroneous entries, and standardising formats. Each of these sub-processes increases the reliability of the data for machine learning models. If these steps are implemented consistently, the company lays the foundation for predictive analytics and AI-supported automation.
When processing sensitive data, companies face several challenges relating to data protection and data quality. Firstly, personal data (e.g. name, address, national insurance number) carries legal risks. Under the GDPR, this data must be either anonymised or at least pseudonymised in order to be processed further.
Anonymisation vs. pseudonymisation
These are two distinct methods for reducing personal identifiers in datasets, each with different legal and technical implications:
Furthermore, a large proportion of the data collected remains unused within the organisation (dark data). Gartner estimates that information which is routinely generated but not used (e.g. old log files or emails) accounts for more than half of data holdings.
Furthermore, poor data quality undermines AI results. Common issues include missing values, typos or format inconsistencies.
Ensuring consistent, complete datasets is therefore essential. When dealing with sensitive data, data engineers must also ensure that subsequent merges do not lead to re-identification. Overall, handling sensitive data requires particular care: in addition to the significant effort involved in data cleansing, data protection entails technical and organisational hurdles.
A wide variety of anonymisation methods are used when handling sensitive data. Important methods include, for example:
Modern anonymisation tools often combine these techniques to optimise data protection and data quality. It is always important to strike an appropriate balance: the data must be anonymised sufficiently so that individuals cannot be traced back, yet remain meaningful for AI analyses.
For test data management and data masking, it is best to use specialised tools. These offer dedicated functions designed for the task and thus speed up the work. One such tool is, for example, IRI Voracity (JetSoftware). It is an integrated data management platform that offers specific functions for test data management (TDM) and data protection. It combines data discovery (profiling, classification), data integration, data migration, data cleansing, data governance and analytics within an Eclipse-based development environment (Fig. 1).
Voracity covers all key methods of test data management, i.e. from pure data synthesis through to subsetting and data virtualisation. Within the platform, developers can create intelligently distributed test data sets without having to access original data. The following use cases and methods are supported:
These methods allow test databases to be populated securely and realistically for functional tests, stress tests or continuous integration processes. The data can originate from a wide variety of sources, ranging from mainframe systems and relational database systems (Oracle, SQL Server, DB2, PostgreSQL) to modern big data formats (Kafka, NoSQL). Users employ the visual IRI Workbench (Eclipse) to design jobs, which are then executed without programming via the engines running in the background.
| Tool | Task |
| IRI Field | Shield for structured data (databases, CSV/fixed-file tables), offers profiling, detection and a variety of masking algorithms (e.g. format-preserving encryption, pseudonymisation, permutation, randomisation) |
| IRI DarkShield | for semi-structured and unstructured data sources (text, documents, JSON, BLOBs, images), identifies personal data (e.g. names, account numbers) in files and media, and removes or anonymises it. |
| IRI CellShield | for Excel tables masks or encrypts personal data columns and generates audit trails of changes |
Table 1: Data masking tools in Voracity.
The tools operate using metadata and rules. Fields can be automatically located through scanning and classification. Suitable transformations can then be selected for each field type (e.g. hashing, replacement with fictitious data, obfuscation). The result is anonymous datasets that are statistically and structurally similar to the original data but no longer contain any sensitive information. In practice, this means: production data is mirrored, for example, for a development copy, whereby all personal data fields are replaced according to defined masking strategies. The anonymised copy can then be securely shared or tested. A key feature is the preservation of referential integrity during masking.
Referential integrity and data masking
In a database context, referential integrity refers to the consistency of relationships between tables – typically via foreign keys. If, for example, a customer number in an order table refers to an entry in the customer table, this relationship must not become invalid as a result of data changes. In the context of data masking, this means: When sensitive data (such as customer numbers, ID card numbers or National Insurance numbers) is replaced, encrypted or pseudonymised, these transformations must be applied consistently across all affected tables. Otherwise, foreign key relationships would ‘break’ – for example, an order table might refer to a customer entry that no longer exists or has been masked differently. An example:
Referential integrity in masking means that the relational logic of the original data is preserved even after anonymisation – a crucial aspect for test data, migrations or data analysis.
In this way, customer or social security numbers, for example, can be pseudonymised consistently without ‘breaking’ databases. Further key features of data masking in Voracity include auditing and compliance support. The applications log which masking rules have been applied to demonstrate compliance with audit requirements.
In this section, we provide an insight into the tasks and typical approach to data masking. The tool used here is IRI FieldShield, typically via the IRI Workbench (an Eclipse-based development environment). The graphical user interface enables users to visually identify sensitive fields in structured data sources and transform them in a targeted manner. Using wizards and context menus, masking rules such as encryption, hashing, tokenisation or pseudonymisation can be configured on a field-by-field basis – without the need for in-depth programming knowledge. Particularly convenient is the automatic detection of personal data fields (e.g. names, addresses, account numbers), which are identified and marked by rule-based classifiers. Typical masking functions include, for example:
The masking jobs created are saved as reusable scripts and can be executed directly from the Workbench or integrated into automated workflows (Fig. 3).
In addition, FieldShield offers a Java and .NET SDK, which can be used to integrate masking functions into custom applications, microservices or middleware. This allows data masking to be implemented both manually via the GUI and fully automatically as part of ETL processes (ETL = Extract, Transform, Load) or CI/CD pipelines. An example of data masking could be illustrated as shown in Table 2 & 3.
| Data source: customers.csv | ||
| Column overview: | ||
| Customer number | => | Tokenisation |
| Name | => | Pseudonymisation |
| Address | => | Masking (partial string) |
| IBAN | => | AES-256 encryption |
| Customer status | => | No masking |
Preview
| Customer number | Name | Address | IBAN |
| KX34Z | Peter M. | ***** Street 10 | ************78 |
| MF22Q | Julia R. | ***** Avenue 55 | ************56 |
Tables 2 & 3: Example of data masking for customer data.
The graphical user interface allows for the targeted selection and transformation of columns. In the preview, the user can immediately see what the masked data looks like.
In the EU, the GDPR regulates the handling of personal data very strictly. For AI projects, this means: Privacy by Design and Privacy by Default. It must be ensured right from the design stage that as little personal information as possible is processed. Pseudonymised data is still considered personal data – only true anonymisation removes the data protection obligations. The Federal Data Protection Act and the GDPR require an explicit legal basis (e.g. consent or contract) for every use of data during the development process.
Particularly in software development and testing, the Federal Commissioner for Data Protection and Freedom of Information (BfDI)[4] emphasises that no special rules apply here: under Article 25 of the GDPR, processing must be designed in such a way “that, as a rule, no personal data or less personal data is processed than during processing in production”. In other words: DevOps teams should, as a rule, only use anonymised or depersonalised data. All data protection principles (purpose limitation, data minimisation, storage limitation, etc.) remain fully applicable. Therefore, best practice involves: always identifying whether data can be anonymised; maintaining audit logs of data origins; storing test data in encrypted form; and restricting access to the data. Only in this way can AI projects comply not only with technical but also with legal requirements. Data protection authorities emphasise that, when it comes to test data, the most data-protection-friendly option must always be chosen. According to the BfDI, the following sequence of checks applies: First, it should be checked whether tests can be carried out using anonymous data (completely devoid of any personal reference). If this is not possible, pseudonymised data should be used. Only if neither anonymised nor pseudonymised data is feasible may unaltered personal data be used in exceptional cases. In practice, this means that synthetic or depersonalised test data should always be used wherever possible.
A test data catalogue is also recommended as best practice: teams define once and for all the fields that need to be obfuscated in the test and specify which data exchange formats (CSV, JSON, etc.) are to be used. Automated pipelines generate test sets according to the same scheme as required. Such standardised test data processes enable the iterative development of CI/CD pipelines in which automated testing is also possible in compliance with data protection regulations. Careful administration of the test data (versioning, encryption of the test database) rounds off the management process.
Incomplete or inconsistent data sets cause incorrect AI results (“garbage in, garbage out”). Thus, inadequate cleaning inevitably leads to faulty models and decisions. Even seemingly harmless changes can be critical: crude anonymisation methods that remove essential data points severely limit the usefulness of the data. Sources of error can arise from typos, missing attributes, delayed updates or unnoticed format changes. Outdated metadata (dark data) often conceals information that is missing from analyses. Furthermore, manual steps or unnoticed transformations carry the risk that data will no longer be reproducible.
To avoid these errors, a number of best practices have been established: Firstly, a clear plan and defined objectives for data maintenance are required. Regular reviews and continuous data maintenance (monitoring) prevent inconsistent data from entering the system. Automated cleansing tools detect typical problems such as duplicates or invalid values and resolve them routinely. Maintaining data profiles is helpful in this regard: analysis tools such as the profiling functions in Voracity capture statistical metrics (min/max, proportion of zeroes, etc.) and flag outliers.
It is also important to maintain referential integrity during transformation (see above). This prevents violations of relationships in data schemas. Finally, comprehensive documentation and transparency are essential; that is, all data quality rules, anonymisation procedures and data flows should be documented and versioned. Organisations benefit from implementing data governance standards – ensuring that all stakeholders (developers, data scientists, data protection officers) are aware of and can control the state of the data.
Data that is both ‘clean’ and anonymised is the decisive factor for the success of AI projects. Only those who place a high value on data quality can fully exploit its potential. The use of AI in data preparation makes it possible to automate time-consuming routine tasks – allowing developers and data scientists to focus more on analysis, interpretation and strategic issues. Companies that take data governance, data protection and data maintenance seriously right from the start of a project can develop their AI models more efficiently whilst complying with legal requirements. In short: data protection and data quality are not obstacles, but rather enable precisely the trustworthy and high-performing data on which successful AI initiatives are built.