How to prepare for data anonymization in non-production environments

Oct 25, 2024 | data anonymization

One of the key tasks during the test design stage is development of test cases based on the acceptance criteria defined in the documentation. In order to cover all developed cases in a correct way, we often need specific resources, including relevant test data.

What is test data? In short, it is all necessary information, as presented in the specification of the tested application, which helps us to verify whether the tested application fulfills all pre-defined expectations, or not.

The scope of generated data should enable testing of all positive, alternative and negative process flows, in order to ensure full test coverage. Proper-quality test data enhances the effectiveness of tests, helps to disclose possible problems and may ensure higher quality of the software.

In some situations, the test environments are fed with production data or data from combined domain systems In such cases we often have to deal with sensitive data. Therefore, we should ensure ‘processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information…’ (anonymization, pursuant to art. 4 (5) of GDPR), which poses additional problems.

Methods of test data generation

Data generated manually using simple generators

For a small number of test cases we can choose to generate relevant base (test data) manually, using fake information. This can be done through some simple tools, available online, such as number generators (which generate various types of numbers, as per the rules accepted, e.g. NIP, REGON, PESEL, ID number, IBAN etc.).

This way we can create e.g. a fake customer. We must remember, however, that manual process of generation of individual resources (e.g. customers) is time-consuming and in many cases insufficient to ensure full test coverage, especially for very complex functionalities or systems. Moreover, we need to remember that in the event of lack of access, environment updates or errors, we will have to re-generate the data.

In order to avoid errors and re-generation, we should consider filing and monitoring of usage of the manually generated test data. This will allow us to manage our valuable test data in a better way.

Automated data generation

In the situations when the amount of the test data required for the full coverage of test scenarios is much higher and we plan or have already deployed automated testing tools, we should consider application of these tools for automated test data generation.

Depending on the type of tests performed and the test automation tool we can rely on dedicated solutions addressing the problem of provisioning of the test data. Below you can find examples of several most popular tools. Below you can find the examples of several most popular tools.

Use of API (Application Programming Interface) testing in test environments

While using Postman we can apply Random Key. This solution offers 24 classes of realistically looking synthetic data. Generated classes include basic information, e.g., figures, date and time (in pre-defined formats) and more complex specific data, such as credit card numbers, social security numbers and regional information (depending on the country), locations and names.

In the situations when we use Selenium, we may apply ready-made Java or C# classes. This is a simple solution where the sequence of constant characters is supplemented by several random characters, thus creating unique data. It is not a perfect solution, as such data is not an exact reproduction of its real-life counterpart. One way to solve this problem may be the use of some ready-made APIs for Java or C#, e.g. jfairy, which can generate unique data based on the information uploaded from Wikipedia.

Postman and Selenium may also use the data generated manually in a file, or imported from a database, which will be uploaded into the test script.

We need to remember that during the development of the tested application such solutions require constant monitoring and maintenance of scripts, which may be laborious and time-consuming.

Is it safe to use such test data?

The above solutions are absolutely safe, provided we use only synthesized fake data.

Sometimes, however, such data may be insufficient. In such cases we use e.g. the information stored in the database, including the production data, in order to test various combinations of the designed scenarios. If the data in the database is real, we bear a significant risk in the event of leakage. Current regulations define which data is considered sensitive and specify the duties related to the protection of the information security. These challenges are addressed by companies offering complex test data management tools.

Complex test data management solutions

Tools of this kind contain numerous features and allow testers to generate various types of test data, covering a wide range of required test resources. The larger scope of the project and its complexity, the higher demand for test data. Use of relevant tools allows to streamline the processes and increase effectiveness as well as quality of testers’ work.

The test data management applications offer wide range of possibilities, starting from data generation and masking, also directly in databases. They provide access to the online test data management, automation and deployment platforms. These enable creation of database subsets, including only selected records, while preserving referential invariance in various types of databases, such as CRM, ERP or financial systems.

One such comprehensive test data management solution is the Global Anonymisation Linked Loader – G.A.L.L., developed by the Soflab team.

Soflab G.A.L.L. is the most versatile and complete solution for generation of relevant amounts of test data

Soflab G.A.L.L. is an advanced, innovative and versatile tool for safe and effective large-scale data management. It enables generation of relevant amounts of necessary test data, based on the production data, and feeding it into the test environments, as well as masking (anonymization, pseudonymization) of data in the production environments, including the data migrated to cloud. All of this is achieved in compliance with the GDPR regulations.

G.A.L.L. offers the autodiscovery feature which enables automated identification of the sensitive data. It helps to define the data scopes which require anonymization. They users may decide themselves which data and to what extent will be modified during the anonymization process. Masking ensures data integrity, owing to the masking mechanism which replaces selected real data with other data, while maintaining its structure and enabling validation. PESEL number, even synthesized, will follow the original pattern. What more, it will correspond to the new, synthesized date of birth. Insurance policy number will still be the policy number and the city or place of residence will be replaced by another, of similar size. Data integrity will be preserved in terms of technical, substantive and business aspects.

The tool may be used in varied contexts and situations – from the needs of an individual department to some general criteria applicable for the entire organization. The user may anonymize data in a individual database or in numerous databases and platforms across the board.

Data maintains both technical and substantive integrity. It may be successfully used by the BI teams for business analysis, as it also maintains all statistical relations.

G.A.L.L. ensures fast process of generation, anonymization and pseudonymization of large amounts of data regardless of the fact if it covers a part of a database, the entire database or a number of databases. The process will be performed with the maximum speed enabled by a given technology.

Owing to JDBC drivers, Soflab G.A.L.L. may connect with any type of relational database eg. MySQL, SQL Server, Postgres SQL or MS Access. It may also perform anonymization based only on the CSV files or via dedicated Python Connector drivers.

Soflab offers fast and effective launch of the tool, owing to our team of experienced BI specialists who have confirmed their skills and expertise during work in numerous complex test and production environments.

Aplikacja do anonimizacji danych testowych

How to select the optimum method of data generation?

There are several methods of test data generation and each of them may be applied in different situations. If the amount of necessary test data is small, manual generation and filing in e.g. a text file or spreadsheet may be the best option. When the demand for data is higher, we may use automated scripts which ensure faster generation of bigger volumes of data and e.g. store them automatically in a text file. We must remember, however, that in both cases the generated data might not reflect the actual data. To address this problem, we should consider complex commercial solutions, such as Soflab G.A.L.L., which prove useful in large complex projects. They guarantee fast deployment and automated identification of test data, automatically provide test data and deliver quick results. Such tools ensure reliable data for the developers and testers, without the risk of compromising confidential production data. This, in turn, minimizes the time and effort required to secure the data. And finally, with these solutions in place, the teams may focus on other critical aspects of the software development process.

Błąd ludzki może kosztować wyciek danych

When and for what purpose should personal data be anonymised?

Traditional testing based on artificially generated data may not provide sufficiently consistent and reliable information about the tested solution, which is why increasing number of companies are choosing to use a copy of the real-world production environment. In such cases the testers must remember that such data may include also personal data of customers, prone to certain risks. When anonymized, however, the data is no longer at risk and can be used as test data.

test automation service group provided by Soflab

Implementing Automated Testing in Small Project Teams: The Soflab TASG Approach

How can agile teams leverage the Soflab TASG (Test Automation Service Group) to implement effective automated testing?
Uncover the journey of an automation project for an international client and the strategy behind its success.

Properly designed dashboards are key for effective test management in large IT projects

How to design an optimal dashboard? What should you pay special attention to, so that the tools you build are effective, provide valuable information and contribute to increasing the effectiveness of the project? Our article answers these and many other important questions.

Study the test data generation methods on the examples of available tools

One of the key tasks during the test design stage is development of test cases based on the acceptance criteria defined in the documentation. In order to cover all developed cases in a correct way, we often need specific resources, including relevant test data.

Manual or automated generation of the test data?

Certainly, the main factor determining the easiest ways of obtaining the needed test data is the type of data we need. Simple data, such as a login or password, may be generated manually and it is a relatively simple method of providing the variables required for the application testing. In the cases when we need a lot of varied data, it may be generated automatically or fed from a database of readily available data sets.

Various types and characteristics of non-production environments

In today’s dynamic world of software technologies, both development and maintenance of the high-quality applications and systems is vital for business. To achieve this, it is necessary to provide various types of non-production environments to support the developers’, testers’ and other specialists’ work on the applications in secure and well-controlled conditions.