While anonymization is a well-known term for most of the people, pseudonymization is a new term that emerged in the public space with GDPR directive concerning handling sensitive data. What are the differences between anonymization and pseudonymization and where did such a distinction come from in GDPR?
When GDPR was adopted last year, the interest in data security started growing; collected, stored and processed data. The aim of operational database use may vary, starting with scientific research, patient treatment, law firms’ operation, to end with new software testing.
Software development or implementing new functionalities to the already existing environment requires launching an entire process in testing conditions in order to verify how the system handles the load of a large database.
It is good for the process to take place in conditions as closely resembling those of the production environment as possible. Therefore, it is best to develop utility apps and their extensions, as well as other modifications, based on a database that fairly reflects the actual situation.
Most companies give a lot of attention to secure production environments against data leakage, putting less emphasis on testing environments. Thus, assuming that they include copies of fragments of information resources of the former, testing environments become a natural target and window to illegally obtain personal data. Anonymization secures TST DEV environments.
The concept of anonymization does not exist in GDPR
In the regulations concerning the processing of particularly high relevance, i.e. sensitive data, there is no mention of data anonymization. Instead, the term pseudonymization occurs. It is logical due to the fact that sensitive data anonymization, in fact, means dealing with a set of features, with no possibility to identify any specific owners of this information.
With the lack of possibility to identify a natural person based on such a data collection, database organized under such principles does not comply with personal data protection rules, and GDPR regulations are not applicable.
Let us explain the main difference between each of these terms but first, we need to answer the question when information becomes personal data?
Well, we deal with personal data when we are in possession of information sufficient to identify a given natural person.
Personal data anonymization – what is it?
The Act on Electronic Services in Poland defines anonymization of personal data as irreversible action disenabling identification of a person whose data is in question.
Such an action involves deleting any information on specific personal features from the database so that there is no effective method to use them in order to identify a natural person.
The personal data anonymization process can take place on various stages of their acquisition. When conducted at the very beginning, in the moment of data acquisition in the system, it does not involve entering identification data into the system.
Data separation in databases supports the anonymization process so that relations between individual records were not put into the same set.
The third mechanism of the anonymization process may involve using anonymization methods on data downloaded from the system (shared). Then, sensitive data is removed from a full database while being downloaded by authorized users. Such data filtering requires a secure system with proper security.
When is anonymization needed?
Anonymization is needed when dealing with personal data, not with the information set with identifiers the use of which allows to identify a natural person. We may distinguish unique identifiers, which were not created in the system and provide direct user identification over a single system implemented in the organization (in Poland PESEL and NIP number were supposed to function similarly), as well as those of an inconclusive nature.
For instance, a set of information including sex, eye color, age or country of origin theoretically is not sufficient to identify a person. However, if we add some information about the name, height, job, the probability of identifying this person increases radically.
We can, however, imagine a situation that the basic inconclusive information is so distinctive that it allows for successful identification. It all depends on the context of space in which we explore the database.
Example: a blue-eyed man, born in Poland, 47 years old is very easy to be identified in…Congo.
What does this mean? It proves that the border between the information and data is not always clear and obvious. Hence, the anonymization process underscores the importance of the analysis that determines the scope of data undergoing this process, the acceptable risk profile, and input parameters.
Reason for and method of anonymization ought to be clearly indicated, as it should be conducted for a clear and specific aim. Anonymization process may, to some extent, limit the scope of valuable and coherent information available in the data set, as with the increase of anonymization degree, the usability of the data set decreases.
Due to that fact, the organization must decide on the level of compromise between the acceptable (or expected) usability and the attempt to reduce the risk. Unskillfully conducted anonymization may destroy data coherence in terms of business, especially in the case of testing the solutions used for segmenting/profiling clients, marketing campaigns, etc.
Pseudonymization is one of the anonymization methods but providing much lower security against sensitive data correlation so it becomes protected personal data.
It is one of the reasons why GDPR touches upon the issues connected with pseudonymization and not full anonymization. After the former has been implemented, we cannot talk about personal data processing.
The notion of pseudonymization was officially introduced in the provisions of the Regulation (UE) 2016/679 of the European Parliament and the Council of 27th April 2016 on natural person protection with regard to personal data processing and a free flow of such data.
Pseudonymization – what is it?
It is an activity that involves replacing identification data with code names (pseudonyms) which in the case of such data, as name and surname, maybe just initials or numbers.
In fact, this term may also be referred to as „cryptonymization”, but in practice, it functions as pseudonymization (with regard to the terms used in GDPR, as explained above).
As a result of pseudonymization, we receive a sequence of information (still in the form of personal data), on the basis of which it is not possible to decipher or identify any individual without a key on which the entire pseudonymization process has been based.
What does pseudonymization mean?
Thus, we cannot find the owner of the data as it is not changeable and darkened in comparison to anonymization, in this case, it is only encrypted. Such a method of data protection is also related to GDPR requirements for securing the data against cyber attacks.
How to determine whether a given person can be identified despite data pseudonymization?
Unfortunately, in this case, the records are less precise as the refer to all and any probable (and reasonable) ways and reasonable likelihood to be used for identification. The verification criteria may constitute costs, time and technology needed for achieving this aim.
Pseudonymization in practice
The fact that pseudonymization is easily reachable despite the conditions is an advantage. There are many methods of pseudonymization which depend on their assumed level of privacy.
Pseudonymization allows for preserving the correlation between various data assigned to a given individual as a whole, as well as ensures the person anonymity.
Moreover, using pseudonymization brings some other benefits: regulations concerning data subject to pseudonymization are far less restrictive than for plain sensitive data (not subject to pseudonymization). Lower requirements mean a higher level of legal security and lower cost.
encrypting, provided that the decrypting key (algorithm) is stored in another database,
tokening, using input symbol streaming to generate tokens,
replacing parts of data with a sequence of symbols (we know it from, for instance, credit card numbers),
modifying data so that it shows approximate estimates.
Reversibility is the basic attribute that distinguishes pseudonymization from anonymization. Anonymization is an irreversible process, while pseudonymization is not.
How does a data anonymizer work?
In terms of software development, the improvement in data anonymization and its export are crucial for data security and for the efficacy of the solutions.
It is worth to use solutions available on the market and to rely on the experience of experts who design them. Thanks to it, we allow programmers, developers, and testers to use credible data, while not exposing confidential production data.
IT solutions for anonymization and pseudonymization have many advantages. The algorithms of Soflab GALL used for personal data anonymization and pseudonymization support the process of sensitive data identification and its random replacement, creating mixed data preventing proper identification.
Increasing data volume
What is very interesting, Soflab GALL allows generating test cases. This tool lets to take one cohesive data sample which can later undergo processing and multiplication. This is how new items in the database with random attributes (name, age, job, etc.) are created. The volume of data may be increased and we can test heavily loaded software.
This operation can be performed fast and repeatedly, and despite its repeatability, each attempt brings another result, preserving data coherence.
Polish data anonymization software
Testing or development environment is one of the regions where data leakage is very common. The way Soflab GALL is programmed does not allow to reverse the anonymization of data that already underwent it using the same rules. This increases data security in the non-production environment and prevents identification.
Anonymization and pseudonymizationof sensitive data using Soflab GALL allow creating fully functioning databases, including credible information that may be used in non-production environments.
The open architecture of Soflab GALL easily adjusts to the organization’s environment and to the existing threats. Anonymization process may be attained from CVS files including the contents of system charts or by via direct connection to databases, based on engines with JDBC drivers or Python connector.
An effective solution connected to the methodology of personal data protection and confidentiality requires good tools and experience. Ensuring the compatibility of these solutions with legal regulations should be a priority. It is worth for all the stored databases to undergo pseudonymization or anonymization even when they are not shared with external entities (e.g. for software system development). This ensures compliance with GDPR.
Right to forgotten resulting from the provisions of the act or an individual client’s request shall be attained by removing all sensitive data. Using anonymization allows to pursuit legal regulations concerning the right to forgetting, as well as allows to save insensitive data as a set of anonymous information related to clients’ behaviors. This lets to a better adjustment of processes within the organizations by means of using such data to improve the procedures and offer.
If you are interested in increasing data security in the non-production environment (testing or development), contact us. We will explain how to use Polish algorithms to increase database security while not damaging its statistical features, increase database volume and pursuit the right to forgetting with no loss of valuable information, as well as in compliance with GDPR.