DevSecOps

Data Anonymization

Data Anonymization

Data Anonymization – An Overview

Have you ever heard of the hacktivist group anonymous? What is the first thing that comes to your mind? Oh yeah! a group of people hiding in obscurity and carrying out targeted attacks at popular constituted authority (most times to demand accountability). You don’t know them, you have never seen them but you do know they exist. It’s like in popular James Bond movies; the identity of the agent – often the main character – has to be kept hidden. All the procedures and precautions employed in keeping all these identities secret is what anonymization is all about. In the instances above, revealing the identities of the stakeholders is deemed as jeopardy to their safety.  Therefore, the best way to keep them safe enough to achieve their aim is to have them work from the shadows.

It is no news that data – in today’s world – translates to power. Just like in the movies too, everyone does anything to hold on to it. From business, to warfare to basic human activities, everything depends on data availability. It is the metaphorical James bond of internet operations; in this case however, the safety of real life humans is at stake. The increasing importance of data and the constant attacks on database systems naturally spurred an evolution of cyber security techniques. Due to that, data anonymization was born.

Essentially, data anonymization is the process of irreversibly hiding data in obscurity. It is a process of removing any thing that links a data set to the owner of the data. In more technical terms, it is a process of altering data – by encoding (or let’s say encrypting) key identifiers – to complicate identification and foster safer data movement between systems. For instance, a bank may encrypt the name, address and other biometric data of their users while attempting to transfer a compilation of their daily savings. That way, exposure to the data set prevents an attacker from tracing them to actual persons.

Oh no! You don’t expect that there is no absolutely no means around this, do you? Not at all; even James bond gets discovered at times. De-anonymization is like the polar opposite of data anonymization. It is when attackers – through critical cross referencing with relevant sources – match data to the data subjects and key identifiers. That way, the data is no longer protected. However, there are different methods of anonymization, and there are procedures to prevent possible de-anonymization of these data sets.

Data anonymization could prove to be quite a reliable way of securing data and combined with other aspects of database management. You definitely have nothing to lose.

What Data Should Be Anonymized

Of course, not all data sets should be subject to anonymization; that defeats the whole purpose of obscurity in the first place. It is important that every database administrator can identify some data sets to make anonymous and some that they should bother with.

The answer to ‘what data should be anonymized’ may appear quite simple – the obvious answer would be “sensitive data”. However, this creates a room for subjectivity. Sensitive data differ with sector and individual; while contact info may not be deemed private to a marketing agency’s CEO, it may be very private to top security personnel.

By standard practices however, Personally Identifiable Information (PIIs) are meant to be kept safe. Therefore, they are a more perfect subject for anonymization than any other type of data set. This doesn’t in any way take away the issue of subjectivity. As a matter of fact, the term PII is not only subject to industry interpretation, there is a huge influence of legality on what PIIs should be.

However, there are some sets of data that are generally deemed as PIIs (at least by the vast majority) irrespective of industry or legal influence. Let’s consider some examples of these PIIs:

  1. NAME – In whatever context we observe this; it is the most important key identifier in a data set. A data set narrows down the list of variables of a data source. With this on the hand of any attacker, he can easily trace the source of a data set even if they are encoded. Therefore, they should be properly anonymized.
  1. CREDIT CARD DETAILS – This PII covers all sorts like credit card number, credit card pin and credit card token. They are considered very personal and are unique tom individuals. In any context, they should be protected.
  1. MOBILE NUMBER – Access to a mobile number could mean access to other data sets that reveal the identity of an individual. To an attacker, this could mean everything. In any case, this data set should be anonymized.
  1. PHOTOGRAPH – What better means could you identify someone as? In many cases, for the sake of identity verification and security, photographs are often collated. This data set should be protected at all cost. Therefore, it would be a perfect subject for anonymization.
  1. PASSWORDS – Depending on context, passwords could be very important. An attacker could conveniently impersonate anyone by gaining access with the person’s pin. In a backend structure made to store this, you may seriously need to consider encrypting these data sets.
  1. SECURITY QUESTIONS – These sets of data are also very important identifiers. With many web applications utilizing it as a means of granting user access, you may want to consider encrypting them too.

As mentioned above, the list may go down further depending on the industry involved. In the end, it is left to the company policy and discretion of the database administrator to figure out which data set pass as sensitive, and which does not. After a selective prioritization of data sets, you could choose the data sets that top your list to anonymize.

data anonymization work

What Are The Data Anonymization Methods?

As indicated earlier, there are different means to this same end – which is data anonymization. Each of these techniques plays on different technicalities and data forms.  That is; there are essential differences in their underlying principles, the end point, and the means to these different ends. Let’s properly examine these methods;

DATA MASKING

This looks very explanatory. It simply means to hide data; to give data sets another apparent identity. In data masking, you could create a duplicate (a mirror version) of a data set. In data masking, the goal is to hide the original data sets among inauthentic ones. Different versions of the same data sets are created at random and shuffled with the original version of data that is subject to protection. When it comes to de-anonymization, data masking makes it a tad more difficult for attackers. There are different types, and techniques to data masking. You could substitute or shuffle data to mask it. Let’s examine some of the types of data masking that are available;

  • Deterministic data masking – This is data masking in every sense of it. It is when you attach the same value to two data sets. There is always the original data set, and the improvised data set. For instance, you could always replace BOY with A XY in a data base.
  • Dynamic data masking – This type of data masking involves creating smaller sets of a single data sets and moving it from a system to another.
  • On the fly masking – This type of masking is similar to the dynamic data masking. However, it differs in destination. The subsets in this case are moved to a new dev./test environment in  secondary storage system
  • Static data masking – It is a process of altering all sensitive datasets in a database. After that, the data is shared to a new destination, and the original version is retained as a backup.

SYNTHETIC DATA

Just as the name sounds, synthetic data sets are data sets created by an algorithm with no relationship whatsoever to existent events. With the help of statistical models, these data sets are synthetic prototypes built off the original data sets meant to be protected. Some administrators consider this sleeker than making alterations to the original data sets.

DATA SWAPPING

Data swapping is as simple as it implies. It is a process of shuffling data sets to re-arrange it. That way, there would be no semblance between the original data base and the resultant records. This method has appeared very difficult for attackers to de-anonymize because of the mismatch.

DATA PERTURBATION

This method of anonymizing data sets is applicable to numerical data inputs. To do this, you round up the data sets with a specific value and operation. For instance, you could add 8 to all the numerical values in your database. In another instance, you could use 5 as the base of your operation; you could divide all the numerical values by 5.

DATA GENERALIZATION

This method epitomizes the underlying principle of data anonymization. In this method, some data sets are removed to make the data less recognizable. You could carefully modify a database by a specific range through removing some identifiers in the database.

PSEUDONYMIZATION

In this method, you replace some private data sets (that could be identifiers) with fake ones. This method is the best for hiding data while preserving the accuracy in statistics and value. What you are doing essentially, is to replace important identifiable information with pseudonyms. It is the best method of anonymization that is used for analytical trainings and demo operations. This is the process of encoding data with the aim of re-identification. It is a much easier way to keep personally identifiable information (PII) secret and separate from other data set. Also, it is much faster to cross reference with sources and decode when compared to other methods of anonymization.

PSEUDONYMIZATION

What Are The Advantages And Disadvantages Of Anonymization?

There are obvious benefits attached to adopting data anonymization methods; on the surface, we could tell that it means an extra layer of security in the very least.  Let’s however check out some of the more intricate benefits of data anonymization

  1. It is a measure against data misuse – No matter how much you trust your employee, you cannot always guarantee their intentions. To prevent a possible collaboration between insiders and external attackers, you could always employ data anonymization.
  1. It serves the multi-function of being a wall and a damage control measure – Aside from adding an layer of security to database security, it could help curtail data loss damage in case of a database breach. Peradventure you lose the fight of access to an outsider attacker; the data sets would still seem incomprehensible to them if they are perfectly anonymized.
  1. The safer, the better – No organization can get anything done without a safe and coherent database. Data anonymization allows you to carry out your operation with a reduced fear of database invasion. This indirectly contributes to productivity; you will definitely get more results.

What possible disadvantage could be associated with the methods of data anonymization?

  1. You will have to request permission and that takes time – According to legal stipulations and standard practices, you need to request the permission of the users to collate and perform any operations on such sensitive data sets like the ones in question.
  1. Data may be less coherent and meaningful – Data, when anonymized, appear less meaningful to the even you as an administrator. The quality of insights you could derive from a data set that has been anonymized is largely dependent on how well you apply your anonymization method. This also depends on which of the methods you apply. Even at that, you cannot get the full insights you would get from an untouched dataset.

Conclusion

From the overall appraisal, if you really value your organization’s reputation and your user’s safety, you may want to consider anonymizing the sensitive data sets in your database. After all, you need all the layers of security you can get, so what do you have to lose?

You should however be careful not to overdo it; if overdone, the data sets lose all bits of relevance. That, is exactly you may want to consider the help of a professional.

Learning Objectives
It’s demo time