Data anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that connect an individual to stored data. Data anonymization seeks to protect private or sensitive data by deleting or encrypting personally identifiable information from a database. Data anonymization is done for the purpose of protecting an individual’s or company’s private activities while maintaining the integrity of the data gathered and shared. Data anonymization is also known as “data obfuscation,” “data masking,” or “data de-identification.”
Techniques of Anonymization
Attribute Suppression: Attribute suppression refers to the removal of an entire part of data (also referred to as “column” in databases and spreadsheets) in a dataset. When a attribute is not required in the anonymised dataset this technique is used. For ex. Name Column is dropped from a table.
Record Suppression: Record suppression refers to the removal of an entire record in a dataset. In contrast to most other techniques, this technique affects multiple attributes at the same time. This rechnique can be used when there is presence of outlier records or during sampling of data.
Character Masking: It is the change of the characters of a data value, e.g. by using a constant symbol (e.g. “*” or “x”). Masking is typically partial, i.e. applied only to some characters in the attribute such a mobile number. Masking some characters in an attribute provides sufficient anonymity.
Pseudonymization: Pseudonymization is also referred to as coding. Pseudonyms can be irreversible, where the original values are properly disposed and the pseudonymisation was done in a non-repeatable fashion, or reversible (by the owner of the original data), where the original values are securely kept but can be retrieved and linked back to the pseudonym, should the need arises. Persistent pseudonyms allow linkage by using the same pseudonym values to represent the same individual across different data sets. On the other hand, different pseudonyms may be used to represent the same individual in different data sets to prevent linking of the different data sets. Pseudonyms can also be randomly or deterministically generated. Records still need to be distinguished from each other in the anonymised data set but no part of the original attribute value can be retained. In summary it is the replacement of identifying data with made up values.
Generalisation: A deliberate reduction in the precision of data. E.g. converting a person’s age into an age range, or a precise location into a less precise location. This technique is also referred to as recoding. Attributes can be modified to be less precise but still be useful, ex. Date Range, Age Range instead of exact age.
Swapping: The purpose of swapping is to rearrange data in the data set such that the individual attribute values are still represented in the data set, but generally, do not correspond to the original records. This technique is also referred to as shuffling and permutation. Swapping is applied when there is no need for analysis of relationships between attributes at record level.
Data Pertubation: The values from the original data set are modified to be slightly different. This technique is applied when slight modification to the attributes are acceptable.
Synthetic Data: Large amount of made up data similar in nature to the original data for purpose such as testing or mock reporting.
Data Aggregation: Data Aggregation is converting a data set from a list of records to summarized values. When individual record is not required & aggregated data is sufficient for the purpose this technique is applied.
Why Anonymize Data?
- To protect Sensitive Data.
- To promote integrity of Data Sharing.
- To adhere to GDPR & Compliance rules.