Privacy Preserving Data Publishing

Updated on May 29, 2026

Abstract

Privacy is an important issue when one wants to make use of data that involves individuals sensitive information. Research on protecting the privacy of individuals and the confidentiality of data has received contributions from many fields, including computer science, statistics, economics, and social science.

This is an area that attempts to answer the problem of how an organization, such as a hospital, government agency, or insurance company, can release data to the public without violating the confidentiality of personal information. Focus is on privacy criteria that provide formal safety guarantees, present algorithms that sanitize data to make it safe for release while preserving useful information, and discuss ways of analyzing the sanitized data.Many challenges still remain. Data in its original form, however, typically contains sensitive information about individuals,and publishing such data will violate individual privacy.Privacy-preserving data publishing (PPDP) provides methods and tools for publishing useful information while preserving data privacy.

The Anonymization Approach

In the most basic form of PPDP, the data publisher has a table of the formD(Explicit Identifier, Quasi Identifier, Sensitive Attributes,Non Sensitive Attributes),where Explicit Identifier is a set of attributes, such as name and social security number (SSN), containing information that explicitly identifies record owners; Quasi Identifier (QID) is a set of attributes that could potentially identify record owners;Sensitive Attributes consists of sensitive person specific information such as disease,salary, and disability status; and Non-Sensitive Attributes contains all attributes that do not fall into the previous three categories The four sets of attributes are disjoint. Most works assume that each record in the table represents a distinct record owner.Anonymization refers to the PPDP approach that seeks to hide the identity and/or the sensitive data of record owners, assuming that sensitive data must be retained for data analysis.

Clearly, explicit identifiers of record owners must be removed. Even with all explicit identifiers being removed, their combination, called the quasiidentifier often singles out a unique or a small number of record.In the above example, the owner of a record is reidentified by linking his quasiidentifier.To perform such linking attacks, the attacker needs two pieces of prior knowledge: the victim’s record in the released data and the quasi- identifier of the victim. Such knowledge can be obtained by observation. For example, the attacker noticed that his boss was hospitalized, and therefore knew that his boss’s medical record would appear in the released patient database. Also, it was not difficult for the attacker to obtain his boss’s zip code, date of birth, and sex, which could serve as the quasi- identifier in linking attacks.

To prevent linking attacks, the data publisher provides an anonymous table,T(QID, Sensitive Attributes, Non-Sensitive Attributes),QID is an anonymous version of the original QID obtained by applying anonymization operations to the attributes in QID in the original table D. Anonymization operations hide some detailed information so that several records become indistinguishable with respect to QID. Consequently, if a person is linked to a record through QID, that person is also linked to all other records that have the same value for QID, making the linking ambiguous.

Attack Models and Privacy Models

Privacy protection provided a very stringent definition: access to the published data should not enable the attacker to learn anything extra about any target victim compared to no access to the database, even with the presence of any attacker’s background knowledge obtained from other sources.The first category considers that a privacy threat occurs when an attacker is able to link a record owner to a record in a published data table, to a sensitive attribute in a published data table, or to the published data table itself.

We call these record linkage, attribute linkage, and table linkage, respectively. In all three types of linkages,it is assumed that the attacker knows the QID of the victim. In record and attribute linkages, we further assume that the attacker knows that the victim’s record is in the released table, and seeks to identify the victim’s record and/or sensitive information from the table. In table linkage, the attack seeks to determine the presence or absence of the victim’s record in the released table. A data table is considered to be privacy preserving if it can effectively prevent the attacker from successfully performing these linkages.The second category aims at achieving the uninformative principle.

The published table should provide the attacker with little additional information beyond the background knowledge. If the attacker has a large variation between the prior and posterior beliefs, we call it the probabilistic attack. 2.1. Record Linkage In the attack of record linkage, some value qid on QID identifies a small number of records in the released table T, called a group. If the victim’s QID matches the value qid, the victim is vulnerable to being linked to the small number of records in the group. k-Anonymity. To prevent record linkage through QID: if one record in the table has some value qid, at least k− 1 other records also have the value qid. In other words, the minimum group size on QID is at least k. A table satisfying this requirement is called k-anonymous. In a k- anonymous table, each record is indistinguishable from at least k− 1 other records with respect to QID.

Consequently, the probability of linking a victim to a specific record through QID is at most 1/k. k- anonymity cannot be replaced by the privacy models in attribute linkage. A k-anonymous T can still effectively prevent this type of record linkage without considering the sensitive information. In contrast, the privacy models in attribute linkage assume the existence of sensitive attributes in T.The k-anonymity model assumes that QID is known to the data publisher. Most work considers a single QID containing all attributes that can be potentially used in the quasi- identifier. The more attributes included in QID, the more protection k anonymity would provide. On the other hand, this also implies that more distortion is needed to achieve k-anonymity because the records in a group have to agree on more attributes.

Minimality Attack on Anonymous Data:

Most privacy models assume that the attacker knows the QID of a target victim and/or the presence of the victim’s record in the published data. In addition to this background knowledge, the attacker can possibly determine the privacy requirement , the anonymization operations to achieve the privacy requirement, and the detailed mechanism of an anonymization algorithm. The attacker can possibly determine the privacy requirement and anonymization operations by examining the published data, or its documentation, and learn the mechanism of the anonymization algorithm.Background knowledge can lead to extra information that facilitates an attack to compromise data privacy. This is called the minimality attack.

Many anonymization algorithms discussed in this section follow an implicit minimality principle. For example, when a table is generalized from bottom-up to achieve k- anonymity, the table is not further generalized once it minimally meets the kanonymity requirement. Minimality attack exploits this minimality principle to reverse the anonymization operations and filter out the impossible versions of the original table . To thwart minimality attack,a privacy model, called mconfidentiality was proposed that limits the probability of the linkage from any record owner to any sensitive value set in the sensitive attribute.

This type of minimality attack is applicable to both optimal and minimal anonymization algorithms that employ generalization, suppression, anatomization, or permutation to achieve privacy models, including, but not limited to, l-diversity . To avoid minimality attack on l-diversity, first k- anonymize the table, then, for each qid group in the k-anonymous table that violates l-diversity, their method distorts the sensitive values to satisfy l- diversity.

Conclusion

Information sharing has become part of the routine activity of many individuals, companies,organizations, and government agencies. Privacy- preserving data publishing is a promising approach to information sharing, while preserving individual privacy and protecting sensitive information. In this survey, we reviewed the recent developments in the field. The general objective is to transform the original data into some anonymous form to prevent from inferring its record owners’ sensitive information.We presented our views on the difference between privacy-preserving data publishing and privacy-preserving data mining, and gave a list of desirable properties of a privacy-preserving data publishing method.We reviewed and compared existing methods in terms of privacy models, anonymization operations, information metrics, and anonymization algorithms. Most of these approaches assumed a single release from a single publisher, and thus only protected the data up to the first release or the first recipient.

References

[1]Benjamin C M Fung, Ke Wang, Rui Chen, Privacy-Preserving Data Publishing: A Survey of Recent Developments, ACM Computing Surveys, Vol. 42, No. 4, Article 14, Publication date: June 2010

[2] Bee-Chung Chen, Daniel Kifer, Kristen LeFevre and Ashwin Machanavajjhala, Privacy-Preserving Data Publishing, Foundations and Trends in DatabasesVol. 2, Nos. 1–2 (2009) 1–167

[3] Ninghui Li Tiancheng Li, t-Closeness, Privacy Beyond k- Anonymity and -Diversity