data privacy

Choquet integral for record linkage

Publication Type:

Journal Article

Source:

Annals of Operations Research, Springer US, Volume 195, Issue 1, p.97-110 (2012)

URL:

http://www.springerlink.com/index/10.1007/s10479-011-0989-x

Abstract:

Record linkage is used in data privacy to evaluate the disclosure risk of protected data. It models potential attacks, where an intruder attempts to link records from the protected data to the original data. In this paper we introduce a novel distance based record linkage, which uses the Choquet integral to compute the distance between records. We use a fuzzy measure to weight each subset of variables from each record. This allows us to improve standard record linkage and provide insightful information about the re-identification risk of each variable and their interaction. To do that, we use a supervised learning approach which determines the optimal fuzzy measure for the linkage.

Improving record linkage with supervised learning for disclosure risk assessment

Publication Type:

Journal Article

Source:

Information Fusion, Volume 13, Issue 4, p.274-284 (2012)

URL:

http://www.sciencedirect.com/science/article/pii/S1566253511000352

Abstract:

In data privacy, record linkage can be used as an estimator of the disclosure risk of protected data. To model the worst case scenario one normally attempts to link records from the original data to the protected data. In this paper we introduce a parametrization of record linkage in terms of a weighted mean and its weights, and provide a supervised learning method to determine the optimum weights for the linkage process. That is, the parameters yielding a maximal record linkage between the protected and original data. We compare our method to standard record linkage with data from several protection methods widely used in statistical disclosure control, and study the results taking into account the performance in the linkage process, and its computational effort.

Document Sanitization: Measuring Search Engine Information Loss and Risk of Disclosure for the Wikileaks cables

Publication Type:

Conference Paper

Source:

Privacy in Statistical Databases, 2012, Springer LNCS, Volume 7556, Palermo, Sicily, p.308-321 (2012)

Keywords:

document sanitization; data privacy; information retrieval

Abstract:

In this paper we evaluate the effect of a document sanitization process on a set of information retrieval metrics, in order to measure information loss and risk of disclosure. As an example document set, we use a subset of the Wikileaks Cables, made up of documents relating to five key news items which were revealed by the cables. In order to sanitize the documents we have developed a semi-automatic anonymization process following the guidelines of Executive Order 13526 (2009) of the US Administration, by (i) identifying and anonymizing specific person names and data, and (ii) concept generalization based on WordNet categories, in order to identify words categorized as classified. Finally, we manually revise the text from a contextual point of view to eliminate complete sentences, paragraphs and sections, where necessary. We show that a significant sanitization can be applied, while maintaining the relevance of the documents to the queries corresponding to the five key news items.

An Evolutionary Optimization Approach for Categorical Data Protection

Publication Type:

Conference Paper

Source:

Privacy and Anonymity in the Information Society 2012, Berlin (2012)

Keywords:

genetic algorithms; data privacy; categorical data; data mining; information loss; disclosure risk

Abstract:

The continuous growing amount of public sensible data has increased the risk of breaking the privacy of people or institutions in those datasets. Many protection methods have been developed to solve this problem by either distorting or generalizing data but taking into account the difficult trade-off between data utility (information loss) and protection against disclosure (disclosure risk).
In this paper we present an optimization approach for data protection based on an evolutionary algorithm which is guided by a combination of information loss and disclosure risk measures. In this way, state-of-the-art protection methods are combined to obtain new data protections with a better trade-off between these two measures. The paper presents several experimental results that assess the performance of our approach.

Analysis of On-line Social Networks Represented as Graphs – Extraction of an Approximation of Community Structure Using Sampling

Publication Type:

Conference Paper

Source:

MDAI 2012, Springer-Verlag, Volume 7647, Girona, Catalunya., p. 149-160 (2012)

Abstract:

In this paper we benchmark two distinct algorithms for extracting community structure from social networks represented as graphs, considering how we can representatively sample an OSN graph while maintaining its community structure. We also evaluate the extraction algorithms’ optimum value (modularity) for the number of communities using five well-known benchmarking datasets, two of which represent real online OSN data. Also we consider the assignment of the filtering and sampling criteria for each dataset. We find that the extraction algorithms work well for finding the major communities in the original and the sampled datasets. The quality of the results is measured using an NMI (Normalized Mutual Information) type metric to identify the grade of correspondence between the communities generated from the original data and those generated from the sampled data. We find that a representative sampling is possible which preserves the key community structures of an OSN graph, significantly reducing computational cost and also making the resulting graph structure easier to visualize. Finally, comparing the communities generated by each algorithm, we identify the grade of correspondence.

Information Loss Evaluation based on Fuzzy and Crisp Clustering of Graph Statistics

Publication Type:

Conference Paper

Source:

IEEE World Congress on Computational Intelligence (WCCI) 2012, Brisbane, Australia (2012)

Keywords:

data privacy; fuzzy clustering; graphs

Abstract:

In this paper we apply different types of clustering,
fuzzy (fuzzy c-Means) and crisp (k-Means) to graph statistical
data in order to evaluate information loss due to perturbation as
part of the anonymization process for a data privacy application.
We make special emphasis on two major node types: hubs, which
are nodes with a high relative degree value, and bridges, which
act as connecting nodes between different regions in the graph.
By clustering the graph's statistical data before and after
perturbation, we can measure the change in characteristics and
therefore the information loss. We partition the nodes into three
groups: hubs/global bridges, local bridges, and all other nodes.
We suspect that the partitions of these nodes are best represented
in the fuzzy form, especially in the case of nodes in frontier
regions of the graphs which may have an ambiguous assignment.

Supervised learning using mahalanobis distance for record linkage

Publication Type:

Conference Proceedings

Source:

6th International Summer School on Aggregation Operators-AGOP2011, Lulu.com, Univ. of Sannio, Benevento, Italy, p.223--228 (2011)

ISBN:

978-1-4477-7019-0

URL:

http://agop2011.ciselab.org/proceedings

Keywords:

data privacy; record linkage; disclosure risk; Mahalanobis distance; fuzzy measure; Choquet integral

Abstract:

In data privacy, record linkage is a well known technique used to evaluate the disclosure risk of protected data. Mainly, the idea is the linkage between records of different databases, which make reference to the same individuals. In this paper we introduce a new parametrized variation of record linkage relying on the Mahalanobis distance, and a supervised learning method to determine the optimum simulated covariance matrix for the linkage process. We evaluate and compare our proposal with other studied parametrized and not parametrized variations of record linkage, such as weighted mean or the Choquet integral, which determines the optimal fuzzy measure.

Fuzzy methods for database protection

Publication Type:

Conference Proceedings

Source:

7th conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-2011) and LFA-2011, Atlantis Press, Volume 1-1, Aix-les-Bains, France, p.439 - 443 (2011)

ISBN:

978-90-78677-00-0

URL:

http://www.atlantis-press.com/php/paper-details.php?from=author+index&id=2328&querystr=

Keywords:

Data privacy; fuzzy clustering; fuzzy measures; fuzzy integrals

Abstract:

Data privacy has become an important topic of research. Ubiquitous databases and the eclosion of web technology eases the access to information. This information can be related to individuals, and, thus, sensitive information about users can be easily accessed by interested parties. Data privacy focuses on tools and methods to protect the privacy of the respondents and data owners. In the last years, a large number of methods have been developed for data privacy. Some of them are based on fuzzy sets and systems. In this position paper we present a review of some of our results in this area. In particular, we focus on the use of fuzzy sets for data protection, for measuring information loss and for measuring disclosure risk. The techniques used in this field and reviewed in this paper range from fuzzy clustering to fuzzy integrals.

On the declassification of confidential documents

Publication Type:

Conference Proceedings

Source:

Modeling Decisions for Artificial Intelligence, MDAI 2011, Springer, Volume 6820, Changsha, China, p.235-246 (2011)

URL:

http://www.springerlink.com/content/tg81j807q42x8837/

Keywords:

declassification; anonymity; privacy preserving information retrieval; semantic; data privacy; information retrieval; pattern classification; named-entity recognition

Abstract:

Abstract. We introduce the anonymization of unstructured documents to settle the base of automatic declassification of confidential documents. Departing from known ideas and methods of data privacy, we introduce the main issues of unstructured document anonymization and propose the use of named entity recognition techniques from natural language processing and information extraction to identify the entities of the document that need to be protected.

Clustering-based Information Loss for Data Protection Methods of Categorical Data

Publication Type:

Thesis

Source:

Universitat Autònoma de Barcelona, Bellaterra (Barcelona), Spain, p.24 (2011)

Keywords:

Data Privacy; Information Loss; Disclosure Risk; Clustering

Abstract:

Data privacy has been always a very important issue but it became much more important with the expansion of the Internet because, nowadays, the number of public datasets avaliable for statistical studies is growing more and more, so the amount of sensitive data available on the Internet is greater every day. This fact makes very important the assessment of the performance of all the methods used to mask those datasets. In order to check the performance there exist two kind of measures: the information loss and the disclosure risk. This performance assessment comes even more important when protecting categorical data which has a very limited manipulation.
In this thesis I present an information loss analysis based on cluster-specific measures over categorical data protection methods. That is, measures specifically defined for the case in which the user will do clustering with the data. We also compare the obtained results with the ones known using general information loss analysis.

Syndicate content