The Predictive Privacy Project

Data Protection in the Context of Big Data and AI

What is Predictive Privacy?

Big Data and Artificial Intelligence pose a new challenge to the traditional understanding of privacy. These techniques can be used to make predictions – for example about human behaviour, the progression of a disease, security risks or purchasing behaviour. The basis for such predictions is a comparison of behavioural data (e.g. usage, tracking or activity data) of the individual concerned with the data of many other individuals. When machine learning and data analytics technology is used to predict future behaviour or unknown information about individuals by pattern matching in large data sets, I refer to this as “predictive analytics”.

Predictive analytics is frequently associated with useful applications that improve, for example, our health care. However, the potential for misuse is just as great: predictive analytics also makes it possible to infer sensitive attributes such as gender, sexual orientation, predisposition to disease, mental health or political attitudes without those concerned realizing it. Such estimates are used, for example, to determine insurance premiums, creditworthiness, advertising and product prices for each individual user.

Predictive Privacy in 15 minutes

Session 13 of the Introduction to the Ethics of AI 2022/23

The concept

I use the concept of “predictive privacy” to research data protection and informational privacy in the context of predictive analytics. This is an approach that specifically addresses the risk of inferred information being misused. A person’s predictive privacy also includes information that can be guessed about them by (algorithmically) matching it with information from many other people. Predictive privacy is thus violated when, without the person’s knowledge and against their will, sensitive information about them is predicted. Predictive privacy is potentially violated by data analytics and machine learning applications in risk scoring, credit scoring, automated job selection, differential pricing, algorithmic triage, etc.

Research paper

  1. Mühlhoff, Rainer. 2021. „Predictive Privacy: Towards an Applied Ethics of Data Analytics“. Ethics and Information Technology. doi:10.1007/s10676-021-09606-x.

Collective privacy: Data protection is not a private decision

Predictive privacy not only extends the traditional and familiar concept of (informational) privacy, it also implies a collectivist ethical approach to data protection. The term “data protection” refers to the legal norms and regulations aimed at protecting the fundamental rights of individuals and groups against possible violations caused by data processing. The idea of data protection is to mitigate the power imbalance created by the use of data technology between data processing organisations and citizens.

Given the violations of predictive privacy enabled by modern predictive analytics technologies, data protection as implemented by the EU’s General Data Protection Regulation (GDPR) faces a fundamental obstacle. The data on which predictive models are trained are usually collected legally, either with user consent or as anonymous data – which is still suitable for training machine learning algorithms that find correlations between, for example, behavioural data and sensitive attributes.

Predictive analytics operates precisely in the blind spot of the individualistic Western notion of privacy: it exploits the masses of data (big data) voluntarily disclosed by individual users who decide for themselves that they “have nothing to hide”. While the individual decision to reveal information, such as when using a digital service, often seems marginal or irrelevant to the user in terms of loss of privacy, on a large scale the data collected through millions of such decisions reveal predictive knowledge about all of us. How this predictive knowledge may be used is poorly regulated, and many applications are detrimental to the individuals concerned or to society.

The individualistic conception of Western privacy legislation is therefore facing a dead end, and to protect predictive privacy, we need a collectivist interpretation of data protection. Predictive analytics can be used to derive sensitive information about a data subject based on the information disclosed by many other individuals. That is, the data you disclose, potentially helps discriminate against others. And the data others disclose about themselves can be use to make predictions about you.

Example: Guessing intimate information from Facebook likes

For a data company like Facebook, it is possible to build predictive models that infer the sexual orientation or relationship status of Facebook users based on their “likes” – researchers have shown that only a few likes from a user are enough (Kosinski et al. 2013). To train such a model, Facebook may proceed as follows: A small number of users, for example, only 5% – explicitly state their sexual orientation or relationship status in their Facebook profile. With a total of 2.8 billion users worldwide, even this 5% makes up a very large cohort from which Facebook then has both the Facebook likes (proxy variable) and information on sexual orientation or relationship status (target variable).

As a result, through “supervised learning”, a predictive model from the data of these users is trained, which learns to predict the target variable based on the proxy variable. Once such a model has been trained, it can infer the sexual orientation or relationship status of its users, even though this information has not been explicitly provided, but based solely on their Facebook likes. Facebook can therefore classify almost all its users according to these sensitive parameters – users are unaware of having been classified according to these attributes even though they have deliberately chosen not to share this information on their profiles.

Further sensitive information that can be determined from Facebook likes includes the user’s ethnic background, religious and political views, psychological personality traits, intelligence, “happiness”, addictive behaviour, childhood with divorced parents, age and gender (Kosinski et al. 2013). Other studies show that numerous health issues can be inferred from Facebook data, including self-harm, depression, anxiety disorders, psychosis, diabetes and hypertension (Mechant et al. 2019).

Talk: Predictive Privacy, CAIS Kolloquium, Bochum, 14 December 2022.

Research articles on Purpose Limitation for Models

  1. Mühlhoff, Rainer, und Hannah Ruschemeier. 2024. „Updating Purpose Limitation for AI: A Normative Approach from Law and Philosophy“. SSRN Preprint, Januar.
  1. Mühlhoff, Rainer, und Hannah Ruschemeier. 2024. „Regulating AI via Purpose Limitation for Models“. AI Law and Regulation.
  1. Mühlhoff, Rainer. 2024. „Das Risiko der Sekundärnutzung trainierter Modelle als zentrales Problem von Datenschutz und KI-Regulierung im Medizinbereich“. In KI und Robotik in der Medizin – interdisziplinäre Fragen, herausgegeben von Hannah Ruschemeier und Björn Steinrötter. Nomos. doi:10.5771/9783748939726-27.

Research articles on Predictive Privacy

  1. Mühlhoff, Rainer. 2023. „Predictive Privacy: Collective Data Protection in the Context of AI and Big Data“. Big Data & Society, 1–14. doi:10.1177/20539517231166886.
  1. Mühlhoff, Rainer. 2021. „Predictive Privacy: Towards an Applied Ethics of Data Analytics“. Ethics and Information Technology. doi:10.1007/s10676-021-09606-x.
  1. Mühlhoff, Rainer, und Hannah Ruschemeier. 2022. „Predictive Analytics und DSGVO: Ethische und rechtliche Implikationen“. In Telemedicus – Recht der Informationsgesellschaft, Tagungsband zur Sommerkonferenz 2022, herausgegeben von Hans-Christian Gräfe und Telemedicus e.V., 38–67. Deutscher Fachverlag.
  1. Mühlhoff, Rainer, und Theresa Willem. 2023. „Social Media Advertising for Clinical Studies: Ethical and Data Protection Implications of Online Targeting“. Big Data & Society, 1–15. doi:10.1177/20539517231156127.
  1. Mühlhoff, Rainer. 2022. „Prädiktive Privatheit: Kollektiver Datenschutz im Kontext von Big Data und KI“. In Künstliche Intelligenz, Demokratie und Privatheit, herausgegeben von Michael Friedewald, Alexander Roßnagel, Jessica Heesen, Nicole Krämer, und Jörn Lamla, 31–58. Nomos. doi:10.5771/9783748913344-31.
  1. Mühlhoff, Rainer. 2020. „Prädiktive Privatheit: Warum wir alle »etwas zu verbergen haben«“. In #VerantwortungKI – Künstliche Intelligenz und gesellschaftliche Folgen, herausgegeben von Christoph Markschies und Isabella Hermann. Bd. 3/2020. Berlin-Brandenburgische Akademie der Wissenschaften.

Essays on predictive privacy

  1. Mühlhoff, Rainer. 2020. „We Need to Think Data Protection Beyond Privacy: Turbo-Digitalization after COVID-19 and the Biopolitical Shift of Digital Capitalism“. Medium, März. doi:10.2139/ssrn.3596506.
  1. Mühlhoff, Rainer. 2020. „Digitale Grundrechte nach Corona: Warum wir gerade jetzt eine Debatte über Datenschutz brauchen“. 31.03.2020.
  1. Mühlhoff, Rainer. 2020. „Die Illusion der Anonymität: Big Data im Gesundheitssystem“. Blätter für Deutsche und Internationale Politik 8: 13–16.