VoxEU Column Frontiers of economic research Industrial organisation Microeconomic regulation

Can we have too much data?

18 Nov 2019

The Cambridge Analytica scandal highlighted the sophisticated ways social media platforms can allow companies to infer information about users and non-users from shared data. This column shows how correlations between platform users’ and non-users’ characteristics mean companies can obtain data at below equilibrium prices, implying welfare inefficiencies for individuals. The authors make some suggestions of regulations that could improve on these data-sharing inefficiencies for users and non-users of the platforms.

Authors

Asuman Ozdaglar

Ali Makhdoumi

Azarakhsh Malekian

Daron Acemoglu

Billions of users are currently using social media platforms and sharing their data (Facebook alone has over 2.5 billion active monthly users). These data are processed with increasingly sophisticated machine learning and AI methods to provide online services and personalised advertising by both social media platforms and third parties. Most economists and technologists emphasise the benefits of data both to users and society at large via improved consumer choice and as input into better and more innovation.

There is a dark side to all of these data, however. Critically, data-sharing on social media and other platforms compromises not only the privacy of users sharing the data, but others who are actively not engaged in such data sharing (see Pasquale 2015 and Zuboff 2019 on the privacy and other adverse implications of government and corporate surveillance).

The Cambridge Analytica scandal illustrates some costs of mass scale data collection and sharing. Facebook allowed Cambridge Analytica to acquire the private information of millions of individuals from data shared by about 270,000 Facebook users, who voluntarily downloaded an app for mapping their personality traits, called “This is your digital life”. The app accessed users' news feed, timeline, posts, and messages, and revealed information about other Facebook users these 270,000 individuals were connected to. Cambridge Analytica was ultimately able to infer detailed information about more than 50 million Facebook users. The company then deployed these data for designing personalised political messages and advertising on behalf of the Leave campaign in the Brexit referendum and for the Republican nominee, Donald J Trump, in the 2016 US presidential election (Gang 2018, Granville 2018).

Cambridge Analytica is the tip of a much larger iceberg involving similar practices throughout the industry. Facebook itself and other third parties engage in analogous strategies. More fundamentally, the very nature of predictive big data approaches is to forecast the behaviour or characteristics of groups of individuals from data shared by samples. Advocates of the benefits of these approaches emphasise how information shared by an individual about their preferences or health problems can be useful for understanding the behaviour and diseases affecting others with similar characteristics. But the same logic extends to privacy concerns as well. When Facebook or other companies can predict the behaviour of individuals who haven’t shared their data, this amounts to a violation of privacy to which these individuals have not consented. Consider, for example, the case of data from Facebook and other social media platforms being used for predicting who will take place in protests against the government. Less extreme but no less relevant is the ability of companies to predict the behaviour of individuals depending on their location, nationality, age, and sexual orientation. This may generate a loss of privacy as well as potential benefits (imagine, for instance, companies forecasting which restaurant or bar you will go to on which evening).

Even if these concerns are present, many in the tech industry (as well as experts) may still argue they are not significant enough to counterbalance the positive benefits obtained from data, because the presumption is that privacy concerns are unimportant. This is based on existing studies (e.g. Laudon 1996, Varian 2002, and Athey et al. 2017) that find relatively low willingness to pay by most users to protect their privacy. Yet this inference (implicitly) depends on the presumption that these revealed willingness-to-pay measures reflect the true value of privacy. When one’s information is revealed by others, this need not be the case.

Our new paper develops a conceptual framework for investigating these issues (Acemoglu et al. 2019). We present a model in which a monopoly platform or a set of competing platforms can purchase data from users (either explicitly by paying for data or implicitly by offering services for free in exchange of their data). Critically, the data of an individual are informative not only about their own characteristics but also the characteristics of other users (and potentially non-users). More specifically, the information structure represents a network in which a link between two individuals captures the correlation between their information. Each individual also differs according to the value they attach to privacy. Information enables the platform or third parties to estimate the underlying characteristics of an individual, and more accurate estimates create greater value for the platform. Conversely, more accurate estimates lead to more compromised privacy from the viewpoint of the individual.

When an individual’s data are correlated only with her characteristics, preferences, or actions, market prices accurately reflect the value of privacy and balance the costs and benefits from data-sharing. But this is no longer the case when the information of different users is correlated. The next example illustrates this in a simple fashion.

Consider a platform with two users, as depicted in Figure 1. The platform can acquire or buy the data of a user in order to better estimate her characteristics, preferences, or actions. The relevant data of the two users are correlated, which means that the data of one user enables the platform to more accurately estimate the characteristics of the other user. The objective of the platform is to minimise the estimation error of user characteristics, or maximise the amount of leaked information about them. Suppose that the valuation (in monetary terms) of the platform for the users’ leaked information is one, while the value that the first user attaches to her privacy, again in terms of leaked information about her, is 1/2 and for the second user it is v > 0. The platform offers prices (either explicitly by paying or implicitly by offering services) to the users in exchange for their data. Each user can choose whether to accept the price offered by the platform or not. In the absence of any restrictions on data markets or transaction costs, the first user will always sell her data (because her valuation of privacy, 1/2, is less than the value of information to the platform). But given the correlation between the characteristics of the two users, this implies that the platform will already have a fairly good estimate of the second user’s characteristics. Suppose, for illustration, that the correlation between the data of users is very high. In this case, the platform will know almost everything relevant about user 2 from user 1’s data, and this undermines the willingness of user 2 to protect her data. In fact, since user 1 is revealing almost everything about her, she would be willing to sell her own data for a very low price (approximately 0). But once the second user is selling her data, this also reveals the first user’s data, so the first user can only charge a very low price for her data. Therefore in this simple example, the platform will be able to acquire users’ data at approximately zero price, even though both users have privacy concerns. The depressed value of data prices below the value of privacy has obvious distributional implications – the platform benefits from cheap data and users receive no compensation for their data. When v ≤ 1, the equilibrium is still efficient because data are socially beneficial — the benefits to the platform exceed the disutility to users due to loss of privacy. In contrast, when v is above one and large, the equilibrium is inefficient, and in fact, it can be arbitrarily so. This is because the first user, by selling her data, is creating a negative externality on the second user.

Figure 1 Valuing data correlations between two users

This example captures two of the most important conclusions from our analysis. First, data-sharing by an individual always creates negative externalities on others whose information is revealed. These negative effects may not be large enough to overturn the benefits from the platform’s use of these data. But even in this case, they create distributional effects (they benefit the platform at the expense of users). However, when some of the other users value their privacy highly, these negative effects may outweigh the benefits, leading to too much data-sharing. Second, and perhaps more subtly, data-sharing by an individual changes both the value of data to the platform and the value of privacy to other users. This is because these data enable the platform to better estimate the characteristics of other users, hence the platform itself will have less use for the data of other users. Analogously, once their information is leaked, these users may no longer choose to protect their own data. Hence, they may themselves share their own data even though they value their privacy greatly. This reiterates that, in the presence of data-sharing externalities, the value of users’ privacy cannot be inferred from their revealed data-sharing decisions.

Our analysis generalises the insights from this example and presents additional new findings.

First, we provide fairly weak conditions under which equilibrium in data markets is necessarily inefficient.

In particular, both with and without competition between platforms, correlation between users with low and high values for privacy leads to inefficiencies. Though this conclusion is fairly general, there are a few exceptions and elaborations that should be mentioned. In some circumstances, such correlation may not be sufficient for inefficiency because the benefits from data-sharing for low-value users are so large that a utilitarian social planner may prefer to sacrifice the privacy of other users with high values of privacy. Conversely, even without such correlation, inefficiencies may result because low-value users may attempt to avoid the negative externalities by distorting their platform choices or other decisions. More importantly, inefficiency also results when only high-value users are correlated with each other, and each would prefer not to share her data. But the platform (or platforms) can exploit the first mover advantage conferred on them by their ability to set prices before data-sharing decisions. They can then induce all users to share their data (and this is made possible by the fact that when others share their data, the value of data to a high-value user is depressed because more of her information is already being leaked).

Second, beyond inefficiency, we also show that data markets may destroy surplus under certain circumstances and (utilitarian) social welfare would be greater when data markets are shut down.

This happens when there are sufficiently many high-value users whose privacy is compromised by the data sharing decisions of other users.

Third, we show that, paradoxically, competition between platforms need not redress these inefficiencies, and in fact, such competition may reduce welfare.
Finally, we also propose new ideas for the regulation of data sharing.

Existing technological approaches focus on anonymised data, which is useful to limiting the exposure of the user sharing her data. Such schemes are not useful, however, when it is not the privacy of the user sharing data herself, but of individuals correlated with her that is at risk. To deal with this problem, we propose a new regulation scheme where data transactions are mediated in a way that reduces their correlation with the data of other users. The main idea is to achieve “de-correlation”, by removing the correlation between an individual’s data and information of those who are not actively sharing their own data. For example, instead of directly sharing their data with the platform, users can reveal their data to an intermediary who then purges the component of their information that is correlated with their demographic group before it shares them with the platform. This mediated data-sharing arrangement can limit the extent to which the platform can learn about others in the group.

Our paper is a contribution to the nascent but growing literature on data markets and the economics of privacy. There are several interesting future directions in this broad area.

First, much more work is needed on the effects of competition over data between different platforms and online companies, particularly in order to clarify the conditions under which competition limits or exacerbates data externalities.
Second, modelling how online platforms can exploit information about users for price discrimination and specific types of advertising is an important area (see, for example, Bergemann et al. 2019). In this context, one interesting direction is to investigate whether applications of personal data for designing personalised services can be unbundled from their use for intrusive marketing, price discrimination, or misleading advertising.
Third, much more work is necessary on realistic schemes for limiting the correlation between users sharing their data and others (or other methods for limiting data externalities). The specific mechanism we proposed is meant to be suggestive, and the limits of what can be designed and implemented need to be investigated much more systematically. A related issue is how to ensure the trustworthiness of data intermediaries.
Finally, and most importantly, our emphasis that market prices and current user actions for protecting privacy do not reveal the value users attach to their privacy highlights the need for careful empirical analysis documenting and estimating the value of data to platforms and the value of privacy to users in the presence of data externalities.

References

Acemoglu, D, A Makhdoumi, A Malekian, and A Ozdaglar (2019), “Too much data: prices and inefficiencies in data markets”, NBER Working Paper No. 26296.

Athey, S, C Catalini, and C Tucker (2017), “The digital privacy paradox: Small money, small costs, small talk”, NBER Working Paper No. 23488.

Bergemann, D, A Bonatti, and T Gan (2019), “The economics of social data”, Cowles Foundation Discussion Paper.

Gang, A (2018), “The Facebook and Cambridge Analytica scandal, explained with a simple diagram”, Vox.com, 2 May.

Granville, K (2018), “Facebook and Cambridge Analytica: What You Need to Know as Fallout Widens”, New York Times, 19 March.

Laudon, K C (1996), “Markets and privacy”, Communications of the ACM, 39 (9), 92–104.

Pasquale, F (2015), The black box society, Harvard University Press.

Varian, H (2002), “Economic aspects of personal privacy”, in W H Lehr and L M Pupillo (eds), Cyber Policy and Economics in an Internet Age, p.127-137, Springer.

Zuboff, S (2019), The age of surveillance capitalism: The fight for a human future at the new frontier of power, PublicAffairs.

1,683 Reads