2015 11 12 WS 261 Big Data for Development: Privacy Risks and Opportunities? Workshop Room 8 FINISHED

The following are the outputs of the real-time captioning taken during the Tenth Annual Meeting of the Internet Governance Forum (IGF) in João Pessoa, Brazil, from 10 to 13 November 2015. Although it is largely accurate, in some cases it may be incomplete or inaccurate due to inaudible passages or transcription errors. It is posted as an aid to understanding the proceedings at the event, but should not be treated as an authoritative record.

***

>> MARCIN DE KAMINSKI: Good morning everyone. We're just waiting a few more minutes for people to drop in, just like two minutes, and then we're -- thank you.

Good morning everyone, and welcome to this session, this workshop on big data for development, privacy risks and opportunities. This is a session that is co-hosted by SIDA, the Swedish International Development Corporation and the United Nations Global Pulse. I'm Marcin de Kaminski and I'm a policy specialist on freedom of Expression and ICT issues, and I will be moderating this session. With me today I have Mila Romanoff, who is the data privacy and data specialist from United Nations Global Policy. I also have (inaudible) of Asia from Sri Lanka. I have Natasha Jackson, who is head of content policy at GSMA. From the U.K. I have Dr. Danilo Doneda, consultant of the National Consumer Secretariat at the Ministry of Justice in Brazil. I have Lorrayne Porciuncula, who is the executive director in the Public and Regulatory Affairs Unit of Telecom Italia, Italy.

And remotely we have Drudeisha Madhub, the Data Protection Commissioner from the Republic of (?).

This is very timely as data has revolutionized our understanding of the world and from SIDA's perspective I'd like to give you an idea how big data could be used in development and that is mainly from our perspective. According to the sustainable development goals that we have all agreed upon, data will apparently have a really big -- it will be really important in terms of both implementing (?) but also to evaluate them and both technology and data is something that we need to be able to take care of and use properly but also understand the risks and opportunities that data brings.

In the work of UN global policy has been engaged in both supporting the UN Global Pulse Lab in Uganda, in Sao Paulo and we're part of -- me -- I'm part of the privacy device group which I'm sure Mila will tell you more and this is a way to try to highlight the importance of treating these issues seriously but also to have a transparent and inclusive process when it comes to big data in development. So this session will try to highlight the risks, opportunities, challenges, discuss both the possibilities with big data and development but also discuss the issues that need to be discussed when it comes to privacy. I will give the floor to the panel -- or give the word to panel and then I will let questions from the floor and then we'll try to have such an inclusive discussion as possible. So please, Mila, if you would begin.

>> MILA ROMANOFF: Thank you. Thank you, Marcin. Good morning, everyone. So we'll start by talking about my organization first just to introduce a little bit of our work. We are a special initiative of the Secretary-General, and we make sure the vision that we share is that the big data is being harnessed responsibly as a public good. Our mission is to assist UN agencies and development organizations, humanitarian organizations in adapting innovative uses of information, big data uses to humanitarian and development causes. We have a twofold strategy as a neodriver and ecosystem catalyst. As an innovation driver we perform research projects visibility studies to understand how realtime information could be used in assisting humanitarian and developing missions. We also do advocacy and policy work, and work with UN agencies and other organizations in academics and private stakeholders in establishing or developing best practices on uses of big data, including access to big data, data sharing, initiative on data philanthropy is also part of our work.

As the data revolution, what is data revolution? The independent expert advisory group to the Secretary-General on data revolution in its report on data revolution in 2013 has suggested that big data is a new source of information that could support traditional sources of information, and assist sustainable development and humanitarian missions. However, such information cannot be utilized unless there are established policy and regulatory frameworks, and the private -- the right to privacy is respected.

What are the opportunities presented by -- by big data? That's what we're trying to explore as part of our work. And as you know, private sector utilized big data for market research or enhancement of products, receiving consumer feedback. Global policy and research community that participates in similar work, utilizes big data for development causes and humanitarian action. One of the projects that you see on the screen right now is actually exploring how mobile data could be used to assist humanitarian missions. This was conducted with world food program to understand how mobile, mobile detail records could be used to understand how people are acting before and after the flood. So on the left side you see the normal regulatory behavior, but then suddenly during the floods you see a spike in call transactions. This could show us there is certainly an unusual behavior and alarm humanitarian agencies in acting fast to respond to the emergency crises.

This is another project we're using now social media data, actually Twitter, to understand and correlate how people are talking or communicating about fires in Indonesia and how this behavior could correlate to the real -- to the real behavior or ground cases -- cases on the ground. And this project has actually explored what people are talking about. And as you see on this map here is suddenly when the fires are happening, people are starting tweeting, and in Indonesia there was certainly a big spike in tweets, and people were talking about government support, where the hot spots are and what the level of crime. Those were the most -- the most tweeted content during the -- during this crisis.

Another -- another project is using mobile data for food security. And actually this one was conducted in an African company in collaboration with world food program mainly to understand how -- basically how much people are spending on their calls could be used to understand food security. And on the left you see the data based on survey conducted on -- on the vital products, and then the map on the right shows actually -- could be correlated similarly on how people are spending on realtime purchase records and how this correlates with the world food program survey. So definitely projects showed that realtime information can help us understand food security problems and also estimate poverty indicators.

So of course with the opportunities there are challenges and in our work we noticed that in order to utilize big data, we have to consider many of the -- of the risks, right, and there are also a lot of gaps in -- and one of them is actually fragmented regulatory landscape on regulating big data including personal data when it comes to personal data uses, then actually understanding what data is in certain case scenarios. There is no unified consensus on that. Issues of consent, when do you need to seek consent, when it comes to emergencies, or data repurposing.

And then we talk about awareness, lack of awareness in understanding of the risks but also understanding the benefits of using big data. Everyone is -- there is either a group of people who are talking about the opportunities and how great big data is, but there is another group of people who is very alarmed by big data uses. So the two groups need to come together and discuss risks and opportunities and how these could be actually mitigated and assessed. We need to properly understand, and this is a topic of our panel, what the risks -- what the right threshold of the risk is.

So global policy, we have privacy principles and the main idea is when we used information we need to make sure that if it's a personal information it cannot be used without consent. However, when it comes to, for example, uses of -- call records, we need to make sure that we never access communications of such call transactions or if the data is anonymous we can never identify such information -- such data. We also work with private stakeholder and we need to perform due diligence and understand that our private stakeholders who give us access to such information or that work on such information and to provide us with the insights are diligent in their work and they collect data in the first place legitimately and fairly.

So as Marcin has mentioned, to include the process to build trust, we need to make sure the process is inclusive and transparent. For these purposes we have a privacy advisory group and actually a few members here of our privacy advisory group are on this panel as well. It's Rohan Deneelan (?) Morson and Drudeisha Madhub, the data protection commissioner who is participating remotely is also part of our privacy advisory group. The topic of the panel is risks and utility, so the main question in our work is to actually understand how we can mitigate risks. And one of the questions that we usually ask us is what is the risk of not using the data, and that's what we are trying -- that's what I would like you to think about as well, of -- we have the risks and harms that could be caused by the users of big data but we need to also think what are the risks when we're not using such data and how such risks could be mitigated.

This is a project that we conducted with Massachusetts Institute of Technology to specifically understand how legitimate purpose, justified purpose, and principle of proportionality could be used and applied in big data applications when it comes to humanitarian and development and response. On the map, the objective of the project was to understand what is the level of detail that you need in call detail records to understand -- to achieve the objective of your mission, how much information do you need in order to achieve -- to understand how people, for example, are moving during the disaster response, and not yet to lose that utility while still protecting privacy.

And the -- the actual with the project -- what the project has shown is while there are risks of re-identification if you aggregate, let's say, by ZIP code or by municipality, of course the risks increase for re-identification the smaller the area is, however the utility does not always decrease, and it highly depends on the context. So for these purposes I think what is purpose to know is that we need to understand what is the risk threshold, how the threshold should be mitigated, what is the risk of using big data versus not using such data, and there is a need for awareness of bringing such points to the public, to the public stakeholders and private stakeholders. Thank you.

>> MARCIN DE KAMINSKI: Thank you so much. Rohan? Please.

>> ROHAN SAMARAJIVA: Good morning. Next. Okay. Thanks. I come from an organization which describes itself -- you may see there, at pro pool, pro market. Everything that we do has to have implications for -- not only for the rich in society. So when we were looking at -- from 2012 we've been -- I've been part of a team of data scientists, domain experts, actually doing big data conceptualizing and executing big data research in the global (?). We may be the only entity that fits this criteria. We looked at what were the comprehensive data sets, so these are some of the countries that we work in. Some of the countries that we work in. And you can see that even when it -- the last column, this would be on your left, is Facebook users per hundred, where, for example, Thailand has 49, Facebook users per 100 people, but still that is below 50%, whereas the first column, which is mobile SIMs per hundred, you see that even in Myanmar, which basically started getting into ICT connectivity this year, in 2014-2015, the number is already at 50 per hundred. So plus it was not a question as to what data sets we would work with, which included the poor, and that is mobile SIMs -- mobile network data. So we are not relying on smartphone data or anything like that but just mobile network data, which is generated by even the simplest phone.

So when we look at our data, we don't think that the mobile data by itself can give all the answers, so some of the answers -- for example, because it tracks how people move through time and space, it gives incredible insights in terms of traffic management and urban development, and those are the slides I'm going to show to you, but in the other cases we are working with the (?) offices and various other entities to supplement, to validate and to cross-check these data sources. We work only with pseudonymized data and we cover about 60% of the population that we are working with in the areas that we report our results, from multiple mobile operators, so that even which operator it comes from cannot be -- cannot be figured out. So some of the things that we do will obviously -- what we're interested in is things that have public implications for public purposes, to see what the public -- that the public policies are working, things of that nature, that will allow even policy experimentation, but we do understand that there are some of these things that are -- will use where they could also be used for private purposes.

What you see before you is a temporal mapping of how people move, we take where they're located at midnight as their home location and the blue areas show where -- so the top middle is a weekday noontime slot, collapsed over multiple months, where we show where people have left, that is the density has decreased, where those areas serve as sources, and the red and the yellow are the sinks where they have moved to. So this is quite useful for urban planning and traffic management kinds of issues.

And we have cross-checked it against conventional transport studies. Ours costs -- even if we put all the hardware costs in, et cetera, would be less than $30,000. The other study that is done every three or four years cost 3 or $400,000, and ours can be done every 15 minutes if one wishes.

We also have been looking at things where we don't really need to know anything about the individuals, it's just the loading factors, how much traffic is carried on base stations. The two sort of profiles here, one is for a commercial area, because you can see that the weekend and weekday behaviors are very different, fundamentally different. The week -- weekends are the blues and the greens, and the other one where there is no big difference between weekday and weekend and the peak is about 7:00 p.m., whereas in the other one the peak is at about 12:00. So from this we are able to identify where -- so you have three figures here for the metropolitan -- the city of Columbo, and two of the figures are the official planning maps which show what land use is, and ours shows that quite a lot of the areas that were demarcated as being residential have already converted to commercial, according to the signatures.

So we -- there are a lot more, but given time constraints I will just give you a flavor of the kind of work that can be done. We do understand that these are data that is generated by competitive suppliers. There are competition issues involved here. There's obviously when you talk about data, even historical, of people moving through time and space, and which sort of the metadata of who they call or are called by, there are significant harms that could come, marginalization harms of not being included in the data sets at all and therefore being ignored by public policy, competition harms in the sense for small companies, new entrants as against those who already have the data, and finally what I generally call privacy harms, which is a rather undefined category because there are many things that come under the heading of privacy, but we have gone through and looked at all the harms that could be caused, and we are working with the data suppliers to develop language that will specifically address the harms and will therefore reduce the transection costs of sharing this data. We're also working with them on simplifying the pseudonymization processes because it does take a lot of their computing resources. So I leave it at that for an introduction.

>> MARCIN DE KAMINSKI: Thank you so much.

>> NATASHA JACKSON: Thank you and good morning. I'm Natasha Jackson and I worked for the GSMA, which is the global association for mobile phone operators, so all the mobile phone operators, around 850 of them around the world, are members of ours.

So we -- we have looked at big data also from Internet of Things perspective, but increasingly are looking at it from the socioeconomic perspective, and we have recognized, as we've heard before on the panel, all of the socioeconomic benefits that can be derived from it. We have a number of members who have used CDRs and others to better understand, particularly around spreads of diseases, and I have a few examples on a slide here of some of our operators who have worked on these issues back in 2013 are in orange, part of France Telecom worked on the Ivory Coast looking at the spread of malaria, Telefonica have done similar work in Mexico on swine flu and Telenor in Pakistan looked at the spread of Dengue. But there are other studies, mapping populations, like Rohan described, and also predictions for crime hot spots in different localities and Millicom also up there has also now developed a partnership with Flowminder to look particularly at health across all of their territories that they operate in, so that would be with sort of local authorities they would then work with ministries, health Ministries and other agents.

So I mean, as more and more people are using mobile phones and their data can potentially become more revealing about them, really the important issue around privacy has come up, and this is something that the GSMA has taken very seriously for a while in a general sense. We developed and started working on privacy issues back in 2011 when we developed a set of mobile privacy principles, and these govern how we believe that mobile privacy should be respected by mobile operators and others in the ecosystem at a very high level. So top-level principles.

We then developed those further into looking in specific areas such as app privacy and how apps should handle data. All of these are available on our webs, trustme.com if you're interested. Last year in 2014 we worked specifically on Ebola and looking at the response to the countries affected and how the mobile operators could help their work through the provision of big data. So we worked with both data scientists and our operators in those countries during the crisis to develop a set of guidelines that would really guide how the data could be accessed and how it would be used, and the -- one of the principal reasons of this was also to reassure the local governments that these privacy risks and implications had been thought about up front before any of this work started.

One of the fundamental principles that we have in the guidelines is that the CDR data will never leave the custody of the mobile operators, and the mobile operators are very strong on this particular issue. So, for example, Telenor in Pakistan followed the same guidelines and made sure the analysis on their data was done in their own premises in their country and that data did not leave it. They only accessed as part of the -- data would only be accessed by authorized personnel. There would be an auditable process of who had access to the data at the time, and the analysis of the data itself was also only performed on the premises, and under the operator's supervision at all times, and only by approved research agencies. So it was only really the aggregated data, the maps and the statistics that would then leave the mobile operator's custody and those would be to approve third parties and those under legal contracts.

In terms of what we learned from the Ebola experience last year, I mean, one of the key issues that came up was this ambiguity in the law particularly around telecoms data. So when we think about privacy we tend to think about data protection rules but in many countries those rules may not exist or there may not be omnibus laws. Telecom operators under licenses, and these may not apply to other entities but they will apply there. So in the Ebola case, for example, the laws may in the countries -- there was more than one country involved, but the laws may not explicitly allow the use of big data in this -- or personal data in this sense, but there were certainly penalties there that would be imposed on the operators if any data confidentiality was breached at all, so they were very nervous about some of those conditions that they had. So this ambiguity in law could be a disincentive to mobile operators.

The other thing we learned from Ebola was that we don't necessarily have all the skills or the local capacity in the countries where the data resides, and if you're only processing on those premises you need to have those skills there, so building up local capacity is very important.

And the other challenges we had was around not necessarily on Ebola but generally competing agencies and research agencies, who do you -- how do you know and how do you judge which entity has addressed privacy risks or which agency is going to be the lead agency.

So sort of in conclusion, our take-out so that, of course, we always have to consider the risks of the data and we take a risk-based approach right from the beginning and before doing any of the work, we need to think about the way that the privacy -- the data might be privacy impactful or harmful, both for individuals and for groups and communities of people. We need to think very carefully about the regulations and new regulations in particular, make sure that there's nothing in there that would preclude the use of big data in the future, because we don't know how we may want to use some of these big data, what the use cases may be.

And one of the ways that we might look at that is by establishing a sort of public good interest in new data protection regulations and laws. And this particularly, I think, could incentivize mobile operators who haven't worked in this area, and especially those who are in countries where there are ambiguities in law or they're worried about some of the penalties that they have under general Telecom's regulations. And of course we need really clear rules and methodologies for data, so there are these guidelines we developed, others that Mila talked about before, and we need to work on them all together, so whether that's company, chief privacy officers, data scientists, regulators, policy makers and civil societies, we need to work on those together because if we don't develop a set of rules and guidelines that people can trust, then there will be increasingly concerns and scrutiny by regulators over what we do in this area, and that is the risk that we don't use the opportunities that big data can provide. Thank you very much.

>> MARCIN DE KAMINSKI: Thank you. Danilo?

>> DANILO DONEDA: Thank you. Good morning. A check note from perspective issues is likely different than the other one from the data controller's perspective from data user's perspective, a check from regulator's perspective. I've worked for several years in Ministry of justice in Brazil, and now I'm currently a counselor to the Minister of Justice on data protection issues. And as many of you may be aware, here in Brazil we don't have general data protection regulation. There is a pending regulation, there is a proposal which has been drawn by the Minister of Justice on data protection law. There are some proposals on Congress, but currently only have some pieces, some slices of regulations in some sectors. We have the consumer law, we have the Internet Bill of Rights, Marco Civil, but the first question, checking a bit of the speech made by Natasha, is what is the panorama of (?) data protection regulation and we just need to sever our big data use. Is it good not to have a special general regulation on data protection? On one side some may think that it's interesting not to have major regulatory obstacles on using big data, but that is much more of wishful thinking than others. Because if you are not sure what are your boundaries when dealing with personal data, with aggregated data, there may be no clear, not concrete legal basis for personal data treatment. There will be no legal certainty. There will be a situation where responsible players may not have the incentive to use big data in favor of other players who are not so, let's say -- maybe not so fairly.

What is the situation of Brazil right now? For instance, check the purpose principle, which is can for big data use for secondary user of big data and so on. The purpose principle is not present in any general statute in Brazil. The purpose principle is mentioned in a credit law -- it's present on the Marco Civil (?) Internet Bill of Rights but not maybe in a very direct way, but something that will be similar to purpose principle can be interpreted from the similar law for the constitution. So it is very hard to say if secondary use of public data, even for public interest, will be deemed legal under circumstances right now.

Let's take, for instance, what happens when -- let's say traditional big data product is used as in Brazil. We have a major tele-competitor, who one year ago announced a project using aggregated zero location data to sell reports of information of people in urban areas and so on. That caught the attention of some in government who asked for how it was done, was it transparent for mobile phone users? Do the tele-operator gather -- constant -- protect it. The concept was included in contract with operator and so it caused the tech operators who redefine, it seemed that the tech operator postponed for one year, more or less, their launch of the product, and until now with this current investigation, maybe it's not easy for the -- that the users, but it's not an easy task for the regulators, which seems that some boundaries of that big data use are not clear and the regulator has to deal with people's rights and expectancies, and also with the fact that some public interest use may be deemed illegal on some occasions because of uncertainty of the lack of real legal basis.

In the proposal of data protection and regulation, which was drafted by the Minister of Justice, there are some points on which you can form a prior legal basis for big data use issue, which is mainly, among others, the legitimate interests of data -- data controllers, but also a section for the public interest, which are all exceptions for the requirement of consent, and can inform -- not only -- limits that justify the treatment of personal data. I believe these questions, these issues can be taken -- must be taken into consideration together with the issue of consent, which is maybe a bit anachronical (?) -- seems to be anachronical sometimes when you deal with big data, but we believe consent can have and will have a role on big data in some occasions, and I wish to talk about this a bit more later.

>> MARCIN DE KAMINSKI: Thank you. Your answer, please?

>> LORENZO PUPILLO: Good morning. First of all thank you very much for the organizer to invite Telecom Italia to this workshop. I believe that the importance of mobile communication and development has been well established in terms of impact on GDP and so on. All of us know very well, this has pressure. Like now (?) can -- farmer can have information on (?) their crops are directed through the phone. But probably now we start to realize that we don't have any more only the phone, but each user with a (?) phone has computer. This is a totally completely new ball game. It is a great potential, especially because innovation, digital innovation is (?) the world is not just based on a big -- big new innovation, but it's a combination of existing piece of innovation and it's not just phone. It means there is a lot that can be done.

Big data is an application of that, and so it's important to look at the potential for development of big data. Telecom Italia promoted the development of big data and we have -- we have done (?) initiative to do that. This is what I was talking about today. We have promoted what we call big data challenge. In other words, we promote our competition, a contest on open innovation that put in touch big data owners with academic and the professionals that have the competence to explore this big data. In other words, we collect this competition -- this competition collect innovative big data project that represents business opportunity in the short and medium term.

In 2015 it was called big data for competitive boost. We had about 1100 participants from more than 20 countries. We examined more than 100 projects, and this project basically were using data from seven Italian large cities, two months of data, more than 100 gigabytes of data. This data, of course, this is important, it was mentioned before, where had heterogeneously, anonymous operation of big data.

So the participant have data from energy consumption, car GPS position, (?) studies, mobile network, so traffic censors data, (?) formation. They developed a bunch of projects. They kept, of course, the Intellectual Property on the idea that they were presented.

There were two tracks of projects: The academic track for advanced innovative solution. The targets were student research academic teams. And then there was an industrial track for market (?) solution. Target is small, medium (?) enterprise and venture capital teams. There was a special committee of academia distinguished personalities, and the best project got a prize of 10,000 euro. Like I could say before, the company -- no, the data set were coming from different companies, like team, they mentioned statistic institute in Italy, insurance company, Twitter for social network and so on. Also the university from abroad, MIT, the university participate through the -- through this contest. And I think was quite successful initially. It probably should be replicated everywhere because it's way of allowing to some extent the matching of demand and supply of using of big data.

A couple of comments on additional. I think that from what I understand, and I believe big data can play an important role, for instance, in the -- in making more resilience more cities. Traditionally more cities have been considered like a way of offering digital services, okay, but I think to make it more resilient, we should use big data available from the offering of these services to have a better governance of the cities. And so here comes the issue of how we use the big data, how the data set talk to each other, and so are we open this data and we create a standard, we make inter-available, inter-available platform to do that.

And I think we should also make clear that data should follow some clear rule. Basically data can be considered under three conditions. We have a personal data, common interest data, also data that can be offered for open monetization.

I think for the first category we should follow, you know, the rules of data protection that each company is developing. Common interest data, they should be made completely anonymized, but they should be made available because we allow the development of new services.

And the third category, there are also data that can be offered for open monetization because there is a market for that, as long as there are clear rules to use. Thank you.

>> MARCIN DE KAMINSKI: Thank you so much. Now we have our remote participant, Drudeisha. So Drudeisha will appear on the screen.

>> DRUDEISHA MADHUB: Hello. Good morning. Thank you for having invited me to participate --

>> MARCIN DE KAMINSKI: When Drudeisha has presented her input to this panel, there will be room for questions from the floor, so please try also to prepare for questions that are brief, on target and not too long timewise. Thank you.

>> DRUDEISHA MADHUB: Okay. Thank you. I'll try. Thank you. So can I start the presentation?

>> MARCIN DE KAMINSKI: Yes.

>> DRUDEISHA MADHUB: Okay. So I would focus my introduction on big data challenges, and try to bring some recommendations for healthy use of big data in the developmental context. The question is how do we analyze and quantify what is still unknown with big data in a developmental context. This is, I think, a big challenge because the data often relied on is gathered from mainly external sources, either public or private. So we must really work towards closing the divide between what is and what could be the possible harms and risks to people. I'm confident that U.N. Global Pulse is handling these challenges, but I'm not confident on the quality of big data we have now on a large scale. For me the challenge is how to (?) between reliable and non-reliable data. So we're dealing with a situation where at the source the data may be potentially corrupt, and we need to adopt preventive and predictive techniques which will seriously allow the detection of such things before critical use of that data.

There are obviously undeniable benefits to be derived from good quality big data, but a lot of data protection regulations, more in developing countries than developed countries is a big concept in a developmental context. In order (?) related to big data adoption, policy makers should ensure various enabling conditions for the creation, availability and use of data. The lack of big data-related skills and competency also underscores the importance of moving the focus beyond the numbers of technological devices to the strengthening of national or international technology capacity to use big data. Collaboration and cooperation among stakeholders is thus essential to foster the properly ecodata system. The value of information no longer resides solely in the primary (?) of big data. It is now in secondary uses that big data is put to. Strikingly, in this big data age, most innovative second re-uses haven't yet been imagined when the data was first collected. So how can (?) give informed content to the unknown?

The alternative is also, which is asking users to agree to any possible future use of their data at the time of collection, but I don't think this is too helpful. Such a wholesale permission emasculates the very notion of informed consent in the notion of big data. The tried and trusted concept of notice and consent is often either too restrictive to (?) data's latent value or too empty to protect individuals' privacy. If everyone's information is in a data set, even choosing to opt out may leave a trace. Let's take Google Stream View. Unfortunately, big data with its increase in the quantity and variety of information facilitates also re-identification.

So what is to be done? We should envision a different privacy framework for the big data age. One focus less on individual consent at the time of collection and more on holding data uses accountable for what they do, .running a form of big data use assessment, such as PIA, a privacy impact assessment, correctly and implementing its findings securely offers tangible benefits to data users. They would be free to pursue second reuses of personal data in many instances without having to go back to individuals to get their express consent. On the other hand, sloppy assessments or poor implementation of safeguards will expose data users to legal liability and regulatory free actions.

So data accountability only works when it has (?) shifting the burden of responsibility from the public to the users of data makes sense for a number of reasons. They know much more than anybody else, and certainly more than data subjects or regulators, about how they intend to use the data. By connecting PIAs, for example, they will avoid the problem of revealing confidential strategies to outsiders. Perhaps most important the data users reap most of the benefits of second reuse. So it's only fair to hold them accountable for their actions and place the burden for this review on them.

I will go straight because it's just an introduction. I don't have too much time. And what I think is my main recommendation going to be. There is a big need for international guidelines, policies, rules on big data. It's a must to ensure that all countries can speak the same language on the same parity level. Otherwise this digital divide may lead to further multiple and different uses of the same data, depending on the context which it has been used in, and this is quite dangerous because it is actually making that data something which it wasn't meant for. So this is my small conclusion to the introduction. Thank you very much for giving me this opportunity. Thank you.

>> MARCIN DE KAMINSKI: Thank you very much. And please stay with us, Drudeisha. Before letting the audience in, I actually have a question for Drudeisha. So if we could get her back on the screen? Thank you. I would ask Drudeisha, so -- can you hear me, Drudeisha?

>> DRUDEISHA MADHUB: Yes.

>> MARCIN DE KAMINSKI: So you began your presentation discussing the possibility that the data actually also could be corrupt, so I have just a quick follow-up question on that. I mean, how can we harness the trust issues in this long chain of actors, from -- from data collectors to data brokers, to users, and governments as well? Do you have any ideas on that?

>> DRUDEISHA MADHUB: Well, actually -- it's actually a very long chain. That's the issue. The longer the chain, the more the potential that the data may be actually get corrupt in the process. So what was initially collected may have been, let's say, anonymized at one point in time and then re-identified as far as it is possible, and then we give a different perspective to that data. And this is the risk that the (?) runs, because the data is taking so many forms and so many dimensions that the individual himself is not aware of, and we actually don't really inform him of the multiple, let's say, facets of -- that the data has taken.

So in this sense this data, which initially belonged to the individual, no longer actually belongs to him because it is something which he has completely changed. So how far -- I mean, how far are we being realistic with the use of data? Let's say not only big data but data itself, because this is the data that will rely in a big data context. So how far are we being realistic and how far are we being (?) and how far are we really putting the focus on the quality of the data that we are dealing with? So I think this is what I would like to say on that.

>> MARCIN DE KAMINSKI: Thank you. Any questions from the audience in the room? And please stay with us, Drudeisha, throughout the session.

>> DRUDEISHA MADHUB: Yes, thank you.

>> MARCIN DE KAMINSKI: Apparently I need to run with the mic, so I will do that. Please.

>> Thank you. I'm from IT (?) bureau. I think there have been a couple of issues here. One is the value of big data which was demonstrated and how it is valuable increasingly for development, and that we would agree. And second is the harm it may do in terms of privacy and if you're talking about it. But the political economic question about who benefits from the value of that data. This discussion I think has not quite started. Now, like Rohan is working on big data sets which he gets from the telephone operators, which they may be sharing it now, but I also heard telephone Italia talk about big data for competitive advantage and they are going to soon realize that's one of the big resource they hold, and at that point the question is that data which has been collected from the people, who is -- I don't think they'll keep on giving it free for whatever may be called public use. So who determines the ownership and who gets the value of the data that is collected by the people? So what's the political (?) of data that has to connect to the privacy issues around data. Just a comment. And I would like to hear your views on it.

>> MARCIN DE KAMINSKI: Thank you.

>> LORENZO PUPILLO: I have to say, I am not an expert in privacy protection, but the kind of breakdown that they did, you know, basically private personal data, okay, common interest data and data offered for open monetization. I think made it clear what you say. In other words, be create some market for, okay? For the three products. Let's put it this way. Personal data, of course if you don't have a consent you cannot use. I mean, it's -- they have to follow the rules of the country related to data protection. So of course if your consumer does not give you the authorization to do it, you cannot use.

Then there are the common interest data. I don't think that, for instance -- I think there is a general benefit if I know what are the flow of -- in the morning of traffic going from one point, from A to B. Yeah, it's all a bunch of -- first there are recommendations from (?) public sector information. This becomes public sector information. It can be used here. So maybe the government can -- can offer this data, make available, and then the private sector can create the service for.

Okay, and then there is the third area. Maybe we should understand better what it means. You know, the offer for open monetization but there can be a market for that too.

>> MARCIN DE KAMINSKI: We have a number of questions from -- we have one there first, and then we have there. So we have a number, but we do also have a gender bias in the question, people asking questions. So actually, if there's a woman also wanting to have a question, I will put that first, between the guys. Please, if you begin.

>> Hello there. I'm Rodrigo from Dynamo, Brazil. As Drudeisha said, one of my concerns is also on the privacy, what we consider today, what's anonymous data. So I just want to give some information for you guys. I work with innovation startups, and Microsoft released two years ago a paper doing research on big data collected for the gyroscopes from the cell phones. So it could generate with information they collected from the phones that everybody has, like a fingerprint of usage. So data that was collected for usage of other things was reused by Microsoft to generate a fingerprint of this data, and made like -- if you got my phone and used it, Microsoft would know who was using my phone wasn't me, it was you, just by using the phone, using it in your pocket and walking around.

So how do you treat what we consider today anonymous data, in the future? How does regulation treat that so we can be secure that what we gave them the data for is not being used for another thing. Thank you.

>> MARCIN DE KAMINSKI: Thank you. Any answers to the question?

>> I have a response to the previous question as well. Would you prefer that we collect some questions and then we answer or do you want me to go --

>> Let's do that. We have remote question to begin with, because we need to let the Internet in as well, if it works.

>> So thank you. My first question is for -- for the -- for India, for (?) companies for presentation. So it's just related to users' mobility for traffic management. So is that available -- for example, taking the basic trajectory for every use's movements in the city and how we can -- how we can know the user's behavior in such situation, how we can (?) user's behavior, for example, maybe my navigation and other personal navigation may be sometimes different, and that's also difficult for privacy issue.

And the other question for -- for the whole experts, when it comes to big data management nowadays there are a lot of money surrounding on big data management because telecom companies, Internet companies, they make a lot of data on that project. They are making a lot of money. So when I came to this, they are using all my full data. I don't have any income because I'm paid for the Internet, but they are using my data, they are selling my data. So what is the future direction for prospective users of telecom companies and use of Internet companies? Are they going to participate -- are they going to be getting some feedback or getting some income? Are they going to know who is using their data? Because I'm not getting who is selling or buying the data. Thank you.

>> MARCIN DE KAMINSKI: Thank you. Was it possible to get a question from remote?

>> Yes. Thank you, Chair. We have a question from Pat Walsh from the United Kingdom, and he asks in times of humanitarian crises people move across borders. Their data moves with them. How to ensure -- how do we ensure protection across borders?

>> MARCIN DE KAMINSKI: We have a question here in the middle too. You in the white shirt.

>> Thank you. I'd like to start by saying that in my opinion data is not relevant for itself, so we don't -- I don't think you have to defer less or more relevant data. You can term relevant data, irrelevant data by just crossing over databases. So we're saying in privacy issues that transparency can solve this, but can we think that transparent can be misdirected too? Like we can influence people to give data and saying that you're going to use for some stuff and then would just cross over with another database and use for other stuff? Can you understand what I'm saying? So do you think that transparency is enough for the privacy issues of big data?

>> MARCIN DE KAMINSKI: Thank you. I think that we have three more questions waiting to be asked, so there and there and there as well, and then we will go to answering the questions.

>> Thank you. Derrick Cogburn from American University. I just wanted to follow up on a question, I think it was the first question in the room, is as we develop principles related to big data, I think that we need to pay particularly close attention to the potential for de-anonymization of data, particularly as we aggregate data from multiple sources, it becomes possible to -- I mean, there have been lots of studies that have shown how it's possible to identify people -- identify persons in data that were previously thought to be anonymous or de-identified. And I think that is a particular problem as we start to aggregate data from multiple sources. So I just -- as you develop the principles, I think Mila, you talked a lot about that, I think that's something we have to consider.

>> MARCIN DE KAMINSKI: Thank you.

>> Good morning, I'm from the ITS and (?) for society and study from Rio de Janeiro. I would like to follow up on one question, the remote had maybe a question on the issue of (?). You mentioned that the big data is become really important asset for humanitarian action, and I see a problem here, at least it needs to discuss this cross-violation on humanitarian action, especially I'd say getting some ideas on Natasha's comments on consistency and the need to harmonize rules. And here in the case of military action we have different stakeholders, processing data, exchanging data, so we have the private sector, Civil Society, industry, national governance, let's say -- I could add a different actor, which is international organizations and then we see that the international organizations (?) national laws on their protection. How could you -- could you think of addressing this issue of international organizations as data controllers and should we discuss at least a minimum set of rules for the processing of data, especially considering the role international organizations play in humanitarian action, maybe a kind of international agreement or a regulation for the UN. That's my point here.

>> MARCIN DE KAMINSKI: Thank you.

>> STEFAAN VERHULST: Thank you very much. My name is Stefaan Verhulst. I'm GovLab at NYU. I have a question with regard to the data value chain, meaning to a large extent sometimes data is being discussed as a thing, and then ultimately data only gets valued through a process. And so there are a variety of decision points through the process that are being made that either can increase the value but also can increase the harm, and it was already mentioned one element was of quality, but there are many other elements that are affected because data is to a large extent a process. So how do you deal with that from a data governance point of view?

>> MARCIN DE KAMINSKI: Thank you so much. Good questions. A lot of questions are discussing the issues of borders, physical borders or database borders, for instance, and Natasha, could I direct that question to you, maybe, as you are working with so many actors in so many countries?

>> NATASHA JACKSON: Yeah, sure. I mean, the issue of cross-border data is not just in big data we struggle with this and our group companies struggle with these where they work across regions in Europe those issues are getting addressed by the GDPR but in Asia and others. But I think one of the key issues here is the fact that there are so many multiple parties involved, so we need to really involve telecoms regulators as well, and that's really important in terms of capacity building because a lot of the regulators in some countries may not have those of these. And in some places equally the opportunities that could be used by these data haven't really been understood and sort of distilled into the thought process that much. So a lot of people are thinking about locking down the data rather than thinking about the opportunities it can present.

And then we need to get those telecom regulators together with data protection authorities, so there's quite a lot of up front work we need to do in order to start the discussions on them, and that needs to be the first thing, and actually it's events like this and some of the capacity building that GSMA does on these issues that's really important as a stepping-stone.

>> MARCIN DE KAMINSKI: Thank you so much. Go on, please.

>> ROHAN SAMARAJIVA: One of the most important things that I would like to emphasize is that before we make rules, before we try to legislate about the subject we should try to understand exactly what is going on. So, for example, abstractly you can talk about data crossing borders and being matched and leaving government control. In actual fact, in some countries you have international gateway operators. That means that those particular CDRs do not show what the B number or the code number on the other side is. All you get is a tandem switch or an operator's code. You are receiving the data from the operators pseudonymized. That is, before it leaves the operator's premises it is pseudonymized. So the script that has been run on data in country A or for operator A will give a certain set of random numbers. When I said pseudonymized, it means the same random string will be associated with a particular caller over time. But that particular caller's number in another network's network, in the custody of another operator, or on the other side of a tandem switch or an international gateway cannot be connected at all because the pseudonymized script that is run on the other side is completely different. So, for example, we would like to know what the international calling patterns of our customer -- the data set that we have, but we can't.

Now, people can sit in rooms and legislate about things that really cannot be done on the ground. So that is why I think it's a very, very important thing for lawyers, who like to make up rules for imagined scenarios, to actually talk to the researchers who are doing the research, and for the researchers who are doing this work to be able to engage with the domain experts and the people who look at rights and obligations.

Related to this is, for example, the GSMA says that their principle is that the data will remain in -- at all times with the operator. Now, the whole point of what is going on now is a democratization of data analytics. When I started work on this subject in the 1990s, only people with (?) computers, Cray supercomputers, the National Security Agency and American Express and people like that could do this kind of analytics. They were being done. Nobody knew about it. I was one of the few people who were writing about it. But today we can do that with $20,000 of computer hardware. We do it in batch mode. If we do it in batch mode, you are able to actually innovate. You can actually come up with new applications and lower-cost applications.

If you are limited to keeping all the data in the operator's premises you will not be able to innovate because they don't do batch mode. For them their computer resources for running the company, their key performance indicators are about churn analysis. That is what they use this data for mostly. And then we have to beg and grovel to do public interest research when there is off time on the computers. So the reality of it, if you really want to do third-party research for the public interest is that people who are making up these rules have to understand what the actual practice of data analytics analysis is, and they have to understand that this is a rapidly changing area, that what we knew two years ago is obsolete and that we now know new ways of handling the data and new models and new ways of improving the quality.

The first question that came up was about enforcing a property rights frame on to data. I think Stefaan was correct. That is to think of data as a thing and to impose an obsolete concept of property on to this. We have seen the damage that has been done in the area of Intellectual Property by enforcing inappropriate framework of property on to data, on to knowledge. This data stream is all about second re-users. It's about making valuable insights, extracting valuable insights where none existed before. So when that -- that means -- you have to think of it as a flow, and if -- obviously I agree, there will be value extracted from it. The companies are not going to give it away. It is co-created data issues by the way. It's not created by the company or the operator, it's co-creation. So when you have co-creation you have difficulty with imposing property framework on it anyway. But the entity which has got possession, and we say possession is nine-tenths, is the operator. They will insist on extracting some of the value, and it doesn't have to be -- that will have to be done in terms of revenue or revenue shares because they don't even know what the value of this thing is today. It doesn't have to be done that way.

I'll give you one example of how this kind of thing plays out. In the city of New York, the city of New York is trying to extract data from their internal systems as well as other information about where people congregate, where there's traffic so small businesses can locate their businesses in the most appropriate places. Because their argument is the big places, the Wal-Marts, all these people are doing it anyway. They've got data analytics. They are locating on the basis of evidence. The small guys are locating on the basis of guesswork.

Now, if you put all these restrictions, what you will do is that the Wal-Marts will continue to do what they do and the small guys will be deprived of this kind of valuable information to be more competitive and to compete with the big companies. Thank you.

>> MARCIN DE KAMINSKI: Thank you so much. Mila, a question for you, maybe, because there was a question about transparency and trust. I mean, the UN has a bigger responsibility than most of the actors in the panel, so to speak. How would you say -- is transparency enough to guarantee privacy in these issues?

>> MILA ROMANOFF: Marcin, thank you so much for the question. I will first ask -- answer this question. I think transparency is not enough. Of course it is not enough. In addition to transparency we also need strong security measures, and we also need respect for privacy rights. We also need to consider harms and we need to understand risks. And as I noted in my presentation before, understanding risks of using information and also understanding the -- how the rights could be harmed if the data that we want to use is not being used, for example, in emergency.

Going back to another comment on anonymization, I would like to also add that while we're saying that transparency is not enough, there's certainly a big lack in developing methodologies on anonymization, or at least the same approach, the unified approach to anonymizing big data, their after use for development and humanitarian causes. So there's certainly a big need in researchers, engineers, policy makers, including the UN, practitioners and humanitarian workers to come together and work on this question. And it's not only an issue of anonymization, lack of anonymization, there is a question of lack of regulatory frameworks. I know that a lot of you have mentioned and there were a lot of comments with regard to establishing one big regulatory framework. There is a big debate with regard to whether we need a Geneva-type convention or an international for use of personal information, on the right to privacy.

I think in order to answer this question and in order to tackle it from the right perspective, I will also echo what Rohan mentioned. We also need not only lawyers participating in this process but we also need researchers and we need realtime practitioners, those who apply data in the end.

Coming back to the point of transparency, for Global Pulse we established the privacy advisory group, which serve as a transparent mechanism but also it's an inclusive mechanism. So in addition to transparency, establishing minimum standards for using big data, for development and humanitarian purposes, we also need to build awareness. And the privacy advisory group, which is comprised of researchers and engineers, lawyers, as well as humanitarian and development practitioners, serves as the basis for these people to come together and to actually think of the ways of how big data could be utilized properly.

When we talk of actually -- about privacy by itself, and understanding what constitutes personal data, you know, work I think -- in general work for using big data for development, it's very important to understand what is personal data. Currently internationally there is no consensus of what is considered private. So I think -- or even understanding a definition of privacy. When we talk about establishing minimum standards for anonymization, data aggregation, if you take one country and you transfer data from one country to another, how are we going to determine what is personal data and what regulation should apply to that if we don't understand what constitutes personal data in one country versus another. One can say that pseudonymized call detail records are actually -- actually constituted personal data, and in a different country, pseudonymized data constitutes nonpersonal information.

Coming back now to the question now of linkability. When you work in one jurisdiction where information is not considered personal, you go into another one and you connect with another data set, such information may become personal. How do we regulate that? And I think the question -- the answer is to explore more frameworks on anonymization to be indeed more transparent on the practices so the public and those who we're trying to benefit, those we're trying to protect are indeed aware of the risks, harms and also benefits that -- of the big data use. And I think that's a very important component.

I would also like to echo what Drudeisha said during her presentation on the privacy impact assessment. I think it's important when we talk -- there is no one answer that fits everything. We need to think of applications of big data and its value on a case-by-case basis and for these purposes we need to understand the context. And within the context we will be able to determine what are the risks and what are the likelihood of the harms that can occur when we're using the data in one country versus another. In one context versus another. And I think performing such assessment before every project is a necessary -- is a necessary tool.

So I think this is it for me, Marcin. Thank you.

>> MARCIN DE KAMINSKI: Thank you so much. Danilo, I have a question for you, but when you're done, I think that we have time for one or two more questions, so please prepare a good question to end with. But a lot of these discussions are about regulation, and Brazil has the Marco Civil. What do you see as the biggest benefits or maybe challenges for Marco Civil to actually be used wisely in this territory of discussion?

>> DANILO DONEDA: Marco Civil indeed has a big hold on data protection framework in Brazil, but it's not enough. We must that Marco Civil in this Article 3 remembers that data protection is a principle of Internet use but must be taken according to the law, but the question is which law does Marco Civil refer to? We believe that it lacks a general framework on data protection that must be complimented with general data protection view, for which Marco Civil (?) only to make a brief comment on the question by Rodrigo, by the representative of American University, regarding anonymization (?) data and digital fingerprint. It depends on the -- on the proposed regulation on data protection, you must stress that we try to consider anonymized data when it can refer to an individual, which is when it becomes a profile of an individual, under (?) on its life, on its choices. This data must be considered as personal data.

>> MARCIN DE KAMINSKI: Thank you. Questions? More questions? Otherwise, I have questions, of course, but that's boring.

>> Thank you. I come from India. I am working with (?) foundation. I have a question. Like we are talking about laws and everything, but do you think privacy-enhancing technologies or security enhancing technologies would be the answer to privacy? I mean privacy (?)

>> MARCIN DE KAMINSKI: Sir, could you just repeat the question?

>> Privacy enhancing technologies or security enhancing technologies could be the answer?

>> MARCIN DE KAMINSKI: Any takers?

>> ROHAN SAMARAJIVA: Even though we have just completed one report where we are looking at different ways in which the data can be shielded from re-identification or de-anonymization, as was stated before, the important thing to remember about de-anonymization is we have to place it in context. In most of the developing world where particularly the mobile data is of great value, what we have are prepaid accounts, and I think it's relatively well-known that despite the government's enormous efforts trying to associate human beings with the -- with the accounts, the failure rate is, I would say -- accuracy is less than 50%, to be generous. Operators in my country have said that they believe that about 70% of the (?) are used by people other than the people they are registered to. So when you come to that, and you're talking about de-anonymize, and you place it in context, and in many countries you have multiple data sets, datafied, data sets. In many developing countries you have very few datafied data sets. So while it may be possible to identify somebody using supermarket data and mobile data, that won't be a large number of the population because a large number of the population does not use supermarkets, and they are not giving credit card information or anything like that because credit card use is less than 5%. So we have to place it in context.

Now, the important thing is that as scientists we take this as a continuing problem. We don't think it's solved. We don't think it's useful to exaggerate it. We don't think it's useful to take the Netflix case study from the United States and try to impose it and say it works exactly like that in India, but on the other hand we think it's a changing field and we are looking at, for example, seeding the data sets with additional information so that when it's reconverted it will not reconvert to the original data set. So there's a lot of scientific work being done, and part of what we do is that we follow the ways in which technological means can be used to safeguard against the re-identification problem, which we believe is a real problem, but it should not be exaggerated.

>> MARCIN DE KAMINSKI: Thank you. Drudeisha, remotely? Are you there still? Hopefully.

(pause.)

So do you hear us, Drudeisha? Not really.

>> DRUDEISHA MADHUB: Can I speak?

>> MARCIN DE KAMINSKI: Yes. Great. Have you been following the questions as well?

>> DRUDEISHA MADHUB: Yes, I have been following all the questions, yes.

>> MARCIN DE KAMINSKI: There are many questions about the data protection basically, when data travels or people travel as well. How do you see your role as a data privacy commissioner or data (?) commissioner, when it comes to these issues.

>> DRUDEISHA MADHUB: From a regulatory perspective it's very important that we -- we really understand how to apply the laws on data protection. As Rohan said it's very difficult to apply data protection rules right now in such a difficult and technological context, and really we need to make laws in a way where everybody is actually involved, like taking from data scientists all the major relevant stakeholders are involved in making, I would say, some rules. Not to say that the data protection principles that we have across the world are not sound, but we need to really innovate in the legal sphere because there are many, many issues that we can't tackle with the current existing regulatory frameworks, which are so multifaceted in different parts of the world and so different from each other. So it doesn't make sense. I would say that we need to apply different rules to perhaps the same data, which is going across borders, and different treatment, legal treatment is being given to this data, which is essentially the same data.

So I think this is a very important point and we need to really focus on how actually we'll work together and make sure that we are actually implementing a very sound and reasonable rules.

So this is my first point.

And my second point would be, we are actually in limbo, because we have the developing context -- we have the underdeveloped world and (?) block countries, and we have obviously very wide technological gaps, and technological infrastructure is really different from one country to the other. So how do we really keep pace on big data with these technological tools that we have or technological structures, infrastructures that we have? So are we really giving respect to the data by applying, you know, a restricted technologies or outdated technologies, which do not really treat the data the way it should be treated in some contexts, and in other contexts it's being done in a way that it is really good. So we need to really work on these issues and try to see how we can actually work together and make sure that as far as big data is concerned, because it's really big, we are actually helping each other and making sure that we're giving quality to data. So that would be really my two points. Thank you.

>> MARCIN DE KAMINSKI: Thank you so much. Mila, some concluding remarks, maybe?

>> MILA ROMANOFF: (?) the concluding remarks, but I would like to follow up and clarify that use of aggregated data or use of pseudonymized information, in our context, in the development and humanitarian context, based on the feasibility study we've done, as I previously mentioned with the Massachusetts Institute of Technology, I think it's important that not in every case scenario you need individual data, lateral data. As I said before it depends on the context. We're looking at community level data. Sometimes it's enough to even have data aggregated by ZIP code or even municipality. And the assessment of those uses needs to be done prior to the projects. But it's important to understand that it's not in every case scenario you will have a specifically unified -- I'm sorry, identifiable individual. No. Ethics needs to be established, to commit to not re-identify that information, and also proper encryption mechanism would have to be used and must be used in order to protect that information when it moves from one location to another. But again, please keep in mind that for development and humanitarian causes we don't need individual data, and I think it's an important point to consider when we move forward and talk about big data users. Thank you.

>> MARCIN DE KAMINSKI: Thank you. We're actually out of time, which means that we are happy with this discussion. It's so interesting, because this is a discussion where many stakeholders meet, apparently, both in the room and in the panel, in the discussions outside of IGF, and obviously there is a lot of interesting and very competent and diverse questions to still be asked and to be answered as well. I hope that this discussion can continue after this session as well, online, in our stakeholder meetings, in the groups that are connected to this issue. Thank you so much for this time in this session.

(Applause)

(end of session)