Best Practices for Working Creatively with Personal Data

Anonymization

As explained in the previous section, data that has been de-identified or anonymized is exempt from privacy laws. In academic research, secondary use of data—use that is different from the original purpose for which the data was collected—is exempt from ethics review when it is applied to anonymous datasets (TCPS 2 2018). Consent is not required for use of anonymized datasets since personal information has either been removed or obscured in a way that re-identification is considered difficult. However, anonymization has become less reliable in an age of big data, smart devices, and social media. It is important to understand that anonymized datasets no longer offer the protections they once did. Smart devices and social media make a wealth of information publicly available (Cooper and Coetzee 2020; Parks 2021) and this big data can undermine the methods for protecting human subjects represented in anonymous datasets (Rocher, Hendrickx, and de Montjoye 2019). Furthermore, as machine learning algorithms work by finding patterns in data, there is no assurance (or even way of knowing) if anonymized datasets are cross-referenced and thus re-identified.

Thus, anonymous datasets that can be legally and ethically used in research can now breach a core tenet of ethical research because of the capabilities of machine learning, facial recognition, and big data (Hassan 2021). The potential for harm to the re-identified data subject is not equal for all data subjects and depends on the kind of data, its age, and a myriad of other local social, legal, political, and economic factors. Disclosure of health status in some parts of the world could prevent someone from accessing health insurance and thus health care in some countries; it could also impact them socially and/or economically by inviting discrimination and stigmatization. The possibility of re-identification is particularly troubling when we consider that social media giants such as Google and Facebook are establishing lines of business in the health care domain. During the KTVR e-Symposium in May 2021, Katrina Ingram and Fahim Hassan presented a workshop about the risks of aggregating health data, with a special focus on Google (Hassan 2021; Ingram 2021). The possibilities to aggregate and share data across platforms are pushing the boundaries for new business models that create a myriad of risks. For example, 23andMe is a Google venture (via Alphabet) that became a publicly traded company in early 2021 and is now using the DNA from millions of Americans to produce pharmaceuticals (Brown 2021). 23andMe is part of a Google health portfolio that includes insurance companies, medical record apps, and home health monitoring technologies that collect biometric data. Google’s Project Nightingale, which gave Google access to health care data through research partnerships, has already raised privacy concerns and lawsuits (Schneble, Elger, and Shaw 2020).

In his discussion of the inadequacy of de-identification to protect the identity of data subjects, legal scholar Mark Rothstein notes four categories of potential harm: group harm, objectionable uses, commercial exploitation, and undermining trust (2010, 6–8). Group harm refers to linking specific populations to stigmatized conditions. Objectionable use refers to the use of de-identified data in ways that the data subject had not anticipated and did not consent to. Commercial exploitation refers to monetary gain that has not been shared with the data subject. Undermining trust refers to the potential long-term impacts on the artist and affiliated organizations that could be difficult to recover from if the public senses that there has been unethical conduct in the use of data. An example undermining trust is the infamous case of Henrietta Lacks (Skloot 2010). Lacks was a cancer patient who unwittingly provided the source material for an important medical breakthrough known as the HeLa cell line. These cells were the first to survive in a lab environment for more than a few days, which made them extremely valuable for medical research. For many decades, neither Lacks, who died soon after her cells were harvested, nor her family received any compensation. It was only after over fifty years of fighting and intense media attention that a donation was finally made by one of the biomedical research companies, the Howard Hughes Medical Institute, to the Henrietta Lacks Foundation (Witze 2020).

Metadata

In addition to the primary content of a data file, there can also be metadata associated with the data. Metadata is data about data. For example, when you take a photograph, information about the type of camera used, the location of where the photograph was taken (GPS coordinates), a time stamp, size of the file, and other information is automatically recorded by the camera. Digital technologies leave a digital footprint, which becomes another form of data that may also need to be anonymized to protect a data subject’s privacy.

To evaluate the risks associated with working with anonymized data, it is important to understand the main ways in which data privacy is enacted through privacy-preserving methods of anonymization, pseudonymization, and de-identification. Understanding these methods can also help shed light on the issue of re-identification.

How is data anonymized?

There are different ways in which datasets can be anonymized. The most common ways are de-identification, pseudonymization, and data-cloaking techniques such as data scrambling and defacing. For big datasets, there are also statistical privacy-preserving methods such as generalization, perturbation, and randomization. These methods are not as applicable to medical scans data but may be pertinent to other datasets used by creators. The issue of re-identification is also related to data anonymization and is an increasing concern as datasets are reused and combined in novel ways that present new risks to privacy.

De-identification

De-identification refers to the removal of personally identifying information such as name, address, or unique identifiers (e.g., health care number) from a dataset. Although true de-identification or anonymization is very difficult, this technique should mean that there is no way to relink the data to the data subject.

In medical scan images, direct-identifying data is often found as metadata in the header of the file (not visible in the scans). Radiologists recommend either a file conversion or removal of the header information that contains personal identifying information as standard methods to de-identify medical scans (Parker et al. 2021). However, there are limitations to both methods. File conversion can result in the loss of data, while removal of the header information may be insufficient if the vendor’s software systems retain identifiers. Historically, removing personal information had been adequate in protecting the privacy of data subjects. However, recent studies have shown that with the development of big data, data subjects in datasets with fifteen or more demographic data points can be reidentified (Rocher, Hendrickx, and de Montjoye 2019), and in fact, studies dating back to the early 2000s have pointed to issues with the standards for de-identification of personal data (Rothstein 2010, 5–6). Chapter 5 of Canada’s Tri-Council Policy Statement notes that “technological developments have increased the ability to access, store and analyze large volumes of data. These activities may heighten risks of re-identification…Various factors can affect the risks of re-identification, and researchers and REBs should be vigilant in their efforts to recognize and reduce these risks” (TCPS 2 2018, 59). As Rothstein suggests, “responsible researchers should consider whether, in the context of their particular research, additional measures are needed to protect de-identified health information and biological specimens and demonstrate respect for the individuals from whom the information and specimens were obtained. Those who engage in research ought to be as thoughtful and meticulous about their relations with the human subjects of their research as they are about designing their experiments and analyzing their data” (2010, 9).

Pseudonymization

Pseudonymized data has had identifying information removed from the dataset and replaced by a random or artificial identifier or pseudonym (fake name). Pairing the identifier or pseudonym and a key allows for the re-identification of the data subject in the dataset (University College London 2019). Pseudonymization is very common across medical research, probably more than plain de-identification, because the master key is needed to track outcomes.

Data scrambling

Data scrambling, or image cloaking, is another method increasingly used in an attempt to anonymize image data as a way to avoid unauthorized facial recognition. The software Fawkes from the University of Chicago (Shan, Wenger, and Zhang 2020), for example, is a system that adds “imperceptible pixel-level changes” or “cloaks” to digital images that are imperceptible to the human eye but stop them from being identified by facial recognition models. What this is essentially doing is adding noise to the data so that it interrupts it being read, analyzed, and potentially re-identified by a machine.

Generalization

Generalization (also known as data blurring) involves making specific attributes of the dataset more broadly characterized. A very simple example might be replacing a person’s exact age with an age range. For example, instead of saying someone is forty-seven years old, the dataset might indicate they are between the ages of forty and fifty years old. This technique is known more specifically as binning. Another generalization technique called shortening, involves reducing the amount of information. Examples include going from six-character postal code to the first three characters, or a geographical generalization such as stating province instead of city.

Randomization

Randomization does not strictly fall under anonymization, but more generally under privacy preserving and is often used as an alternative to anonymization. Randomization involves altering the relationship between variables in the data so they are less likely to identify an individual. This technique is often referred to as “adding noise” to a dataset or “perturbing” the data. Randomizing techniques used in machine learning with big data include differential privacy, permutation, and substitution.

Typically, there are trade-offs between improving privacy and retaining utility of the data. Striking the right balance in determining which technique(s) to use and what trade-offs are acceptable is part of the ethical choice involved in applying these privacy-preservation methods.

When Marilène Oliver uploaded a 3D rendering of her MR scan data to Facebook, it was tagged instantly as belonging to her. Facial recognition software has the ability to identify people from their medical scans (Parker et al. 2021). In their white paper on the de-identification of medical imaging, Parker et al. (2021) explain that as well as being able to capture facial identity, medical scans can also retain identifying features both beneath and above the skin’s surface such as moles, fillings, pins, and hip and knee replacements. The anonymized Melanix dataset, which is included with the DICOM viewing software OsiriX, has several such possible identifiers such as artifacts from dental fillings, a mole, and even the indentation of a wedding ring. The Melanix dataset, which has been technically been de-identified and pseudonymized, is thus an example of a dataset that has the potential of being re-identified.

The risk of re-identification with facial recognition software has led to the creation of defacing software. This technique automatically identifies and crops away the faces from head scans. For diagnostic purposes, this method is reported to work well, but in terms of aesthetics it seems disturbingly aggressive to the human figure and dehumanizing of it.

Screenshot of Facebook automatically identifying a 3D rendering on an MR scan
Image courtesy of Marilène Oliver

Blurring

Blurring or pixilation has historically been a common way of rendering an individual in a photograph anonymous. However, this technique has been criticized as inappropriate because it dehumanizes the individual (Nutbrown 2011, 8). This sentiment is echoed in the Guidelines for Ethical Visual Research Methods where the authors acknowledge that blurring “reduces the authenticity” of the visual and/or risks dehumanizing participants, or denies participants the ability or right to make an informed choice about revealing their identity (Cox et al. 2014, 20).

Blurring human faces has been used by several artists as a way to memorialize and respect their subjects. For Monument (Odessa), the French artist Christian Boltanski worked with an image of a group of Jewish girls in France from 1939. By altering their likenesses through re-photographing, enlarging, and editing the photos, the identities of the individuals were lost, but references to their humanity and youth were retained. The images were further manipulated through the mode of their presentation—the altered images were installed in dark rooms, alongside glowing lights, and with reflective surfaces. The artwork was thus able to use the likeness of the children without infringing on their privacy and while reinforcing a memorial aspect (Boltanski 1989). In another example, Canadian artist Sandra Brewster used the technique of blurring to empower her subjects and to explore the “layered experiences of identity—ones that may bridge relationships to Canada and elsewhere, as well as to the present and the past” (Brewster 2019). Brewster achieved the blur in her 2017 series of photographs Blur by directing her subjects to move, to “evoke the self in motion, embodying time and space, and channeling cultures and stories passed down from generation to generation.” When asked why she decided to title her solo exhibition of 2019 Blur, Brewster explained,

“Blur” plays with and was inspired by all of the interpretations I mentioned: [how the works] explored movement and referenced migration and how the effects of migration may influence and inspire the formation of one’s identity here—whether the person was born elsewhere or is the child of a person born somewhere else. The intention of the blur is also to represent individuals as layered and complex: to not see people solely in one dimension; [and to be] aware that a person is made up of [both] who they are tangibly, and so many other intangible things, which includes their experiences with time and location—whether they access this on their own or through generational storytelling. (Price 2019)

In her long-term project self-Less, artist Dana Dal Bo collects hundreds of screen-captured selfies (naked people taking their own picture in bathroom mirrors with smartphones) and manually renders them unrecognizable through digital photo manipulations techniques such as painting, erasing, and pixel cloning (where one part of the image is copied to another). The manipulated images are then re-posted to a dedicated Instagram feed (Witze 2020). Dal Bo also made a series of prints of the manipulated self-Less images using the antique and resistant technique of carbon printing (Dal Bo 2017). The title of the series, Carbon Copy, refers both to the printing technique and to the “CC” that is now used in emails as a way point to the permanence, reproducibility, and recursion of images posted online. In her 2021 artist presentation “ArtAIfacts: Co-Creation with Non-Human Systems,” Dal Bo reflected that although the original images were clearly intimate and meant for a specific person, once they were posted online they became available to anyone, everywhere at any time.

Dana Dal Bo, Ass sink from self-Less series, 2014–ongoing, digital image, dimensions variable.
Image courtesy of the artist.

Zach Blas, Fag Face Mask, 2012, vacuum formed, painted plastic.
Image courtesy of the artist.

A growing number of artists are highlighting the dangers of facial recognition technology and speculating solutions to evade it. In his 2012 project Facial Weaponization Suite, American artist Zack Blas created a series of four amorphous vacuum-formed plastic masks based on aggregated of facial data of marginalized groups (queer, black, female, and Mexican). When the “collective” masks are worn, the wearer is protected from facial recognition systems. Fag Face Mask, for example, was generated from the biometric facial data of queer men’s faces. In an essay on the project, Blas explains that the mask is a “a response to gay face and fag face scientific studies that link the successful determination of sexual orientation through rapid facial recognition techniques…The mask is not a denial of sexuality nor a return to the closet; rather, it is a collective and autonomous self-determination of sexuality, a styling and imprinting of the face that evades identificatory regulation” (2013). Each mask in the series in motivated by instances of social and political abuse of facial recognition software, such as the failings of the technology to detect Black faces, veil legislation in France targeted at Islamic women, and the abuse of biometric surveillance at the Mexico-US border (Blas 2012).

American and German artist and software developer Adam Harvey also works with facial recognition software to raise awareness of the existence of large, publicly gathered facial recognition training datasets and their accuracy, and to develop anonymization tools to protect individuals from exposing their or other’s faces to facial recognition software when posting images to social medial platforms. In his 2017 work MegaPixels, Harvey created a photobooth in which audience members could have their faces matched to a face within the MegaFace (V2) dataset, at the time the largest publicly available and widely used facial recognition training dataset (Harvey 2017). MegaPixels produces an image of the participant’s face next to the image of the face it has been identified as, plus the accuracy of the identification that can be thermally printed and taken by the audience member. The MegaFace (V2) dataset, which contains 4.2 million images, was created from Flickr images without consent and is being used in research projects in the US, China, and Russia to train facial recognition systems.

A more recent work by Harvey, DFACE.app is a web-based application that masks or redacts faces in photographs. There are several different “redaction effects,” including blur, mosaic, emoji, fuzzy and colour fill. Although these effects are fun and playful, the DFACE.app is motivated by the unwarranted use of surveillance technologies in protests and large public gatherings (Harvey 2018).

Example of image “dfaced” with Adam Harvey’s DFACE app.

Anonymization Discussion Questions

• Has identifying information been removed data? Has identifying information been destroyed, or has the link between the identifying data and the dataset been destroyed? Is it possible for someone to relink data?

• Does the scan data contain information such as facial structures that facial recognition software could identify? Is it appropriate to render those features unidentifiable? What methods of censoring or scrambling the data are appropriate? In the past, black bars were used to obscure parts of the face or body, but this strategy can be perceived as a violent way to anonymize participants.

• Will the way in which the artwork is disseminated put the data subjects at risk of re-identification? Is social media being used in any part of the work? What kinds of algorithms does social media expose the data to? Should images of the artwork be uploaded to social media platforms or not? Should the data be rendered illegible to algorithms using cloaking software such as Fawkes?

• When the dataset was created, how have (or could) changes in technology challenge or undermine privacy or consent? Could demographic data be used to re-identify the individual?

Previous: Reflexive Questions Introduction

CONTENTS

Next: Provenance, Access, and Licencing