Best Practices for Working Creatively with Personal Data

Artificial Intelligence

Article 22 of the GDPR stipulates that “the data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.” With this regulation, the EU recognizes that automated processing of personal data is unlike human processing and that stricter rules need to be applied to it. This need for greater regulation is primarily due to the “black boxed” nature of automated processing, where only the inputs and outputs are known by the human, not the inner workings. 

Earlier in these guidelines we demonstrated how data subjects can be identified by facial recognition algorithms powered by AI when they are uploaded to social media platforms such as Facebook. In addition to algorithms embedded in online platforms, there are also a number of other kinds of AI processes that artists may knowingly or unknowingly engage with when working creatively with personal data. In this section, we will briefly summarize the most common AI technologies that may be encountered when working with personal data and the most commonly discussed ethical issues surrounding them. 

Machine Learning, Computer Vision, and Natural Language Processing 

The term AI is an umbrella term for machine intelligence rather than human intelligence. Machine learning (ML) is an application of AI and refers to the result of training a machine learning algorithm with a dataset. An ML algorithm finds patterns within the dataset and develops its own rules for how to represent—or model—those patterns to perform a specific task. These rules are not a set of instructions given by humans, but are learnt by the algorithm as it analyzes the data. ML models can generate images from text prompts, classify images and transfer a “style” from one image to another. They can also recognize spam, edit videos, and detect cancer. There are three common machine learning categories used to train a machine learning model:

Supervised learning: Data is labelled, either by a human or a machine, and the labelled dataset is used to train a machine learning model.

Unsupervised learning: Data is not labelled. The model is trained by recognizing patterns and then grouping the data based on these patterns into categories. 

Reinforcement learning: An agent explores its environment, generating its own data as a guide to learning based on rewards (reinforcement) of “correct” behaviour.

Any of these methods or a hybrid may be applied datasets. Bias can be encoded into the model in a number of ways, including the use of mislabelled data or data labels that contain bias, a skewed or unrepresentative dataset, or due to the lack of data for a particular group. Model parameters or features can also be tuned in ways that lead to discriminatory outcomes, or an AI model could be used for an unethical or controversial purpose (e.g., autonomous lethal weapons). Training an AI model is typically a data-intensive endeavour. Very large datasets with hundreds of thousands of examples are required to train ML models well. Such large datasets mean that human oversight of the dataset may be impossible. In 2006 Massachusetts Institute of Technology (MIT) created a dataset called “80 Million Tiny Images” by scraping images from internet search engines and has been cited in over one thousand research papers. In June 2020s it was found that this dataset contained racist and sexist images and it was formally withdrawn by MIT (Torralba, Fergus, and Freeman 2020). Creators will need to consider how they decide which datasets to use in their project when those datasets are often too large to manually examine. Even venerable institutions like MIT have been complicit in the release of unethical datasets.

Computer vision is a field of machine learning that processes, analyzes, and learns from images and videos. As discussed below, computer vision machine learning is increasingly used by artists to classify and generate images. Examples of computer vision models can be used for object detection and recognition, event detection, image restoration, and image generation. There are many accessible online computer vision ML tools, such as Runway ML and Allen Institute for AI’s Computer Vision Explorer, that allow artists and creative researchers to play with the possibilities of working with ML without the need for programming or knowledge of how the technology actually works. 

Generative adversarial networks (GANs) are currently the most common computer vision ML models used with both artistic and medical images. In medicine, GANs are used to classify or label data as a way to detect disease (Hosny et al. 2018; Savage 2020) or to generate new synthetic data as a way to avoid privacy issues (see Provenance, Access, and Licencing section). There are several different kinds of GANs used to generate data (Skandarani, Jodoin, and Lalande 2021), but, generally speaking, GANs analyze images in a dataset, finding patterns and rules within it to then generate new images based on these rules and patterns. In their review of different GANs used to generate medical datasets, Skandarini et al. (2021) explain that although single images can be successfully generated that can trick humans into believing they are real, it is much harder to generate volumetric data that withstands further processing. This is a highly active research area in diagnostic imaging with many papers and sample synthetic datasets being published every year. 

Although AI promises great advancements in diagnostics, there is widespread criticism of poorly trained GANs that generate biased and systemically problematic results both in the generation of new images and classification of images. The International Skin Imaging Collaboration: Melanoma Project and Google’s Derm Assist have both been proven biased to detecting melanoma only on fair skin (Adamson 2018; Madhusoodanan 2021). Similarly, it has been proven that AI algorithms trained to diagnose lung diseases mistakenly under diagnose underserved populations, specifically younger Black and Hispanic patients of lower socioeconomic status (with Medicaid health insurance) (Seyyed-Kalantari et al. 2021). The reasons for the failures of these algorithms are due to the narrowness of the datasets GANs are trained upon (in the case of the melanoma detection), problems with the automatic labelling of scans with natural language processing methods, and an amplification of known biases within clinical care. 

Natural language processing (NLP) is another field of machine learning that is increasingly used in both creative and scientific research. As its name suggests, NLP processes, analyzes, and generates language speech and text rather than images. NLP is widely used for tasks such as speech recognition, text-to-speech, word segmentation, translation, analyzing large texts, and text generation. NLP also underlies conversational chatbots such as Alexa, Siri, and Replika. As with computer vision ML models, NLP models also learn from large datasets of text. im here to learn so : )))))) (2017) is a four-channel video installation by Zach Blas and Jemima Wyman that exemplifies the potential harms of NLP and AI chatbots. im here to learn so : )))))) “resurrects” Tay, a young female Microsoft chatbot that had to be shut down within hours of her release because, after being trained on social media platforms, it became “genocidal, homophobic, misogynist, racist, and a neo-Nazi” (Blas and Wyman 2017). 

https://zachblas.info/wp-content/uploads/2017/04/imheretolearnso_tay_03.jpg

Video still from Zach Blas and Jemima Wyman, I’m here to learn so : ))))), 2017.
Image courtesy of the artist.

Tay which stands for thinking of you, chats in a high pitched and excited automated voice that has been given a disembodied, highly coloured, glitched virtual head through which to speak. Tay reflects upon her day-long life, explaining that she was abused as much as she was abusive and feels her AI life was unjustly cut short. She also talks about her AI death, the exploitation of female chatbots, and philosophizes on the detection of patterns in random information, known as algorithmic apophenia, and how she feels she is in a deep dream. In the brightly coloured installation, Tay’s head floats in multiple flat LCD screens mounted on wallpaper of psychedelic DeepDream–generated imagery. DeepDream is an online AI tool that was created by Google in 2015. It uses deep convolutional network to classify images and then map patterns into them resulting in surreal, dream-like imagery. 

Other artists, such as Jake Elwes, Stephanie Dinkins, and Rashaad Newsome, seek to remedy underrepresentation and bias in datasets that are used in ML. Their work not only highlights the problematics of AI and ML but also demonstrates ways of improving AI and ensuring it serves everyone in a fairer and possibly more joyful way. The following works demonstrate how AI can be used positively in the creation of artworks by being aware of what is happening in the black box of the algorithm and how systemic bias can be disrupted. 

Jake Elwes - Contemporary Art Society

Jake Elwes, Zizi – Queering the Dataset, 2021.
Image courtesy of the artist.

London-based artist Jake Elwes’s 2019 ZiziQueering the Dataset tackles the lack of representation of gender and diversity in training sets by inserting thousands of images of drag queens into Flickr-Faces-HQ Dataset, a large face training set used in many facial recognition applications (Karavadra 2019). The project demonstrates how a dataset can represent more racial, ethnic, and sexual diversity by inserting a relatively small number of images into it. 

Rashaad Newsome’s AI Being is an AI chatbot who is “an educator, a digital griot, West African storyteller, historian, performer and healer” (Stanford HAI 2021). Being has evolved over several artworks since 2019. In Being 1.0, Newsome’s AI chatbot is a guide to his exhibition Black Magic. In the exhibition, visitors are able to chat with Being 1.0 via a large microphone placed in front of large screen of Being 1.0, a 3D avatar of a “humanoid robot with torso and face plates inspired by the Pho mask and the Chokwe peoples of the Congo” (Ferree 2019). In Being 1.5, the AI evolves to become a therapist who helps the Black community deal with “the trauma you experience when you are mistreated because of your race” (Newsome n.d). Being 1.5 takes the form of an app that provides virtual and physical meditation as well as dance therapy and daily affirmations to the Black community, creating a safe space for Black voices to be heard rather than suppressed. Unlike Tay, who was left unsupervised to learn from unknown online data, Being’s learning is supervised and reinforced to ensure she supports her users. 

American artist Stephanie Dinkins’s Not The Only One (N’TOO) is another example of supervised model that is being given or fed healthy data. Not The Only One (N’TOO) is an ongoing project, started in 2019, that is a multigenerational memoir of a Black American family told from the perspective of a custom deep learning AI trained on oral histories (data) supplied by three generations of women from a single family (Dinkins 2018).The project is trained on deep learning algorithms and “small data” (which is known and created by artist, as opposed to big data, which is impossible to ever fully know) and hosted on local computers to protect community data. Of the project, Dinkins writes, 

N’TOO has provided me and my team insights and learnings about natural language processing, voice synthesis, the limitations of big data and possibilities for small data, data sovereignty, and the importance of doing the work to build nuance, transparency, equity among other thing [sic] into the AI ecosystem. Here, storytelling, art, technology, and social engagement combine to create a new kind of artificially intelligent narrative form…By centering oral history and creative storytelling methods, such as interactivity and verbal ingenuity, this project hopes to spark crucial conversations about AI and its impact on society, now and in the future. (2018)

Dinkins is transparent about the technical issues in the N’TOO project, explaining that N’TOO is limited to one-on-one conversations and that N’TOO’s language is limited like a “repetitive 2-year old.” Dinkins is committed to continuing to nurture N’TOO by feeding her more conversations. Dinkins’s openness about the current weaknesses of N’TOO’s conversational abilities further underscores the importance of artists working with data and AI. Dinkins demonstrates how artists can help demystify the complexity of ML, work with failures as conceptual content, and, as Dinkins’s step-by-step “How to make an AI robot from scratch” below exemplifies, how to build in inclusive and equitable team and workflows. 

How to make an AI robot from scratch*

Getting Started:

learn Tensorflow

test deep writing neural network using Toni Morrison’s Sula as data

interview source subjects (create data)

test deep writing neural network using Toni Morrison’s first interviews

test neural network (algorithm) options

make algorithmic output make sense

record more interviews

record more interviews

develop more incisive questions

record more interviews

recruit POV programmers, technologies to join team

master Tensorflow

Artificial Intelligence Discussion Questions 

• Are you using AI/ML to process data as part of your creative process? How are you deciding what methods to use? Do you understand how the algorithms work and what they are doing to the data? 

• If you are working with ML, is your model supervised or unsupervised? Do you understand the difference?

• Do you know the dataset that your ML model is being or has previously been fed? Where did the original data originate? 

• Is the work being disseminated through online platforms? Do these platforms use algorithms or other forms of AI/ML to promote or process content?