Best Practices for Working Creatively with Personal Data

Glossary

aggregate data: Data from multiple sources that is compiled into a single data summary.

anonymization: The process of removing, or the condition of having removed, identifying data from the dataset in a way that makes it impossible to relink the identifying data with the data subjects; anonymizing and/or anonymized data. Secondary use of anonymized data is not regulated under GDPR, UK GDPR, or HIPAA.

anonymous data: Data for which identifying information was never collected.

author or co-author (in scientific research): A researcher who makes a substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data; or the creation of new software used in the work; or have drafted the work or substantively revised it; and to have approved the submitted version (and any substantially modified version that involves the author’s contribution to the study); and to have agreed both to be personally accountable for the author’s own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.

avatar: A computer-generated image or model by which an individual represents themselves on a communications network or in a virtual community, such as a chatroom or multiplayer game. In Hinduism an avatar is a god appearing on earth in bodily form.

big data: Extremely large, diverse sets of information that typically need to be analyzed computationally. Big data is generally sourced via from data mining and comes in multiple formats. Big data typically focuses on human behaviour and interactions (Segal 2022).

biometric data: A measurement of an individual’s physical traits that can be used to verify identity. Biometric data includes fingerprints, face recognition, iris recognition, voice recognition, handwriting, and gait (the way a person walks or moves).

binning: see data generalization

blurring: see data generalization

cloud computing: Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on-demand.

collaborator (in social sciences research): An artist or researcher who makes a significant contribution to the intellectual direction of an artwork or research project, and who plays a significant role in the conduct of the research or research-related activity (TCPS2).

computer vision: A field of machine learning that processes, analyzes, and learns from images and videos.

consent: The way that parties agree to participate in a proposed activity, event, or arrangement. In the context of research, consent is a particularly important topic as it respects the autonomy of the data subject to voluntarily engage in the research process.

contributor (to a research project): A research assistant, researcher, or artist who contributes valuable resources and input to the research project but does not actually contribute to the creation of an artwork or writing/editing of a research paper (Enago Academy 2021).

CT scan: Combines x-rays taken from multiple angles to create detailed cross-sectional images that essentially image tissue density. CT scans emit radiation and are therefore only acquired for clinical research.

conversational AI: Technologies, such as chatbots or virtual agents, that users can talk to. They use large volumes of data, machine learning, and natural language processing to help imitate human interactions, recognizing speech and text inputs and translating their meanings across various languages (IBM Cloud Education 2020).

cookies: Text files stored on the user’s device by a website. Cookies are normally used to provide a more personalized experience and to remember user profile without the need of a specific login. Cookies can be placed by third parties to track users when surfing across different websites associated with that third party.

Creative Commons license: One of several public copyright licenses that enable the free distribution of an otherwise copyrighted work. A Creative Commons license may be used by individual researchers, authors, and artists and large institutions and companies alike to give other people the right to share, use, and build upon a work that they have created. Creative Commons licenses are increasingly being used in the sharing of medical scan datasets.

data generalization (also known as blurring): The transformation of one value into a more imprecise one. One data generalization technique is binning, where values within a range are all converted to that range, or providing a less specific value. For instance, a date of birth could be “blurred” to become a month of birth.

data minimization: A principle described in the GDPR (article 5.1.C) that suggests data collection should be limited only to what is directly relevant and necessary to accomplish a specified purpose. It also suggests that data be retained for as long as is necessary to fulfil that purpose.

data mining: The process of analyzing data from different sources. Data mining uncovers patterns and other information from large dataset or big data.

data randomization: The processing of large datasets so that key identifiers are randomly masked.

data scrambling: A process to obfuscate or remove sensitive data by changing the values of certain fields such as uppercase characters or numbers with “dummy data.”. This is also known as image cloaking.

data segmentation: The process of organizing data into defined groups, so that it can be ordered and viewed more easily. In the case of medical scan data, data segmentation typically refers to the selection of certain tissues types (such as muscle, bone, or fat).

dataset: A collection of related sets of information (such as medical scans) that is composed of separate elements but can be manipulated as a unit by a computer.

data subject: A person about whom a researcher holds personal data and who can be identified, directly or indirectly, by reference to that personal data.

de-identification (also called anonymization): The removal of identifying information from a dataset.

DeepDream: A computer vision program created by Google that uses deep learning to create dream-like, psychedelic images.

deep fake: An image or video of a person in which their face or body has been digitally altered using deep learning techniques so that they appear to be a different person.

deep learning: A type of machine learning based on artificial neural networks in which more than three layers of processing are used to learn from data.

defacing: A common procedure required to anonymize brain scans. The procedure masks out the face in scans by blurring or deleting voxels, making it impossible to identify the subject (if the image is volume rendered or the face surface is extracted). Examples of scan defacing software include pydeface and mri-deface.

DICOM (Digital Imaging and Communications in Medicine): The standard for the communication and management of medical imaging information and related data.

fMRI (Functional Magnetic Resonance Imaging): A type of MR scanning of the brain that is used to image brain function by tracking blood flow in the brain identify which part of the brain is active during certain activities.

generative adversarial network (GAN): A machine learning model that learns from a set of training data. GANs consist of two neural networks, the generator and the discriminator, which compete against each other. The generator is trained to produce fake data, and the discriminator is trained to distinguish the generator’s fake data from real examples. If the generator produces fake data that the discriminator can easily recognize as implausible, such as an image that is clearly not a face, the generator is penalized. Over time, the generator learns to generate more plausible examples. GANs cans be used for image improvement, generation, labelling, and identification (S. Lewis 2019; Wood 2020).

image cloaking: see data scrambling

incidental findings: Unexpected observations, results, or other findings that may arise in research that are considered beyond the scope of the project, and often beyond the expertise of the researcher.

longitudinal data: A collection of repeated observations of the same subjects, taken from a larger population, over a long period of time.

machine learning: The use of computational systems that learn and adapt without following explicit instructions, by using algorithms and statistical models to analyze and find patterns in data. There are three main kinds of machine learning; supervised learning where data is labelled, either by a human or a machine, and the labelled dataset is used to train a machine learning model; unsupervised learning where data is not labelled. The model is trained by recognizing patterns and then grouping the data based on these patterns into categories, and, reinforcement learning where an agent explores its environment, generating its own data as a guide to learning based on rewards (reinforcement) of “correct” behaviour.

metadata: Data that provides information about data.

MRI (Magnetic Resonance Imaging): A kind of medical imaging that combines magnetic field and computer-generated radio waves. The magnetic field temporarily realigns water molecules in your body. Radio waves cause these aligned atoms to produce faint signals, which are used to create cross-sectional MRI images of organs and tissues. MRI is non-invasive and can be used for non-clinical research.

multimodal data: Data from multiple modalities. An example of a multimodal medical scan dataset could include MRI, CT, and ultrasound from the same data subject.

natural language processing (NLP): A field of machine learning that processes, analyzes, and generates language speech and text rather than images. NLP is widely used for tasks such as speech recognition, text-to-speech, word segmentation, translation, analyzing large texts, and text generation.

neural network: A computation system used in machine learning that is modelled on the human brain and nervous system. Neural networks are used in AI applications such as speech and image recognition, spam email filtering, finance, and medical diagnosis.

normative bias: A tendency to assume that anything going against an established norm is not effective or appropriate.

OCAP®: Stands for ownership, control, access, and possession and is an educational resource created by the First Nations Information Governance Centre (FNIGC) to help First Nations communities in Canada control data collection processes in their own communities and how information is used.

open access: Free and open online access to academic resources such as publications, data, and software. A resource is “open access” when there are no financial, legal, or technical barriers to accessing it.

open source: Describes software for which the original source code is made freely available and may be shared and modified. The term open source also designates a set of “open-source values” based on principles of open exchange, collaborative participation, rapid prototyping, transparency, meritocracy, and community-oriented development (Opensource, n.d.).

personal data: Information that relates to an identified or identifiable individual. Different jurisdictions define personal data differently but generally personal data includes identifiers such as name, date of birth, address, identification numbers, location data, internet protocol numbers, telephone numbers, and data held by hospitals or doctors.

PET (Positron Emission Tomography) scan: A kind of scan that uses a radioactive tracer (typically ingested or injected) to track metabolic or biochemical activity. The radioactive tracer concentrates in areas with higher metabolic or biochemical activity. Often paired with CT or MRI scans.

primary use of data: The purpose for which the data was originally collected by researchers. Data collection of personal information is strictly regulated in most Western countries.

pseudonymization (also called de-identification): Data that is pseudonymized has had all identifying data removed from the dataset and data subjects are identified by a randomized signifier, usually a number. Pseudonymized data can be reidentified by relinking the dataset with the identifying data using a key. Protection of the identifying data and the key for relinking are strictly regulated in most Western countries.

research-creation: Research that combines creative and academic research practices, and supports the development of knowledge and innovation through artistic expression, scholarly investigation, and experimentation (Social Sciences and Humanities Research Council of Canada, n.d.).

research participants: Individuals whose data, biological materials, or responses to interventions, stimuli, or questions by the researcher are relevant to answering a research question(s) (TCPS2).

secondary use of data: When data is anonymized and then used by researchers for purposes that differ from those at the time of collection. For data to be anonymous, researchers conducting secondary use of the data must not be able to relink the data subjects with identifying information. Secondary use of anonymized data is not regulated under GDPR, UK GDPR, or HIPAA. The TCPS exempts researchers from obtaining consent from data subjects for secondary use of anonymized datasets, but REB approval is still required (TCPS2).

sensitive data: A special category of personal data. The definition of sensitive data differs by jurisdiction but typically includes data that reveals racial or ethnic origin, political opinions, religious or philosophical beliefs, genetic data, biometric data, health data, or data concerning sexual orientation. Medical scan data is a type of sensitive data.

synthetic data: Data generated from computer simulations or algorithms that provide alternatives to real-world data.

ultrasound: A kind of scanning that uses high-frequency, low-power sound waves to image soft tissue (cannot be used on bone or where there is gas, ie lungs).

Visible Human Project (VHP): A database of publicly available complete, anatomically detailed, three-dimensional representations of a human male body and a human female body created by the National Library of Medicine (US). Specifically, the VHP provides a public-domain library of cross-sectional cryosection, CT, and MRI images obtained from one male cadaver and one female cadaver. The Visible Man dataset was publicly released in 1994 and the Visible Woman in 1995 (National Library of Medicine 2019).

Previous: Conclusion, Limitations, Future Work

CONTENTS

Next: References