Seen by Machine: Computational Spectatorship in the BBC television archive

paper, specified "long paper"
Authorship
  1. 1. Daniel Alberto Chavez Heras

    King's College London

  2. 2. Tobias Blanke

    King's College London

  3. 3. Tim Cowlishaw

    British Broadcasting Corporation (BBC)

  4. 4. Jakub Fiala

    British Broadcasting Corporation (BBC)

  5. 5. Amaya Herranz Donnan

    British Broadcasting Corporation (BBC)

  6. 6. David Man

    British Broadcasting Corporation (BBC)

Work text
This plain text was ingested for the purpose of full-text search, not to preserve original formatting or readability. For the most complete copy, refer to the original conference program.


Summary
This paper is a report and critical account of research undertaken in the project
Made by Machine: When AI Met the Archive (
https://www.bbc.co.uk/programmes/b0bhwk3p). In this project we used three computational approaches to analyse and automatically browse the BBC television archive; we then proposed a novel way of combining these approaches through machine learning by fitting their outputs to a recurrent neural network as a time-series.

Through these methods, we were able to generate new sequences out of a dataset of short clips of archive footage. These sequences were then edited and packaged into a television programme broadcast on BBC Four in September 2018. This is, to our knowledge, the first time machine learning has been used in this way to recast archive footage into original prime-time content for television.
In this paper, we first frame
Made by Machine as a project inscribed in the political economy of datafied cultural production. We then describe the technological approaches we used to traverse archive space, learn and extract features from video, and model their relations through time. And finally, we introduce the idea of
computational spectatorship as a concept to analyse the objects and practices of automated seeing/editing of moving imagery through machine learning.

The television archive as cultural big data
In public discourse we are constantly reminded that how we learn, what we enjoy, and the ways we conceive and exercise power, are all consistently mediated by imagery. In various academic disciplines this pre-eminence of imagery in all spheres of human activity has been referred to as visual culture or the visual turn (Mirzoeff, 2002; Jay, 2002). It has been further recognised that digital and networked technologies have dramatically increased the role imagery plays in society. According to an estimate already 23.2% of files available online in 2003 were images (Lyman and Varian, 2003) and this is before the rise of YouTube and Netflix in the mid-2000s. A more recent report by CISCO predicts that by 2022 video will account for more than 80% of the world’s IP traffic. If this is the case, by then we will be collectively watching through Video-on-Demand (VoD) systems the equivalent to 10 billion DVDs per month (CISCO, 2018: 2).
Arguably, out of all the cognitive-cultural production today, moving imagery is the most semantically ubiquitous, the larger in size, and one of the most semiotically complex insofar as it combines visual, linguistic, and aural modes of expression. In their abundance and complexity moving images are in today’s networked societies one of the most challenging aspects of contemporary culture, and one that increasingly demands social and academic attention. In this context, while the volume of digital video grows, and its formats continue to diversify, the use of computational techniques to access, analyse, produce and reproduce vast collections of digital moving imagery becomes a pressing issue for organisations and archives that deal with audio-visual production (see: Kuhn et al., 2015; and Ward and Barker, 2013)
One of such collections of is the BBC television archive, which amounts to approximately 700,000 hours of television distributed in about 400,000 programmes in various formats, including items originally stored in film and magnetic tapes that have now been digitised, and new programming which is preserved as born-digital data (British Film Institute, 2018; Lee, 2014). A significant portion of this archive is now available through the internal
Redux system, which allows on-demand access to high-resolution video stream files through an online interface (Butterworth, 2008).

In the rapidly changing landscape of digital entertainment industries, where large corporations like
Netflix or
Amazon are producing original video content and are —in industry parlance— leveraging their own moving image collections as profitable datasets, the BBC is increasingly pushed to recognise the value of their archives as data. In 2017, BBC R&D signed a five-year partnership with eight UK universities to “unlock the potential of data in the media” by creating a framework to make BBC data available for research (Chadwick in BBC R&D, 2017). To unlock such potential can be understood in this context as extracting value from this data, but unlike
Amazon or
Netflix, the BBC is a public corporation funded largely by tax-payers to fulfil public purposes (
https://www.bbc.com/aboutthebbc/governance/charter). The British broadcaster therefore needs to grapple with the questions of what types of data and what kinds of value it extracts from its television archives. These questions require the broader intellectual frameworks of the humanities to place
Made by Machine into context.

Seen by Machine
Made by Machine was a project commissioned under a simple premise: to create a TV programme out of archive footage using machine learning. The project was conceived, produced and delivered between May and August 2018, and it aired under the category of documentary on BBC Four in September as part of a two-night experimental programming block called “AITV on BBC 4.1” (
https://www.bbc.co.uk/programmes/p06jt9ng).

Most of
Made by Machine was developed internally by a small group of technologists at BBC R&D, in collaboration with an external researcher from King’s College London. This team designed the computational methods to analyse archive footage and produce new video sequences out of it; the sequences were then packaged and delivered as a documentary by a producer who commissioned a 3D-rendered talking head and a presenter whose commentary bookends the sequences.

First, a Support Vector Machine (SVM) was trained on existing programme metadata “to predict whether or not a given programme was broadcast on BBC Four” (Cowlishaw, 2018). Using this, an ad hoc subset of programmes was selected, and then segmented into several thousand clips that ranged from tens of seconds to several minutes using an algorithm developed internally at BBC R&D (Cowlishaw, 2018). These clips were then used as a dataset to be explored and manipulated using four methods:
1. Object detection
2. Subtitle analysis
3. Visual energy
4. Mixed (generative) model
Object detection
The
Densecap system (
https://cs.stanford.edu/people/karpathy/densecap/) was used to automatically annotate the dataset of clips and then these annotations were used as metadata to generate the sequences by similarity.
Densecap is a computer vision system designed to “localize and describe salient regions in images in natural language” (Johnson et al., 2015, p.1). It is trained on the visual genome dataset (
https://visualgenome.org/), a crowd-sourced set of densely annotated images.

Subtitle analysis
A pre-trained
Word2Vec language model (
https://code.google.com/archive/p/word2vec/) was used to create word embeddings of the subtitles of the programmes, and then term frequency–inverse document frequency (TFIDF) to retrieve clips and generate the sequences based on word salience and similarity.

Visual energy
MPEG-2 video encoding was used to extract a frame-to-frame motion vector of pixel colour difference. This signal was then used to rank every clip in the dataset according to this representation of video dynamics (also preserving the variation information at the frame level).
Mixed mode
The features learned/extracted from the previous three sections were then used as inputs to train a long short-term memory (LSTM) machine (Hochreiter and Schmidhuber, 1997; Gers et al., 2000) to learn a model of the relations between the features across time. The model was later used to generate new similar sequences.
Through these four techniques the team aimed to represent digital video in terms of three of its dimensions: depiction (visual), dialogue (linguistic) and motion (plastic). One significant contribution in our approach is to try to model moving imagery as a multi-dimensional time-series problem: analysing video streams as a sequence of para-symbolic units whose ordering in time is integral to the creation of meaning.

Computational spectatorship
Finally, we conceptualise this family of machine-learning approaches as machine-seers. These machine-seers mediate our relationship with large audio-visual collections and provide novel ways of automatic browsing as a form of automated editing.
We argue that machine-seers enable a new and specific modality of vision, one which depends heavily on machine-learned representations not only of digital video features but, crucially, of spectatorship. Machine-seers are built on top of existing computer vision and other machine learning techniques, but we conceive them as a second generation of systems designed to learn and replicate complex viewing practices in and across media; they activate in audiences new visual dispositions and capacities, and as such they enable particular ways of seeing (see: Berger, 1972).
One significant advantage of framing a project like
Made by Machine in this way is that theories of spectatorship enable us to think of the imagery produced through machine learning systems as images of the systems themselves; their aesthetics include the technological and social conditions for their existence.

We therefore argue that
Made by Machine can be understood as one early instance of a new aesthetic modality: the datamatic (Ikeda et al., 2012). Unlike the cinematic or the televisual before it, the datamatic is first and foremost a form of computational spectatorship: more than novel ways of
creating imagery, machine-seers afford us with novel ways of
enjoying imagery; they fetishise calculation and turn the datafication of society into its own form of spectacle. Through this
spectaculum ex computatio, datamatic watching allows us, potentially, to enjoy sequencing without continuity, narrative without authorship and, ultimately, presence without subject.

Bibliography
BBC R&D (2017) Data Science Research Partnership [online]. Available from:
https://www.bbc.co.uk/rd/projects/data-science-research-partnership (Accessed 28 September 2018).

Berger, J. (1972) Ways of Seeing. 2008 edition. Penguin UK.
British Film Institute (2018) Key archives and what they do [online]. Available from:
https://www.bfi.org.uk/archive-collections/archive-projects/artist-archive/key-archives-what-they-do (Accessed 15 October 2018).

Butterworth, B. (2008) History of the ‘BBC Redux’ project [online]. Available from:
http://www.bbc.co.uk/blogs/bbcinternet/2008/10/history_of_the_bbc_redux_proje.html (Accessed 15 October 2018).

CISCO (2018) Cisco Visual Networking Index: Forecast and Trends. [online]. Available from:
https://www.cisco.com/c/m/en_us/solutions/service-provider/vni-forecast-highlights.html (Accessed 22 January 2019).

Cowlishaw, T. (2018) Using Artificial Intelligence to Search the Archive. BBC R&D Blog [online]. Available from:
https://www.bbc.co.uk/rd/blog/2018-10-artificial-intelligence-archive-television-bbc4 (Accessed 27 September 2018).

Gers, F. A. et al. (2000) Learning to Forget: Continual Prediction with LSTM. Neural Computation. [Online] 12 (10), 2451–2471.
Hochreiter, S. & Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation. [Online] 9 (8), 1735–1780.
Ikeda, R. et al. (2012) Ryoji Ikeda: Datamatics. Charta.
Jay, M. (2002) Cultural relativism and the visual turn. Journal of Visual Culture. [Online] 1 (3), 267–278.
Johnson, J. et al. (2015) DenseCap: Fully Convolutional Localization Networks for Dense Captioning. arXiv:1511.07571 [cs]. [online]. Available from:
http://arxiv.org/abs/1511.07571 (Accessed 23 November 2018).

Kuhn, V. et al. (2015) ‘The VAT: Enhanced Video Analysis’, in Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure. XSEDE ’15. pp. 11:1–11:4. [online]. Available from:
http://doi.acm.org/10.1145/2792745.2792756 (Accessed 23 November 2018).

Lee, A. (2014) An interview with Adam Lee, BBC archive expert [online]. Available from:
http://www.bbc.co.uk/archive/tv_archive.shtml (Accessed 15 October 2018).

Lyman, P. & Varian, H. R. (2003) How Much Information. [online]. Available from:
http://groups.ischool.berkeley.edu/archive/how-much-info-2003/ (Accessed 23 November 2018).

Mirzoeff, N. (2002) The Visual Culture Reader. Psychology Press.
Ward, J. S. & Barker, A. (2013) Undefined By Data: A Survey of Big Data Definitions. arXiv:1309.5821 [cs]. [online]. Available from:
http://arxiv.org/abs/1309.5821 (Accessed 17 October 2018).

If this content appears in violation of your intellectual property rights, or you see errors or omissions, please reach out to Scott B. Weingart to discuss removing or amending the materials.