From Visual Storytelling (Huang et al. 2016):

“Abstract: We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The first release of this dataset, SIND1 v.1, includes 81,743 unique photos in 20,211 sequences, aligned to both descriptive (caption) and story language. We establish several strong baselines for the storytelling task, and motivate an automatic metric to benchmark progress. Modelling concrete description as well as figurative and social language, as provided in this dataset and the storytelling task, has the potential to move artificial intelligence from basic understandings of typical visual scenes towards more and more human-like understanding of grounded event structure and subjective expression.”

Visual Storytelling (Huang et al. 2016) builds off of related work in vision to language that examined “image captioning  (Lin et al., 2014; Karpathy and Fei-Fei, 2015; Vinyals et al., 2015; Xu et al., 2015; Chen et al., 2015; Young et al., 2014; Elliott and Keller, 2013), question answering (Antol et al., 2015; Ren et al., 2015; Gao et al., 2015; Malinowski and Fritz, 2014), visual phrases (Sadeghi and Farhadi, 2011), video understanding (Ramanathan et al., 2013), and visual concepts (Krishna et al., 2016; Fang et al., 2015)” to progress from literal description to narration. But realistic storytelling is not the only aspect of human intelligence.

The crowd-sourcing methods for the Cooperative Poem (CoPo) project are similar to the methods outlined in Visual Storytelling, and along with generating a collection of co-authored poems, the CoPo dataset can also be analyzed to inform an AI model of ekphrastic human expression.

According to the Poetry Foundation, ekphrastic poetry is “a vivid description of a scene or, more commonly, a work of art. Through the imaginative act of narrating and reflecting on the ‘action’ of a painting or sculpture, the poet may amplify and expand its meaning.” The expansion of meaning occurs because we are capable of more than narrating a visual scene. Our ability to reflect on the past, present, and future implications of a visual scene, and its associated narrative, allows our imagination to create a unique narrative.