VTPPG: End-to-end Video-Transformed Persona Poem Generation

Model Parameters and Training

As shown in Fig. 3, The module of character action recognition has 1 convolution layer, 1 pooling layer, 5 residual layers, 1 fully-connected classifier layer with 79 classes numbers.

The module of character emotion recognition has 3 convolution layers, 3 separable convolution layers, 2 pooling layers, 1 softmax layer, 1 fully-connected classifier layer with 6 classes.

The model of the relationship recognition of characters has 5 convolution layers, 2 fully-connected layers, and 1 fully-connected classifier layer with 16 classes(These 16 specific relationships are simplified into five domains. And after we merge two more similar domains in ancient poetry, they are finally divided into four classes).

The modules identified by scenery emotion has 3 convolution layers, 2 pooling layers, 2 dropout layers, 2 dense layers, 1 flatten layer, 1 softmax layer, 1 fully-connected classifier layer with 3 classes.

We ran the model on a cloud server configured as video memory 11G, CPU 6 × Xeon E5-2678 v3, memory 62G, hard disk 100G. For the training of the poetry generation model, we set the beam size, mini-batches, word embedding, latent feature, hidden state, context vector as 20, 128, 256, 256, 512 and 512, respectively. We choose the leaky ReLU as the activation function, and adopt the dropout with 15% probability and L2 norm regularization to avoid overfitting. The learning rate decays during the training, and the range is between \(8*^\) and \(5*^\). We also adopt the dropout and l2 norm, where the drop ratio is 0.15. A denoising Auto Encoder trains for 10 epochs; A classifier trains for 10 epochs; a Mix Train with all samples trains for 12 epochs, and a Mix Train with labelled samples trains for 18 epochs. For the generation, we set the beam size to 20, and randomly adopted the one generation from the top-3 possible results to ensure the quality.

Evaluation

We evaluate the proposed poem generation (VTPPG) by comparing SOTA baselines, the ground truth (human-created poetry), two text-based baselines Seq2Seq2 [31], and Jiu Ge [20] [29] [30], i.e., a THUNLP online poetry generation system integrating the MixPoet model [20]. We note that the image-based baseline is not included, due to the unavailability of the git repository. Given the 50 people-centric videos, VTPPG generates 50 persona poems. Since videos are VTPPG input and, to the best of our knowledge, relevant works are rarely conducted, we can only compare with Jiu Ge that is driven by text-based inputs. The video employed by VTPPG corresponds to the character keywords that serves as inputs for Jiu Ge in the comparison. Coupling with widely used human evaluation manner [20], we focus on eight criteria to assess the quality of the created poems. 1. Character Actions Detection: Does the poem embody any action? 2. Character Relationships Detection: Does the poem manifest any relationship? 3. Character Emotions Detection: Does the poem represent any character feeling?

4. Scenery Emotions Detection: Does the poem represent any background mood? 5. Fluency: Is the poem fluent and well-formed? 6. Coherence: Is the poem as coherently and logically structured? 7. Meaningfulness: Does the poem convey specific messages? 8. Overall Quality: What is the general impression of the poem? The above criteria follows a 5-point scale (1–5).

Figure 5 and Table 1 show the comparison between other four models and our approach based on human evaluation. In Table 1, the average score of our system is 5% higher than Jiu Ge, which indicates we have achieved a better performance in several aspects, especially in both detection of CR, CE, and SE. However, contrary to expectations, this study finds a relatively poor performance incoherence. The most likely cause is an uncommon combination, such as angrily riding a horse. Since readers are subjective, the bewilderment stems from that rare combinations without historical context or relevant allusions. More detailed discussion are available in examples in the next paragraphs.

Fig. 5Fig. 5The alternative text for this image may have been generated using AI.

Comparison of Poetry Generation by Multiple Algorithms

Table 1 Qualitative EvaluationIn-depth Comparative of the Works by Jiu Ge and VTPPG

Our analysis, reflected by diversified types of CA, CR, CR, and SE, consists of four common scenarios of social dynamics (Fig. 6): (A) friends playing their instruments sadly in a festive scene (36 s, s); (B) kings and soldiers riding horses on a battlefield with full of terror and brutality (57 s); (C) a couple happily rowing a boat (42 s); (D) a mother and daughter go on a mountain outing on a sunny day (45 s). We analyze the poem generation results and compare them with Jiu Ge. We note that the scenarios pose both aligned and reversed relationship between CE and SE. Table 2 details the four scenarios.

Fig. 6Fig. 6The alternative text for this image may have been generated using AI.

Comparative Analysis between Jiu Ge and VTPPG: the four features from the videos are annotated by four colors, respectively. Ticks and crosses mean whether the features are satisfied or not

Table 2 Four features of the four scenarios (A–D): Comparing Jiu Ge, VTPPG, and Human PoetScenario A: Dream of the Red Chamber, Ep.31 (1987)

The scenarios depicts two characters and their actions – Baoyu listening to Daiyu playing an instrument. VTPPG successfully embodies the four features of poetry with the characters and their contexts: the emotion of the characters is sad, while their actions are playing musical instruments; However, the scenery emotion is positive, and the relationship of the characters is reciprocity. We elaborate the poems as follows. The first line of VTPPG-generated poem describes the action of play the instrument”琴”(the action of playing the qin, a Chinese zither). In particular, the phrase of “高山墨水” (High Mountains and Ink Waters) in the third line highlights the friendship between Yu Boya and Zhong Ziqi. The allusion of “高山流水” implies the relationship of Lin Daiyu and Jia Baoyu, in which “流”(Flowing Waters) is replaced by “墨”(Ink Waters). Also, the character’s emotion of the character is sad, and the fourth line, containing the philosophical context, reflects the feeling of the protagonists “愁” (melancholy). Furthermore, the scene’s emotion is positive. Thus, “富” (affluence) reflects the happy scenery of the video. Nevertheless, Jiu Ge cannot reflect the characters subtly, and hence neglect SE, CA and CR. As a result, it is unable to achieve character-centric poetry.

Scenario B: Romance of the Three Kingdoms, Ep.70 (2010)

VTPPG parses the videos of a battle scene, resulting in fear as the character’s emotion, negative scenery emotion, and the relationship of the hierarchical characters, while Jiu Ge missed such character’s emotions and relationships but successfully describe the horse.

riding action with limited scene description of the expedition, “出京畿”(departing from the capital’s outskirts). VTPPG contains “帝”(emperor) and”将”(general) to accurately reflect the relationship between the characters as the emperor and commanders (in green color). Additionally,”贫”(poverty) reflects the negative character’s emotion ‘fear’, i.e., the frightened mood connected to the battlefield.

Scenario C: Bu Bu Jing Xin, Ep.15 (2011)

VTPPG fully reflects four features: the happy emotion of the characters, the neutrality in the scene’s emotion, the relationship of the characters mating, and the character’s actions as boating.”涓”(a gurgling stream) in the first line reflects the sound of water flow and vividly describes the boating environment.”爱”(love) in the second line depicts the character relationship of love, and “渔”(fishing) also matches the scene around the rowing boat.”晴”(sunny weather) in the third line can be understood as the sunny weather, connecting to the happy mood of the characters. Finally, “花屋”(a flower-decked hut) in the fourth line mentions a very harmonious and bright picture. In comparison, Jiu Ge can deliver only two elements in the poem. The first line contains “水自流”(the water flows on its own), reflects the action of boating. Meanwhile, the third line has “鹧鸪”(the partridge) highlighting the meaning of love and thus such a relationship between characters. Nevertheless, the character’s relationship and emotions with Jiu Ge are neglected.

Scenario D: Jin Fen Shi Jia, Ep.1 (2003)

The video relates to a hiking activities among a protagonist Leng Qingqiu and her mother. VTPPG can satisfy all features, while the Jiu Ge can only reflect the character relationship and actions. We note that Jiu Ge uses text-based input (keywords) to generate poem and thus achieve good results in relationship and actions with such a straightforward scene. For example, relatives are represented by “亲心”(kinship) with Jiu Ge, while VTPPG makes “祖郊”(ancestral suburbs) to represent both the family and hiking. In Jiu Ge, a separate phrase “青山” (green mountains) describes the character’s action of hiking in an implicit manner. In contrast, VTPPG describes both the characters, relationships and emotions in a subtle way, i.e., the keywords are not explicitly used, for instance, green mountains are annotated by “夏梢”(summer treetops), denoting flourishing trees in summer.

Emotion in-depth Analysis

Finally, we present an additional case to reveal the adaptive characteristics of emotions with VTPPG. We adjust the characters’ happy and sad emotions for an identical scene, describing a dancing couple. Specifically, CA is dancing, CR refers to lovers, and SE becomes neutral. Among the two emotions, the poems share the same opening in the first two lines: “银白祖郊雨见河|发蟠溪女处多蛾 |”(Silver-white ancestral suburbs, rain reveals a river; By the coiled stream, the maiden’s brows are delicate like moths). With the happy emotion, the second half of the poem is “十时及隐生圆入|正是春风 (spring breeze) 在水河|”(At the tenth hour, secrecy gives way to fulfillment; It is precisely the spring breze upon the river), in which “春风” refers to the characters with great pleasure. In contrast, the sad emotion gives “海魂后昼心何得|生有风尘(wind and dust)竹月歌|”(After the sea-soul fades at daylight, where can the heart find solace? Life is fraught with worldly toils, a bamboo-and-moonlight song). In the third line, “心何得”(where can the heart find solace) employs a rhetorical method, evoking a sense of loss and melancholy. In the fourth line, “风尘”(wind and dust) illustrates a picture of the desolation of the world, connecting to the sorrowful characters.

Human Poet Comparison and Analysis

To comprehensively evaluate the performance of the VTPPG model and gain insight into the human-authored compositions, we invited some contemporary poets to view video clips A, B, C, and D and create corresponding quatrains based on their impressions. Below are the poet’s creations and their analysis from a first-person perspective, aiming to provide a fair comparison with VTPPG’s outputs in terms of creative motivation, emotional integration, and literary conception.

Case A: Corresponding to Scenario A in Fig. 6

Human Poet’s Work:独坐幽轩理玉琴, 弦凝指咽寄情深。知音不见空山晚, 唯有泠泠月下寻。(Alone she sits in a quiet lodge, tuning the jade zither’s strings; The paused strings, muted fingers, convey a feeling deep. Where is the soulmate? Empty mountains greet the dusk; Only the clear, cold moonlight seeks her in its sweep.)

Poet’s Creative Analysis:”Watching the video, I was first struck by the image of the woman alone in a room, focused on playing her instrument. The first line, ‘Alone she sits…’, aims to establish a solitary atmosphere. The phrase ‘The paused strings, muted fingers’ is my attempt to translate the visual of the performer’s brief pauses and subtle facial expressions into an auditory stagnation, hinting at a burden in her heart. The third line, ‘Where is the soulmate?’, is the dotting of the eyeballs; it stemmed from my observation of the woman occasionally looking up into the void with a gaze of searching and expectation, leading me to infer that her heart’s melody yearns to be understood. The final line, ‘Only the clear, cold moonlight…’, naturally transitions the scene from indoors to outside, using the cold moonlight to echo the loneliness of no response, rendering the emotional resonance more ethereal and profound. The poem generated by the VTPPG model, with lines like ‘子千知中画思琴’, accurately captures the core action of ‘playing the zither’ and metaphorically describes the scene as a ‘painting’, demonstrating its fundamental capability for scene translation. However, the human creation focuses more on excavating and sublimating the underlying ‘emotional motivation’ (e.g., the longing for a soulmate) from subtle movements and expressions, which is a direction for further exploration in AI’s pursuit of deep emotional resonance.”

Case B: Corresponding to Scenario B in Fig. 6

Human Poet’s Work:蹴踏风沙意气豪, 弓鸣箭啸掠平皋。蓦然回望烟尘处, 一点惊魂逐雁高 。 (Kicking up sand and wind, his spirit proud and high; Bowstrings hum, arrows whiz, sweeping past the plain. Suddenly, he turns to gaze where dust clouds lie; A trace of startled soul follows wild geese in the sky.)

Poet’s Creative Analysis:”The video conveys not merely heroic courage, but an adventure mixed with tension and a hint of fear. The first two lines, ‘Kicking up sand and wind’, ‘Bowstrings hum, arrows whiz’, intensely depict the fierceness of the scene and the rider’s boldness, setting the stage for a turn. The third line, ‘Suddenly he turns to gaze’, is key, capturing a sudden glance back from the rider during the gallop, with a flash of alertness or fear in his eyes. Therefore, in the final line, I use ‘A trace of startled soul follows wild geese in the sky’ to depict this subtle psychology: that momentary fear, like a startled wild goose, breaks away from the heroic backdrop and soars. The juxtaposition of these complex emotions is the source of the poem’s tension. VTPPG’s generated line ‘未先冠帝尚明将, 贫碧期期骨散梁’ directly describes the characters and emotion, demonstrating a good ability to associate characters with emotional dynamics. However, the strength of human creation lies in constructing ‘layers and transitions’ in emotion, mapping the external ‘proud’ to the internal ‘startled’, thereby shaping a more three-dimensional character psychology. Capturing such rapidly shifting complex emotions remains an area for VTPPG to deepen.”

Case C: Corresponding to Scenario C in Fig. 6

Human Poet’s Work: 叶轻剪碧波, 桃云蘸水影婆娑。 何须更问武陵事, 此处春光已最多。(A leaf-like boat cuts through the emerald waves; Peach blossoms kiss the water, shadows sway. Why ask again about the Peach Blossom Spring? Here, the essence of spring is on full display.)

Poet’s Creative Analysis:”This video is filled with leisurely joy. I decided to use bright, lively imagery to capture this feeling. The first line, ‘cuts through the emerald waves’, animates the static scene, depicting the lightness of the boat as it parts the spring water. The second line, ‘Peach blossoms kiss the water’, captures a classic visual from the video: the banks of peach blossoms (like clouds) reflecting in the water, as if the clouds are dipping into the clear water, striving for novel and vivid imagery. For the latter two lines, The expert employed the allusion to ‘Peach Blossom Spring’ (武陵源, Wuling Yuan), but deliberately negated it— ‘Why ask again?’— intending to express that the real-world beauty in the video already surpasses the idyllic imagination, representing an immediate, genuine joy. The expert’s approach can be described as a realistic sublimation of visual content, transitioning from a scene to a focused emotional expression. In stark contrast, the poem generated by VTPPG for the same video demonstrates a significantly more sophisticated capacity for narrative construction. The lines, ‘暮立怒林涓祝酒, 断无未爱户能渔’ (At dusk I stand, toasting with a trickle against the furious woods; Severed, without a past love, how could a home sustain fishing?), transcend mere description of the visual elements like boating or flowers. This output indicates that VTPPG operates beyond simple scene translation; it performs a profound narrative interpretation. The model appears to have synthesised the input features— specifically the character action (CA: Boating) and the character relationship (CR: Mating)— to abstract and articulate a complex psychological theme: the intricate anxieties preceding the commitment to a relationship. It constructs an internal drama of tension: the protagonist, facing the overpowering force of nature (furious woods) at dusk, makes a tentative pledge (toasting with a trickle) and expresses a resolute acknowledgement of an inaccessible, ordinary life (fishing). This ability to transmute visual cues into a rich narrative of internal conflict represents a significant advancement in generative AI for poetry. The subsequent lines, ‘阴诗风雨晴人个, 花屋期期道减锄’ (A shaded poem of storm, rain, and a solitary clear sky; The flower-hut, awaited season after season, sees the path overgrown with disuse.), further deepen this complexity. They hint at interpersonal turmoil (’storm, rain’) contrasted with individual solitude (’a solitary clear sky’), and the gradual fading of promised bliss (’flower-hut’) through prolonged waiting (’season after season’). This technique of juxtaposing a pleasant scene with profound emotional undertones creates a powerful artistic tension that far exceeds a simple ‘Happy’ emotion classification. It showcases the model’s advanced capability to grasp and generate content reflecting emotional complexity. In summary, while my poem offers a ‘realistic sublimation’ of the video content, VTPPG’s generation is an exemplary piece of ‘narrative creation’ It does not merely describe the scenery but invents a story imbued with psychological depth and narrative tension. This comparison strongly validates the VTPPG model’s advanced ability to interpret latent character dynamics and generate poetry with significant narrative depth and emotional layering, underscoring its reliability and sophistication.”

Case D: Corresponding to Scenario D in Fig. 6

Human Poet’s Work: 携亲步翠微, 笑语共霞晖。林静鸟声悦, 心随暮色归。 (With kin I tread the emerald slopes; Our laughter blends with sunset’s glow. In hushed woods, birds sing in delight, My heart returns with twilight’s light.)

Poet’s Creative Analysis:”My composition aims to capture the essence of familial harmony through concise, vivid imagery. The first line,’携亲步翠微’ (With family, we tread the emerald slopes), immediately establishes the core relationship (携亲) and activity (步翠微). The second line, ‘笑语共霞晖’ (Our laughter blends with the sunset glow), reinforces the joyful emotion (笑语) and integrates the natural scenery (霞晖) to create a warm atmosphere. The third line, ‘林静鸟声悦’ (In the quiet woods, birds sing joyfully), uses auditory imagery to enhance the serene yet happy mood. The final line, ‘心随暮色归’ (Our hearts return with the twilight), provides a gentle conclusion, suggesting a harmonious union with nature and a satisfying completion of the outing. The expert’s approach emphasises emotional consistency and scenic cohesion, creating a straightforward yet evocative portrayal of the family outing.

The opening line,’赋依金向闻祖郊,’ immediately transcends literal description. The term ‘赋’ (to compose poetry) adds a layer of literary refinement, while ‘祖 郊’ (ancestral outskirts) imbues the outing with a sense of cultural and generational continuity, framing it as a symbolic return to roots. This shows VTPPG’s ability to infer latent cultural narratives from visual cues. The second line, ‘歇是香人士夏梢,’ depicts a person resting under fragrant summer trees, effectively capturing a serene moment. The third line, ‘一点行高贞醉馨,’ artfully combines the physical ascent (’行高’) with emotional steadfastness (’贞’) and sensory immersion (’醉馨’), demonstrating the integration of multimodal features. The concluding line, ‘轻 虚成月去梁茅,’ transitions to a moonlit thatched lodge, creating an ethereal and philosophically resonant ending. VTPPG constructs a symbolic narrative arc— moving from poetic invocation, through a sensory journey, to a tranquil culmination— that interprets the family outing as a meaningful ritual. This highlights its strength in narrative abstraction and cultural reasoning, generating poetry rich in connotation beyond direct scene description.

While my poem provides a clear, emotionally consistent depiction, VTPPG’s generation excels in creating a layered, culturally embedded narrative. This contrast underscores VTPPG’s advanced capability in interpreting complex social dynamics (CR: Attachment) and generating poetry with profound symbolic depth, validating its innovative approach to persona poem generation.

In summary, by introducing comparative creations and analyses of these four poems from some human poets, we enable a more refined comparison at the level of creative mechanism. The VTPPG model demonstrates powerful capabilities in scene element recognition, core emotional theme summarisation, and basic poetic language organisation. Its outputs are notably reasonable and reliable. The advantages of human creation lie in the deep interpretation of subtle expressions and actions, the layered construction of complex, contradictory emotions, and the rigorous logic by which all imagery serves a unified theme. This precisely indicates the advanced direction for VTPPG’s future development: progressing from ‘accurately describing scenes and expressing emotions’ towards ‘narrating and expressing emotions with depth, layers, and logic,’ thereby continuously approaching the rich connotation and artistic heights of human poetry while maintaining its advancement. This comparative study also validates the value of incorporating multi-faceted human creation as an evaluation benchmark.

Comments (0)

No login
gif