Video is worth a thousand images: exploring the latest trends in long video generation

Waseem, Faraz; Shahzad, Muhammad

Video is worth a thousand images: exploring the latest trends in long video generation

Lists

Tools

Waseem, F. and Shahzad, M. ORCID: https://orcid.org/0009-0002-9394-343X (2025) Video is worth a thousand images: exploring the latest trends in long video generation. ACM Computing Surveys. ISSN 1557-7341 (In Press)

Text - Accepted Version
· Restricted to Repository staff only
· The Copyright of this document has not been checked yet. This may affect its availability.
15MB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

Abstract/Summary

An image may convey a thousand words, but a video, composed of hundreds or thousands of image frames, tells a more intricate story. Despite significant progress in multimodal large language models (MLLMs), generating extended videos remains a formidable challenge. As of this writing, OpenAI’s Sora [1], the current state-of-the-art system, is still limited to producing videos of up to one minute in length. This limitation stems from the complexity of long video generation, which requires more than generative AI techniques for approximating density functions. Critical elements, such as planning, narrative construction, and spatiotemporal continuity, pose significant challenges. Integrating generative AI with a divide-and-conquer approach could improve scalability for longer videos while offering greater control. In this survey, we examine the current landscape of long video generation, covering foundational techniques such as GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of existing video generation capabilities. We believe it would serve as a comprehensive foundation, offering extensive information to guide future advancements and research in the field of long video generation.

Item Type:	Article
Refereed:	Yes
Divisions:	Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
ID Code:	124876
Publisher:	ACM

Deposit Details

University Staff: Request a correction | Centaur Editors: Update this record

University of Reading

CentAUR: Central Archive at the University of Reading

Accessibility navigation

Video is worth a thousand images: exploring the latest trends in long video generation

Abstract/Summary

Page navigation

See also

Footer navigation