Joseph
Wang

172 Summer Program for Undergraduates in Data Science Long-Form Visual Understanding in Multimodal Large Language Models

Abstract profile. Full document pending author claim.

Authors:

Joseph Wang, Mengyu Wang

Date Created:

2025-01-01

Course Title:
Professor:

Not specified

About Paper:

Recent advancements in multimodal large language models with 28.47% for BLEU-1, 27.67% for METEOR, and 31.56% (MLLMs), particularly the representation of visual inputs as for ROUGE-1, while more advanced semantic alignment scores additional tokens alongside text, have enabled impressive progress all remain under 15%. To improve performance, we applied in MLLM open-world visual understanding. These developments supervised Low-Rank Adaptation (LoRA) finetuning to fine-tune offer the potential of assistive technologies for visually impairedthe InternVL3-8B MLLM, and preliminary LoRA results show individuals. However, significant challenges arise when these improved semantic accuracy, though fine-grained consistency models are applied to long-form video content, where the remains uneven. To assess subtle temporal understanding, we will number of visual tokens vastly exceed MLLM limits. Uniform then evaluate thefully LoRA fine-tunedmodel on natural language frame sampling serves as the state-of-the-art practice for this query tasks. Ultimately, preliminary results suggest that standard issue but sometimes loses crucial context in videos. In uniform frame sampling often omits critical frames, creating gaps our study, we benchmarked four open-source MLLMs on the in contextual memory and limiting the model’s ability to truthfully Ego4Ddataset,comparingmodel-generatedvideosummariesfrom understand certain events. From these findings, we introduce a uniform frame sampling against ground-truth summaries using paradigm shift: reinforcement-learning agent for adaptive frame standard captioning metrics. Then, we finetuned the model selection for future work, representing a meaningful step toward to observe performance shifts. Our results indicated that base scalable, context-aware MLLMs for robust video comprehension model performance in basic semantic accuracy benchmarks is low, across diverse environments.

Abstract:

Recent advancements in multimodal large language models with 28.47% for BLEU-1, 27.67% for METEOR, and 31.56% (MLLMs), particularly the representation of visual inputs as for ROUGE-1, while more advanced semantic alignment scores additional tokens alongside text, have enabled impressive progress all remain under 15%. To improve performance, we applied in MLLM open-world visual understanding. These developments supervised Low-Rank Adaptation (LoRA) finetuning to fine-tune offer the potential of assistive technologies for visually impairedthe InternVL3-8B MLLM, and preliminary LoRA results show individuals. However, significant challenges arise when these improved semantic accuracy, though fine-grained consistency models are applied to long-form video content, where the remains uneven. To assess subtle temporal understanding, we will number of visual tokens vastly exceed MLLM limits. Uniform then evaluate thefully LoRA fine-tunedmodel on natural language frame sampling serves as the state-of-the-art practice for this query tasks. Ultimately, preliminary results suggest that standard issue but sometimes loses crucial context in videos. In uniform frame sampling often omits critical frames, creating gaps our study, we benchmarked four open-source MLLMs on the in contextual memory and limiting the model’s ability to truthfully Ego4Ddataset,comparingmodel-generatedvideosummariesfrom understand certain events. From these findings, we introduce a uniform frame sampling against ground-truth summaries using paradigm shift: reinforcement-learning agent for adaptive frame standard captioning metrics. Then, we finetuned the model selection for future work, representing a meaningful step toward to observe performance shifts. Our results indicated that base scalable, context-aware MLLMs for robust video comprehension model performance in basic semantic accuracy benchmarks is low, across diverse environments.

Source:

Harvard / Harvard College | Adams House | Statistics | 2027 / 2025

Topics:

model, mllm, frame, visual, understanding, language, semantic, performance, lora, fine, video, uniform

Professor Score
92.5
Verified
Patrick Slade
0
Hassan Farah
0
Jae-Ryeong Choi
0
Phoebe Rubio
0
Michael Darnowski
0
Sara Buhrlage
0