-
Pretraining Multilingual Foundation Models
Multilingual pretraining in large language models (LLMs) may confer cognitive benefits similar to those observed in multilingual humans, including enhanced reasoning and cross-linguistic generalization. Models often learn better when exposed to linguistic diversity rather than monolingual data, but the optimal balance remains unclear. In this study, we systematically investigate the impact of multilingual exposure by training LLaMA 3.2-1B models on varying ratios of English-Chinese data, characterize the performance changes across multiple benchmarks, and find that 25% multilingual exposure yields optimal results—improving logical reasoning and code synthesis by up to 130% while preventing catastrophic forgetting, though with some trade-offs in fairness metrics.
-
Using NeRF and Foundational Models to Create Distilled Feature Fields
Modern robotics applications require powerful visual systems to interact with the physical world. For example, consider a warehouse robot working at Amazon, tasked with grabbing a product from a bin filled with other products, using a mechanical arm, a camera, and the product description. Such a robot would need several capabilities: the semantic capabilities to identify the product in the bin from its description, and the spatial capabilities to successfully grab the product. Researchers have tackled this very problem by training Computer Vision models to both learn how to represent the 3D space captured by images of the scene, and learn how to represent the semantics of all objects present in the scene. Separately, these two approaches are achieved through Neural Radiance Fields (NeRF) and Image Features from Foundational Models, respectively; when combined, the new approach is called Distilled Feature Fields (DFFs). Here, we describe the inner workings of NeRF and Foundational Model Features, DFFs, and other approaches that have since been developed, along with their advantages and drawbacks.
-
A Survey of LERF-TOGO and Related Works
In this article, we examine LERF-TOGO, a zero-shot task-oriented grasper, meaning it can grasp unseen objects at the correct part depending on what its task is. We first conduct an overview of two previous state of the art approaches. Then, we discuss the various foundation models that were used to construct the LERF-TOGO pipeline. Afterwards, we assess the advantages and limitations of LERF-TOGO. Finally, we briefly discuss trends that can be seen throughout the years of research efforts in task-oriented grasping - namely, the increasing prominence of advanced foundation models in the development of novel task-oriented grasping models.
-
Neural Game Engine
This report explores recent advances in neural game engines, with a focus on generative models that simulate interactive game environments. Based on the prior presentation of GameNGen, a diffusion-based video model trained on DOOM, I review and compare several state-of-the-art approaches including DIAMOND, MineWorld, IRIS, GameGAN, and the original World Models. I also analyze their differences in architecture, visual fidelity, speed, and controllability, highlighting the trade-offs each design makes. Finally I conclude with a discussion on future directions for building responsive, efficient, and generalizable neural simulators for reinforcement learning and interactive media.
-
Pong on Fire: Evaluating LLaVA's Potential as a Gaming Adversary
In this study, I take a deep dive into LLaVA—a state-of-the-art large vision-action language model with robust general purpose visual and language understanding in a chat context—and its capabilities in Pong to extrapolate the potential of using generalist LMMs as gaming agents. This firstly involves an investigation into LLaVA’s general purpose capabilities, followed by an exploration of LLaVA family models, with a focus on LLARVA, a LLaVA-inspired model for robot learning. Upon examination, it is clear that LLaVA demonstrates great capabilities across a variety of domains, and LLaVA-inspired models similary demonstrate enhanced capabilities when further designed around and fine-tuned for downstream tasks. I thus evaluate LLaVA on its ability to provide accurate inputs for the video game Pong based on a set of frames of gameplay. I successfully used a pre-trained LLaVA model to provide action responses when prompted, and though compute resources and time constrained my ability to generate more test data, a manual evaluation of the results demonstrated that LLaVA had the capability to correctly assess where to move its paddle and additionally reason about why it made the action it did. While these results are not definitive, this work showcases the potential for vision language models to be performant at tasks involving a variety of stimuli and complex control.
-
Survey on Vision-Language-Action Models for the Digital World
Vision-Language-Action (VLA) models are emerging as generalist agents that can see, read, and act within graphical user interfaces. These models bridge computer vision, natural language, and reinforcement learning to enable AI systems to perceive screen content and execute UI actions. This survey reviews the latest developments in VLA models for digital environments, from early text-based approaches to today’s fully vision-driven agents with advanced reasoning, and discusses their benchmarks, innovations, results, and open challenges.
-
Survey on Foundation Models for Embodied Decision Making
Foundation models are reshaping embodied AI – from robots that manipulate the physical world to agents that navigate virtual environments. These models leverage vast datasets and high-capacity architectures to learn generalizable policies for perception, reasoning, and action across diverse tasks and domains. This survey reviews recent advances in vision-language-action (VLA) models for embodied decision making, covering their architectures, training pipelines, datasets, benchmarks, reasoning abilities, and current limitations, with an emphasis on cross-domain generalization and strategies for grounding abstract knowledge in embodied action.
-
Vision, Language, Action - The Robotics Trifecta Leveling Up AI Control
Vision–Language–Action (VLA) models have advanced robotic control by integrating multimodal understanding with action generation. This report systematically examines five representative VLA models—RT-1, RT-2, OpenVLA, TinyVLA, and diffusion-based policies—focusing on their architectures, datasets, and inference strategies. We highlight their strengths and limitations in areas such as real-time control, generalization, and scalability. Drawing on a comparative analysis, we identify ongoing challenges including data diversity, hierarchical reasoning, and safety evaluation. Finally, we propose future directions to improve VLA models’ robustness, efficiency, and applicability in real-world robotics. Our synthesis provides a roadmap for researchers aiming to develop scalable, interpretable, and reliable embodied AI systems.
-
Toward Unified Diffusion–MLLM Systems in Biomedical AI - A Survey of Integration Strategies and Research Directions
Diffusion models and multimodal large language models (MLLMs) have become pivotal to biomedical AI, excelling at generating high-quality medical images and interpreting clinical texts, respectively. Although these approaches have complementary strengths, their integration in biomedical applications remains limited. This review systematically analyzes over 20 studies that employ diffusion and MLLM techniques for tasks such as medical image synthesis, report creation, visual question answering, and cross-modal retrieval. We highlight emerging general-domain integration frameworks that offer promising approaches toward closer integration. Based on this analysis, we propose a taxonomy of four integration approaches and evaluate existing biomedical systems across multiple key dimensions. Our findings reveal a persistent gap between current modular implementations and the unified architectures needed for seamless clinical reasoning. Finally, we outline critical obstacles related to data fragmentation, architectural design, clinical validation, and evaluation protocols; we suggest research avenues to advance integrated foundation models for end-to-end multimodal reasoning within clinical workflows.
-
Unpacking Llama 3 Meta's Next Leap in Open-Source AI
Meta’s Llama 3 series offers the open-source world a GPT-4-level model family—raising the bar on what public AI models can do across instruction-following, coding, multilinguality, and long-context reasoning.
-
The Frontier of Image and Video Generation Models.
Recent advances in deep learning have revolutionized image generation, from traditional Generative Adversarial Networks (GANs)-based models to diffusion-based and more recent black box models, to produce increasingly photorealistic visuals. These image generation advances allow for the generation of faithful, realistic, and high-resolution visual contents based on textual instructions, and have found success in domains like digital art, advertising, and prototyping. Furthermore, to achieve utility in more dynamic storytelling and immersive content creation, researchers further studied video generation for creating coherent, temporally consistent motion content directly from data. This paves the way for new applications in filmmaking, virtual reality, marketing, and beyond. However, generating videos introduces new challenges far beyond those faced in static image synthesis: models must not only render realistic frames, but also capture motion dynamics, long-range temporal dependencies, and scene coherence. Moreover, obtaining large-scale, high-quality video data for training remains a bottleneck, and quality control becomes more complex when both spatial and temporal dimensions are involved. Along similar lines, the further incorporation of audio effects with video generation has also become an emerging research field. As research pushes the boundary of what’s possible, video generation has become one of the most exciting—and demanding—frontiers in AI-driven content creation.
-
Cosmos-Transfer1: Conditional World Generation
In this report, we present Cosmos-Transfer1, a conditional world generation model that enables the spatiotemporal control for state-of-the-art world models. We first introduce several prior works, including world models and controllable generation, which laid the foundation for this work. Then, we study how Cosmos-Transfer1 integrates these two methods to achieve controllable world generation. Through the experiments, we concluded some key insights about the model and its performance, with visualization from reproducing their release model. Finally, we discuss the future work and potential improvements for Cosmos-Transfer1.
-
Navigation World Models
In vision-based robotics, Navigation World Models (NWMs) offer a unified framework for perception, planning, and control by either learning policies directly or by predicting future observations to inform trajectory optimization. Their goal is to help mobile agents reach targets, explore unknown spaces, and adapt to novel environments from real-time visual inputs. We mainly discuss the following four state-of-the-art NWMs: GNM [1], which trains a cross-robot navigation policy on multi-robot datasets; ViNT [2], a Transformer-based NWM for image-goal navigation with zero-shot generalizablility; NoMaD [3], which unifies goal-conditioned and undirected exploration in a single diffusion-based policy; and NWM [4], a large-scale Conditional Diffusion Transformer world model that simulates video trajectories for MPC motion planning. In this study, we discuss the core techniques they employ, and we also find common challenges: heavy reliance on large, diverse datasets; limited planning horizons; computational overhead from diffusion decoding; and the high model and compute demands of video prediction. This survey provides a detailed comparison of these strengths and weaknesses.
-
Safeguarding Stereotypes - an Exploration of Cultural and Social Bias Mitigation in Large Language Models
Large Language Models (LLMs) have become central to modern AI applications, from education to customer service and content generation. Yet with their widespread use comes growing concern about how they encode and reproduce racial, gender, religious, and cultural stereotypes. This paper explores the presence of social biases in LLMs through a review of definitional frameworks, statistical and benchmark-based evaluation techniques, and bias mitigation strategies. Key benchmarks such as StereoSet, CrowS-Pairs, and BLEnD are examined for their effectiveness in identifying and quantifying stereotypical behavior. Mitigation strategies—including data filtering, Reinforcement Learning from Human Feedback (RLHF), and Anthropic’s Constitutional AI—are evaluated across leading models like ChatGPT, Gemini, and Claude. Finally, an experiment using stereotype-sensitive prompt completions reveals significant differences in how these three models respond to socially loaded questions. The findings suggest that while technical safeguards are increasingly effective at identifying stereotypes, the definition of “valid” responses are different across models. This work provides a high level comparative lens on how today’s most widely used LLMs handle stereotypes, both in theory and in practice.
-
The Evolving Landscape of AI in Virtual Agent Design: Architectures, Reasoning, Learning, and the Influence of Foundation Models
The design of virtual agents is undergoing a significant transformation, driven by the advanced capabilities of foundation models (FMs), particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). These FMs now serve as the cognitive core for agents, enabling sophisticated understanding, reasoning, planning, and interaction. This literature review delves into the AI design principles and architectures underpinning modern virtual agents, focusing on their cognitive architectures, reasoning and planning mechanisms, learning and adaptation capabilities, and the pivotal influence of FMs. We explore key innovations such as advanced memory systems (e.g., MemGPT, A-MEM) that address context limitations, sophisticated reasoning frameworks (e.g., ReAct, Reflexion, Tree-of-Thoughts, Case-Based Reasoning integration) that enhance decision-making, and paradigms for agent self-improvement and autonomous evolution (e.g., SICA, ADAS). The review also examines the AI design of multi-agent systems (MAS), including collaborative architectures (e.g., AutoGen, MetaGPT) and the emergence of collective intelligence. A central theme is the trend towards agents that are not only more autonomous but are also capable of evolving their own cognitive processes and architectural blueprints, largely enabled by the meta-cognitive capabilities of recent FMs. However, this rapid progress is accompanied by significant challenges in ensuring robustness, reliability, safety, and ethical alignment, particularly as agents become more integrated into real-world applications. The increasing complexity and resource demands of FM-powered agents also highlight the growing importance of system-level support, such as that proposed by AIOS, to efficiently manage cognitive resources. The field is maturing towards a more holistic approach, integrating AI-aware system design with advanced cognitive capabilities, while emphasizing the critical need for rigorous evaluation and responsible development to harness the full potential of AI-driven virtual agents.
-
Balancing Reasoning and Efficiency in LLMs - Insights from Qwen3-8B
Recent advances in large language models (LLMs) have led to remarkable improvements in reasoning abilities. However, excessive reasoning can incur high computational costs and sometimes harm accuracy on simple tasks. In this report, we trace the evolution of LLM reasoning techniques from early prompting to built-in reasoning modes in state-of-the-art models like Qwen3. We conduct experiments on Qwen3-8B to evaluate the impact of different reasoning strategies, including truncated reasoning, self-consistency voting, and no-thinking prompts. Our findings highlight the trade-offs between reasoning depth and efficiency, and suggest that dynamic reasoning control is essential to optimize LLM performance across tasks.
-
Understanding GPT-4o: Capabilities, Trade-offs, and Real-World Utility
This blog aims to provide a comprehensive analysis of GPT-4o by examining both its technical architecture and real-world usability. It explores GPT-4o’s multimodal capabilities—spanning text, image, and audio processing—and situates them within the broader evolution of reasoning-enhanced and multimodal language models. Through a combination of literature review, system-level speculation, and a user survey of early adopters, the blog investigates how GPT-4o is actually used in practice, what trade-offs it introduces in terms of speed, reasoning fidelity, and interactivity, and whether its multimodal design represents a genuine leap or a transitional phase in LLM development. Ultimately, it seeks to bridge the gap between technical potential and lived user experience, offering insights into the future trajectory of multimodal AI systems.
-
Recent Developments in GUI Web Agents
Web agents are a new class of agents that can interact with the web. They are able to navigate the web, search for information, and perform tasks. They are a type of multi-modal agent that can use text, images, and other modalities to interact with the web. Since 2024, we have seen a surge in the development of web agents, with many new agents being developed and released. In this blog post, we survey the recent developments in the field of web and particularly GUI agents, and provide a comprehensive overview of the state of the art. We review core benchmarks - WebArena, VisualWebArena, Mind2Web, and AssistantBench — that have enabled systematic measurement of these capabilities. We discuss the backbone vision-language models that power these agents, as well as the recent advancement in reasoning.
-
Discrimination and Fairness in Language Model Decisions
In this study, we traces the evolution of research on discrimination in large language model (LLMs) decisions, highlighting a shift from representational bias in early word embeddings to allocative harms in modern LLM outputs. Foundational studies such as Bolukbasi et al. (2016) [4] and Caliskan et al. (2017) [5] revealed gender and racial associations in pretrained models. As LLMs entered high-stakes decision-making contexts, researchers like Sheng et al. (2019) [6] and Zhao et al. (2021) [7] explored bias in prompt-based outputs and counterfactual reasoning. Anthropic’s paper in 2023 [3] marked a turning point by introducing a large-scale, mixed-effects framework to evaluate demographic discrimination across realistic decision prompts, revealing systematic disparities tied to race, gender, and age. Recent work builds on this with tools like FairPair (causal perturbation analysis) [11], BiasAlert (knowledge-aligned bias detection) [12], and CalibraEval (fairness in model evaluations) [13], while multilingual efforts like SHADES [14], CultureLLM [15], and MAPS [16] broaden the scope to culturally and linguistically diverse contexts. Together, these contributions signal a growing commitment to auditing and mitigating discrimination in LLMs from both technical and ethical perspectives.
-
Exploring Various State-of-the-Art Techniques in Diffusion Models
In this study, we discuss and analyze various state-of-the-art works for diffusion models. From our research, we can see that diffusion models are continously being improved frequently. In addition, we found that previous works help inspire new works, pushing the bounds of diffusion models even further. With this blog article, we hope to inspire some interest and enthusiasm in the blooming field of diffusion models.
-
Survey on Foundation Models for Robotics
Foundation models (FMs) have demonstrated remarkable capabilities in the field of natural language processing (NLP) and computer vision (CV). Their success stems from pre-training on massive, diverse datasets using self-supervised learning, leading to the ability to generalize across diverse tasks and also enables efficient finetuning to the downstream tasks. This paradigm shift, largely driven by advancements in large language models and vision transformers, holds great potential for the field of robotics. While conventional robotic systems often rely on task-specific models requiring extensive, domain-specific data and expert engineering, foundation models provide the opportunity to design robots with greater autonomy, adaptability, and generalized intelligence. In this study, we aim to provide a comprehensive survey of foundation models in robotics. This includes the main challenges of foundation models for robotics and most of their main use cases.
-
On the Evolution of Reasoning in Large Language Models
Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, increasingly attributed to their ability to perform multi-step reasoning. This paper surveys the evolution of reasoning in LLMs, organizing the literature into four main categories: internal representations, structured prompting, reinforcement learning, and supervised fine-tuning. We explore how reasoning can emerge from scale, be encouraged through prompt design, be enhanced through interaction and reward signals, and be explicitly taught through labeled reasoning traces. We discuss the advantages, limitations, and trade-offs of each method and analyze how these strategies influence model performance, generalization, interpretability, and scalability. Together, these advances reflect a growing understanding of how to build LLMs that not only generate fluent text but also reason through complex problems in a structured and effective manner.