• Representation and Prediction for Generalizable Robot Control

    Vision-based learning has become a central paradigm for enabling robots to operate in complex, unstructured environments. Rather than relying on hand-engineered perception pipelines or task-specific supervision, recent work increasingly leverages large-scale video data to learn transferable visual representations and predictive models for control. This survey reviews a sequence of recent approaches that illustrate this progression: learning control directly from video demonstrations, pretraining universal visual representations, incorporating predictive dynamics through visual point tracking, and augmenting learning with synthetic visual data. Together, these works highlight how representation learning and prediction from video are enabling increasingly generalizable robot manipulation capabilities.

  • Efficient Super-Resolution: Bridging Quality and Computation

    Super-resolution has long faced a fundamental tension: the highest-quality models require billions of operations, while edge devices demand sub-100ms inference. This article examines three recent methods—SPAN, EFDN, and DSCLoRa—that challenge this tradeoff through architectural innovation, training-time tricks, and efficient adaptation. We’ll see how rethinking the upsampling operation, leveraging structural reparameterization, and applying low-rank decomposition can each dramatically improve efficiency without sacrificing output quality.

  • Membership Inference Attacks against Vision Deep Learning Models

    Membership Inference Attacks (MIA) are privacy attacks on machine learning models meant to predict whether or not a a data point was used to train a model. This blog looks at three different MIA methods against three different types of models.

  • Streetview Semantic Segmentation

    [Project Track: Project 8]Semantic segmentation models achieve high overall accuracy on urban datasets, yet systematically fail on thin structures and object boundaries critical for safety-relevant perception. This project presents a diagnostic-driven framework for understanding segmentation failures on Cityscapes and introduces a SAM3-guided boundary supervision method that injects geometric priors into SegFormer. By combining cross-model difficulty analysis with geometry-aware auxiliary training, we demonstrate targeted improvements on thin and boundary-sensitive classes without increasing inference cost.

  • Computer Vision for Medium/Heavy-Duty Vehicle Detection via Satellite Imagery

    We fine-tuned a ResNet50-FPN detector on the DOTA aerial dataset and applied it to Google satellite tiles to locate medium/heavy-duty truck parking clusters for EV charging planning.

  • XAI in Facial Recognition

    Facial Recognition (FR) systems are being increasingly used in high stake environments, but their decision making processes remain a mystery, raising concerns regarding trust, bias, and robustness. Traditional methods such as occlusion sensitivity or saliency maps (e.g., Grad-CAM), often fail to capture the causal mechanisms driving verification decisions or diagnosis reliance on shortcuts. This report analyzes three modern paradigms that shift Explainable AI (XAI) from passive visualization to active, feature level interrogation. We examined FastDiME [5] which utilizes generative diffusion models to create counterfactuals for detecting shortcut learning, Feature Guided Gradient Backpropagation (FGGB) [3], which mitigates vanishing gradients to produce similarity and dissimilarity maps, and Frequency Domain Explainability [2], which introduces Frequency Heat Plots (FHPs) to diagnose biases in CNNs. By synthesizing these approaches, we examine how modern XAI tools can assess model reliance on noise versus structural identity, with the goal of offering a pathway toward more robust and transparent biometric systems.

  • From Classifiers to Assistants: The Evolution of Visual Question Answering

    Visual Question Answering (VQA) represents a fundamental challenge in artificial intelligence: the ability to understand both visual content and natural language, then reason across these modalities to produce meaningful answers. This report traces the evolution of VQA from its formal definition as a classification task in 2015, through the era of sophisticated attention mechanisms, to its modern integration into Large Multimodal Models. We analyze three papers that define this trajectory, revealing how VQA transformed from a specialized benchmark into a core capability of general-purpose AI assistants.

  • Semantic Segmentation of Coral Reefs: Evaluating SegFormer, DeepLab, and SAM3

    [Project Track: Project 1] Deep learning has become a standard tool for dense prediction in environmental monitoring. Inthis project, we focus on semantic segmentation of coral reef imagery using the CoralScapes dataset. Starting from a pretrained SegFormer-B5 model, we design a coral-specific training pipeline that combines tiling, augmentations, and a CE+Dice loss. This yields a modest but consistent improvement in mIoU and qualitative boundary sharpness over the original checkpoint. We also run exploratory experiments with DeepLabv3 and SAM3, and discuss practical limitations due to the absence of coral-specific pretraining and limited compute.

  • Evolution of Human Pose Estimation - From Deep Regression to YOLO-Pose

    Human pose estimation is a fundamental problem in computer vision that focuses on localizing and identifying key body joints, such as elbows, knees, wrists, and shoulders of a person through images or video. By predicting these keypoints or predefined body landmarks, models can infer a structured, skeleton-like representation of the human body, which enables further exploration into understanding human posture, motion, and interactions with the environment. As such, this field is a crucial area of research that is used in various real-world applications like action recognition or healthcare. In this project, we study a variety of different deep learning approaches to 2D human pose estimation, beginning first with early end-to-end regression models and progressing towards more structured and context-aware architectures. In particular, we delve deeper into how modeling choices around global context, spatial precision, and body structure influence pose estimation performances.

  • Food Detection

    Food detection is a subset of image classification that is able to classify specific foods. As such, it requires the ability to learn finer-grain details to differentiate specific foods. In this paper, we investigate how three models, DeepFood, WISeR, and Noisy-ViT, built upon state-of-the-art (at the time) object classification models for food detection, along with a dataset built for food detection Food-101. On this dataset, DeepFood performed at a 77.40% accuracy, WISeR performed at a 90.27% accuracy, and Noisy-ViT performed at a 99.50% accuracy.

  • A Comparison of Recent Deepfake Video Detection Methods

    The rapid recent improvments of generative AI models has created an era of hyper-realistic generated images and videos. This has rendered traditional, artifact-based detection methods obsolete. As synthetic media becomes multimodal and increasingly realistic, new methods are needed to identify generated videos from real media. This report examines three new methodologies designed to counter these advanced threats: Next-Frame Feature Prediction, which leverages temporal anomalies to identify manipulation; FPN-Transformers, which utilizes feature pyramids for precise temporal localization; and RL-Based Adaptive Data Augmentation, which employs reinforcement learning to improve model generalization against unseen forgery techniques.

  • Comparison of Approaches to Human Pose Estimation

    This paper presents a comparative analysis of prominent deep learning approaches for 2D human pose estimation, the task of locating key anatomical joints in images and videos. We examine the core methodologies, architectures, and performance metrics of three seminal models: the bottom-up OpenPose (2019), the top-down AlphaPose (2022), and the top-down ViTPose (2022), which leverages a Vision Transformer backbone. We then introduce Sapiens (2024), a recent foundation model that pushes state-of-the-art accuracy by adopting a massive MAE-pretrained transformer, high-resolution inputs, and significantly denser whole-body keypoint annotations. The comparison highlights the change from complex, manual systems like OpenPose, moved to efficient refining methods with AlphaPose, and now powerful but simple transformer models like ViTPose and Sapiens.

  • Global Human Mesh Recovery

    In this blog post, we introduce and discuss recent advancements in global human mesh recovery, a challenging computer vision problem involving the extraction of human meshes on a global coordinate system from videos where the motion of the camera is unknown.

  • From Labeling to Prompting: The Paradigm Shift in Image Segmentation

    The evolution from Mask R-CNN to SAM represents a paradigm shift in computer vision segmentation, moving from supervised specialists constrained by fixed vocabularies to promptable generalists that operate class-agnostically. We examine the technical innovations that distinguish these approaches, including SAM’s decoupling of spatial localization from semantic classification and its ambiguity-aware prediction mechanism, alongside future directions in image segmentation.

  • Visuomotor Policy

    Visuomotor Policy Learning studies how an agent can map high-dimensional visual observations (e.g., camera images) to motor commands in order to solve sequential decision-making tasks. In this project, we focus on settings motivated by autonomous driving and robotic manipulation, and survey modern learning-based approaches—primarily imitation learning (IL) and reinforcement learning (RL)—with an emphasis on methods that improve sample efficiency through policy/representation pretraining.

  • Unupervised Domain Adaptation with GTA -> Cityscapes

  • Instance Segmentation Paper Synthesis: Evolution and New Frontiers

    Instance segmentation is a fundamental task in computer vision that detects and separates individual object instances on a pixel level. There have been several recent developments in computer vision that have led to improvements in instance segmentation performance and new applications of instance segmentation. We will discuss and analyze Segment Anything Model, Mask2Former, and Relation3D for image and point cloud instance segmentation in this paper report.

  • Machine Learning for Studio Photo Retouching: Object Removal, Background Inpainting, and Lighting/Shadow Preservation

    Studio photography aims to produce aesthetically polished images. However, even in controlled environments, unwanted objects such as chairs, props, wires, etc., often appear in the scene. Further, lighting is altered tremendously by the addition / removal of these objetcs. Traditionally, these objects have been removed manually, requiring careful reconstruction of the background and its lighting conditions. This paper looks at modern models aimed at making this process easier.

  • Advances in Medical Image Segmentation

    This paper is a review on the advances in medical image segmentation technology over the past few years. With the increasing popularity of deep learning, there has been more innovation and application of these techniques in the medical space. Through an analysis of these approaches we can see the clear progression in innovation and the extensive applications.

  • Center Pillars - Anchor-Free Object Detection in 3D

    [Project Track: Project 8] This project implements a LiDAR-based 3D object detection pipeline that uses PointPillars to encode raw point clouds into a bird’s-eye-view (BEV) pseudo-image, enabling efficient convolutional feature extraction. On top of this representation, the CenterPoint framework decodes BEV features by predicting object centers and regressing bounding box attributes in an anchor-free manner. This design removes the need for predefined anchors while maintaining accurate spatial localization and computational efficiency.

  • Introduction to Camera Pose Estimation

    Camera pose estimation is one important component in computer vision used for robotics, AR/VR, 3D reconstruction, and more. It involves determining the camera’s 3D position and orientation, also known as the “pose” in various environments and scenes. PoseNet, MeNet, and JOG3R are all various deep learning techniques used to accomplish camera pose estimation. There are also sensor-based tracking like LEDs and particle filters. We focus on geometric methods, specifically Structure-from-Motion (SfM).

  • Vision Language Action Models for Robotics

    The core of computer vision for robotics is utilizing deep learning to allow robots to perceive, understand, and interact with the physical world. This report explores Vision Language Action (VLA) models that combine visual input, language instruction, and robot actions in end-to-end architectures.

  • MLP-based Architectures for Computer Vision

    In recent years, the field of computer vision has been dominated by attention-based and convolution-based architectures. As a result, MLP-based architectures were largely overshadowed due to their lack of inherent inductive bias. This paper reviews and compares three MLP-based architectures: MLP-Mixer, ResMLP, and Caterpillar. We explore their architectural designs, inductive biases, accuracy, and computational efficiency in comparison to attention-based and convolutional architectures. Across standard ImageNet benchmarks, these models achieved accuracies close to state-of-the-art models while outperforming them in computational efficiency. Although each model introduces locality through different mechanisms, they all rely primarily on dense matrix multiplications, which can be easily parallelized on modern GPU and TPU hardware. These findings demonstrate that MLP-based architectures can achieve high accuracy while remaining computationally efficient, making them a promising area for future computer vision research.

  • Facial Emotion Recognition

    This post details the current landscape of the Facial Emotion Recognition (FER) field. We discuss why this field is important and the current challenges it faces. We then discuss two datasets and two deep learning models for FER, one building off of ResNet and another extending ConvNeXt. Finally, we compare the two approaches and summarize our view on the research area.

  • Deep Learning for Image Super-Resolution

    Image super-resolution is a natural, ill-posed computer vision problem, being the task of recovering a high-resolution, clean image from a low-resolution, degraded image. In this post, we survey 3 recent methods of image super-resolution, each highly distinct with its own advantages and disadvantages. Finally, we conduct an experiment in modifying the structure of one of these methods.

  • Streetview Semantics

    [Project Track] Street-level semantic segmentation is a core capability for autonomous driving systems, yet performance is often dominated by severe class imbalance where large categories such as roads and skies overwhelm safety-critical but rare classes like bikes, motorcycles, and poles. Using the BDD100K dataset, this study systematically examines how architectural choices, loss design, and training strategies affect segmentation quality beyond misleading pixel-level accuracy. Starting from a DeepLabV3-ResNet50 baseline, we demonstrate that high pixel accuracy (~ 94%) can coincide with extremely poor mIoU (~ 4%) under imbalance. We then introduce class-weighted and combined Dice-Cross-Entropy/Focal losses, auxiliary supervision, differential learning rates, and gradient clipping, achieving a 10x improvement in mIoU. Then, we propose a targeted optimization strategy that remaps the task to six safety-critical small classes and leverages higher resolution, aggressive augmentation, and boosted class weights for thin and small objects. This approach significantly improves IoU for bicycles, motorcycles, and poles, highlighting practical trade-offs between accuracy, resolution, and computational cost. However, such increases in resolution resulted in significant increases to training time per epoch, resulting in less training. Our last contribution is a boundary-aware auxillary supervision strategy that explicitly promotes boundary preservation for thin and small objects while maintaining architectural simplicity. Overall, the work provides an empirically grounded blueprint for addressing class imbalance and small-object segmentation in urban scene understanding.

  • Unsupervised Domain Adaptation for Semantic Segmentation

    Data annotation is widely considered a major bottleneck in semantic segmentation. It leads to domain gaps between labeled source data and unlabeled target data and stresses the need for unsupervised domain adaptation (UDA) methods. This post covers DAFormer, a recent transformer-based UDA method which significantly improved state-of-the-art performance, as well as two more recent performance-improving approaches to UDA (HRDA and MIC).

  • Scale Up VLM for Embodied Scene Understanding

    [Project Track: Self-Propose-Topic] Spatial reasoning and object-centric perception are central to deploying Vision–Language Models (VLMs) in embodied systems, where agents must interpret fine-grained scene structure to support action and interaction. However, current VLMs continue to struggle with fine-grained visual reasoning, particularly in the object-centric tasks that demand visual understanding that requires dense visual perception. We introduced VLIMA, a guided fine-tuning framework that enhances VLMs by incorporating auxiliary visual supervision from external self-supervised vision encoders. Specifically, VLIMA adds an auxiliary alignment loss that encourages intermediate VLM representations to match features from encoders such as DINOv2, which exhibit emergent object-centricity and strong spatial correspondence. By transferring these spatially precise and object-aware inductive biases into the VLM representation space, VLIMA improves object-centric embodied scene understanding without calling external tools or modifying the VLM’s core architecture.

  • From Paris to Seychelles - Deep Learning Techniques for Global Image Geolocation

    Image geolocation—the task of predicting geographic coordinates from visual content alone—has evolved significantly with advances in deep learning. This survey examines four landmark approaches that have shaped the field. We begin with PlaNet (2016), which pioneered the geocell classification framework using CNNs and adaptive spatial partitioning based on photo density. We then explore TransLocator (2022), which leverages Vision Transformers and semantic segmentation maps to capture global context and improve robustness across varying conditions. Next, we analyze PIGEON (2023), which introduces semantic geocells respecting administrative boundaries, Haversine smoothing loss to penalize geographically distant predictions less harshly, and CLIP-based pre-training to achieve human-competitive performance on GeoGuessr. Finally, we examine ETHAN (2024), a prompting framework that applies chain-of-thought reasoning to large vision-language models, enabling interpretable geographic deduction without task-specific training. Through this progression, we trace the architectural evolution from convolutional networks to transformers to foundation models, highlighting key innovations in spatial partitioning strategies, loss function design, and the integration of semantic reasoning for worldwide image localization.

  • Street-view Semantic Segmentation

    [Project Track: Project 8] In this project, we delve into the topic of developing model to apply semantic segmentations on fine-grained urban structures based on pretrained SegFormer model. We explore 3 approaches to enhance the model performance, and analyze their result. You can find the code here

  • Human Pose Estimation

    In this paper, I will be discussing the fundamentals and workings of deep learning for human pose estimation. I believe that there has been a lot of research and breakthroughs, especially recently, on technology that relates to this, and I hope that this deep dive will bring some clarity and new information to how it works!

  • Novel View Synthesis

    In computer graphics and vision, novel view synthesis is the task of generating images of a scene given a set of images of the same scene taken from different perspectives. In this report, we introduce three important papers attempting to solve this task.

  • Exploring Modern Novel View Generation Methods

    [Project Tack] Historically, novel view synthesis (NVS) has relied on volumetric radiance field approaches, such as NeRF. While effective, these methods are often computationally expensive to train and prohibitively slow to render for real-time applications. To address these limitations, researchers have developed new architectures that reduce computational costs while maintaining or exceeding visual fidelity. This report examines two distinct solutions to this challenge: 3D Gaussian Splatting (3DGS) and the Large View Synthesis Model (LVSM).

  • Self-Supervised Learning

    Self-supervised Learning is a way for models to learn useful features without relying on labeled data. The model can create its own learning targets from the structure of the data. This method becomes popular in computer vision and many other fields because it makes use of large amounts of unlabeled data and can produce strong representations for downstream tasks. In this paper, we introduce the basic ideas behind self-supervised learning and discuss several common methods and why they are effective.

  • Camera Pose Estimation

    Camera pose estimation is a fundamental Computer Vision task that aims to determine the position and orientation of a camera relative to a scene using image or video data. Our project evaluates three camera pose estimation methods, COLMAP, VGGSfM, and depth-based pose estimation with ICP.

  • Human Pose Estimation in Robotics Simulation

    [Project Track] Human pose estimation is the task of detecting and localizing key human joints from 2D images or video. These joints are typically represented as keypoints connected by a skeletal structure, forming a pose representation. Pose estimation has found applications in areas such as physiotherapy, animation, sports analytics, and robotics.

  • Optical Flow

    Optical flow is simply the problem of estimating motion in images, with real world applications in other fields such as autonomous driving. As such, there have been many different approaches to this problem. We compare three of these approaches: FlowNet, RAFT, and UFlow. We explore each of these models in depth, before moving on to a comparative analysis and discussion of the three models. This analysis highlights the key differences between each approach, when they are most applicable, and how they each handle common problems in the optical flow field such as the lack of available training data.

  • Diffusion Models for Image Editing: A Study of SDEdit, Prompt-to-Prompt, and InstructPix2Pix

    Image editing stands at the heart of computer vision applications and enables object attribute modifications, style changes, or transformations in general appearance while retaining much of the structural information about an image. Classic deep learning methods, in particular the GAN-based approach, suffer from this balancing act. They either introduce artifacts, distort salient features, or fail to preserve the original content when performing edits. Recently, diffusion models have emerged as a powerful alternative. Instead of generating an image with a single forward pass, diffusion models progressively remove noise according to a series of denoising steps. This iterative structure makes them particularly suitable for editing: partial noise levels can preserve the content, cross-attention layers control which regions change, and textual instructions guide the model to make targeted changes. For this reason, diffusion-based editing methods are among the most flexible and reliable tools in modern image manipulation. In this report, I investigate three diffusion-based image editing methods-SDEdit, Prompt-to-Prompt, and InstructPix2Pix-each representing a different stage in the evolution of editing techniques. SDEdit showcases how diffusion is able to maintain structure during edits, Prompt-to-Prompt introduces fine-grained control through prompt manipulation, and InstructPix2Pix allows for natural-language-driven edits without large model retraining. All together, these works highlight the versatility of diffusion models and illustrate how iterative denoising can support a wide range of editing tasks.

  • Medical Image Segmentation

    Medical image segmentation is an important component in the medical field, supporting the diagnosis of patients, treatment planning, and disease monitoring. Segmentation in machine learning is a process where datasets are broken into meaningful groups for annotation and deeper analysis. Medical segmentation, a combination of the two, has grown to importance in the field, but there still remains a challenging problem due to the large variability in modalities, anatomical structures, and usage. This report examines three recent approaches, MedSAM, UniverSeg, and GenSeg, that aim to address and improve on these limitations by improving generalization and adaptability.

  • Camera Pose Estimation, Recent Developments and Robusticity

    3D camera pose estimation has become a widely used tool for recovering camera motion and scene structure from video, with applications spanning robotics, AR/VR, and 3D mapping. In this project, we compare a classic SfM baseline (COLMAP) with modern learning-based pipelines (VGGSfM and ViPE) across multiple real-world sequences, evaluating both trajectory quality and stability using qualitative visualizations and simple translation-variability metrics.

  • Street-View Semantic Segmentation

    [Project Track: Street-View Semantic Segmentation] In this project, we implemented and evaluated semantic segmentation models on the Cityscapes dataset to enhance pixel-level understanding of urban scenes for autonomous driving; we also built a car to evaluate our models in real-world scenarios.

  • Post Template

    This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.