-
Medical Image Segmentation
Medical image segmentation is a process that involves dividing a medical image into multiple distinct Regions of Interest corresponding to different organs, tissues, or pathological areas automatically. This technique allows healthcare professionals to interpret and analyze medical images like X-rays, ultrasounds, and CT scans much more efficiently than having to label areas by hand.
-
Audio-Visual Sentiment Analysis
Our group explored the cutting-edge field of Audio-Visual Sentiment Analysis (AVSA), and how multimodal data (comprising of visual, audio, and textual inputs) can be combined to better understand human emotion.
-
Peek-A-Boo, Occlusion-Aware Visual Perception through Active Exploration
In this study, we present a framework for enabling robots to locate and focus on objects that are partially or fully occluded within their environment. We split up any robotic tasks into two steps: Localization, where the robot searches for objects of interest, and Task Completion, where the robot completes the task after finding the object. We propose Peekaboo, a solution to the Localization stage to find partially or even fully occluded objects. We train a reinforcement learning algorithm to teach the robot to actively reposition its camera to optimize visibility of occluded objects. The key features include engineering a reward function that incentivizes effective object localization and setting up a comprehensive training environment. We develop a simulation environment with randomness to learn to localize from numerous initial viewpoints. Our approach also includes the implementation of a vision encoder for processing visual input, which allows the robot to interpret and respond to objects and occlusions. We design metrics to quantify the model’s performance, demonstrating its capability to handle occlusions without any human intervention at all. The results of this work showcase the potential for robotic systems to actively improve their perception in cluttered or obstructed environments.
-
Dynamic Hand Gesture Classification
In this report, we review three deep learning models in the domain of static and dynamic hand gesture classification.
-
Occlusion Removal Using DesnowNet
Occlusion removal in computer vision restores images by addressing obstructions, including removing weather conditions such as rain, fog, snow, and haze. These weather-induced occlusions hinder object detection and thus can impact the performance of computer vision models. Deep learning-based techniques such as convolutional neural networks (CNNs) and generative adversarial networks (GANs) can be used to remove weather artifacts from image content. These models exploit spatial and temporal features, preserving key features of images while removing occlusions from the background. Applications range from autonomous driving to surveillance, where improved image clarity under challenging weather conditions enhances safety and accuracy. In our project, we attempt to create an implementation of DesnowNet. View our code here.
-
Image Super Resolution
This post compares and contrasts three methods of performing image super resolution: Enhanced Deep Residual Networks (EDSR), Residual Channel Attention Networks (RCAN), and Residual Dense Networks (RDN). In addition, we experiment with finetuning one of these networks on the MiniPlaces dataset.
-
Text-to-Image Generation
Text-to-image generation is a model that generates images from input text. Textual prompt is passed in as input, and the model outputs an image based on that prompt. We will be exploring the architecture and results of four text-to-image models: DALL-E, Imagen, Stable Diffusion, and GANs. We will also be running Stable Diffusion with WebUI and implementing subject-driven fine-tuning on diffusion models with Dreambooth’s method.
-
Anti-Facial Recognition Technology
In this article, we examine various anti-facial recognition techniques and assess their effectiveness. We begin by introducing facial recognition and providing a high-level overview of its pipeline. Next, we explore how Fawkes software exploits vulnerabilities in image cloaking by testing it on the PubFig database. We then discuss MTCNN-Attack, which prevents models from recognizing facial features by overlaying grayscale patches on individuals’ faces. Finally, we present a method that adds noise to images, rendering them unlearnable to models while remaining visually indistinguishable to the naked eye.
-
Fashion Image Editing
Fashion image editing involves modifying model images with specific target garments. We examine various approaches to fashion image editing based on latent diffusion, generative adversarial networks, and transformers.
-
Zero Shot Learning
Zero-Shot Learning (ZSL) enables models to classify unseen classes, which addresses scaleability challenges in traditional supervised learning. that requires extensive labeled data. By leveraging auxiliary information like semantic attributes, ZSL facilitates knowledge transfer to new, dynamic tasks. ZSL facilitates knowledge transfer to new, dynamic tasks. OpenAI’s CLIP exemplifies ZSL, aligning image and text embeddings though contrastive learning for flexible classification. In this study, we explore optimizing CLIP’s pre-trained model using prompt engineering. By testing various prompt formulations, from from generic (“A photo of a {}”) to highly specific (“A hand showing a [rock/paper/scissors]”), we aim to enhance classification accuracy, demonstrating ZSL’s potential for scalable and adaptable vision model.
-
Text-To-Image Generation - A Study of CV Models Shaping the Newest Tool in Creative Expression
The goal of this study is to analyze different approaches to text to image generation. We specifically looked at innovations being made with GANs and diffusion models. This study explores the implementations of StackGAN, latent diffusion models, and Stable Diffusion XL (SDXL).
-
Vehicle Trajectory Prediction
This project explores advanced vehicle trajectory prediction methods, a critical component for safe and efficient autonomous driving. By analyzing models like STA-LSTM, Convolutional Social Pooling, and CRAT-Pred, it highlights their unique approaches to handling spatial and temporal complexities, as well as their applications in structured and unstructured traffic environments.
-
Contrastive Language–Image Pre-training Applications and Extensions
With the strong data-driven nature of training Computer Vision models, the demand for reliably annotated data is quite high. Manually labeling datasets is very time consuming and a bottleneck to progression. Another bottleneck comes from training large, specialized models from scratch on these big datasets, which is computationally expensive. In the mid 2010s, the method of pretraining deep ConvNets on ImageNet before fine-tuning on a specific downstream image classification task was popular. This effectively set a baseline of visual features that newer models could build off of. However, multi-modal relationships between image and text weren’t strong enough and there was still the need to fine-tune models. We focus this blog on exploring CLIP (Contrastive Language-Image Pre-training). The purpose of CLIP is to perform zero/few-shot learning (the ability to classify images seeing little to no prior examples). The idea is that fine-tuning would not be compulsory. CLIP achieved state of the art results in zero shot. Also, CLIP has been extended and improved, we’ll go into deeper detail below. CLIP opened the door for CLIPScore, a unique and preferred evaluation metric for image captioning. CLIP’s rich text encoder is also used in latent diffusion models (LDM) for text conditioning. FALIP also brought improvements and variations to the original CLIP.
-
Image-to-Image Style Transfer
Recent advances in deep generative models have enabled unprecedented control over image synthesis and transformation, from translating between visual domains to precisely controlling spatial structure or appearance in generated images. This report traces a few seminal developments in architectures and training methodologies that have made these capabilities possible, from GAN-based approaches like CycleGAN to modern diffusion-based techniques that leverage pre-trained Stable Diffusion models to enable fine-grained control through image conditioning.
-
Bird's Eye View Segmentation
In recent years, deep learning methods have advanced at an incredible pace and in this paper we will dive into some of the state of the art segmentation approaches utilized within vehicles and robotic navigation systems. We will study and discuss three different approaches to Bird’s Eye View segmentation: the Lift, Splat, and Shoot method, PointBeV method, and BeVSegFormer method. We also go through our own replication of Lift, Splat, and Shoot and show the capabilities of this deep learning method.
-
Face Detection: From Neural Networks to Dense Detectors
The field of face detection has evolved significantly from early neural network approaches to modern deep learning architectures. This article traces this evolution, focusing particularly on Deep Dense Face Detector (DDFD). We examine how DDFD built upon earlier foundations like Rowley’s neural networks and the Viola-Jones framework while introducing innovations that enabled face detection across multiple views without requiring pose annotations. The paper analyzes DDFD’s architectural choices, training methodology, and performance characteristics compared to contemporary approaches like R-CNN. We highlight DDFD’s practical applications through case studies, including its integration into face detection and tagging systems achieving 85% accuracy.
-
CNN Loss Function Advances for Deep Facial Recognition
In this report, we focus on analyzing loss functions used in Convolutional Neural Networks (CNNs) for Deep Face Recognition, specifically comparing A-Softmax, CosFace, and ArcFace, and examining their performances.
-
On Exploring Modern Facial Recognition Models with Bounding Box Detection
This project explores the application of four modern object detection models—Faster R-CNN, SSD, YOLOv5, and EfficientDet—for facial recognition tasks. Each model was evaluated based on its performance, speed, and scalability using a curated dataset of face images. Faster R-CNN exhibited high accuracy, making it ideal for precision-focused tasks, but its slower inference speed limited real-time applicability. SSD offered a balanced trade-off between speed and accuracy, suitable for diverse scenarios. YOLOv5 excelled in real-time performance, demonstrating a strong balance of speed and precision. EfficientDet showcased scalability and multi-scale detection capabilities but faced limitations in computational efficiency for face-specific tasks. These evaluations highlight the trade-offs between accuracy, speed, and resource constraints, providing insights for model selection in real-world facial recognition applications.
-
3D Semantic Segmentation
3D semantic segmentation is a cornerstone of modern computer vision, enabling understanding of our physical world for applications ranging from embodied intelligence and robotics to autonomous driving. In 3D semantic segmentation, our goal is to assign a semantic label to every point in a LiDAR point cloud. Compared to pixel grids of 2D images, data from 3D sensors are complex, irregular, and sparse, lacking the niceties and biases in data we often exploit in processing 2D images. We will discuss 3 deep learning based approaches pioneering the field, namely PointNet, PointTransformerV3, and Panoptic-Polarnet.
-
FaceNet
Our project is about FaceNet, how it works, and its use cases. We had a lot of fun playing around with it and learning about it. We hope you do too!
-
Image Translation
An exploration of three different approaches of image-to-image translation.
-
Panoptic Segmentation – From Foundational Tasks to Modern Advances
In this post, I will discuss recent advancements in image segmentation, including panoptic segmentation.
-
Super Resolution
Super Resolution is an image processing technique that enhances/restores a low-resolution image to a higher resolution. Researchers have proposed and implemented many methods of tackling this classical computer vision task over the years, and improvements have been rapid in the last decade with the boom in deep learning. However, one of the major pitfalls of both CNN and standard transformer-based models is an inability to capture a wide range of spatial information from the input image. We will look at the design and architecture of one of the current cutting-edge models, HAT, which combats this problem by utilizing channel attention, self-attention, and cross-attention. Then we will apply the HAT model to novel input images to test its performance.
-
Hand Pose Fugl Meyer
We explore 3 deep-learning based approaches to hand pose estimation and build a proof-of-concept RNN algorithm that uses hand pose estimation for an important application: Fugl-Meyer Assessment evaluation.
-
Visual Question Answering
(Open-answer) visual question answering (VQA** for short) is a computer vision task to: given an image and a natural-language question about the image, return an accurate and human-like natural-language response to the query using information in the image. Formally, the open-answer VQA task is: given an image-question pair
(I, q)
, output a sequence of characterss
(of arbitrary length). -
Visual Question Answering
Visual Question Answering (VQA) combines computer vision and natural language processing to enable AI systems to answer questions about images. This project explores and compares models like LSTM-CNN, SAN, and CLIP, evaluating their performance on datasets such as VQA v2, CLEVR, GQA, and DAQUAR. Using accuracy metrics and attention map visualizations, we uncover how these models process visual and textual data, highlighting their strengths and identifying areas for improvement.
-
Team49 Linguistic binding in diffusion models
This project explores the improvement of text-to-image diffusion models, focusing on the problem of language binding in stable diffusion models. Text-to-image generation usually suffers from attribute misbinding, omissions, and semantic leakage, which can lead to inaccurate visual representations between textual prompts and generated images. Based on the SynGen method, our team proposes a new loss function by introducing an extra entropy term. During the denoising process, this entropy term aims to refine the attention graph to make the relationship between modifiers and their corresponding entities more accurate. This method achieves an improvement in the correspondence of attributes in the generated image compared to the Syngen method.
-
Medical Image Segmentation
This report covers medical image segmentation using U-Net, U-NET++, and PSPNet. These models are ran on an ISIC challenge dataset from 2017.
-
Controlling Images with Diffusion Models
Image Diffusion models generate novel images by starting from pure noise and gradually denoising. This process is inherently random and much work has been put in to controlling the generation of image diffusion models. We will discuss three papers that introduced innovative methods to control the generation process beyond standard text-guided generation: InstructPix2Pix, DreamBooth and ControlNet.
-
Anomaly Detection
Anomaly detection is a critical task in domains such as cybersecurity, healthcare, and fraud detection. It involves identifying patterns in data that deviate significantly from the norm. This report compares three state-of-the-art approaches to anomaly detection: a clustering-based method, a GAN-based method, and a reinforcement learning (RL)-based method. Each approach leverages unique architectures and methodologies to address the challenge of detecting anomalies in various datasets. This comparative analysis evaluates these models across key dimensions: datasets, architectures, and results.
-
Point Cloud
PointNet introduced a groundbreaking approach to processing 3D point cloud data directly, bypassing the need for voxelization or other preprocessing techniques. Its core innovation lies in its ability to handle unordered point sets while maintaining permutation invariance and learning robust features for tasks such as classification and segmentation.
-
Medical Image Segmentation
Medical image segmentation leverages deep learning to partition medical images into meaningful regions like organs, tissues, and abnormalities. This report explores key segmentation models such as U-Net, U-Net++, and nnU-Net, detailing their architectures, challenges, comparative performance, and practical applications in clinical and research settings.
-
ASL Fingerspelling
We evaluate and compare 3 different implementations of Continuous Sign Language Recognition (CSLR): MiCT-RANet, a spatio-temporal approach, C2ST, which takes advantage of textual contextual information, and Open Pose landmarks. We implement the MiCT-RANet approach for ASL fingerspelling recognition and attempt to train it from scratch.
-
Image Super-Resolution: A Brief Overview
Image Super-Resolution (SR) is a technique in computer vision that reconstructs a high-resolution (HR) image from one or more low-resolution (LR) images. In this bloc post, we aim to provide an overview of both fundamental and recent state-of-the-art (SOTA) machine learning models within this field.
-
Novel View Synthesis with 3D Gaussian Splatting
Gaussian Splatting is a novel 3D reconstruction algorithm, radically improving the generation time compared to NeRFs using 3D Gaussians. In this project, we review its implementation as well as several applications of the more capable 3D reconstruction algorithm.
-
Image to Image Translation
This post delves into cutting-edge methods for image-to-image translation and generative modeling, including Pix2Pix, CycleGAN, and FreeControl.
-
Image Retrieval
This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.
-
Sign Language Recognition
Sign language recognition is task that can solve by applying computer vision principles. This post will explore various methods and an implementation of a solution.
-
Super-resolution
This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.
-
CamoNet
We used a GAN approach to generate optimal camouflage patterns for individual scenes.
-
Exporing Object Detection
Object tracking is a core computer vision tasks that aims to identify and track objects across a sequence of frames. Its applications span anywhere from surveillance to autonomous driving to medical imaging. In this blog like post, we explore object tracking advancements across different frameworks and promising architectures.
-
Hierarchical Label Explainability
Understanding how hierarchical labels affect saliency maps can unlock new pathways for model transparency and interpretability. This post explores the motivation, existing methods, and implementation of our project on hierarchical label explainability.