2024F, UCLA CS163 Course Projects

Jeffrey Huang, James Wang, James Zhang, Aaron Zhao on Dec 14, 2024
Medical Image Segmentation

Medical image segmentation is a process that involves dividing a medical image into multiple distinct Regions of Interest corresponding to different organs, tissues, or pathological areas automatically. This technique allows healthcare professionals to interpret and analyze medical images like X-rays, ultrasounds, and CT scans much more efficiently than having to label areas by hand.
Ariv Gupta, Lucas Wheeler, Dennis Gavrilenko on Dec 13, 2024
Audio-Visual Sentiment Analysis

Our group explored the cutting-edge field of Audio-Visual Sentiment Analysis (AVSA), and how multimodal data (comprising of visual, audio, and textual inputs) can be combined to better understand human emotion.
Medha Kini, Ophir Siman-Tov on Dec 13, 2024
Peek-A-Boo, Occlusion-Aware Visual Perception through Active Exploration

In this study, we present a framework for enabling robots to locate and focus on objects that are partially or fully occluded within their environment. We split up any robotic tasks into two steps: Localization, where the robot searches for objects of interest, and Task Completion, where the robot completes the task after finding the object. We propose Peekaboo, a solution to the Localization stage to find partially or even fully occluded objects. We train a reinforcement learning algorithm to teach the robot to actively reposition its camera to optimize visibility of occluded objects. The key features include engineering a reward function that incentivizes effective object localization and setting up a comprehensive training environment. We develop a simulation environment with randomness to learn to localize from numerous initial viewpoints. Our approach also includes the implementation of a vision encoder for processing visual input, which allows the robot to interpret and respond to objects and occlusions. We design metrics to quantify the model’s performance, demonstrating its capability to handle occlusions without any human intervention at all. The results of this work showcase the potential for robotic systems to actively improve their perception in cluttered or obstructed environments.
Jared Velasquez, Clyde Villacrusis, Oscar Cooper Stern on Dec 13, 2024
Dynamic Hand Gesture Classification

In this report, we review three deep learning models in the domain of static and dynamic hand gesture classification.
Alexander Thaik, Beide Liu, Lintao Cui on Dec 13, 2024
Occlusion Removal Using DesnowNet

Occlusion removal in computer vision restores images by addressing obstructions, including removing weather conditions such as rain, fog, snow, and haze. These weather-induced occlusions hinder object detection and thus can impact the performance of computer vision models. Deep learning-based techniques such as convolutional neural networks (CNNs) and generative adversarial networks (GANs) can be used to remove weather artifacts from image content. These models exploit spatial and temporal features, preserving key features of images while removing occlusions from the background. Applications range from autonomous driving to surveillance, where improved image clarity under challenging weather conditions enhances safety and accuracy. In our project, we attempt to create an implementation of DesnowNet. View our code here.
Team 39 (Aneesh Saba, Max Deng, Nicolas Cuenca) on Dec 13, 2024
Image Super Resolution

This post compares and contrasts three methods of performing image super resolution: Enhanced Deep Residual Networks (EDSR), Residual Channel Attention Networks (RCAN), and Residual Dense Networks (RDN). In addition, we experiment with finetuning one of these networks on the MiniPlaces dataset.
Hae Won Cho, Grace Mao, Sakshi Thoutireddy, Roger Wang on Dec 13, 2024
Text-to-Image Generation

Text-to-image generation is a model that generates images from input text. Textual prompt is passed in as input, and the model outputs an image based on that prompt. We will be exploring the architecture and results of four text-to-image models: DALL-E, Imagen, Stable Diffusion, and GANs. We will also be running Stable Diffusion with WebUI and implementing subject-driven fine-tuning on diffusion models with Dreambooth’s method.
Edward Nawrocki, Donovan Rimer, Tyler Cho on Dec 13, 2024
Anti-Facial Recognition Technology

In this article, we examine various anti-facial recognition techniques and assess their effectiveness. We begin by introducing facial recognition and providing a high-level overview of its pipeline. Next, we explore how Fawkes software exploits vulnerabilities in image cloaking by testing it on the PubFig database. We then discuss MTCNN-Attack, which prevents models from recognizing facial features by overlaying grayscale patches on individuals’ faces. Finally, we present a method that adds noise to images, rendering them unlearnable to models while remaining visually indistinguishable to the naked eye.
Antara Chugh, Joy Cheng, Caroline DebBaruah, & Nicole Ju on Dec 13, 2024
Fashion Image Editing

Fashion image editing involves modifying model images with specific target garments. We examine various approaches to fashion image editing based on latent diffusion, generative adversarial networks, and transformers.
Siddharth Khillon, Eric Choi, Jacky Dai, Chanh Tran on Dec 13, 2024
Zero Shot Learning

Zero-Shot Learning (ZSL) enables models to classify unseen classes, which addresses scaleability challenges in traditional supervised learning. that requires extensive labeled data. By leveraging auxiliary information like semantic attributes, ZSL facilitates knowledge transfer to new, dynamic tasks. ZSL facilitates knowledge transfer to new, dynamic tasks. OpenAI’s CLIP exemplifies ZSL, aligning image and text embeddings though contrastive learning for flexible classification. In this study, we explore optimizing CLIP’s pre-trained model using prompt engineering. By testing various prompt formulations, from from generic (“A photo of a {}”) to highly specific (“A hand showing a [rock/paper/scissors]”), we aim to enhance classification accuracy, demonstrating ZSL’s potential for scalable and adaptable vision model.
Urja Gathoo, Anvi Penmetsa, and Esha Sidhu on Dec 13, 2024
Text-To-Image Generation - A Study of CV Models Shaping the Newest Tool in Creative Expression

The goal of this study is to analyze different approaches to text to image generation. We specifically looked at innovations being made with GANs and diffusion models. This study explores the implementations of StackGAN, latent diffusion models, and Stable Diffusion XL (SDXL).
Angelina Sun, Jinyuan Zhang, Jun Yu Chen on Dec 13, 2024
Vehicle Trajectory Prediction

This project explores advanced vehicle trajectory prediction methods, a critical component for safe and efficient autonomous driving. By analyzing models like STA-LSTM, Convolutional Social Pooling, and CRAT-Pred, it highlights their unique approaches to handling spatial and temporal complexities, as well as their applications in structured and unstructured traffic environments.
Pranav Sankar, Pranav Subbaraman, Vishnu Manathattai, Nathan Wei on Dec 13, 2024
Contrastive Language–Image Pre-training Applications and Extensions

With the strong data-driven nature of training Computer Vision models, the demand for reliably annotated data is quite high. Manually labeling datasets is very time consuming and a bottleneck to progression. Another bottleneck comes from training large, specialized models from scratch on these big datasets, which is computationally expensive. In the mid 2010s, the method of pretraining deep ConvNets on ImageNet before fine-tuning on a specific downstream image classification task was popular. This effectively set a baseline of visual features that newer models could build off of. However, multi-modal relationships between image and text weren’t strong enough and there was still the need to fine-tune models. We focus this blog on exploring CLIP (Contrastive Language-Image Pre-training). The purpose of CLIP is to perform zero/few-shot learning (the ability to classify images seeing little to no prior examples). The idea is that fine-tuning would not be compulsory. CLIP achieved state of the art results in zero shot. Also, CLIP has been extended and improved, we’ll go into deeper detail below. CLIP opened the door for CLIPScore, a unique and preferred evaluation metric for image captioning. CLIP’s rich text encoder is also used in latent diffusion models (LDM) for text conditioning. FALIP also brought improvements and variations to the original CLIP.
Leon Liu, Rit Agarwal, Tony Chen, Tejas Kamtam on Dec 13, 2024
Image-to-Image Style Transfer

Recent advances in deep generative models have enabled unprecedented control over image synthesis and transformation, from translating between visual domains to precisely controlling spatial structure or appearance in generated images. This report traces a few seminal developments in architectures and training methodologies that have made these capabilities possible, from GAN-based approaches like CycleGAN to modern diffusion-based techniques that leverage pre-trained Stable Diffusion models to enable fine-grained control through image conditioning.
Mihir Baviskar, Derek Jiang, Gabe Weisiger, Alan Michael on Dec 13, 2024
Bird's Eye View Segmentation

In recent years, deep learning methods have advanced at an incredible pace and in this paper we will dive into some of the state of the art segmentation approaches utilized within vehicles and robotic navigation systems. We will study and discuss three different approaches to Bird’s Eye View segmentation: the Lift, Splat, and Shoot method, PointBeV method, and BeVSegFormer method. We also go through our own replication of Lift, Splat, and Shoot and show the capabilities of this deep learning method.
Kalyan Karamsetty, Kaylee Mei Chao, Dane Guthner, Claire Zhang on Dec 13, 2024
Face Detection: From Neural Networks to Dense Detectors

The field of face detection has evolved significantly from early neural network approaches to modern deep learning architectures. This article traces this evolution, focusing particularly on Deep Dense Face Detector (DDFD). We examine how DDFD built upon earlier foundations like Rowley’s neural networks and the Viola-Jones framework while introducing innovations that enabled face detection across multiple views without requiring pose annotations. The paper analyzes DDFD’s architectural choices, training methodology, and performance characteristics compared to contemporary approaches like R-CNN. We highlight DDFD’s practical applications through case studies, including its integration into face detection and tagging systems achieving 85% accuracy.
Curtis Chen, Kendra Lin, Emine Ozer on Dec 13, 2024
CNN Loss Function Advances for Deep Facial Recognition

In this report, we focus on analyzing loss functions used in Convolutional Neural Networks (CNNs) for Deep Face Recognition, specifically comparing A-Softmax, CosFace, and ArcFace, and examining their performances.
A. Doosti, S. Esfahani, J. Sun, and A. S. Makki (Team 12) on Dec 13, 2024
On Exploring Modern Facial Recognition Models with Bounding Box Detection

This project explores the application of four modern object detection models—Faster R-CNN, SSD, YOLOv5, and EfficientDet—for facial recognition tasks. Each model was evaluated based on its performance, speed, and scalability using a curated dataset of face images. Faster R-CNN exhibited high accuracy, making it ideal for precision-focused tasks, but its slower inference speed limited real-time applicability. SSD offered a balanced trade-off between speed and accuracy, suitable for diverse scenarios. YOLOv5 excelled in real-time performance, demonstrating a strong balance of speed and precision. EfficientDet showcased scalability and multi-scale detection capabilities but faced limitations in computational efficiency for face-specific tasks. These evaluations highlight the trade-offs between accuracy, speed, and resource constraints, providing insights for model selection in real-world facial recognition applications.
Rathul Anand, Jason Liu on Dec 13, 2024
3D Semantic Segmentation

3D semantic segmentation is a cornerstone of modern computer vision, enabling understanding of our physical world for applications ranging from embodied intelligence and robotics to autonomous driving. In 3D semantic segmentation, our goal is to assign a semantic label to every point in a LiDAR point cloud. Compared to pixel grids of 2D images, data from 3D sensors are complex, irregular, and sparse, lacking the niceties and biases in data we often exploit in processing 2D images. We will discuss 3 deep learning based approaches pioneering the field, namely PointNet, PointTransformerV3, and Panoptic-Polarnet.
Aarush Maddela, Ben Guo, Sacaar Jain on Dec 13, 2024
FaceNet

Our project is about FaceNet, how it works, and its use cases. We had a lot of fun playing around with it and learning about it. We hope you do too!
Richard Yin on Dec 13, 2024
Image Translation

An exploration of three different approaches of image-to-image translation.
Liu Martin on Dec 13, 2024
Panoptic Segmentation – From Foundational Tasks to Modern Advances

In this post, I will discuss recent advancements in image segmentation, including panoptic segmentation.
Eliot Yoon, Yubo Zhang, Ben Liang, William Park on Dec 13, 2024
Super Resolution

Super Resolution is an image processing technique that enhances/restores a low-resolution image to a higher resolution. Researchers have proposed and implemented many methods of tackling this classical computer vision task over the years, and improvements have been rapid in the last decade with the boom in deep learning. However, one of the major pitfalls of both CNN and standard transformer-based models is an inability to capture a wide range of spatial information from the input image. We will look at the design and architecture of one of the current cutting-edge models, HAT, which combats this problem by utilizing channel attention, self-attention, and cross-attention. Then we will apply the HAT model to novel input images to test its performance.
Ryan Yang, Minh Trinh, Yash Goyal, Seungmin Jung on Dec 13, 2024
Hand Pose Fugl Meyer

We explore 3 deep-learning based approaches to hand pose estimation and build a proof-of-concept RNN algorithm that uses hand pose estimation for an important application: Fugl-Meyer Assessment evaluation.
Rohan Sharma,Arnav Marda, Stanley Wei on Dec 12, 2024
Visual Question Answering

(Open-answer) visual question answering (VQA** for short) is a computer vision task to: given an image and a natural-language question about the image, return an accurate and human-like natural-language response to the query using information in the image. Formally, the open-answer VQA task is: given an image-question pair (I, q), output a sequence of characters s (of arbitrary length).
Jacob Goodman, Andrew Hong on Dec 12, 2024
Visual Question Answering

Visual Question Answering (VQA) combines computer vision and natural language processing to enable AI systems to answer questions about images. This project explores and compares models like LSTM-CNN, SAN, and CLIP, evaluating their performance on datasets such as VQA v2, CLEVR, GQA, and DAQUAR. Using accuracy metrics and attention map visualizations, we uncover how these models process visual and textual data, highlighting their strengths and identifying areas for improvement.
Eric Hanchen Jiang, Yuheng Li, Zhu zi on Dec 12, 2024
Team49 Linguistic binding in diffusion models

This project explores the improvement of text-to-image diffusion models, focusing on the problem of language binding in stable diffusion models. Text-to-image generation usually suffers from attribute misbinding, omissions, and semantic leakage, which can lead to inaccurate visual representations between textual prompts and generated images. Based on the SynGen method, our team proposes a new loss function by introducing an extra entropy term. During the denoising process, this entropy term aims to refine the attention graph to make the relationship between modifiers and their corresponding entities more accurate. This method achieves an improvement in the correspondence of attributes in the generated image compared to the Syngen method.
Om Patel, Suyeon Shin, Harkanwar Singh, Emmett Cocke on Dec 12, 2024
Medical Image Segmentation

This report covers medical image segmentation using U-Net, U-NET++, and PSPNet. These models are ran on an ISIC challenge dataset from 2017.
Eric Huang, Emil Goubasarian on Dec 12, 2024
Controlling Images with Diffusion Models

Image Diffusion models generate novel images by starting from pure noise and gradually denoising. This process is inherently random and much work has been put in to controlling the generation of image diffusion models. We will discuss three papers that introduced innovative methods to control the generation process beyond standard text-guided generation: InstructPix2Pix, DreamBooth and ControlNet.
Ashita Singh, Ava Gonick, Claire Chen, and Lara Papsian on Dec 12, 2024
Anomaly Detection

Anomaly detection is a critical task in domains such as cybersecurity, healthcare, and fraud detection. It involves identifying patterns in data that deviate significantly from the norm. This report compares three state-of-the-art approaches to anomaly detection: a clustering-based method, a GAN-based method, and a reinforcement learning (RL)-based method. Each approach leverages unique architectures and methodologies to address the challenge of detecting anomalies in various datasets. This comparative analysis evaluates these models across key dimensions: datasets, architectures, and results.
Om Patel, Suyeon Shin, Harkanwar Singh, Emmett Cocke on Dec 12, 2024
Point Cloud

PointNet introduced a groundbreaking approach to processing 3D point cloud data directly, bypassing the need for voxelization or other preprocessing techniques. Its core innovation lies in its ability to handle unordered point sets while maintaining permutation invariance and learning robust features for tasks such as classification and segmentation.
James Wu, Shiyu Ye, Yun Zhang, Nelson Lu on Dec 11, 2024
Medical Image Segmentation

Medical image segmentation leverages deep learning to partition medical images into meaningful regions like organs, tissues, and abnormalities. This report explores key segmentation models such as U-Net, U-Net++, and nnU-Net, detailing their architectures, challenges, comparative performance, and practical applications in clinical and research settings.
Jeffrey Kwan, Selina Song, Ishita Ghosh, Jason Cheng on Dec 10, 2024
ASL Fingerspelling

We evaluate and compare 3 different implementations of Continuous Sign Language Recognition (CSLR): MiCT-RANet, a spatio-temporal approach, C2ST, which takes advantage of textual contextual information, and Open Pose landmarks. We implement the MiCT-RANet approach for ASL fingerspelling recognition and attempt to train it from scratch.
Tony Yu, Arnav Jain, Arash Dewan, Ki Riley on Dec 8, 2024
Image Super-Resolution: A Brief Overview

Image Super-Resolution (SR) is a technique in computer vision that reconstructs a high-resolution (HR) image from one or more low-resolution (LR) images. In this bloc post, we aim to provide an overview of both fundamental and recent state-of-the-art (SOTA) machine learning models within this field.
Shawn Zhuang, Allen Luo, William Smith, Howard Huang on Dec 7, 2024
Novel View Synthesis with 3D Gaussian Splatting

Gaussian Splatting is a novel 3D reconstruction algorithm, radically improving the generation time compared to NeRFs using 3D Gaussians. In this project, we review its implementation as well as several applications of the more capable 3D reconstruction algorithm.
Tianle Zheng on Dec 5, 2024
Image to Image Translation

This post delves into cutting-edge methods for image-to-image translation and generative modeling, including Pix2Pix, CycleGAN, and FreeControl.
Vikram Chilkunda, Aral Muftuoglu, Azad Azargushasb on Jan 1, 2024
Image Retrieval

This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.
Vikram Puliyadi, Kevin Yuan, Santiago Mulanovich, Eduardo Jaramillo on Jan 1, 2024
Sign Language Recognition

Sign language recognition is task that can solve by applying computer vision principles. This post will explore various methods and an implementation of a solution.
Nicholas Chu, James Feeney, Tyler Nguyen, Jonny Xu on Jan 1, 2024
Super-resolution

This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.
Thomas McGall, Alex Chen, Jake Ekoniak on Jan 1, 2024
CamoNet

We used a GAN approach to generate optimal camouflage patterns for individual scenes.
Tingyu Gong, Frederick Zhang, Nathan Tran, Ronald Pineda on Jan 1, 2024
Exporing Object Detection

Object tracking is a core computer vision tasks that aims to identify and track objects across a sequence of frames. Its applications span anywhere from surveillance to autonomous driving to medical imaging. In this blog like post, we explore object tracking advancements across different frameworks and promising architectures.
Kosta Gjorgjievski, Won June Lee, Adrian McIntosh on Jan 1, 2024
Hierarchical Label Explainability

Understanding how hierarchical labels affect saliency maps can unlock new pathways for model transparency and interpretability. This post explores the motivation, existing methods, and implementation of our project on hierarchical label explainability.

Audio-Visual Sentiment Analysis Our group explored the cutting-edge field of Audio-Visual Sentiment Analysis (AVSA), and how multimodal data (comprising of visual, audio, and textual inputs) can be combined to better understand human emotion.

Dynamic Hand Gesture Classification In this report, we review three deep learning models in the domain of static and dynamic hand gesture classification.

Fashion Image Editing Fashion image editing involves modifying model images with specific target garments. We examine various approaches to fashion image editing based on latent diffusion, generative adversarial networks, and transformers.

CNN Loss Function Advances for Deep Facial Recognition In this report, we focus on analyzing loss functions used in Convolutional Neural Networks (CNNs) for Deep Face Recognition, specifically comparing A-Softmax, CosFace, and ArcFace, and examining their performances.

FaceNet Our project is about FaceNet, how it works, and its use cases. We had a lot of fun playing around with it and learning about it. We hope you do too!

Image Translation An exploration of three different approaches of image-to-image translation.

Panoptic Segmentation – From Foundational Tasks to Modern Advances In this post, I will discuss recent advancements in image segmentation, including panoptic segmentation.

Hand Pose Fugl Meyer We explore 3 deep-learning based approaches to hand pose estimation and build a proof-of-concept RNN algorithm that uses hand pose estimation for an important application: Fugl-Meyer Assessment evaluation.

Medical Image Segmentation This report covers medical image segmentation using U-Net, U-NET++, and PSPNet. These models are ran on an ISIC challenge dataset from 2017.

Image to Image Translation This post delves into cutting-edge methods for image-to-image translation and generative modeling, including Pix2Pix, CycleGAN, and FreeControl.

Image Retrieval This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

Sign Language Recognition Sign language recognition is task that can solve by applying computer vision principles. This post will explore various methods and an implementation of a solution.

Super-resolution This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

CamoNet We used a GAN approach to generate optimal camouflage patterns for individual scenes.

Hierarchical Label Explainability Understanding how hierarchical labels affect saliency maps can unlock new pathways for model transparency and interpretability. This post explores the motivation, existing methods, and implementation of our project on hierarchical label explainability.

Audio-Visual Sentiment Analysis

Our group explored the cutting-edge field of Audio-Visual Sentiment Analysis (AVSA), and how multimodal data (comprising of visual, audio, and textual inputs) can be combined to better understand human emotion.

Dynamic Hand Gesture Classification

In this report, we review three deep learning models in the domain of static and dynamic hand gesture classification.

Fashion Image Editing

Fashion image editing involves modifying model images with specific target garments. We examine various approaches to fashion image editing based on latent diffusion, generative adversarial networks, and transformers.

CNN Loss Function Advances for Deep Facial Recognition

In this report, we focus on analyzing loss functions used in Convolutional Neural Networks (CNNs) for Deep Face Recognition, specifically comparing A-Softmax, CosFace, and ArcFace, and examining their performances.

FaceNet

Our project is about FaceNet, how it works, and its use cases. We had a lot of fun playing around with it and learning about it. We hope you do too!

Image Translation

An exploration of three different approaches of image-to-image translation.

Panoptic Segmentation – From Foundational Tasks to Modern Advances

In this post, I will discuss recent advancements in image segmentation, including panoptic segmentation.

Hand Pose Fugl Meyer

We explore 3 deep-learning based approaches to hand pose estimation and build a proof-of-concept RNN algorithm that uses hand pose estimation for an important application: Fugl-Meyer Assessment evaluation.

Medical Image Segmentation

This report covers medical image segmentation using U-Net, U-NET++, and PSPNet. These models are ran on an ISIC challenge dataset from 2017.

Image to Image Translation

This post delves into cutting-edge methods for image-to-image translation and generative modeling, including Pix2Pix, CycleGAN, and FreeControl.

Image Retrieval

This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

Sign Language Recognition

Sign language recognition is task that can solve by applying computer vision principles. This post will explore various methods and an implementation of a solution.

Super-resolution

This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

CamoNet

We used a GAN approach to generate optimal camouflage patterns for individual scenes.

Hierarchical Label Explainability

Understanding how hierarchical labels affect saliency maps can unlock new pathways for model transparency and interpretability. This post explores the motivation, existing methods, and implementation of our project on hierarchical label explainability.