• Super-resolution via diffusion method

    Super resolution enhances image resolution from low to high, with modern techniques like convolutional neural networks and diffusion models like SR3 significantly improving image detail and quality. This Post explore a simplified implementation of SR3. View code [Here]

  • NeRFs for Synthesizing Novel Views of 3D Scenes

    In the domain of generative images, NeRFs are a powerful tool for generating novel views of 3d scenes with an extremely high degree of detail. here, we will review the basics of Neural Radiance Field (NeRF) models, look at how they can be used to generate novel views, and investigate an optimization to the original design with the KiloNeRF model to see how we can improve on some of it’s shortcomings.

  • Deep Learning Based Image-to-Image Translation Techniques

    The goal of this project is to explore and understand the problem of image to image translation. Two approaches addressing this topic will be analyzed: CycleGan and FreeControl. An implementation of CycleGAN is also discussed.

  • Exploring CLIP for Zero-Shot Classification

    In this blog article, we’ll delve into CLIP(Contrastive Language-Image Pre-training), focusing primarily on its application in zero-shot classification. Unlike examining a pre-trained version, we will embark on training CLIP ourselves, crafting the core components of the model and employing a distinct, smaller dataset for training purposes. We’ll also introduce custom loss functions and dissect specific elements of CLIP, such as its contrastive loss mechanism. This article aims to dissect the architecture and its implications, making minor adjustments to better grasp what drives its effectiveness.

  • Pose Estimation

    The problem of Human Pose Estimation is widely applicable in computer vision—almost any task involving human interaction could benefit from pose estimation. As such, we explore the techniques and developments in this field by discussing three works relevant and reflective of these advances.

  • Human Mesh Recovery

    WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion. Human Mesh Recovery (HMR) is a computer vision task that involves reconstructing a detailed 3D mesh model of the human body using 2D images or videos. In the 2D realm, we have seen solutions to extract 2D key points. Now, HMR aims to go a step further by capturing the shape and pose of the human body.

  • Visuomotor Policy Learning

    In visuomotor policy learning, an agent learns to excel at a sequential decision-making task involving visual inputs and motor control. Two important applications include autonomous driving and robotics control.

  • A Comparison of Point Cloud 3D Object Detection Methods

    3D object detection is a very important task that is critical to many current and relevant problems. It has numerous applications for developing car technology involving features such as obstacle avoidance and autonomous driving. Another valuable application is medical imaging, specifically brain tumor segmentation. Our paper explores the most recent advances in 3D object detection using point clouds. Doing this, we acknowledge that work in this area is less progressed than with 2D object detection. We analyze performance and model design, evaluating prominent 3D object detection models VoxelNet, PointRCNN, SE-SSN, and GLENet with the widely used KITTI dataset as a common benchmark. In our examination, we determined that VoxelNet, as one of the earlier models in 3D object detection, had a poorer performance than the later advancements. After VoxelNet, PointRCNN performs next best, then SE-SSN, and then GLENet. With each development, we discuss how differences in design decisions and architecture contribute to improved average precision and inference times. These advancements in 3D object detection show a lot of promise and potential for future computer vision applications.

  • Text-to-Video Generation

    In this paper, we will discuss diffusion-based video generation models. We will first do a preliminary exploration of diffusion, then extend it to video generation by examining Video Diffusion Models by Jonathon Ho, et al. We will then follow this with a refinement of video diffusion models by conducting a deep dive into Imagen Video, a high definition video generation model developed by researchers at Google. Through this paper, we aim to provide an overview of diffusion-based video generation, as well as rigorously cover a high definition refinement of the basic video diffusion model.

  • 3D Scene Representation from Sparse 2D Images

    In this report, we first introduce the topic of 3D scene representation as a whole. We briefly go over classical approaches and explain some of the common issues preventing them from rendering a high-quality scene. Then, we discuss three deep learning based approaches in more depth, taking Neural Radiance Fields (NeRFs) as our starting point. Instant NGP improves upon the original NeRF paper by suggesting a hash encoding of inputs to the MLP that speeds up training at least 200x. Zero-1-to-3 combines NeRF and diffusion to offer extraordinary zero-shot scene synthesis abilities. Lastly, we give the most attention to 3D Gaussian Splatting, which represents the entire scene as a set of 3D Gaussians, enabling efficient training time and real-time scene rendering.

  • U-Net for Biomedical Image Segmentation

    This report dives into the aspect of biomedical image segmentation, using U-Net and U-Net 3D. U-Net revolutionized the field with its unique U-shaped architecture designed for precise semantic segmentation, while 3D U-Net tackled 3D images, enhancing efficiency and accuracy in 3D segmentation compared the regular U-Net.

  • Polyp Segmentation for Colorectal Cancer

    Polyp Segmentation for Colorectal Cancer

  • Deep Learning for Prostate Segmentation

    We aim to analyze how we can use deep learning technique for prostate image segmentation. Prostate cancer is the second most common form of cancer for men worldwide and the fifth leading cause of death for men globally [3]. However, this is a statistic that can be considerably changed with early stage detection. In fact, the cancer is completely curable within 5 years if we catch it early. To this end, we explore how we can use existing deep learning architectures to help with prostate image segmentation to catch early prostate cancer in patients.

  • Multimodal Vision-Language Models: Applications to LaTeX OCR

    LaTeX is widely utilized in scientific and mathematical fields. Our objective was to develop a pipeline capable of transforming hand-written equations into LaTeX code. To achieve this, we devised a two-step model. Initially, we employed an R-CNN model to delineate bounding boxes around equations on a standard ruled piece of paper, utilizing a custom dataset we generated. Subsequently, we passed these selected regions into a TrOCR model pre-trained on Im2LaTeX-100k, a dataset comprising rendered LaTeX images. We further fine-tuned the model on a handwritten mathematical expressions dataset on Kaggle, which is a collection of the CROHME handwritten digit competition datasets over three years [6] [7] [8]. Our model successfully generated the ground LaTeX accurately for 4 out of 8 hand-drawn examples we produced. For the remaining 4 examples, it produced LaTeX similar to the ground truth, albeit with minor errors.

  • Navigating the Future: A Comparative Analysis of Trajectory Prediction Models

    Trajectory prediction is a challenging task due to the multimodal nature of human behavior and the complexity in multi-agent systems. In this technical report we explore three machine learning approaches that aim to tackle these challenges: Social GAN, Social-STGCNN, and EvolveGraph. Social GAN uses variety loss to generate diverse trajectories and a pooling module to model subtle social cues. Social-STGCNN models social interactions explicitly through a graphical structure. EvolveGraph establishes a framework for forecasting the evolution of the interaction graph. We compare the advantages and disadvantages of these approaches at the end of the report.

  • Road Agent Behavior and Trajectory Prediction

    In this paper, we review various deep learning models that are capable of predicting the future actions of various road actors.

  • Final Report - Hand Gesture Recognition

    Hand gesture recognition aims to identify hand gestures in the context of space and time. It is a subset of computer vision whose goal is to learn and classify hand gestures.

  • Image Inpainting

    Image inpainting is the task of reconstructing missing regions in an image, which is important in many computer vision applications. Some of these applications include restoration of damaged artwork, object removal, and image compression.

  • Novel View Synthesis

    In this article, we take a deep dive into the problem of novel view synthesis and discuss three different representative methods. First, we examine the classical method of light fields rendering, which involves learning a 4D representation of a scene and applying multiple compression techniques. We then jump into Neural Radiance Fields (NeRFs), a modern deep learning approach that has achieved much better visual clarity and memory usage by implicitly modeling scene representations with a neural network. Our discussion of NeRFs’ drawbacks will lead us to the emergence of 3D Gaussian Splatting (3D-GS), an effective rasterization technique that explicitly encodes a scene with thousands of anisotropic gaussians.

  • Text to Image Generation

  • Image Style Translation

    Image style translation is the process of converting an image from one style to another. We will explore two deep learning approaches to image style translation: CycleGAN and Stable Diffusion, and compare their performance on the task of converting realistic images to Monet-style paintings.

  • Trajectory Prediction

    Estimating the future positions of people or objects (agents) is a crucial capability that can serve diverse applications. These include self-driving cars, as being pursued by Waymo and Tesla, to Hawkeye that can determine an out or not call in tennis, to analyzing crowd behavior to ensure no human-crush occurs. Knowing the past, present, and most importantly, future trajectory of agents is clearly an important tool.

  • Conditional Control of Text-to-Image Diffusion: ControlNet and FreeControl

    In the realm of artificial intelligence, the text-to-image(T2I) generation has become a focal point of research. ControlNet stands out by offering users precise spatial control over T2I diffusion models. On the other hand, FreeControl represents a paradigm shift by granting users unparalleled control over T2I generation without the need for extensive training processes. This blog aims to provide an in-depth comparison between ControlNet and FreeControl.

  • Monocular Depth Estimation

    This document discusses the CS 188 report on various image generation methods, including traditional GANs, conditional GANs, and the Pix2Pix method. It explains the concepts, loss functions, architectures, and performance of each method. The document also covers the implementation details, training process, and post-processing techniques used in the project. The results and discussion highlight the strengths and limitations of the Pix2Pix method in generating realistic images.

  • Depth Estimation

    Depth estimation is defined as the computer vision task of calculating the distance from a camera to objects in a scene. It has become more and more important, with a wide variety of applications in many fields, including autonomous vehicles, medicine, and human-computer interaction. Recent deep learning methods have enhanced the accuracy, robustness, and efficiency of depth estimation methods in both images and videos. We discuss two different deep learning approaches to depth estimation, including an Unsupervised CNN, and Depth Anything. We compare and contrast these approaches, and expand on the existing code by combining it with other effective architectures to further enhance the depth estimation capabilities.

  • Human Pose Estimation

    Human pose estimation (HPE), a pivotal task in computer vision, seeks to deduce the configuration of human body parts from images or sequences of video frames. In this post, I will examine two approaches to human pose estimation: MMPose, the most common library for HPE, and OpenPifPaf, a more lightweight, efficient pose estimation model.

  • Object Tracking

    Object tracking has been a prevalent Computer Vision task for many years with many different deep learning approaches. In this report, we will explore the inner workings of two different approaches, DeepSORT for multiple object tracking and SiamRPN++ for single object tracking, comparing and contrasting their capabilities. We also briefly look at ODTrack, a more recent tracking algorithm. Finally, we have a short demo on DeepSORT with YOLOv8.

  • Trajectory Prediction

    In the AV pipeline, trajectory prediction plays a crucial role in the Autonomous Vehicle (AV) pipeline. In this report we’ll explore two different machine learning approaches to trajectory prediction for Autonomous Vehicles: a Conv-based architecture which uses rasterized semantic maps, and a GNN-based architecture which uses a vector-based representation of the scene.

  • UCLA Human Pose Estimation + Trajectory Prediction for AVs

    This report explores the topic of Human Pose Estimation - specifically its history, approaches, and application to the realm of Autonomous Driving. We talk in depth about some of the overarching approaches that are taken when tackling HPE and then we go in depth on some exciting developments in the space. We also connect our research to work done in Trajectory Prediction, also with a focus on autonomous driving.

  • Deep Neural Networks for Facial Recognition

    Facial recognition is the technology of identifying human beings by analyzing their faces from pictures, video footage or in real time. Facial recognition has been an issue for computer vision until recently. The introduction of deep learning techniques which are able to grasp big data faces and analyze rich and complex images of faces has made this easier, enabling new technology to be efficient and later become even better than human vision in facial recognition.

  • 3D Model Generation through Machine Learning

    Ever since the advent of 3D graphics, computer graphics engineers have been looking for better, more streamlined ways to create models (data representing a 3D object), a historically a tedious and long process. The goal of this post is to trace the history of 3D model generation through AI vision algorithms and suggest a potential method not just generating 3D models but pre-preparing them for use in animation and game development.

    To be clear we will be discussing how AI is leveraged to generate models in a format usable by the wider 3D community. We will not be discussing algorithms that choose to represent 3D data in a form unusable by traditional animation software such as Blender, Maya or Unity (e.g. unconverted NERF data).

  • Computer Vision for Brain Tumor Detection

    In our exploration of brain tumor classification and segmentation, we explore both 2D and 3D Convolutional Neural Networks. The brain tumor detection issue is paramount - millions of people die each year due to brain cancer, and early detection from MRI scans is integral in the prevention and aid in cancer treatment. The brain tumor classification is a machine problem, and researchers have attacked this issue in many ways, as we will elaborate on below.

  • Putting a Name to a Face – Deep Learning for Facial Classification

    Our report examines the chronological development and mechanisms behind the evolution of deep learning architectures used for facial recognition, as well as industry approaches and applications.

  • Fine-tuning Stable Diffusion with Dreambooth

    Stable diffusion is an extremely powerful text-to-image model, however it struggles with generating images of specific subjects. We decided to address this by exploring the state-of-the-art fine-tuning method DreamBooth to evaluate its ability to create images with custom faces, as well as its ability to replicate custom environments.

  • Natural Image Classification

    Natural image classification involves the process of developing and training machine learning models capable of accurately assigning labels to images depicting real-world objects, landscapes, and scenes at a scale decipherable to humans. In the project, we will cover traditional approaches as well as deep learning approaches to tackle this problem.

  • Galaxy Morphology

    In this Blog, we will take a look at how Deep Learning impacted the Astronomy community with deep representation learning. It’s interesting to recognize that transferring the learned representations from a different dataset to other unseen related-tasks is more effective than starting to learn everything from scratch! These tasks for astronomers include identifying galaxies with similar morphology to a query galaxy, detecting interesting anomalies, and adapting models to new tasks with only a small number of newly labeled galaxies. In this post, we will share our own results and insights about a paper “Practical Galaxy Morphology Tools from Deep Supervised Representation Learning” and their contributions that, even to our days, are still very meaningful to understanding Neural Networks (7 min read)

  • Anomaly Detection for Semantic Segmentation

    When semantic segmentation models are used for safety-critical applications such as autonomous vehicles, they have to handle unusual things that don’t fall into one of the predefined categories that they were trained for. We discuss two state-of-the-art methods that attempt to detect and localize anomalies, one using kNN and the other using generative normalizing flow models.

  • 3d Bounding Box Estimation Using Deep Learning and Geometry

    This blog delves into techniques for estimating 3d bounding boxes from monocular images, examining common datasets and evaluating three prominent methods. Additionally, it investigates enhancing the Deep3dBox model’s performance by incorporating temporal and stereo information, assessing the scalability of its geometric insights.

  • Mitigating Biases in Computer Vision

    In this blog post, we investigate the issue of biases and model leakage in computer vision, specifically in classification tasks. We discuss traditional and deep learning approaches to prevent biases in classification models.

  • Image Generation

    In the realm of image generation, deep learning technologies have revolutionized traditional methods, enabling advanced applications like text-to-image generation, style transfer, and image translation with unprecedented effectiveness. This blog highlights the transformative capabilities of deep learning for generating high-quality images. We will compare and contrast Generative Adversarial Networks (GANs) and diffusion models and show the strengths and applications of each model

  • Text Guided Image Editing using Diffusion

    In this study, we explore the advancements in image generation, particularly focusing on DiffEdit, an innovative approach leveraging text-conditioned diffusion models for semantic image editing (Couairon et al., 2022 [1]). Semantic image editing aims to modify images in response to textual prompts, enabling precise and context-aware alterations. A thorough comparison between DiffEdit and various traditional and deep learning-based methodologies have been conducted, highlighting its consistency and effectiveness in semantic editing tasks. Additionally, we introduce an interactive framework that integrates DiffEdit with BLIP and other text-to-image models to create a comprehensive end-to-end generation and editing pipeline. Moreover, we delve into a novel technique for text-guided mask generation within DiffEdit, proposing a method for object segmentation based solely on textual queries. We emphasize the importance of integrating AI safety and ethical considerations, ensuring our advancements are both technologically groundbreaking and responsibly implemented.

  • Exploring different techniques for Facial Expression Recognition (FER)

    Facial expression recognition (FER) is a pivotal task in computer vision with applications spanning from human-computer interaction to affective computing. In this project, we conduct a comparative analysis of three prominent model architectures for FER: a feature decomposition and feature reconstruction network (FDRL), a cross-fusion transformer-based network (POSTERV2), and YOLOv5. These architectures represent diverse approaches in leveraging deep learning techniques for facial expression analysis. We evaluate the performance of each model architecture on RAF-DB, which encompass a wide range of facial expressions under various contexts. The evaluation metrics include accuracy and number of parameters, which give comprehensive insights into the models’ capabilities in recognizing facial expressions accurately across different datasets. Our evaluations have shown that the POSTERV2 model outperforms the other models in terms of accuracy. We also present a demonstration of the YOLOv5 model running on a webcam and training a custom model to recognize “awake” and “sleep” expressions. Our findings provide valuable insights into the strengths and limitations of different model architectures for FER, which can guide the selection of appropriate models for specific applications.

  • Vision-Language Alignment in Large Vision-Language Models: Advances and Challenges

    One recent advancement and trend that’s at the intersection of computer vision and natural language processsing is the development of large vision-language models like GPT-4V. Therefore, aligning vision and language becomes increasingly crucial for models to develop higher-level joint understanding and reasoning capabilities over multi-modal information. We explore the progress and limitations of vision-language models by first giving an introduction to CLIP (Contrastive Language–Image Pre-training), an important work connecting text and image and influencing many state-of-the-art vision-language models today. We will then discuss models influence by CLIP, and some limitations shared by them, which is potentially a promising future direction to explore.

  • The ViViT Model for Video Classification

    The ViViT Model for Video Classification

  • Image Generation

    In this project I will provide a brief overview of the history of generative modeling with respect to computer vision and dive into an application of the transformer architecture to the field.

  • A Deep Dive Comparison of 3D Object Detection Methodologies

    This report delves into the realm of 3D object detection that is a large topic in the realm in computer vision. The demand for sophisticated perception systems is especially relevant in the recent developments in autonomous vehicles, robotics, and augmented reality. The ability to detect and localize objects in a three-dimensional space has become increasingly important due to the high precision and accuracy that these methods require, especially in the case of autonomous vehicles where lives are at stake. This report provides an overview of 3D object detection, explaining the many different methods in depth to achieve accurate detection. We go over three vastly different approaches - point based methods, voxel-based methods, and transformer based methods - aiming to explain and compare the performance, strength and weaknesses, and architecture of each.

  • Facial-Action-Detection

  • Mitigating Spurious Correlations in Deep Learning

    Spurious correlations pose a major challenge in training deep learning models that can generalize well to arbitrary real life data. Models often exploit easy-to-learn features that are correlated with prediction(s) in the training set but not semantically meaningful in the real world. In this post, we provide an overview of the spurious correlation problem by characterizing it mathematically, discuss some of its causes, and describe three key methods for mitigating such correlations: GroupDRO, GEORGE, and SPARE, along with some empirical results for each method.