• Contrastive Language–Image Pre-training Benchmarks and Explainability

    The manual labelling of high quality datasets for the purposes of training Computer Vision models remains one of the most time consuming tasks for Computer Vision research. Consequently, alternative methods of extracting label information from existing images and visual data is one of the areas of focus for recent research. In this blog we explore one of the state of the art models in pre-training Image Classification models, namely CLIP (Contrastive Language–Image Pre-training). Extracting latent labels from images already associated with text widely available on the internet is a promising method to fast-track the training of Computer Vision models using text and image encoders. These models demonstrate the power of Contrastive Pre-training to perform well with “zero-shot” classification, or classifying images which the model has not been trained on or seen before.

  • Galaxy Detection/Classification with Computer Vision

    In this blog, I will share my experience in using a machine learning model (based on YOLO) that detects and classifies galaxies from public datasets from the Sloan Digital Sky Survey (SDSS) and Galaxy Zoo while taking CS 188: Deep Learning for Computer Vision at UCLA.

  • Object Detection Algorithms in MMDetection

    Topic: Object Detection Algorithms This project compares object detection algorithms using MMDetection between YOLO and Fast R-CNN family of models.

  • Generating Images with Diffusion Models

    This project focuses on the generating images from text using diffusion models.

  • Mushroom Classification Project Proposal

    Mushrooms can be delicious, symbols in popular culture, and found nearly anywhere, but dealing with them safely can be tricky because while many look the same to the untrained eye, some mushrooms are extremely poisonous. Here, we attempted to classify 1394 types of mushrooms with as little as 3 images for some species using various deep learning methods.

  • An Introduction to NeRFs

    This project is an introduction to the Neural Rendering Field (NeRF) model for generating novel views of 3D scenes. It explains what a NeRF is, how it works, and provides a complete example of how to build the original NeRF architecture using PyTorch in Google Colab.

  • Skin Lesion Classification

    It is crucial to catch skin cancer at an early stage. This can be done using image classification techniques. Specifically, this post aims to explore using different methods to handle class imbalance while classifying skin lesions using the HAM10000 dataset.

  • Trajectory Prediction

    Trajectory prediction of pedestrians and vehicles involves utilizing information relating to the previous locations of such subjects to predict where they will be in the future, which is important in contexts such as that of an autonomous vehicle but may be complex as subjects may respond unpredictably in a real-time environment. We examine two approaches to trajectory prediction, a stepwise goal approach with SGNet [1] [2] and a graph approach with PGP [3] [4], while also briefly examining a third model, Trajectron++ [5] [6], as a comparison. We work with the ETH / UCY (obtained through the Trajectron++ repository [6]) and nuScenes [7] datasets during these studies.

  • NeRF Models

    This project delves into the cutting-edge technology behind Image-generative AIs, particularly Neural Radiance Fields (NeRF). We will investigate the research and methodologies behind NeRF and its applications in 3D scene reconstruction and rendering. Additionally, we will provide a hands-on, toy implementation using Pytorch and Google Colab.

  • Video Grounding

    Topic: Video Grounding

  • Melanoma and Skin Cancer Detection

    Skin cancer is one of the most prevalent cancers in the US and often misdiagnosed or underdiagnosed globally. Benign and malignant lesions often appear visually similar and are only distinguishable after intensive tests. However, deep learning models have proven to be extremely accurate in the classification of melanoma and other skin cancers. Recently, convolutional neural network (CNN) models such as Resnet-50 have achieved over 85% classification accuracy on the ISIC binary melanoma classification datasets and ensemble classifiers have reached over 95% accuracy. In this project, we will explore the relevant high-performing CNN models and their efficacy when utilized for skin cancer classification. We will run various experiments on these models to explore performance-related differences and potential issues with current datasets available. We find from our experiments that pre-trained CNNs are already quite robust in their ability to categorize seven different skin lesion categories from the ISIC 2018 dataset. However, the specific errors of the networks depend significantly on the data, demonstrating potential input-related biases and issues with current data available.

  • Exploring The Effectiveness Of Classification Models In Autonomous Driving Simulators Through Imitation Learning

    In this blog, we explore the effectiveness of supervised learning (classification) models that are trained by imitation learning in autonomous driving simulators.

  • Object Detection Algorithms

    This project explores different R-CNN based deep learning algorithms used for object detection. We analyze the performance 5 different R-CNN based models for transfer learning on a new dataset, using OpenMMLab’s MMDetection toolbox to finetune these 5 models, pretrained on COCO and using the Aerial Maritime target dataset.

  • Human Action Recognition using Pose Estimation and Transformers

    This post is a review of different models for pose estimation and their applications to human action recognition (HAR) when combined with Transformers.

  • Adversarial examples to DNNs

    DNN driven image recognition have been used in many real-world scenarios, such as for detection of road case or people. However, the DNNs could be vulnerable to adversarial examples (AEs), which are designed by attackers and can mislead the model to predict incorrect outputs while hardly be distinguished by human eyes. This blog aims to introduce how to generate AEs and how to defend these attacks.

    YouTube link of the demo: https://youtu.be/3G267xLYzPU

  • Anime ViT-StyleGAN2

    Generative adversarial network (GAN) is a type of generative nueral network capable of varies tasks such as image creation, super-resolution, and image classifications[1]. This project will explore the usage of GAN model on the specific domain of anime character and try to find improvemnts on current GAN model by using more advanced discriminators.

  • Trajectory Prediction in Autonomous Vehicles

    In recent years, the computer vision community has became more involved in automonomous vehicle. With evergrowing hardware support on modeling more complicated interactions between agents and street objects, trajectory prediction of traffic flows is yielding more promising results. The introduction of Transformer technique also galvanized the deep learning community, and our team will explore the application of Transformer in trajectory prediction.

  • Text Guided Image Generation

    My project investigates GLIDE, a state of the art text to image generation diffusion model by OpenAI that uses classifier-free model. I ran investigations on how a filtered version of this model performed by giving different kinds of text input and tuning different hyperparameters to observe the generated output. The results that I found is that while GLIDE definitely is quite powerful in generating images, there are still quite a numerb of issues including bias, image quality, novel images, and image types.

  • Deep Learning-Based Single Image Super-Resolution Techniques

    Image super-resolution is a process used to upscale low-resolution images to higher resolution images while preserving texture and semantic data. We will outline how state-of-the art techniques have evolved over the last decade and compare each model to its predecessor. We will also show PyTorch implementations for some of the described models.

  • Human Activty Classification Using Transformer Encoders

  • Driving Simulators

    Self-driving is a hot topic among deep vision learning. One way of training driving models is using imitation learning. In this work, we focus on reproducing the findings from “End-to-end Driving via Conditional Imitation Learning.” To do so, we utilize CARLA, a driving simulator, and emulate the models created in said paper using their provided dataset.

  • Analysis of Panoptic Image Segmentation Performance

    Panoptic segmentation is a type of image segmentations that unifies instance and semantic segmentation. In this project we compare three different panoptic segmentations models: Panoptic FPN, MaskFormer, and Mask2Former. This includes descriptions of their architectures, objective analysis and subjective analysis of the models. We evaluated eaach modle on the COCO validation dataset with mmdetection, compared differences in model segmentation, and visualized the attentions of MaskFormer and Mask2Former with Detectron2 and HuggingFace.

  • Object Detection

    In this project, we wish to dive deep and examine the YOLO v7 (You Only Look Once) object detection model while exploring any possible improvements. YOLO v7 is one of the fastest object detection models in the world today which provides highly accurate real-time object detection. During our examination, we will inspect the layers and algorithms within YOLO v7, test the pre-trained YOLO v7 model provided by the YOLO team, and train and test YOLO v7 against different datasets.

  • Panoptic Scene Graph Generation and Panoptic Segmentation

    In this post, we’ll explore what panoptic scene graph generation and panoptic segmentation are, their implementation, and potential applications.

  • Facial Detection and Recognition

    This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

  • GAN Network Application towards Face Restoration

    Analyzing the Application of Generative Adversarial Networks For Face Restoration

  • Object Detection and Classification of Cars by Make using MMDetection

  • The Influence of Deep Learning in Dance

    Below is our setup for part 1…

  • Visual Physical Reasoning

    Our project will be based around the topic of visual physical reasoning. Essentially, visual physical reasoning deals with whether computers can answer “common sense” queries about an image, such as “how many chairs are at the table?”, or “where is the red cube in relation to the purple ball?”

  • Image Similarity

    Image Similarity has important applications in areas such as retrieval, classification, change detection, quality evaluation and registration. Here, we aim to improve the performance of models to pair images based on their similarities.

  • Exploring CLIP

    Abstract: This project explores CLIP - Contrasive Language-Image Pre-training, a pre-training technique that jointly trains image and text encoders to map them into the same feature space. This project reproduces CLIP’s one shot ability and performance of a linear classifier using CLIP’s features. It also proposes a method to initialize the few-shot classifiers, solving the previously discovered problem that few-shot classifiers could underperform zero-shot classifier. It is found that there is a special type of overfitting in few-shot classification, which poses a significant challenge to the concept of few-shot classification. This project also trains a image generation model and a semantic segmentation model using CLIP’s features, which can provide us insight into the amount of information that CLIP’s features contain.

  • Language Representation for Computer Vision

    In this blog I will investigate natural language representations as they are used in computer vision. A foray into the predominant language architecture, Transformers, will be linked to tasks in image captioning and art generation.

  • Art Generation using Cycle GANS

  • Action Localization

  • CLIP and Some Applicatios

    CLIP (Contrastive Language-Image Pre-training) is a neural network model developed by OpenAI that combines the strengths of both computer vision and natural language processing to improve image recognition and object classification. CLIP has been shown to achieve state-of-the-art results on a wide range of image recognition benchmarks, and it has been used in various applications such as image captioning and image search. I found an open-source SimpleCLIP model online and conducted some experiments to reconstruct the model. Using ideas and methods from some papers, I attempted to address its overfitting issue.

  • Pose Estimation

    Pose estimation research has been growing rapidly and recent advances have allowed us to accurately detect the various joints in the human body from just a photo. Convolutional Neural Networks have been the medium for obtaining high performance models and in this post we explore the novel model HRNet. Applications of pose estimation include identifying classes of actions undertaken by individuals, their poses, and animation.

  • Dataset Analysis for Deepfake Detection Using Mesonet

    Abstract

    Deepfake detection models and algorithms are the future in preventing malicious attacks against entities that wish to use deepfakes to circumvent modern security measures, sway the opinions of groups of people, or simply to deceive other entities. This analysis of a deepfake detection model can potentially bolster our confidence in future detection models. At the core of the effectiveness of every model is the data used to train and test the model. Good data (generally) creates better models that are more trustworthy to be deployed in real applications. In this study, the MesoNet model is used as the model for training and testing and evaluation. From the three datasets we investigated, the model trained on Dataset 1 (found in Dataset Control) performed the best. We then chose various properties of an image to analyze to compare the three datasets, which include RGB colors, contrast, brightness, and entropy. After comparing the mean and standard deviation across the image properties, we found that the mean values from Dataset 1 are in between the other two datasets and has higher variance in 2 of the 4 image properties observed.

  • Text Guided Image Generation

    Text-to-image models are machine learning models which take as input a natural language description and produce an image matching the given description. We will explore the architecture and design of one such model, Stable Diffusion, and explore extensions of Stable Diffusion through experiments with prompt-to-prompt editing and model finetuning.

  • Deep Fake Generation

    The use of deep learning methods in deep fake generation has contributed to the rise of fake of fake images which has some very serious ethical dilemmas. We will look at two different ways to generate deepfake pictures and videos, and will then focus in on Image-to-Image Translation. CycleGAN and StarGAN are two different models we will by studying to create deep fake images using Image-to-Image Translation.

  • Generating Images with Diffusion Models

    This project explores the latest technology behind Image-generative AIs such as DALLE-2 and Imagen. Specifically we’ll be going over the research and techniques behind Diffusion Models, and a toy implementation in Pytorch/COLAB.

  • Deepfake Generation

    This blog details Shrea Chari and Dhakshin Suriakannu’s project for the course CS 188: Computer Vision. This project discusses the topic of Deepfake Generation and takes a deep dive into three of the most popular models: Cycle-GAN, StarGAN and StyleGAN.

  • DDIM Inversion and Latent Space Manipulation

    We explore the inversion and latent space manipulation of diffusion models, particularly the denoising diffusion implicit model (DDIM), a variant of the denoising diffusion probabilistic model (DDPM) with deterministic (and acceleratable) sampling and thus a meaningful mapping from the latent space \(\mathcal{Z}\) to the image space \(\mathcal{X}\). We implement and compare optimization-based, learning-based, and hybrid inversion methods adapted from GAN inversion, and we find that optimization-based methods work well, but learning-based and hybrid methods run into obstacles fundamental to diffusion models. We also perform latent space interpolation to show that the DDIM latent space is continuous and meaningful, just like that of GANs. Lastly, we apply GAN semantic feature editing methods to DDIMs, visualizing binary attribute decision boundaries to showcase the unique interpretability of the diffusion latent space.

  • Investigation on the Robustness of CLIP

    Topic: CLIP: Text-image joint embedding

  • Multi View Stereo (MVS)

    This post provides an introduction to Multi View Stereo (MVS) and presents to deep learning based algorithms for MVS reconstruction.

  • Deepfake Generation

    This article investigates the topic of DeepFake Generation, comparing two models, GAN and CNN, and analyzing their similarities and differences in the context of Faceswap. The study was conducted by implementing the Deepfacelab model and comparing it with the Faceswap CNN model. The hypothesis is that GAN will perform better due to its traditional generative model nature that specializes in image generation, whereas CNNs are designed for image processing tasks, such as object recognition and segmentation. The results of the study shed light on the effectiveness of these models for DeepFake Generation.

  • Object Detection

  • Deepfake Detection

    Detecting synthetic media has been an ongoing concern over the recent years due to the increasing amount of deepfakes on the internet. In this project, we will explore the different methods and algorithms that are used in deepfake detection.

  • Semantic Image Editing

    Image generation has the potential to greatly streamline creative processes for professional artists. While current capabilities of image generators have shown to be impressive, these models aren’t able to produce good results without fail. However, slightly tweaking text prompts can often result in wildly different images. Semantic image editing has the potential to alleviate this problem, by letting users of image generation models to slightly tweak results in order to achieve higher quality end results. This project will explore and compare some technologies currently available for semantic image editing.

  • Medical Imaging

    Medical Imaging analysis has always been a target for the many methods in deep learning to help with diagnostics, evaluations, and quantifying of medical diseases. In this study, we learn and implement models of Logistic Regression and ResNet18 to use in medical image classification. We use image classification to train our models to find brain tumors in brain MRI images. Furthermore, we will use our own implementation of LogisticRegression and finetune our ResNet18 model to better fit our needs.

  • Pose Estimation

    Introduction: This project explores the latest technology behind pose estimation. Pose estimation uses machine learning to estimate the pose of a person or animal by looking at key joint positions in the body. This project consists of 2 parts: this blog post, and a demonstration of pose estimation.