• Rethinking TCAV

    Concept-based interpretations of black-box models are often more intuitive for humans to understand. The most widely adopted approach for concept-based interpretation is Concept Activation Vector (CAV). CAV relies on learning a linear relation between some latent representation of a given model and concepts. The linear separability is usually implicitly assumed but does not hold true in general. In this project, we extending concept-based interpretation to non-linear concept functions with Concept Gradients (CG). We showed that gradient-based interpretation can be adapted to the concept space. We demonstrated empirically that CG outperforms CAV in both toy examples and real world datasets.

  • Weak Supervision with Heterogeneous Annotations

    Fully-supervised Convolutional Neural Networks (CNN) have become the state-of-the-art for semantic segmentation. However, obtaining pixel-wise annotations is prohibitively expensive. Thus, weak supervision has become popular to reduce annotation costs. Although there has been extensive research in weak supervision for semantic segmentation, prior methods have focused solely on a single type of weak annotation (e.g. points, scribbles, bounding boxes, image tags), otherwise known as homogeneous annotations. This results in rigidity, often forcing researchers to either use a combination of multiple algorithms or throw out valuable weak labels of a different type. Universal weak supervision methods attempt to remedy this by being compatible with several types of weak labels. Despite this, there has been little to no study on the effects of heterogeneous annotations when using universal weak supervision methods. In this work, we use the state-of-the-art universal weak supervision method, Universal Weakly Supervised Segmentation by Pixel-to-Segment Contrastive Learning (SPML), to study the effects of heterogenous annotations. We show extensive results for several types of heterogeneous annotations and compare these results with their homogeneous counterparts. In addition to this, we explore how information in the language domain can significantly improve weak annotation results while requiring no further cost in annotations.

  • Character Generation - StyleGAN for Pokémon

    Character design is a challenging process: artists need to create a new set of characters tailored to the specific game or animated feature requirements while still following the basic anatomy and perspective rules. In this project, we try to utilize automation to ease the creation process. We add discriminator branches to StyleGAN and incorporate the idea of SemanticGAN to make the character generation process more human controllable.

  • Visual Counting

    Repetition of a task is quite common in our day to day life ranging from a simple pendulum to the periodic day and night pattern of earth, everything repeats. Counting repetitions through time from a video has interesting applications into healthcare, sports and fitness domains for tracking reps in exercises, shots in a badminton rally etc. Through this project, we would like to explore existing literature towards class agnostic counting from video and specifically apply/invent to a scenario of counting paper bills(currency) from a video. Although machines exist for this particular task, being able to do the same just by using a smartphone camera has its advantages of being widely accessible and availability at low cost. We believe that this task is also a perfect fit for this course as it needs both human and AI collaboration!. We also like to mention that this seemingly simple task might be ambitious to achieve and thereby planning to explore other directions to mould existing counting mechanisms which process video as a whole towards real time counting systems, so that they can be used in day to day life.

  • MetaDrive: Compositional and Interactive Driving Scenarios with Human-in-the-Loop

    In this project, we incorporated human interactions into the driving scenarios of MetaDrive platform, as a supplement to its compositionality, diversity, and flexibility. Experiments showed that human interactions could help improve RL agents’ responsiveness to traffic vehicles, obstacles, and accidents on the road.

  • Self-Learning cars under Simulator and Experiment

    This report uses the Evolutionary Algorithms with a deep learning neuro network to teach a car to drive itself in the simulation. There will be a 2D planar Lidar in the simulation to detect the crash on the track. The final goal is to let the car know how to drive itself.

  • VQPy

    Video analytics has been gaining popularity in recent years, due to easier production of videos as well as development in computer vision algorithms and computing hardware. Traditional video analytics pipelines require developers to spend a lot of time on model optimizations, and can be difficult to reuse. Efforts such as SQL-like languages have tried to approach this problem, but only solved part of it, due to restrictions such as limited expressiveness. We propose to implement a Python dialect called VQPy, as an attempt to make video queries easier even for people with no CV-related knowledge, with flexible and customizable interfaces, and transparent query optimizations.

  • Leveraging CLIP for Visual Question Answering

    Recently, models (like CLIP) pre-trained on large amounts of paired multi-modal data have shown excellent zero shot performance across vision-and-language (VL) tasks. Visual Question Answering is one such challenging task that requires coherent multi-modal understanding in the vision-language domain. In this project, we experiment with CLIP and a CLIP-based Semantic Segmentation model for VQA (visual question answering) task. We also analyse the performance of these models on the various types of questions in the VQA dataset. We also experiment with publicly available multilingual CLIP on multilingual VQA, which is extremely challenging considering the sparse nature of some languages. Through all of our experiments, we intend to show the zero-shot capabilities and suggest ways in which these models can be creatively used in a challenging task like VQA.

  • An Analytical Dive into What FID is Measuring

    Frechet Inception Distance (FID) is a metric that measures the similarity between two sets of images as a distance. It is the gold standard today for quantitative measurement of the performance of generative models such as Generative Adversarial Networks (GANs). Qualitative inspection is often overlooked in GAN research, especially on bad samples generated. In this work, manually inspect approximately 40,000 GAN-generated images and pick 159 good-bad sample pairs, each of which we confirm to be close variants of the same image. We present an analysis of human perceived image quality with respect to variations in FID scores using simple discard and replace schemes. We then analyze FID’s focus on images using Grad-Cam-based visualizations of the selected pairs. Our results urge against relying solely on FID for the evaluation of generators and highlight the need for additional assessment during evaluation.

  • Measuring and Mitigating Bias in Vision-and-Language Models

    Models pre-trained on large amounts of image-caption data have demonstrated impressive performance across vision-and-language (VL) tasks. However, societal biases have been serious issues in existing vision or language tasks and careful calibrations are required before deploying models in real-world settings, while only a few recent works have paid attention to the social bias problem in these models. In this work, we first propose a retrieval-based metric to measure gender and racial biases in two representative VL models (CLIP and FIBER). Then, we propose two post-training methods for debiasing VL models: subspace-level transformation and neuron-level manipulation. By identifying model output neurons or subspaces that correspond to the specific bias attributes, based on which we manipulate the model outputs to mitigate these biases. Extensive experimental results on the FairFace and COCO datasets demonstrate that our models can successfully reduce the societal bias in VL models while not hurting the model performance too much. We further perform analyses to show potential applications of our models on downstream tasks, including reversing gender neurons to revise images and mitigating the bias in text-driven image generation models.

  • SketcHTML - An interactive sketch to HTML converter

    The area of Web-UI design continues to evolve , often reqiuring a balance of effort between designers and developers to come up with a suitable user interface. One of the key challenges in standard Web UI development involves reaching an interface between desginers and developers, in being able to convert designs to code. To this end, several works in recent years have taken atempts to try and automate this task, or provide for easy conversion from design to code. Works such as Microsoft sketch2code and pix2code , provide automation by converting sketches and screenshots respectively to code. However, there still remains room for further work in this domain, such as including more element types, encoding more information in the sketch and allowing for more variablity. This project seeks to improve upon the Sketch-to-Code framework, by first constructing an enriched dataset of Web-UI samples and then allowing user manipulation of the generated web-page through an interactive user interface as well as allowing custom generated images using GANs.

  • Back-N-Forth (BNF): A Self-diagnosing Mechanism for Human-Collaborating ML Systems

    For interactive ML systems, we’d expect a reliable, controllable model to:

    1. “To admit that you know what you know, and admit what you don’t know” - as in Confucianism philosophy.
    2. Instead of silently producing low-confidence results, try to report the confusion and seek further help from human collaborators
    3. Use additional human-machine interactions to improve the overall experience from the user’s perspective (including, but maybe not limited to model performance)

    In this project, we want to study the design of a special module, that can be injected into any interactive machine learning systems for us to achieve these targets.

  • Object Removal using combination of segmentation and image inpainting

    The project mainly focuses on the problem of object removal using a combination of segmentation and image inpainting techniques. We plan to develop a system that automatically detects target objects to remove using weakly supervised language-driven semantic or instance segmentation models and remove the object to recover the background image with image inpainting techniques.

  • Virtual Try-on on Videos

    Image-based Virtual Try-On focuses on transferring a desired clothing item on to a person’s image seamlessly without using 3D information of any form. A key objective of Virtual Try-On models is to align the in-shop garment with the corresponding body parts in the person image. The problem at hand becomes challenging due to the spatial misalignment between the garment and the person’s image. With the recent advances of deep learning and Generative Adversarial Networks(GANs), intensive studies were done to accomplish this task and were able to achieve moderately succesful results. The subsequent task in this direction would be is to apply the Virtual Try-On on videos. This has many applications in fashion, e-commerce etc sectors. We have started by using an existing state of the art image virtual tryon model and included various state of the art techniques to improve the performance of the videos. We have included Flow obtained from Flownet model to improve the overall smoothness of the video. Previously in video virtual tryon tasks, depth has never been taken into consideration to improve the video quality. In a novel approach, To improve the fitting of the cloth, we have used depth information and trained on various models including ResNet([7]), DenseNet([8]) and CSPNet([9]). The video quality has improved after adding these training tasks. Finally we have augmented the dataset by adding different backgrounds in the videos and trained on the above models to understand the effect of background in virtual tryon.

  • Independent Causal Mechanism for Robust Deep Neural Networks

    In order to generalize the machine learning models and solve the “Distribution Shift” problem, we want to propose a different solution with independent mechanisms. We want to find effective independent mechanisms or experts to help us generalize across human faces. With the casual mechanisms we implemented, we can use them to detect human faces more robustly.

  • Explore wider usage of CLIP

    Explore wider usage of CLIP, a large scale self-supervised models. We believe CLIP have more usage than what’s shown in the original paper, as it has great feature extraction ability.

  • Zero-shot Object Localization With Image-Text Models and Saliency Methods

    Recent works in computer vision have proposed image-text foundation models that subsume both vision and language pre-training. Methods such as CLIP, ALIGN, and COCA demonstrated success in generating powerful representations for a variety of downstream tasks, from traditional vision tasks such as image classification to multimodal tasks such as visual question answering and image captioning. In particular, these methods showed impressive zero-shot capabilities. CLIP and COCA was able to achieve 76.2% and 86.3% top-1 classification accuracy on Imagenet, respectively, without explicitly training on any images from the dataset. Motivated by these observations, in this work, we explore the effectiveness of image-text foundation model representations in zero-shot object localization. We propose a variant of Score-CAM that considers both Vision and Language inputs (VL-Score). VL-Score generates a saliency map for an image conditioning on a user-provided textual query, from intermediate features of pre-trained image-text foundation models. When the query asks for an object in the image, our method would return the corresponding localization result. We quantitively evaluate our method on the ImageNet validation set, and demonstrate comparative ground truth localization accuracy with state of the art of weakly supervised object localization methods. We also provide a streamLit interface that enable users to experiment with different image and text combinations. The code is released on Github.

  • Exploring EdgeGAN, Object Generation From Sketch

    Latest GAN models like Dalle are already capable in generating high-resolution photorealistic images and imaginative scenes with different styles. However, rendering the model fully autonomous to generate a image might not always be the case. In real life, people often tend to only edit a part of the image. We realize that generating pure objects from sketch is helpful, such that it only generates what we want. In this project, we adapt and explore the model EdgeGAN[1].

  • Table Structure Recognition

    In this work, we will explore ways to improve LaTeX table structure recognition. The task involves generating latex code for input table image. This will be useful in making LaTeX more accessible to everyone. We also get to understand some of the challenges involved in applying deep learning techniques, normally developed for natural images, to a different domain, specifically table images.

  • Image Captioning with CLIP

    Image captioning is a fundamental task in vision-language understanding, which aims to provide a meaningful and valid caption for a given input image in a natural language. Most existing image captioning model rely on pre-trained visual encoder. CLIP is a neural network which demonstrated a strong zero-shot capability on many vision tasks. In our project, we want to further investigate the effectiveness of CLIP models for image captioning.

  • Post Template

    This block is a brief introduction of your project. You can put your abstract here or any headers you want the readers to know.

  • An Overview of Deep Learning for Curious People (Sample post)

    Starting earlier this year, I grew a strong curiosity of deep learning and spent some time reading about this field. To document what I’ve learned and to provide some interesting pointers to people with similar interests, I wrote this overview of deep learning models and their applications.