Ever wonder how Facebook can tell you which friends to tag in your photos or how Google automatically makes collages and animations for you? This lecture is all about that: We'll teach you the basics of computer vision using convolutional neural networks so you can make your own algorithm to automatically analyze your visual data!
Ever wonder how Facebook tells you which friends to tag in your photos, or how Siri can even understand your request? In this meeting we'll dive into convolutional neural networks and give you all the tools to build smart systems such as these. Join us in learning how we can grant our computers the gifts of hearing and sight!
Abstract: We present an application of back-propagation networks to hand-written digit recognition. Minimal preprocessing of the data was required, but architecture of the network was highly constrained and specifically designed for the task. The input of the network consists of normalized images of isolated digits. The method has 1% error rate and about a 9% reject rate on zipcode digits provided by the US Postal Service.
Ever wonder how Facebook tells you which friends to tag in your photos, or how Siri can even understand your request? In this meeting we'll dive into convolutional neural networks and give you all the tools to build smart systems such as these. Join us in learning how we can grant our computers the gifts of hearing and sight!
Some of the hardest aspects of Machine Learning are the details. Almost every algorithm we use is sensitive to "hyperparameters" which affect the initialization, optimization speed, and even the possibility of becoming accurate. We'll cover the general heuristics you can use to figure out what hyperparameters to use, how to find the optimal ones, what you can do to make models more resilient, and the like. This workshop will be pretty "down-in-the-weeds" but will give you a better intuition about Machine Learning and its shortcomings.
Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions tolearn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCOdatasets. We then show that the generated descriptions sig outperform retrieval baselines on both full images and on a new dataset of region-level annotations.