visual language model

techniques have improved on not only text data but also computer vision recently. Here we focus on Visual Language Model (VLM) based on transformers. In the begining, some researchers try to extend BERT to process visual data and make a success. For example, visual-BERT and ViL-BERT achive strong performances on many visual tasks by training on two different objectives: 1) masked modeling task that aims to predict the missing part of a given input; and 2) a match task that aims to predict if the text and the image content are matched....