The video explores an interesting paper seeing how easily ( fine-tuning effort) pre-trained visual embeddings can be combined with text captions for visual question generation in the BERT model. This video explores the approach the authors take for this, applications of vision-language models, and question generation. Thanks for watching! Please Subscribe!
Paper Links:
BERT Can See Out of the Box:
BERT:
ImageBERT: https://arxiv.