OFA
OFA
Paper|Mar 5, 2023|Last edited: Jul 24, 2023
type
status
date
slug
summary
tags
category
icon
password
  • 任务无关和模型无关的
  • 不需要额外的特定任务层
  • 只用20M公开的图片文本对
omni-model:
  1. Task-Agnostic (TA)
  • 分类、生成、自监督代理任务
  • 与预训练、微调无关
  1. Modality-Agnostic (MA)
  • 所有模式的输入输出
  1. Task Comprehensiveness (TC)
  • 充足的任务种类来积累泛化能力
目前无法建立omni-model的原因
  • 模型会设计用于微调的额外可学习组件,例如task-specific heads、adapters、 soft prompts
    • 使模型结构只能完成特定任务
    • 在预训练和微调产生差异
    • 无法完成zero-shot方式的unseen tasks
  • task-specific formulation
    • differ in task formulation and training objectives
    • 违反 TA 并且扩大任务规模以实现 TC 是很麻烦的
  • 将模态表示与下游任务纠缠在一起
    • 例如It is a common practice for Vision-Language models to take the detected objects as part of the image input features,Though it demonstrates better downstream task performance on some closed-domain datasets,it depends on an extra object detector which usually fails at open-domain data
OFA: unifying architectures, tasks, and modalities
  1. supports Task Comprehensiveness
  1. on the publicly available datasets of 20M image-text pairs
  1. achieves competitive performance in zero-shot learning
    1. it can transfer to unseen tasks with new task instructions and adapt to out-of-domain information without finetuning
notion image

I/O & Architecture

I/O

  • image
    • 使用 ResNet 模块进行卷积xv ∈ ℝH × W × C
      • Compared with the complex, resource&time-consuming object feature extraction
    • CoAtNets
    • 使用前三次ResNet
      • ResNet与transformer模型一起训练
    • image quantization图像量化
    • sparse coding 稀疏编码
  • text
    • same as BERT and GPT
    • byte-pair encoding (BPE)
    • transform it into a subword sequence and then embed them to features
    • embedding层与decoder softmax层共享参数
  • object
    • 提取label与bounding box
    • continuous corner coordinates作为位置token〈x1, y1, x2, y2〉
    • label使用BPE tokens
离散化文本、图像和对象,并用一个统一的词表的token表示他们
  • 使用一个统一的词表进行token
    • 分词, 图像编码, 位置编码

Architecture

notion image
img
解耦token embeddings和patch embeddings与其位置关系
  • 1D relative position bias
  • 2D relative position bias

Tasks & Modalities

seq2seq泛式
跨模态、单模态任务都是seq2seq generation + instructions的形式
多模态学习
  • visual grounding
    • 输入图片和描述
    • Which region does the text xt describe?
    • 输出bounding box
      • notion image
  • image-text matching
    • 输入图片文本对
    • 输出yes/no
  • image captioning
    • 输入图片
    • 输入标题
  • visual question answering
    • 输入图片和问题
    • 输出答案
      • notion image
  • grounded captioning
    • 输入带bounding box的图片
    • bounding box内容的描述
语言任务
  • 完型填空
视觉任务
  • 遮掉图片中间部分
    • What is the image in the middle part?
    • 输出图像
  • 目标检测
    • What are the objects in the image?
    • 输出目标位置与label

Training & Inference

loss:交叉熵
decode策略
  • beam search
    • 很难再整个词典上优化
    • 模型生成无效标签
    • 下游的分类任务效果差
  • trie-based search——prefix tree
      notion image
Parallax Attention for Unsupervised Stereo Correspondence LearningNeuS
  • GitTalk
  • WebMention