OFA
type
status
date
slug
summary
tags
category
icon
password
- 任务无关和模型无关的
- 不需要额外的特定任务层
- 只用20M公开的图片文本对
omni-model:
- Task-Agnostic (TA)
- 分类、生成、自监督代理任务
- 与预训练、微调无关
- Modality-Agnostic (MA)
- 所有模式的输入输出
- Task Comprehensiveness (TC)
- 充足的任务种类来积累泛化能力
目前无法建立omni-model的原因
- 模型会设计用于微调的额外可学习组件,例如task-specific heads、adapters、 soft prompts
- 使模型结构只能完成特定任务
- 在预训练和微调产生差异
- 无法完成zero-shot方式的unseen tasks
- task-specific formulation
- differ in task formulation and training objectives
- 违反 TA 并且扩大任务规模以实现 TC 是很麻烦的
- 将模态表示与下游任务纠缠在一起
- 例如It is a common practice for Vision-Language models to take the detected objects as part of the image input features,Though it demonstrates better downstream task performance on some closed-domain datasets,it depends on an extra object detector which usually fails at open-domain data
OFA: unifying architectures, tasks, and modalities
- supports Task Comprehensiveness
- on the publicly available datasets of 20M image-text pairs
- achieves competitive performance in zero-shot learning
- it can transfer to unseen tasks with new task instructions and adapt to out-of-domain information without finetuning
I/O & Architecture
I/O
- image
- 使用 ResNet 模块进行卷积xv ∈ ℝH × W × C
- Compared with the complex, resource&time-consuming object feature extraction
- CoAtNets
- 使用前三次ResNet
- ResNet与transformer模型一起训练
- image quantization图像量化
- sparse coding 稀疏编码
- text
- same as BERT and GPT
- byte-pair encoding (BPE)
- transform it into a subword sequence and then embed them to features
- embedding层与decoder softmax层共享参数
- object
- 提取label与bounding box
- continuous corner coordinates作为位置token〈x1, y1, x2, y2〉
- label使用BPE tokens
离散化文本、图像和对象,并用一个统一的词表的token表示他们
- 使用一个统一的词表进行token
- 分词, 图像编码, 位置编码
Architecture
img
解耦token embeddings和patch embeddings与其位置关系
- 1D relative position bias
- 2D relative position bias
Tasks & Modalities
seq2seq泛式
跨模态、单模态任务都是seq2seq generation + instructions的形式
多模态学习
- visual grounding
- 输入图片和描述
- Which region does the text xt describe?
- 输出bounding box
- image-text matching
- 输入图片文本对
- 输出yes/no
- image captioning
- 输入图片
- 输入标题
- visual question answering
- 输入图片和问题
- 输出答案
- grounded captioning
- 输入带bounding box的图片
- bounding box内容的描述
语言任务
- 完型填空
视觉任务
- 遮掉图片中间部分
- What is the image in the middle part?
- 输出图像
- 目标检测
- What are the objects in the image?
- 输出目标位置与label
Training & Inference
loss:交叉熵
decode策略
- beam search
- 很难再整个词典上优化
- 模型生成无效标签
- 下游的分类任务效果差
- trie-based search——prefix tree