OFA | BigCileng’s CyberMansion

type

status

date

slug

summary

category

icon

password

任务无关和模型无关的

不需要额外的特定任务层

只用20M公开的图片文本对

omni-model：

Task-Agnostic (TA)

分类、生成、自监督代理任务

与预训练、微调无关

Modality-Agnostic (MA)

所有模式的输入输出

Task Comprehensiveness (TC)

充足的任务种类来积累泛化能力

目前无法建立omni-model的原因

模型会设计用于微调的额外可学习组件，例如task-specific heads、adapters、 soft prompts

使模型结构只能完成特定任务
在预训练和微调产生差异
无法完成zero-shot方式的unseen tasks

task-specific formulation

differ in task formulation and training objectives
违反 TA 并且扩大任务规模以实现 TC 是很麻烦的

将模态表示与下游任务纠缠在一起

例如It is a common practice for Vision-Language models to take the detected objects as part of the image input features，Though it demonstrates better downstream task performance on some closed-domain datasets,it depends on an extra object detector which usually fails at open-domain data

OFA: unifying architectures, tasks, and modalities

supports Task Comprehensiveness

on the publicly available datasets of 20M image-text pairs

achieves competitive performance in zero-shot learning

it can transfer to unseen tasks with new task instructions and adapt to out-of-domain information without finetuning

I/O & Architecture

I/O

image

使用 ResNet 模块进行卷积xv ∈ ℝH × W × C

Compared with the complex, resource&time-consuming object feature extraction

CoAtNets
使用前三次ResNet

ResNet与transformer模型一起训练

image quantization图像量化
sparse coding 稀疏编码

text

same as BERT and GPT
byte-pair encoding (BPE)
transform it into a subword sequence and then embed them to features
embedding层与decoder softmax层共享参数

object

提取label与bounding box
continuous corner coordinates作为位置token〈x1, y1, x2, y2〉
label使用BPE tokens

离散化文本、图像和对象，并用一个统一的词表的token表示他们

使用一个统一的词表进行token

分词, 图像编码, 位置编码

Architecture

img

解耦token embeddings和patch embeddings与其位置关系

1D relative position bias

2D relative position bias

Tasks & Modalities

seq2seq泛式

跨模态、单模态任务都是seq2seq generation + instructions的形式

多模态学习

visual grounding

输入图片和描述
Which region does the text xt describe?
输出bounding box

image-text matching

输入图片文本对
输出yes/no

image captioning

输入图片
输入标题

visual question answering

输入图片和问题
输出答案

grounded captioning

输入带bounding box的图片
bounding box内容的描述

语言任务

完型填空

视觉任务

遮掉图片中间部分

What is the image in the middle part?
输出图像

目标检测

What are the objects in the image?
输出目标位置与label

Training & Inference

loss：交叉熵

decode策略

beam search

很难再整个词典上优化
模型生成无效标签
下游的分类任务效果差

trie-based search——prefix tree