《neural motifs》阅读笔记

2019-07-04

原文链接：Neural Motifs: Scene Graph Parsing with Global Context

代码：code

引言

对于文章中提到的一些概念不是很理解，查找了文献的相关阅读笔记，总结如下。

Scene Graph = Objects + Relationships

Scene Graph 译为场景图谱，即预测场景中物体之间的关系
Motifs

文献中将 motifs 定义为 regularly appearing substructures in scene graph, 即场景图中出现的子结构。
IoU

IoU (Intersection over Union), 交集并集比

IoU = Area of Overlap / Area of Union
RoI

RoI (Region of Interest), 感兴趣区域
SGD with momentum

带动量的随机梯度下降，具体的可参考这一篇博客
RPN

RPN（Region Proposal Networks），区域候选网络，第一次出现是应用于 Faster RCNN 结构中，用以提取候选框。
- Anchor
  
  Anchor是大小和尺寸固定的候选框。Faster RCNN论文中用到的Anchor有三个尺寸和三种比例，分别是：小（128）、中（256）、大（512），3 * 3的组合总共有9种anchor。
  
  这9种anchor在特征图（feature）左右上下移动，每一个特征图上的点都有9个anchor，最终生成了（H/16）*(W/16) * 9 个anchor。（特征图的大小是（H/16）*(W/16) ）。论文将数据集中的每张图片，压缩到 800*600，于是有：
  
  $ceil(800/16) * ceil(600/16) * 9 = 50 * 38 * 9 = 17100$
  
  因此对于一张图片，也有20000个左右的anchor，有点像是暴力穷举法，穷举所有可能的anchor。
  
  可参考这篇知乎专栏，里面详细介绍了Faster RCNN的原理，其中，作者指出：
  
  RPN就是在原图尺度上，设置了密密麻麻的候选Anchor。然后用cnn去判断哪些Anchor是里面有目标的positive anchor，哪些是没目标的negative anchor。其实是一个二分类任务。
问题：
- 如何获取 positive anchors? （利用softmax分类器）
NMS

NMS(non Maximum Suppression)，非极大值抑制，是一种获取局部最大值的有效方法。

在目标检测中，会从一张图片中找出许多的候选框，我们需要判断哪些框是有用的，哪些框是需要舍弃的。

比如在一张汽车的图片中，找出了6个候选框，按照分类器的分类概率排序，从小到大分别属于车辆的概率分别为A、B、C、D、E、F。
1. 从最大概率矩形框 F 开始，分别判断 A~E 与 F 的重叠度 IoU 是否大于某个设定的阈值
2. 假设 B、D 与 F 的 IoU 超过阈值，那么就扔掉B、D ；并标记第一个矩形框 F，是我们保留下来的
3. 从剩下的矩形框 A、C、E中，选择概率最大的 E，然后判断 E 与 A、C 的 IoU，IoU 大于一定的阈值，那么就扔掉；并标记 E 是我们保留下来的第二个矩形框
4. 一直重复这个过程，直到没有候选框

VG 数据集分析

在实验建模之前，作者先对VG数据集进行分析，并发现了一些有价值的规律。

1. Prevalent Relations in Visual Genome (VG中的主流关系)

作者将VG集合中的物体和关系分为几类。其中，关系主要由三类构成：

Geometric(几何) 占数据集的50.0%
Possessive(所有格) 占数据集的40.9%
Semantic(语义)占数据集的8.7%

同时，作者发现，服装和局部物体主要通过 Possessive 关系连接，然而家具和建筑物体几乎都是由几何关系连接。此外，关系并不是唯一的，例如以下两种关系描述的是同一个场景：

wheel on bike (wheel 作为 head object)
bike has wheel (bike 作为 head object)

Scene Graph 是由 Objects 和 Relationships 组成的。作者分析了如果已知 Scene Graph 中的 head objects、tail objects 或 relationships，预测 graph 中其他成分的正确率。发现给出 head objects 和 tail object 的 label 之后，可以很好地预测出 relation; 然而，给出 relation, 却无法很好地预测 head or tail objects。

“这启发我们在预测 objects 间的 relation 时，要利用 head 和 tail 的label。”

2. Larger Motifs

Scene graphs not only have local structure but have higher order structure as well.

（场景图不仅存在局部结构，还存在高阶结构）

在一篇博客里看到对上句话更详细的解释：Scene Graph中不仅有上文描述存在的局部结构（先验），全局里也有类似的结构特点。此外，作者在实验中发现，有50%的图片存在长度是2的motif。

“这启发我们在预测关系的时候要考虑全局上下文信息，即要考虑全局中出现的motif，它们之间也是有联系的。”

Model

由上述VG数据集的分析发现：

Predicted object labels may depend on one another, and predicted relation labels may depend on predicted object labels.

（预测物体标签可能依赖于另一个物体标签，预测关系标签可能依赖于物体标签）

作者将图像 $I$ 中出现场景 $G$ 的概率分解成三个因子，并在模型中运用三个模块描述：

$Pr(G|I) = Pr(B|I)Pr(O|B, I)Pr(R|B,O,I)$

$Pr(B|I)$: object detection model (物体检测模型)， predict the probability of bounding box
$Pr(O|B, I)$: object model (物体模型) , predict the probability of the labels of objects

Linearize B into a sequence that an LSTM the processes to create a contextualized representation of each box.
$Pr(R|B, O, I)$: relation model (关系模型)，predict the probability of relation

linearize the set of predicted labeled objects, O, and process them with another LSTM to create a
representation of each object in context.

1. Bounding Boxes

利用在 VG 上预训练好的 Faster R-CNN 来标注输入图像上的 objects 的 label 和 bounding boxes.

输入：图像
输出：$[(b_1, f_1, l_1), …, (b_n, f_n, l_n)]， B_i = (b_i, f_i, l_i)$
- $b_i$ 表示区域 $i$
- $f_i$ 表示该区域 faster R-CNN 的 feature vector
- $l_i$ 表示该区域 object label 的概率分布向量
实验细节
1. Use Faster RCNN with a VGG backbone as our underling object detector.
2. To control for detector performance in evaluating different scene graph models, we first pretrain the dectector on Visual Genome objects.

2. Objects

基于上一步的输出，构造 contextualized representation for object prediction（物体的上下文表示）。首先构造 $B$ 的线性序列，经过一层 bidirectional LSTM，得到具有 contextualized information 的特征 $c_i$ :

$C = biLSTM({[f_i; W_1 l_i]}_{i = 1, …, n})$

$C = [c_1, …，c_n]$

$W_1$: 参数矩阵，将 object label 的概率分布向量 $l_i$ 映射为 100维的向量
$c_i$ : 表示该LSTM最后一层隐含层的状态信息
输入：$f_i、l_i $
输出：object 的上下文特征向量 $C$

接着，再经过一层 LSTM。

$h_i = {LSTM}i([c_i; {\hat{o}}{i-1}])$

$ {\hat{o}}_i = argmax(W_oh_i) \in R^{C} (one-hot)$

$h_i$: 表示该LSTM最后一层隐含层的状态信息
${\hat{o}}_i$: 预测的 object label

3. Relations

类似地，利用一层双向LSTM 来构造 bbox 和 objects 的上下文表示：

$D = biLSTM({[c_i; W_2 {\hat{o}}i]}{i=1, …，n)}$

$D = [d_1, …，d_n]$

$W_1$: 参数矩阵，将 object 物体的上下文特征表示 $ {\hat{o}}_i$ 映射为 100维的向量

接着，预测 object i 和 object j 之间可能的关系。

$g_{i, j} = (W_hd_i) ◦ (W_td_j) ◦ f_{i, j} $

$Pr(x_{i \rightarrow j}|B, O) = softmax(W_rg_{i, j} + w_{o_i, o_j})$

$f_{i, j}$ : object i 和 object j 所在区域 $b_i, b_j$的并集对应的特征
- 提取 object i 和 object j 对应的 Faster R-CNN 的
$W_h, W_t$: 参数矩阵，将 head object 和 tail object 的特征向量映射为4096维的向量。

VG 数据集

上周ljr师兄抽空给我讲了下VG数据集，将一些文件信息整理如下。

VG数据集总共包含几个文件：

neural_motif
├── dataset
│   ├── VG_100K
└── data
│   ├── stanford_filtered
│   │   ├── image_data.json
│   │	├── VG-SGG.h5
└───└───└── VG-SGG-dicts.json

其中，VG_100K存储的是VG数据集的所有图片，总共包含了108077张图片。

image_data.json

是一个长度为108077的list数组，存储的是图片的一些基本信息，print第一张图片的信息如下：
1
{'width': 800, 'url': 'https://cs.stanford.edu/people/rak248/VG_100K_2/1.jpg', 'height': 600, 'image_id': 1, 'coco_id': None, 'flickr_id': None}
可以看到，该json文件下存储的是图片的宽度、高度、id号以及url链接等基础信息。

VG-SGG.h5

查看h5文件中的键值，如下：

active_object_mask
# shape(1145398, 1)
# 每张图片有多个bounding box, 数据集总共包含有 1145398 个 bounding boxes

boxes_1024
# shape(1145398, 4)
# 图片压缩为1024大小后，每个bounding box左上角的x、y坐标及右下角的x、y坐标

boxes_512
# shape(1145398, 4)
# 图片压缩为512大小后，每个bounding box左上角的x、y坐标及右下角的x、y坐标

img_to_first_box
# shape(108073,)
# 图片中对应的第一个bbox的id

img_to_first_rel
# shape(108073,)
# 图片中对应的第一个relation的id

img_to_last_box
# shape(108073,)
# 图片中对应的最后一个bbox的id

img_to_last_rel
# shape(108073,)
# 图片中对应的最后一个relation的id

labels
# shape(1145398, 1)
# bounding box对应的物体label

predicates
# shape(622705, 1)
# relationship对应的谓词label

relationships
# shape(622705, 2)
# relationship对应的bounding box的id

split
# shape(108073,)
# 图片训练集和测试集的划分，0表示该图片为训练集，2表示该图片为测试集
# 尽管image_data.json包含108077张图片，剔除掉gif等4张其他格式的图片，有效图片为108073张
# 训练集包含 75651 张图片，测试集包含 32422 张图片

VG-SGG-dicts.json

查看文件中的键值，如下：

VG-SGG-dicts.json
├── object_count
├── idx_to_label
├── predicate_to_idx
├── predicate_count
├── idx_to_predicate
└── label_to_idx

输出 data['object_count']的信息如下：

{'bowl': 5152, 'cap': 4130, 'boy': 8125, 'person': 41278, 'boat': 6164, 'sneaker': 2212, 'beach': 3693, 'paper': 4714, 'lady': 3335, 'paw': 2595, 'counter': 4632, 'snow': 9437, 'neck': 4545, 'motorcycle': 4537, 'bike': 4884, 'fork': 2470, 'stand': 2934, 'food': 6057, 'clock': 4877, 'guy': 2823, 'wave': 5938, 'vegetable': 2264, 'elephant': 4870, 'bus': 5118, 'zebra': 4449, 'wire': 4017, 'dog': 4651, 'basket': 2666, 'mouth': 3630, 'board': 4649, 'leaf': 13984, 'wheel': 9236, 'hand': 17497, 'flower': 8228, 'giraffe': 4953, 'light': 14182, 'number': 3294, 'box': 5467, 'tree': 49902, 'child': 3799, 'vehicle': 3943, 'laptop': 3077, 'train': 6411, 'ski': 3529, 'animal': 3611, 'sidewalk': 9478, 'racket': 2610, 'bird': 4339, 'lamp': 4054, 'orange': 2536, 'sign': 23499, 'cabinet': 4476, 'branch': 6161, 'shirt': 33920, 'room': 2552, 'bag': 7391, 'seat': 4360, 'hat': 8366, 'pot': 2726, 'window': 42466, 'letter': 6630, 'boot': 2714, 'head': 21376, 'airplane': 2633, 'tower': 2957, 'sheep': 4117, 'plate': 11632, 'jean': 5535, 'girl': 7543, 'face': 8419, 'player': 5567, 'cow': 4531, 'trunk': 5191, 'skateboard': 3898, 'house': 5006, 'tie': 2654, 'pillow': 5499, 'kite': 3706, 'jacket': 10463, 'logo': 4606, 'rock': 7922, 'surfboard': 4117, 'plant': 6478, 'windshield': 3177, 'sock': 3159, 'chair': 11936, 'table': 19064, 'shelf': 5659, 'people': 14415, 'tire': 6353, 'plane': 5058, 'men': 2037, 'bench': 5589, 'track': 7054, 'glass': 10951, 'truck': 4624, 'engine': 2673, 'fence': 12027, 'vase': 2853, 'sink': 3456, 'helmet': 6306, 'roof': 6453, 'kid': 2264, 'ear': 12069, 'man': 54659, 'cat': 3720, 'book': 4297, 'nose': 6035, 'pizza': 3678, 'woman': 26910, 'arm': 9951, 'car': 17352, 'post': 6163, 'handle': 7996, 'cup': 4305, 'glove': 4169, 'railing': 3049, 'drawer': 2612, 'bear': 3487, 'desk': 2969, 'banana': 4566, 'shoe': 12419, 'leg': 22335, 'curtain': 3241, 'street': 10996, 'pant': 13147, 'mountain': 4894, 'finger': 3059, 'building': 31805, 'flag': 3238, 'wing': 4455, 'phone': 2645, 'door': 13354, 'tile': 7619, 'towel': 3231, 'coat': 4285, 'bed': 3296, 'tail': 8821, 'screen': 2498, 'eye': 5619, 'skier': 2567, 'toilet': 2702, 'umbrella': 6389, 'hill': 4457, 'fruit': 2366, 'short': 7807, 'bottle': 6246, 'hair': 17422, 'pole': 21205, 'horse': 5315}

输出 data['idx_to_label']的信息如下：

{'118': 'ski', '87': 'pant', '74': 'leg', '133': 'tower', '117': 'skateboard', '102': 'racket', '59': 'handle', '22': 'building', '64': 'horse', '80': 'motorcycle', '2': 'animal', '129': 'tile', '109': 'sheep', '67': 'jean', '46': 'finger', '93': 'pillow', '15': 'book', '51': 'fruit', '17': 'bottle', '98': 'player', '62': 'helmet', '57': 'hair', '35': 'curtain', '106': 'room', '123': 'stand', '55': 'glove', '142': 'vehicle', '144': 'wheel', '7': 'beach', '130': 'tire', '56': 'guy', '103': 'railing', '134': 'track', '149': 'woman', '5': 'banana', '23': 'bus', '108': 'seat', '78': 'man', '89': 'paw', '50': 'fork', '146': 'windshield', '37': 'dog', '116': 'sink', '13': 'board', '141': 'vegetable', '127': 'tail', '10': 'bench', '101': 'pot', '92': 'phone', '38': 'door', '76': 'light', '40': 'ear', '111': 'shirt', '114': 'sidewalk', '45': 'fence', '41': 'elephant', '48': 'flower', '4': 'bag', '86': 'orange', '53': 'girl', '137': 'truck', '32': 'counter', '95': 'plane', '69': 'kite', '26': 'car', '131': 'toilet', '126': 'table', '9': 'bed', '60': 'hat', '33': 'cow', '94': 'pizza', '1': 'airplane', '140': 'vase', '99': 'pole', '27': 'cat', '19': 'box', '75': 'letter', '138': 'trunk', '21': 'branch', '6': 'basket', '20': 'boy', '121': 'snow', '119': 'skier', '122': 'sock', '58': 'hand', '91': 'person', '85': 'number', '120': 'sneaker', '147': 'wing', '143': 'wave', '43': 'eye', '25': 'cap', '54': 'glass', '97': 'plate', '24': 'cabinet', '82': 'mouth', '96': 'plant', '113': 'short', '63': 'hill', '83': 'neck', '139': 'umbrella', '28': 'chair', '52': 'giraffe', '47': 'flag', '31': 'coat', '125': 'surfboard', '104': 'rock', '36': 'desk', '44': 'face', '34': 'cup', '81': 'mountain', '70': 'lady', '100': 'post', '84': 'nose', '150': 'zebra', '30': 'clock', '65': 'house', '128': 'tie', '8': 'bear', '12': 'bird', '11': 'bike', '29': 'child', '77': 'logo', '105': 'roof', '110': 'shelf', '132': 'towel', '72': 'laptop', '148': 'wire', '124': 'street', '39': 'drawer', '71': 'lamp', '115': 'sign', '112': 'shoe', '66': 'jacket', '42': 'engine', '14': 'boat', '107': 'screen', '79': 'men', '145': 'window', '88': 'paper', '135': 'train', '68': 'kid', '49': 'food', '3': 'arm', '18': 'bowl', '73': 'leaf', '136': 'tree', '90': 'people', '61': 'head', '16': 'boot'}

输出 data['predicate_to_idx']的信息如下：

{'holding': 21, 'under': 43, 'using': 44, 'says': 39, 'and': 5, 'attached to': 7, 'on back of': 32, 'lying on': 26, 'eating': 14, 'looking at': 25, 'in front of': 23, 'along': 4, 'flying in': 15, 'parked on': 35, 'walking in': 45, 'laying on': 24, 'sitting on': 40, 'has': 20, 'above': 1, 'carrying': 11, 'belonging to': 9, 'near': 29, 'part of': 36, 'standing on': 41, 'made of': 27, 'growing on': 18, 'watching': 47, 'at': 6, 'playing': 37, 'painted on': 34, 'with': 50, 'covered in': 12, 'behind': 8, 'to': 42, 'over': 33, 'of': 30, 'covering': 13, 'from': 17, 'wearing': 48, 'mounted on': 28, 'walking on': 46, 'across': 2, 'in': 22, 'against': 3, 'for': 16, 'riding': 38, 'between': 10, 'wears': 49, 'on': 31, 'hanging from': 19}

输出 data['predicate_count']的信息如下：

{'wearing': 136099, 'using': 1925, 'part of': 2065, 'hanging from': 9894, 'standing on': 14185, 'parked on': 2721, 'riding': 8856, 'walking on': 4613, 'at': 9903, 'along': 3624, 'of': 146339, 'near': 96589, 'in': 251756, 'on back of': 1914, 'looking at': 3083, 'has': 277936, 'sitting on': 18643, 'mounted on': 2253, 'behind': 41356, 'covered in': 2312, 'between': 3411, 'and': 3477, 'against': 3092, 'in front of': 13715, 'lying on': 1869, 'to': 2517, 'from': 2945, 'eating': 4688, 'over': 9317, 'walking in': 1740, 'holding': 42722, 'flying in': 1973, 'playing': 3810, 'for': 9145, 'wears': 15457, 'carrying': 5213, 'growing on': 1853, 'says': 2241, 'across': 1996, 'made of': 2380, 'laying on': 3739, 'on': 712409, 'with': 66425, 'painted on': 3095, 'covering': 3806, 'watching': 3490, 'under': 22596, 'above': 47341, 'belonging to': 3288, 'attached to': 10190}

输出 data['idx_to_predicate']的信息如下：

{'43': 'under', '29': 'near', '18': 'growing on', '37': 'playing', '50': 'with', '27': 'made of', '2': 'across', '19': 'hanging from', '16': 'for', '35': 'parked on', '41': 'standing on', '47': 'watching', '38': 'riding', '12': 'covered in', '14': 'eating', '33': 'over', '24': 'laying on', '20': 'has', '15': 'flying in', '22': 'in', '30': 'of', '9': 'belonging to', '21': 'holding', '25': 'looking at', '31': 'on', '40': 'sitting on', '49': 'wears', '23': 'in front of', '44': 'using', '11': 'carrying', '10': 'between', '4': 'along', '26': 'lying on', '48': 'wearing', '45': 'walking in', '36': 'part of', '39': 'says', '7': 'attached to', '42': 'to', '8': 'behind', '46': 'walking on', '6': 'at', '17': 'from', '5': 'and', '3': 'against', '13': 'covering', '28': 'mounted on', '1': 'above', '34': 'painted on', '32': 'on back of'}

输出 data['label_to_idx']的信息如下：

{'elephant': 41, 'counter': 32, 'bottle': 17, 'track': 134, 'tower': 133, 'mountain': 81, 'towel': 132, 'train': 135, 'kite': 69, 'tile': 129, 'sneaker': 120, 'arm': 3, 'table': 126, 'nose': 84, 'racket': 102, 'truck': 137, 'man': 78, 'logo': 77, 'beach': 7, 'shelf': 110, 'phone': 92, 'shirt': 111, 'wing': 147, 'rock': 104, 'motorcycle': 80, 'coat': 31, 'screen': 107, 'boot': 16, 'dog': 37, 'helmet': 62, 'street': 124, 'person': 91, 'fork': 50, 'child': 29, 'handle': 59, 'kid': 68, 'board': 13, 'mouth': 82, 'bench': 10, 'windshield': 146, 'skateboard': 117, 'people': 90, 'box': 19, 'face': 44, 'sidewalk': 114, 'fruit': 51, 'plane': 95, 'men': 79, 'bear': 8, 'book': 15, 'tie': 128, 'branch': 21, 'sock': 122, 'toilet': 131, 'cat': 27, 'food': 49, 'pot': 101, 'banana': 5, 'bike': 11, 'plate': 97, 'wire': 148, 'jacket': 66, 'number': 85, 'bowl': 18, 'stand': 123, 'finger': 46, 'glass': 54, 'lamp': 71, 'laptop': 72, 'ear': 40, 'glove': 55, 'cap': 25, 'vehicle': 142, 'tire': 130, 'hair': 57, 'shoe': 112, 'flag': 47, 'basket': 6, 'fence': 45, 'house': 65, 'plant': 96, 'neck': 83, 'lady': 70, 'sheep': 109, 'woman': 149, 'light': 76, 'railing': 103, 'trunk': 138, 'bag': 4, 'player': 98, 'pillow': 93, 'cabinet': 24, 'snow': 121, 'orange': 86, 'cup': 34, 'letter': 75, 'head': 61, 'engine': 42, 'tree': 136, 'cow': 33, 'seat': 108, 'airplane': 1, 'clock': 30, 'pizza': 94, 'wheel': 144, 'vase': 140, 'chair': 28, 'post': 100, 'bed': 9, 'guy': 56, 'drawer': 39, 'animal': 2, 'door': 38, 'girl': 53, 'giraffe': 52, 'sign': 115, 'skier': 119, 'leg': 74, 'roof': 105, 'hand': 58, 'umbrella': 139, 'tail': 127, 'boy': 20, 'pant': 87, 'boat': 14, 'curtain': 35, 'zebra': 150, 'sink': 116, 'surfboard': 125, 'building': 22, 'room': 106, 'desk': 36, 'bus': 23, 'flower': 48, 'paw': 89, 'window': 145, 'hat': 60, 'short': 113, 'leaf': 73, 'vegetable': 141, 'paper': 88, 'pole': 99, 'wave': 143, 'car': 26, 'eye': 43, 'bird': 12, 'ski': 118, 'jean': 67, 'horse': 64, 'hill': 63}