0%

mmaction2之demo_spatiotemporal_det.py分析

mmaction人体行为检测脚本demo/demo_spatiotemporal_det.py的分析。


Intro

对人体行为检测demo脚本demo/demo_spatiotemporal_det.py进行分析。

extract all frames

1
2
frame_paths, original_frames = frame_extract(args.video, 
out_dir=tmp_dir.name) # extract all frames

sample interval

  1. window_size = clip_len * frame_interval, here clip_len is the number of frames in the window, frame_interval is the interval of frames in window.
1
2
3
4
5
window_size = clip_len * frame_interval
assert clip_len % 2 == 0, 'We would like to have an even clip_len'
# Note that it's 1 based here
timestamps = np.arange(window_size // 2, num_frame + 1 - window_size // 2,
args.predict_stepsize)

human detections

  1. inference:
1
2
3
4
5
human_detections, _ = detection_inference(args.det_config,
args.det_checkpoint,
center_frames,
args.det_score_thr,
args.det_cat_id, args.device)
  1. restore to the original size:
1
2
3
4
5
for i in range(len(human_detections)):
det = human_detections[i]
det[:, 0:4:2] *= w_ratio
det[:, 1:4:2] *= h_ratio
human_detections[i] = torch.from_numpy(det[:, :4]).to(args.device)

SpatioTemporal Action Detection

  1. get all frames in a window according to a target frame(timestamp):
1
2
3
4
start_frame = timestamp - (clip_len // 2 - 1) * frame_interval
frame_inds = start_frame + np.arange(0, window_size, frame_interval)
frame_inds = list(frame_inds - 1)
imgs = [frames[ind].astype(np.float32) for ind in frame_inds]
  1. get the result of SpatioTemporal Action Detection:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
input_array = np.stack(imgs).transpose((3, 0, 1, 2))[np.newaxis]
input_tensor = torch.from_numpy(input_array).to(args.device)

datasample = ActionDataSample()
datasample.proposals = InstanceData(bboxes=proposal)
datasample.set_metainfo(dict(img_shape=(new_h, new_w)))
with torch.no_grad():
result = model(input_tensor, [datasample], mode='predict')
scores = result[0].pred_instances.scores
prediction = []
# N proposals
for i in range(proposal.shape[0]):
prediction.append([])
# Perform action score thr
for i in range(scores.shape[1]):
if i not in label_map:
continue
for j in range(proposal.shape[0]):
if scores[j, i] > args.action_score_thr:
prediction[j].append((label_map[i], scores[j,i].item()))
predictions.append(prediction)

Use result

  1. Show result in dense frames.
1
2
3
4
5
6
7
def dense_timestamps(timestamps, n):
"""Make it nx frames."""
old_frame_interval = (timestamps[1] - timestamps[0])
start = timestamps[0] - old_frame_interval / n * (n - 1) / 2
new_frame_inds = np.arange(
len(timestamps) * n) * old_frame_interval / n + start
return new_frame_inds.astype(np.int64)

here the meaning of start = timestamps[0] - old_frame_interval / n * (n - 1) / 2:

修复代码中的问题

加载大视频

问题描述: 当处理高清晰图像时,会在运行时崩溃掉。
原因如下:

  1. 如我要处理的视频共510M, 但在 [[#extract all frames]] 这一步,把所有视频帧保存下来后,竟占用40多G,原因是视频进行了编码(类似于进行了压缩)。所以若遇到崩溃,首先看下,自己的磁盘空间是否足够。
  2. 更可气的是,每帧大小为(1440, 2560, 3)的图片,磁盘空间占用34G, 而内存却占用了309G, 原因是:
    1. 内存 VS 硬盘( sys.getsizeof() VS os.path.getsize() ): 当加载一个文件到内存中时,用sys.getsizeof()获取变量大小,其包括文件数据及可能的其他python对象所占用的内存空间, 所以通常会远远大于文件在磁盘上的实际大小。

解决方案:对代码进行改造,分段进行识别。