mmaction人体行为检测脚本demo/demo_spatiotemporal_det.py的分析。

Intro

对人体行为检测demo脚本demo/demo_spatiotemporal_det.py进行分析。

extract all frames

1 2	frame_paths, original_frames = frame_extract(args.video, out_dir=tmp_dir.name) # extract all frames

sample interval

window_size = clip_len * frame_interval, here clip_len is the number of frames in the window, frame_interval is the interval of frames in window.

window_size = clip_len * frame_interval
assert clip_len % 2 == 0, 'We would like to have an even clip_len'
# Note that it's 1 based here
timestamps = np.arange(window_size // 2, num_frame + 1 - window_size // 2,
					   args.predict_stepsize)

human detections

inference:

human_detections, _ = detection_inference(args.det_config,
                                              args.det_checkpoint,
                                              center_frames,
                                              args.det_score_thr,
                                              args.det_cat_id, args.device)

restore to the original size:

for i in range(len(human_detections)):
	det = human_detections[i]
	det[:, 0:4:2] *= w_ratio
	det[:, 1:4:2] *= h_ratio
	human_detections[i] = torch.from_numpy(det[:, :4]).to(args.device)

SpatioTemporal Action Detection

get all frames in a window according to a target frame(timestamp):

start_frame = timestamp - (clip_len // 2 - 1) * frame_interval
frame_inds = start_frame + np.arange(0, window_size, frame_interval)
frame_inds = list(frame_inds - 1)
imgs = [frames[ind].astype(np.float32) for ind in frame_inds]

get the result of SpatioTemporal Action Detection:

input_array = np.stack(imgs).transpose((3, 0, 1, 2))[np.newaxis]
input_tensor = torch.from_numpy(input_array).to(args.device)

datasample = ActionDataSample()
datasample.proposals = InstanceData(bboxes=proposal)
datasample.set_metainfo(dict(img_shape=(new_h, new_w)))
with torch.no_grad():
	result = model(input_tensor, [datasample], mode='predict')
	scores = result[0].pred_instances.scores
	prediction = []
	# N proposals
	for i in range(proposal.shape[0]):
		prediction.append([])
	# Perform action score thr
	for i in range(scores.shape[1]):
		if i not in label_map:
			continue
		for j in range(proposal.shape[0]):
			if scores[j, i] > args.action_score_thr:
				prediction[j].append((label_map[i], scores[j,i].item()))
	predictions.append(prediction)

Use result

Show result in dense frames.

def dense_timestamps(timestamps, n):
	"""Make it nx frames."""
	old_frame_interval = (timestamps[1] - timestamps[0])
	start = timestamps[0] - old_frame_interval / n * (n - 1) / 2 
	new_frame_inds = np.arange(
		len(timestamps) * n) * old_frame_interval / n + start
	return new_frame_inds.astype(np.int64)

here the meaning of start = timestamps[0] - old_frame_interval / n * (n - 1) / 2:

修复代码中的问题

加载大视频

问题描述: 当处理高清晰图像时，会在运行时崩溃掉。
原因如下：

如我要处理的视频共510M, 但在 [[#extract all frames]] 这一步，把所有视频帧保存下来后，竟占用40多G，原因是视频进行了编码（类似于进行了压缩）。所以若遇到崩溃，首先看下，自己的磁盘空间是否足够。
更可气的是，每帧大小为(1440, 2560, 3)的图片，磁盘空间占用34G, 而内存却占用了309G, 原因是:
1. 内存 VS 硬盘( sys.getsizeof() VS os.path.getsize() ): 当加载一个文件到内存中时，用sys.getsizeof()获取变量大小，其包括文件数据及可能的其他python对象所占用的内存空间, 所以通常会远远大于文件在磁盘上的实际大小。

解决方案：对代码进行改造，分段进行识别。

虚怀若谷，大智若愚

mmaction2之demo_spatiotemporal_det.py分析