The detail of TriDet.

Intro

TriDet: Temporal Action Detection with Relative Boundary Modeling.
Temporal Action Detection(TAD)
1. Detect all action boundaries and categories from an untrimmed video.
2. The pipeline of TAD:
  1. backbone:
    1. use the pre-trained model in Action Recognition Task.
    2. to get the feature map of each frame.
The core Focus of the author:
1. get more accuracy boundary.
2. explore Transformer for TAD.

Intro

two classes by the method of boundary dividing for TAD works using Transformer:
1. Segment-level prediction
2. Instance-level prediction

Segment-level prediction

based on extracted feature maps, get a clip, and simple, global expression(e.g., pooling), finally judge whether the clip is the target.
e.g. both of the below are two-stage network(like Faster-Rcnn, one stage generate lots of proposals, the second stage regresses and classifies proposals):
1. BMN
2. PGCN：
  1. use GCN to refine every proposal.
The method can’t be trained end-to-end.
End-to-end:
1. TadTR
2. ReAct

Instance-level prediction

Anchor-free Detection

AFSD

predict the distance to the start or end boundary.
then make the position pointed by most other position as the boundary.

ActionFormer

Using slide-window to apply self-attention.
In 2022, the work in TAD improved obviously.
So TriDet carry out based on the work.

Segmentation

MLAD

Apply self-attention in the time level dimension and the classes level dimension.
Add the two attentions of the two level dimensions to build the final feature map.

MS-TCT

Add a CNN module after a traditional self-attention module.
Add residual connections.

Summary

For the above two works, they both pay attention to modifying the self-attention.
So, it indicates that the original self-attention cannot be applied to TAD.

Pros and Cons

Segment-level prediction

Contain global representation of segments.
Larger receptive field.
More information.
Detailed Information at each instant is discarded.
Highly depend on the accuracy of segments.

Instant-level prediction

Contain detailed representation of instants.
Smaller receptive field.
The requirement for feature distinguishability is high. (use the strong backbone to extract features.)
The degree of response varies greatly with different videos.

Motivation of Trident-head

Consider both instant-level and segment-level feature.
Set it as the segment-level feature that the predicted frame with the fixed number of adjacent frames.
Set the segment-level feature as instant-level feature?

Trident-head

Three branch:
1. Start Boundary and End Boundary would extract the segment-level feature.
2. Center Offset would extract instance-level feature.
E.G., Predicted Start Boundary is decided by Start Boundary and Center Offset:
Expectation is decided by B:
1. if B is too small, we can’t find the more far boundary.
2. if B is too big, the difficulty of learning and convergence of training is more great, so that the predicted result is not accuracy.
Combined with FPN:
1. In the different level layers, the fixed number of B is set to product different Bs, so it can have small and big Bs simultaneously.
2. While finally outputting, the predicted results in different layers times corresponding scale ratio to get real position.

The second question: Attention in Temporal Dimension.

Many methods require complex attention mechanisms to make the network work.
The success of the previous transformer-based layers(in TAD) primarily relies on their macro-architecture, rather than the self-attention mechanism.
Above, when 1D-Conv take the place of Self-attention, the Avg map only drops by 1.9, but when CNN baseline takes the place of Transformer baseline, the Avg map drops very much, which indicates that Transformer is effective depending to its structure not Self-attention.
The Rank Loss Problem of Self Attention

In TAD, making the features same is disastrous, because we need to distinguish a position is the action or not.
Pure LayerNorm will normalize the features $x\in R^n$ to a modulus $\sqrt{n}$ :
$x’ = LayerNorm(x)$

$x’_i = \frac{x_i-mean(x)}{\frac{1}{n} {\textstyle \sum_{n}^{}(x_i-mean(x))^2}}$

$\left | x’ \right | ^2_2 = {\textstyle \sum_{n}^{}x’^2_i}=n$
The Evidence on HACS:
we consider the cosine similarity:
1. here SA is Self-Attention, SGP is proposed by the author, BackB is the backbone network to extract the feature.
2. here the value is the average of the cosine similarity of every feature and average feature in the same layer ?
Consider the Self Attention:
$V’=WV$
$W=Softmax(\frac{QK^T}{\sqrt{d}})$
W is non-negative and the sum of each row is 1, thus the $V’$ are [[conceptAI#convex combination|convex combination]] for the input $V$.
But the value in Convolution kernel can be negative, and the sum can be not 1.

The author’s Solution: The SGP layer

increase the discrimination of feature.
capture temporal information with different scales of receptive fields.
$f_{SGP}=\Phi (x)FC(x)+\psi(x)(Conv_w(x)+Conv_{kw}(x))+x$,

$\Phi(x)=ReLU(FC(AvgPool(x)))$,

$\psi(x)=Conv_w(x)$

Window-level: make the network extract features in different scales adaptively.

In detail:
1. the author uses the depth-wise convolution to reduce the computation of the network.
2. add a additional residual connection.

Experiment

The performance is so good: the accuracy is higher, the speed is faster.

虚怀若谷，大智若愚

TriDet

Intro

Intro

Segment-level prediction

Instance-level prediction

Anchor-free Detection

AFSD

ActionFormer

Segmentation

MLAD

MS-TCT

Summary

Pros and Cons

Segment-level prediction

Instant-level prediction

Motivation of Trident-head

Trident-head

The second question: Attention in Temporal Dimension.

The author’s Solution: The SGP layer

Experiment

Refs

Intro

Related Work

Intro

Segment-level prediction

Instance-level prediction

Anchor-free Detection

AFSD

ActionFormer

Segmentation

MLAD

MS-TCT

Summary

Pros and Cons

Segment-level prediction

Instant-level prediction

Motivation of Trident-head

Trident-head

The second question: Attention in Temporal Dimension.

The author’s Solution: The SGP layer

Experiment

Refs