[논문리뷰] HiFT

HiFT : Hierarchical Feature Transformer for Aerial Tracking
Ziang Cao† , Changhong Fu†,*, Junjie Ye† , Bowen Li† , and Yiming Li‡

HiFT: Hierarchical Feature Transformer for Aerial Tracking

Most existing Siamese-based tracking methods execute the classification and regression of the target object based on the similarity maps. However, they either employ a single map from the last convolutional layer which degrades the localization accuracy in

arxiv.org

1. Abstract

대부분 기존 Siamese-based tracking method는 similarity map을 기반으로 target object의 classification 및 regression을 실행한다.
그러나 아래와 같은 이유로 항공 모바일 플랫폼에 대해 다루기 어려운 계산을 도입한다.
- 복잡한 시나리오에서 localization accuracy를 저하시키는 last convolutional layer로부터의 single map 사용
- 의사 결정을 위해 multiple maps을 별도로 사용
따라서 본 연구에서는 항공 추적을 위한 효율적이고 효과적인 hierarchical feature transformer (HiFT)를 제안한다.
Multi-level convolutional layers에 의해 생성된 hierarchical similarity maps은 spatial(shallow layers)와 semantics cues(deep layers)의 interactive fusion을 달성하기 위해 feature transformer에 공급된다.
결과적으로,
1. global contextual information을 증가시켜 target search를 용이하게 할 수 있을 뿐만 아니라,
2. Transformer를 사용하는 본 논문의 end-to-end 아키텍쳐도 multi-level features 간의 interdependencies를 효율적으로 학습할 수 있으므로,
3. discriminability(식별성)이 강한 tracking-tailored feature space를 발견할 수 있다.

2. Introduction

1) What is VOT ?

initial state에서 프레임별로 물체의 위치를 추정하는 것을 목표로 하는 Visual object tracking은 특히 unmanned aerial vehicels (UAVs)에 대한 번창한 응용으로 인해 상당한 관심을 끌었다.
- 예를 들어, aerial cinematography(항공 촬영) [5], visual localization [48], collision warning(충돌 경고) [19]
인상적인 진전에도 불구하고, 효율적이고 효과적인 항공 추적은 limited computational resources, fast motion, low-resolution, frequent occlusion 등과 같은 다양한 어려움으로 인해 여전히 어려운 과제로 남아 있다.

2) 발전 및 한계

시각적 추적 커뮤니티에서 딥러닝(DL) 기반 추적기 [44, 35, 9, 2, 31, 53, 18, 17, 6]
- robust representation capability를 가진 CNN을 사용한다는 점에서 두각을 나타낸다.
그러나 AlexNet [30]과 같은 lightweight CNNs은 complex aerial scenarios에서 성능을 추적하는 데 필수적인 robust features를 거의 추출할 수 없다.
- 더 큰 kernel size 또는 더 깊은 backbone [31]을 사용하면 앞서 언급한 단점을 완화할 수 있지만 효율성과 실용성은 희생될 것이다.
dilated convolution [49]은 receptive field를 확장하고, pooling layer에 의한 resolution의 손실을 방지할 것을 제안했다.
- 안타깝게도 small objects를 다루는 동안 여전히 불안정한 성능에 시달린다.

3) Transformer 도입

최근 Transformer는 인코더-디코더 구조 [1]를 가진 많은 도메인에서 큰 잠재력을 보여주었다.
global relationship modeling에서 transformer의 우수한 성능에 영감을 받아, 항공 추적에서 트랜스포머의 아키텍쳐를 활용하여 multi-level features을 효과적으로 융합하여 유망한 성능을 달성하려고 한다.
한편, multiple layer의 계산으로 인한 효율성 손실과 small objects를 처리하는 트랜스포머의 결함 ([52]에서 지적됨)은 동시에 완화될 수 있다.

4) 본 논문의 특징

Figure 1. Qualitative comparison of the proposed HiFT with stateof-the-arts [23, 8, 31] on three challenging sequences (BMX4, RaceCar1 from DTB70 [34], and Car16 from UAV20L [39]). Owing to the effective tracking-tailored feature space produced by the hierarchical feature transformer, our HiFT tracker can achieve robust performance under various challenges with a satisfactory tracking speed while other trackers lose effectiveness.

특히 시각적 추적의 target object는 임의의 객체일 수 있기 때문에, 본래의 트랜스포머 구조에서 학습된 object query는 시각적 추적에서 잘 일반화되지 않는다.
- 따라서 학습된 object query를 대체하기 위해 더 깊은 레이어로부터의 low-resolution features를 채택한다.
- 한편, 얕은 레이어를 트랜스포머에 공급하여 high-resolution의 spatial information와 low-resolution의 semantic cues의 관계를 암시적으로 모델링하는 end-to-end training에 의한 강력한 discriminability를 가진 tracking-tailored feature space를 발견한다.
또한 low-resolution objects [52]에 직면한 부족함을 추가로 처리하기 위해, multi-level features 간의 interdependencies를 완전히 탐구하기 위해 트랜스포머에서 novel feature modulation layer를 설계한다.
제안된 HiFT 추적기는 그림1과 같이 복잡한 시나리오에서 강력한 성능을 효율적으로 달성했다.
이 작업의 주요 기여는 다음과 같다.
- Multi-level features 간의 관계를 학습하여 항공 추적에 대한 강력한 차별성을 가진 tracking-tailored feature space을 발견하기 위한 새로운 hierarchical feature transformer를 제안한다.
- Siamese networks

✔ Multi-level features 간의 관계를 학습하여 항공 추적에 대한 강력한 차별성을 가진 tracking-tailored feature space을 발견하기 위한 새로운 hierarchical feature transformer를 제안한다.

✔ Siamese networks의 hierarchical features을 더욱 활용하고, small object를 처리하는 추적 정확도를 향상시키기 위해 neat feature modulation layer and classification label을 설계한다.

✔ 4개의 권위 있는 항공 벤치마크에 대한 포괄적인 평가는 다른 최첨단(SOTA) 추적기, 심지어 더 깊은 백본을 장착한 추적기에 대한 HiFT의 유망한 성능을 검증했다.

✔ 실제 테스트는 일반적인 공중 플랫폼에서 수행되어 실제 시나리오에서 HiFT의 우수한 효율성과 효과를 입증한다.

3. Related Works

3.1. Visual Tracking Methods

1) DCF

MOSSE[4] 이후 수작업으로 제작된 DCF 기반 추적기에서 다양한 성과가 있었다. [21, 36, 29, 9]
- Fourier domain에서 계산함으로써 DCF 기반 추적기는 높은 효율 [20]로 경쟁력 있는 성능을 달성할 수 있다.
- 그럼에도 불구하고 이러한 추적기는 handcrafted feature의 표현 능력이 낮기 때문에 다양한 추적 조건에서 견고성을 유지하기 어렵다.
- 추적 성능을 향상시키기 위해 DCF 기반 방법에 딥러닝을 도입한 여러 연구가 발표되었다. [9, 50, 35]
- 큰 발전에도 불구하고 여전히 항공 추적에 대한 열등한 Robustness and Efficiency에 직면해 있다.

2) Siamese

SOT 커뮤니티의 또 다른 뛰어난 분야는 대규모 offline training data와 end-to-end learning strategy으로부터 이점을 얻는 Siamese-based methods [2, 24, 32, 53, 31]이다.
- SiameseFC [2]는 Siamese framework의 이점을 노출하여 추적 작업을 template and search patches의 유사성 일치 과정으로 공식화했다.
- SiameseFC를 베이스라인으로, DSiam [24]는 물체 외관 변화와 배경 간섭을 효과적으로 처리하기 위해 제안되었다.
- RPN [40]에서 영감을 받아 SiamRPN [32]는 추적을 두개의 하위 작업으로 간주하여 각각 분류 및 회귀 분기를 적용했다.
- DaSiamRPN [53]은 새로운 distractor-aware module과 effective sampling strategy를 도입하여 robustness를 더욱 증진시켰다.
최근에는 매우 심층적인 네트워크를 백본으로 채택할 가능성이 광범위하게 활용 [31]되고 있지만 효율성은 크게 희생되고 있다.
- 분명히 RPN-based 추적기 [32, 53, 31] 효과적인 추적 전략을 제공한다.
- 그러나 anchor와 관련된 하이퍼파라미터는 추적기의 일반화를 크게 감소시킨다.
- 이러한 단점을 제거하기 위해 anchor-free 방법이 제안된다. [23, 8]

3) 기존연구와 본 논문의 차이점 및 특징

Siamese-based 추적기에서 robust feature은 추적 성능에 중요한 영향을 미친다.
- 그러나 AlexNet [30]과 같은 lightweight backbone을 가진 추적기 [2, 32, 53, 18]은 global context의 부족으로 어려움을 겪고
- ResNet [25]과 같은 deep CNN을 활용하는 추적기 [31, 8, 23]은 UAV의 real-time requirements와 거리가 멀다.
시각적 추적 [31, 16]에서 multi-level features을 탐색하기 위해 여러 연구가 제안되었지만, mobile platforms에서는 감당할 수 없는 번거로운 계산을 불가피하게 도입한다.
이와는 달리, 본 연구는 효과적이고 효율적인 multi-level feature fusion을 위한 brand-new lightweight hierarchical feature transformer (HiFT)를 제안하여 강력한 항공 추적을 효율적으로 달성한다.

3.2. Transformer in Computer Vision

1) Transformer의 확장

[1]에서 attention 매커니즘을 기반으로 기계 번역을 위한 트랜스포머를 제일 처음으로 제안했다.
high representation ability를 활용하여, 트랜스포머 구조는 video captioning [51], image enhancement [47], pose estimation [27]과 같은 컴퓨터 비전 영역으로 확장된다.
- DETR [7]가 object detection에서 transformer 연구를 시작한 후,
- deformable DETR [52]는 효율적인 수렴을 위한 deformable attention module을 제안하여 Transformer와 CNN의 조합에 대한 영감을 제공했다.
- 일부 연구는 multi-object tracking에 트랜스포머를 도입하려고 시도했고 유망한 성능을 달성했지만 [38], SOT에서의 트랜스포머의 연구는 아직까지 차단되어 있다.

2) Transformer의 한계 및 본 논문의 제안 방법

Transformer attention 매커니즘이 광범위한 시각적 작업에서 우수한 성능을 보여주지만, predefined(or learned) object queries가 임의의 객체를 마주할 때 효과를 거의 유지하지 못하기 때문에 그 우월성이 SOT로 확장되기 어렵다.
더욱이 트랜스포머는 항공 추적에서 자주 마주치는 low-resolution object를 거의 다루지 않는다.
📕본 연구에서는
- object query and related structures를 재설계하는 대신, novel and robust tracking-tailored(맞춤형) feature space를 구성하기 위한 hierarchical feature transformer를 제안한다.
- global context의 도입과, multi-level feature 간의 interdependencies로 인해 feature space의 discriminability가 크게 증가하여 추적 성능을 향상시킨다.
- 한편 HiFT는 모바일 플랫폼에 적합한 lightweight encoder-decoder 구조를 가지고 있다.

4. Proposed Method

Figure 2. Overview of the HiFT tracker. The modules from the left to right are feature extraction network, hierarchical feature transformer, and classification & regression network. Three arrows with different colors represent the workflow of features from different layers respectively. Note that only the input of the encoder is combined with position encoding. Best viewed in color. (Image frames are from UAV20L [39].)

HiFT의 workflow는 그림2에 나와있다.
1. Feature extraction network
2. Hierarchical feature transformer
3. Classification & Regression network
본 논문에서 hierarchical feature transformer을 구축하기 위해 last three layers로부터 features을 활용한다는 것에 주목한다.

4.1. Feature Extraction Network

Deep CNNs (e.g., ResNet [25], MobileNet [42], GoogLeNet [[43])은 Siamese frameworks [31]에서 인기 있는 feature extraction backbone 역할을 하면서 놀라운 능력을 입증했다.
- 그러나 깊은 구조가 가져오는 무거운 계산은 항공 플랫폼에서 거의 제공되지 않는다.
이 문제에 대해 HiFT는 template과 search branch 모두에서 작동하는 AlexNet [30]과 같은 lightweight backbone을 채택한다.
- 명확한 설명을 위해 template/search image는 각각 Z, X로 표시된다.
- pi_k(X)는 search branch의 k번째 layer의 output

👩🏻‍💻 참고

AlexNet의 feature extraction capability는 이러한 deeper networks에 비해 약하지만, 제안된 feature transformer는 real-time aerial tracking을 위한 계산 리소스를 절약하는 동시에 그러한 단점을 크게 보완할 수 있다.

4.2. Hierarchical Feature Transformer

제안된 hierarchical feature transformer는 크게 두 단계로 나눌 수 있다.
- high-resolution features encoding
  - 서로 다른 feature layers와 spatial information 간의 interdependencies를 학습하여 서로 다른 scale(특히 low-resolution object)을 가진 objects에 대한 attention을 raise하는 것을 목표로 한다.
- low-resolution features encoding
  - low-resolution feature map에서 semantic information을 집계한다.
hierarchical features 간의 풍부한 global context와 interdependencies의 이점을 활용하여, 본 논문의 방법은 tracking-tailored feature space를 발견한다.
- 따라서 다양한 항공 추적 조건에서 transformed features의 discriminability와 representative capabilities가 크게 향상된다.
- 특히 last three layers의 features가 사용된다.
k-th layer의 feature map은 feature transformer에 공급되기 전에, M_i로 convoluted and reshape 된다.

1) 💛Feature Encoding💛 -> 다시 정리할 것

Figure 3. Detailed workflow of HiFT. The left sub-window illustrates the feature encoder. The right one shows the structure of the decoder. Best viewed in color.

hierarchical features 사이의 interdependencies를 완전히 탐구하기 위해, multi-head attention module [1]의 input으로 M_3'와 M_4'의 조합을 M^1_E = Norm(M_3'+M_4')으로 사용한다.
- 여기서 Norm은 normalization layer
일반적으로 scaeld dot-product attention Att는 (2)로 표시될 수 있다.

where √ c is the scaling factor to avoid gradient vanishment in the softmax function.

그런 다음 Multi-head attention module mAtt는 (3)으로 표시될 수 있다.
- Q, K, V는 함수를 명확히 하기 위한 수학적 기호일 뿐이므로 실용적인 의미를 갖고 있진 않다.

where Wc ∈ R C×C , Wj 1 ∈ R C×Cd , Wj 2 ∈ R C×Cd , and Wj 3 ∈ R C×Cd (Cd=C/N, N is the number of parallel attention head) can all be regarded as fully connected layer operation.

그 후 first multi-head attention module (즉 Wj 2)의 output은 (4)에 의해 얻어질 수 있다.

결과적으로 M'_3와 M'_4 사이의 interdependencies은 high-resolution feature map M^2_E를 풍부하게 하기 위해 효과적으로 학습된다.
또한 two feature maps의 glob context는 M^2_E에도 도입된다.
그 후, 그림3에 나타낸 구조인 M^3_E와 M'_4 사이의 interdependencies의 잠재력을 탐색하기 위해 modulation layer를 구성한다.
구체적으로, modulation layer M^3_E의 input은 M'_3와 M^2_E의 정규화 (즉, M^3_E = Norm(M'_3+M^2_E))에 의해 얻어진다.
feed-forward network 및 global average pooling operation (GAP) 후, modulation layer M^4_E의 output은 (5)로 나타낼 수 있다.

modulation layer에 의해 M'_4와 M^3_E 사이의 internal spatial information가 효율적으로 활용되어 object를 복잡한 배경으로부터 효과적으로 구별할 수 있다.
최종적으로 encoded information은 FFN and normalization를 통해 계산될 수 있다.

👩🏻‍💻 참고

feature encoder 덕분에 M'_3 and M'_4 사이의 global context and interdependencies가 완전히 탐구된다.

또한 mall objects를 처리하는 부족함을 극복하기 위해 modulation layer를 제안하여 인코딩된 정보를 풍부하게 하기 위해 spatial information를 추가로 탐색한다.

마지막으로 이를 기반으로 디코더는 강력한 추적을 위한 효과적인 feature transformation을 구축할 수 있다.

2) Feature Decoding

디코딩하기 전에, low-resoluation feature map은 먼저 (1)에서 M_5로 재구성된다.
feature decoder는 standard transformer [1]와 유사한 구조를 따른다.
- 다른점으로는 본 논문에서는 position encoding 없이 효과적인 feature decoder를 구축한다.
- 본 논문의 방법은 location의 수를 sequence length로 취급하기 때문에 position encoding이 feature map의 각 위치를 구별하기 위해 도입된다.
- transformed feature에 대한 직접적인 영향을 피하기 위해, 인코더를 통해 위치 정보를 암시적으로 도입하기로 결정했다.
- positional encoding 전략 분석은 4.3.3 후반부에서 수행된다.
디코더의 구조는 그림3에 나와있다.

👩🏻‍💻 참고

hierarchical feature transformer에 의해, high-/low-resolution features의 spatial/semantic information가 final transformed features의 discriminability을 향상시키기 위해 충분히 활용된다.

한편, modulation layer는 서로 다른 feature layers 간의 interdependencies의 집계를 달성하여 다양한 scales로 객체를 추적하는 견고성을 향상시킨다.

3) Definition of Classification Label

분류 및 회귀의 구조는 여러 convolution layers에 의해 구현된다.
정확한 분류를 달성하기 위해 본 논문에서는 두개의 classification branch를 적용한다.
- 1) ground truth box와 관련된 영역을 통해 분류하는 것을 목표로 한다.
- 2) ground truth의 중심과 해당 점 사이의 거리로 측정된 positive samples을 결정하는데 초점을 맞춘다.
또한 수렴을 가속화하기 위해 T로 표시된 pseudo-random number generators를 사용하여 negative labels의 수를 제한한다.

👩🏻‍💻 참고

분류 및 회귀의 상세한 계산 과정은 보충자료 참고

따라서 전체 손실 함수는 (6)과 같이 표현된다.

5. Experiments

5.1. Implementation Details

70개의 epochs를 훈련하는 동안 AlexNet의 마지막 3개 레이어는 마지막 60개의 epochs에서 fine-tuned되는 반면, 처음 2개 레이어는 frozen된다.
learning rate : 5 x 10^-4로 초기화되고, log space는 10^-2에서 10^-4로 감소한다.
Z and X의 Size : 각각 3x127x127, 3x287x287로 설정
feature transformer : 하나의 인코더 레이어와 두개의 디코더 레이어로 구성
Training data : COCO, ImageNet VID, GOT-10K, Youtube-BB에서 추출한 이미지 쌍
SGD 채택
Batch size : 220
momentum : 0.9
weight decay : 10^-4
Intel i9-9920X CPU, 32GB RAM, two NVIDIA TITAN RTX GPUs가 장착된 PC에서 훈련된다.

5.2. Evaluation Metrics

추적 성능을 평가하기 위해 Precision and success rate를 포함한 one-pass evaluation (OPE) metrics이 적용
- success rate : ground truth and estimated bounding boxes의 IoU에 의해 측정
  - IoU가 사전 정의된 임계값을 초과하는 프레임의 백분율은 success plot (SP)로 그려진다.
- precision : estimated location and ground truth 사이의 center location error (CLE)를 사용하여 평가
  - CLE가 특정 임계값 내에 있는 프레임의 백분율은 precision plot (PP)로 그려진다.
- 한편, SP의 곡선 아래 영역(AUC)와 20 pixels의 임계값에서 precision이 추적기의 순위를 매기기 위해 채택된다.

5.3. Evaluation on Aerial Benchmarks

1) Overall Performance

전반적인 평가를 위해 HiFT는 4개의 challenging and authoritative aerial tracking benchmarks에서 테스트된다.
SiamRPN++, DaSiamRPN, UDT, UDT+, TADT, CoKCF, ARCK, AutoTrack, ECO, C-COT, MCCT, DeepSTRCF, STRCF, BACF, SRDCF, fDSST, SiameseFC, DSiam, KCF를 포함한 19개의 최첨단(SOTA) 추적기와 종합적으로 비교된다.
공정성을 위해 모든 SIamese-based trakcers는 동일한 백본 (즉, ImageNet[14]에서 사전 훈련된 AlexNet[30]) 채택

UAV123

UAV123은 frequent occlusion, low resolution, out-of-view 등과 같은 다양한 까다로운 항공 시나리오를 다루는 112K 프레임 이상의 123개의 high-quality sequences를 포함하는 대규모 UAV 벤치마크이다.
따라서 UAV123은 항공 추적에서 추적 성능을 철저히 평가하는 데 도움이 될 수 있다.

🔎 결과

표1에서 설명한 바와 같이, HiFT는 precision and success 모두에서 다른 추적기를 능가한다.
In terms of precision, HiFT gains first place with a precision score of 0.787, surpassing the secondand third-best SiamRPN++ (0.769) and ECO (0.752) by 2.3% and 4.7% respectively.
As for the success rate, HiFT (0.589) also improves over SiamRPN++ (0.579) and ranks first place.
한마디로, HiFT는 모든 종류의 항공 추적 시나리오에서 우수한 성능을 보여준다.
다른 벤치마크
- UAV20L
- DTB70
- UAV123@10fps

2) Attribute-based Comparison

다양한 과제에서 HiFT를 철저히 평가하기 위해 표4에서 보이는 바와 같이 attribute-based 비교가 수행된다.
- low-resolution, scale variaion, occlusion, fast motion 특성에서 2등의 성능을 크게 능가한다.
위의 결과는 본 논문의 hierarchical feature transformer가
- severe motion의 문제를 극복하기 위해 global contextual information을 활용하는게 도움이 될 수 있음을 보여준다.
- 또한 object가 심각하게 occluded 될 때, objects를 구별하기 위해 더 robust features를 학습할 수 있다.
- 따라서 HiFT는 occlusion의 시나리오에서도 눈에 띄는 개선을 제공한다.
Multi-scale feature maps은 feature transformation을 구축하는 데 사용되므로
- 트래커는 low-resolution and scale variation 속성에서 성능으로 검증됐으므로, 다양한 scale을 가진 객체를 추적할 수 있는 능력을 부여받는다.

3) Ablation Study

제안된 방법의 각 모듈의 효과를 검증하기 위해 다양한 모듈이 활성화된 HiFT의 자세한 연구가 UAV20L에서 수행된다.

Symbol introduction

먼저 표5에 사용된 기호의 의미를 소개
- Baseline : Feature extraction + regression network + classification network만 있는 모델
- OT : original standard transformer (with object query)
- FT : origianl transformer with the feature map (insted of object query, without modulation layer)
- HFT : full version of the proposed hierarchical feature transformer
- PE : M5에 대한 direct positional encoding (HiFT는 3.2.2항에서 설명한 바와 같이 M5에서 positional encoding을 제외한다.)
- RL : traditional trakcers에서 사용되는 rectangle label
- 공정성을 위해 트래커의 각 버전은 조사된 모듈을 제외하고 동일한 training strategy를 채택한다.

Discussion on Transformer architecture

표5에와 같이 object queries를 사용하여 origianl transformer를 추가 (Baseline + OT)는 baseline의 성능을 낮춘다.
- 이는 object query가 novel target objects를 가진 SOT에서 잘 수행되지 않는다는 것을 증명한다.
object qeury를 feature map으로 대체하는, Baseline + FT는 precision을 10.47% 증가시킨다.
Modulation layer를 추가로 채택한, Baseline + HFT는 최고 성능 24.88%
앞서 언급한 모든 결과를 결합하여 항공 추적에서 modulation layer를 사용하여 정교하게 설계된 hierarchical transformer의 효과를 검증할 수 있다.

Discussion on position encoding & classification label

3.2.2항의 positional encoding과 3.3항의 new classification label의 두가지 전략을 증명하는 것을 목표로 한다.

Positional encoding
- 표5에서 Baseline + HFT + PE는 HiFT의 성능을 엄청나게 손상시킨다. (24.88% -> 12.77%)
- 이는 direct position encoding이 feature M5에 적합하지 않음을 증명한다.
new classification label
- ground truth and sample points의 거리를 고려하면, HiFT에 사용된 circular strategy는 traditional rectangle label에 비해 현저한 개선을 달성한다. (24.88% -> 2.95%)

4) Qualitative Evaluation

Figure 5. Visualization of the confidence map of three tracking methods on several sequences from UAV20L [39] and DTB70 [34]. The target objects are marked out by red boxes in the original frames. HiFT gets more robust performance for visual tracking in the air.

그림5에 나타난 바와 같이, HiFT 추적기의 confidence map은 항공 추적에서 부담스러운 문제가 있는 물체에 지속적으로 초점을 맞춘다.
- 예를 들어, fast motion in Motor2, low-resolution in SpeedCar4, occlusion in group3 and Yacht4.
Baseline and Baseline+OT가 HiFT와 동일한 전략으로 훈련됨에도 불구하고, 여전히 복잡한 추적 시나리오에서 target object에 집중하지 못해 제안된 hierarchical feature transformer의 robustness를 입증한다.

5) Comparision to Trackers with Deeper Backbone

Figure 6. Precision-speed trade-off analysis by quantitative comparison between HiFT and trackers with deeper backbone on UAV20L [39] (left) and DTB70 [34] (right). Our method realizes an excellent trade-off on both two benchmarks.

proposed hierarchical feature transformer는 큰 계산 부담 없이 SOTA 성능을 달성할 수 있도록 multi-level features 간의 효과적인 feature mapping을 모델링하는 데 전념한다.
- 그 효과를 추가로 평가하기 위해 더 깊은 백본이 장착된 추적기를 사용한다.
- SiamRPN++(ResNet-50), SiamRPN++(MobileNet), SiamMask(ResNet-50), ATOM(ResNet-18), DiMP(ResNet-50), PrDiMP(ResNet-18), SiamCAR(ResNet-50), SiamGAT(GoogleNet), SiamBAN(ResNet-50)를 포함한 최첨단 추적기들이 비교에 포함된다.
그림6에 나타난 바와 같이, HiFT는 tracking robustness and speed의 만족스러운 균형을 달성한다.
- UAV20L에서 AlexNet을 백본으로 채택한 HiFT(0.763)은 precision에서 2등 추적기 SiamRPN++(ResNet-50)을 능가하고, 2배 빠른 127fps의 속도를 달성한다.
- DTB70에서 HiFT는 더 깊은 CNN기반 추적기와 유사한 성능을 달성한다.

average precision and tracking speed는 표6에 보고되었으며 HiFT는 129.87fps의 유망한 속도로 최고의 average precision (0.783)을 산출하여 HiFT가 추적 성능과 효율성 사이에서 우수한 균형을 달성함을 입증했다.

Table 6. Average precision and tracking speed of HiFT and the trackers with deeper backbone. The proposed approach runs at a satisfactory speed of &sim;130 FPS, while achieving comparable tracking performance with those trackers equipped with a deeper backbone. The best three performances are respectively highlighted with red, green, and blue color.

6. Related-World Tests

Figure 7. Visualization of real-world tests on the embedded platform. The tracking results and ground truth are marked with red and green boxes. The CLE score below the blue dotted line is considered as the success tracking result in the real-world tests.

HiFT는 실제 애플리케이션에서 실행 가능성을 입증하기 위해 embedded onboard process (i.e., NVIDIA AGX Xavier)에서 일반적인 UAV platform에서 추가로 구현된다.
그림7은 day and night scenes을 포함하여 wild에서 세가지 테스트를 보여준다.
- 테스트의 주요 과제는 partial occlusion, viewpoint change(first row), low-resolution, camera motion(second row), small object, similar object around(third row)이다.
결과
- 효과적인 feature transformer 덕분에, HiFT는 다양한 도전적인 시나리오에서 만족스러운 tracking robustness를 유지한다.
- 게다가 HiFT tracker는 TensorRT를 사용하지 않고 테스트하는 동안 평균 31.2fps의 속도를 유지한다.
- 따라서 embedded system에 탑재된 real-world tests는 다양한 UAV-specific challenges에서 HiFT의 우수한 성능과 효율성을 입증한다.

7. Conclusion

본 연구에서는 global contextual information and multi-level features를 활용하는 프로세스를 streamlining하기 위해 효율적인 항공 추적을 위한 novel hierarchical feature transformer를 제안한다.
- low-resolution semantics information and high-resolution spatial details 모두를 통해 transformed feature은 lightweight structure를 통해 object와 clutters를 구별하는 데 유망한 성능을 달성할 수 있다.
- 한편, modulation layer and new classificaion label 덕분에 feature transformer의 효과는 최대 잠재력에 도달할 수 있다.
포괄적인 실험을 통해 HiFT가 우수한 precision-seed trade-off 를 달성할 수 있으며, 실제 항공 추적 시나리오에 활용될 수 있음을 검증했다.
또한 백본이 더 깊은 추적기와 비교해도 HiFT는 비슷한 성능을 달성할 수 있다.
본 논문의 작업이 항공 추적의 개발을 진전시키고 시각적 추적의 실제 응용을 촉진할 수 있다고 확신한다.

저작자표시

'Research' 카테고리의 다른 글

[Object Tracking] Transformer-based Tracker (0)	2023.01.26
[논문리뷰] OSTrack (0)	2023.01.12
[논문리뷰] AiATrack (1)	2022.12.29
[논문리뷰] SwinTrack (1)	2022.12.20
[논문리뷰] TREG (0)	2022.12.20

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

1. Abstract

2. Introduction

3. Related Works

3.1. Visual Tracking Methods

3.2. Transformer in Computer Vision

4. Proposed Method

4.1. Feature Extraction Network

4.2. Hierarchical Feature Transformer

5. Experiments

5.1. Implementation Details

5.2. Evaluation Metrics

5.3. Evaluation on Aerial Benchmarks

6. Related-World Tests

7. Conclusion

'Research' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

'Research' 카테고리의 다른 글

검색 태그

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역