The transformer revolution in video recognition. Are you ready for it?

Jayanth K Ajay

Sr. Associate- DSG

Dr. Monika Singh

Consultant - DSG

Imagine how many lives could be saved if caretakers or medical professionals could be alerted when an unmonitored patient showed the first signs of sickness. Imagine how much more secure our public spaces could be if police or security personnel could be alerted upon suspicious behavior. Imagine how many tournaments could be won if activity recognition could inform teams and coaches of flaws in athletes’ form and functioning.

With Human Activity Recognition (HAR), all these scenarios can be effectively tackled. HAR has been one of the most complex challenges in the domain of computer vision. It has a wide variety of potential applications such as – sports, post-injury rehabilitation, analytics, security surveillance traffic monitoring, etc. But the complexity of HAR arises from the fact that an action spans both spatial and temporal dimensions. In normal computer vision tasks, where a model is trained to classify/detect objects located in a picture, only spatial dimension is involved. However, in HAR, learning multiple frames together over the time helps in classifying the action better. Hence, the model must be able to track both the spatial and temporal components.

The architecture used for video activity recognition includes 2D convolutions, 3D CNN volume filters to capture spatio-temporal information (Tran et al., 2015), 3D convolution factorized into separate spatial and temporal convolutions (Tran et al., 2018), LSTM for spatio-temporal information (Karpathy et al., 2014), as well as the combination/enhancements of these techniques.

TimeSformer – A revolution by Facebook!

Transformers have been making waves in Natural Language Processing for the last few years. They employ self-attention with encoder-decoder architecture to make accurate predictions by extracting information about context. In the domain of computer vision, the first implementation of transformers came through ViT (visual transformers) developed by Google. In ViT, a picture is divided into patches of size 16×16 (see Figure 1) and then flattened to 1D vectors. Then they are embedded and passed through an encoder. The self-attention is calculated with respect to all the other patches.


Figure 1. The Vision Transformer treats an input image as a sequence of patches, akin to a series of word embeddings generated by an NLP Transformer. (Source: Google AI Blog: Transformers for Image Recognition at Scale (

Recently Facebook developed TimeSformer, the first instance in which transformers are used for HAR. In the case of TimeSformer, as in the case of other HAR methods, the input is a block of continuous frames from the video clip, for example, 16 continuous frames of size 3x224x224. To calculate the self-attention for a patch in a frame, two sets of other patches are used:

 a) other patches of the same frame (spatial attention).

 b) patches of the adjacent frames (temporal attention).

There are several and different ways to use these patches. We have utilized only the “divided space-time attention” (Figure 2) for this purpose. It uses all the patches of the current frame and patches at the same position of the adjacent frames. In “divided attention”, temporal attention and spatial attention are separately applied within each block, and it leads to the best video classification accuracy (Bertasius et al., 2021).

Figure 2. Divided space-time attention in a block of frames (Link: TimeSformer: A new architecture for video understanding (

It must be noted that TimeSformer does not use any convolutions which brings down the computational cost significantly. Convolution is a linear operator; the neighboring pixels are used by the kernel in computations. Vision transformers (Dosovitskiy et al., 2020), on the other hand, are permutation invariant and require sequences of data. So, for the transformer’s input, spatial non-sequential data is converted into a sequence. Learnable positional embeddings are added per patch (analogously taken in an NLP task) to allow the model to learn the structure of the image.

TimeSformer is roughly three times faster to train than 3DCNNs, requires less than one-tenth the amount of compute for inference, and has 121,266,442 trainable parameters compared to only 40,416,074 trainable parameters in the 2DCNN model and 78,042,250 parameters in 3DCNN models.

In addition to this, TimeSformer has the advantages of customizability and scalability over convolutional models – the user can choose the size and depth of the frames which is used as input to the model. The original study has utilized images as big as 556×556 and as deep as 96 frames and yet could not scale exponentially. 

But challenges abound…

Following are some challenges while tracking motion:

  1. The challenge of angles: A video can be shot from multiple angles; the pattern of motion could appear different in different angles.
  2. The challenge of camera movement: Depending on whether the camera moves with the motion of the object or not, the object could appear to be static or moving. Shaky cameras add further complexity.
  3. The challenge of occlusion: During motion, the object could be hidden by another object temporarily in the foreground.
  4. The challenge of delineation: Sometimes it is not easy to differentiate where one action ends and the other begins.
  5. The challenge of multiple actions: Different objects in a video could be performing different actions, adding complexity to recognition.
  6. The challenge of change in relative size: Depending on whether an object is moving towards or away from the camera, its relative size could change continuously adding further complexity to recognition.


In order to determine video recognition capability, we have used a modified UCF11 data set.

We have removed the Swing class as it has many mislabeled data. The goal is to recognize 10 activities – basketball, biking, diving, golf swing, horse riding, soccer juggling, tennis swing, trampoline jumping, volleyball spiking, and walking. Each has 120-200 videos of different lengths ranging from 1-21 seconds. The link to the dataset is (CRCV | Center for Research in Computer Vision at the University of Central Florida (

How we trained the model

We carried out different experiments to get the best-performing model. Our input block contained 8 frames of size 3x224x224. The base learning rate was 0.005 which was reduced by a factor of 0.1 in the 11th and 14th steps. For augmentation, color jitter (within 40), random horizontal flip and random crop (from 256×320 to 224×224) were allowed. We trained the model on Amazon AWS Tesla M60 GPU with a batch size of 2 (due to memory limitations) for 15 epochs.

Metrics are everything

In the original code, Timesformer samples one input block per clip for training and validation. In the case of test videos, it takes 3 different crops and averages over the predictions (we term this samplewise accuracy). As a result, several of the models we trained could achieve over 95% validation accuracy. However, in our humble opinion, this is not satisfactory because it does not examine all the different spatio-temporal possibilities in the video. To address that, we take two other metrics into consideration.

  • Blockwise accuracy – a video clip is considered an object obtained by combining continuous building blocks (no overlap). The model makes predictions for all the input blocks, and this accuracy of prediction is measured. This is more suitable for real-time scenarios.
  • Clipwise accuracy – prediction of all the blocks of a video is considered and the mode is assigned to be the prediction for the clip, and that accuracy is measured. This also helps to understand real-time accuracy in a larger timeframe.

The final outcome

Our best model had the following performance metric values:

  • Samplewise accuracy – 97.3%
  • Blockwise accuracy – 86.8%
  • Clipwise accuracy – 92.2%

The confusion matrix for the clipwise accuracy is given in Figure 3(a). For comparison, the confusion matrix for 2dCNN and 3DCNN models are shown in Figure 3(b) and 3(c) respectively.

These metrics are quite impressive and far better than the results we obtained using 2D convolution (VGG16) as well as 3D convolution (C3D); 81.3% and 74.6% respectively. This points to the potential of TimeSformers.

Figure 3(a). Clipwise accuracy confusion matrix for TimeSformer

Concluding Remarks

In this work, we have explored the effectiveness of the TimeSformer for Human Activity Recognition task on the modified UCF11 dataset. This non-convolution model has outperformed the 2DCNN and 3DCNN models and performed extremely well on some hard to classify classes such as ‘basketball’ and ‘walking’. Future work includes trying more augmentation techniques to fine-tune this model and using Vision-based Transformers in other video-related tasks such as video captioning and action localization.

When done right, TimeSformer can truly change the game for Human Activity Recognition. Its use cases across healthcare and sports, safety and security can truly come to life with TimeSformer.


[1] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).

[2] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 6450-6459).

[3] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).

[4] Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding?. arXiv preprint arXiv:2102.05095.

[5] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2020). An image is worth 16×16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

About Author

A Ph.D. by education and an analytical problem solver at heart, Jayanth leverages deep learning to build resilient AI-led solutions. He developed the Human Activity Recognition solution for sports videos using 2D CNN, 3D CNN, multimodal learning and TimeSformer models.

Jayanth K Ajay

About Author

Affine is leading AWS select consulting partner renowned for providing cutting-edge cloud services on AWS platform

Dr. Monika Singh

Recommended Blogs & Articles

Manas Agrawal

CEO & Co-Founder

Add Your Heading Text Here

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.