For real-world video classification use cases it is imperative to capture the spatiotemporal features. In such cases, the interwoven patterns in an optical flow are expected to hold higher significance. By contrast, most of the implementations involve learning individual image representations disjunctive with the previous frames in the video. In an attempt at exploring more appropriate methods, this case study revolves around video classification that sends an alert in the instance of any violence detected. PyTorch*1, trained on an Intel® Xeon® Scalable processor, is used as the Deep Learning framework for better and faster training and inferencing.
In typical contemporary scenarios we frequently observe sudden outbursts of physical altercations such as road rage or a prison upheaval. However, to address such scenarios, high tech industries have deployed closed-circuit television (CCTV) cameras that provide extensive virtual coverage of public places. In the case of any untoward incidents, it is common to analyze the footage made available through video surveillance and start an investigation. An intervention by security officials as the violence is taking place could prevent loss of precious lives and minimize destruction of public property. One obvious approach to implementing this solution is to position human forces for continuous manual monitoring of CCTV cameras. This can be burdensome and erroneous at the same time due to the tedious nature of the job along with human limitations. A more effective method could be automatic detection of violence in CCTV footage triggering alerts to security officials, thus reducing the risk for manual errors. More appealing to the defense and security industries, this solution can also be of relevance to authorities associated with handling public properties.
In this experiment, we implemented the proposed solution using 3D convolutional neural networks (CNN) with ResNet-342 as the base topology. The experiments were performed using transfer learning on pretrained 3D residual networks (ResNet) initialized with weights of the Kinetics* human action video dataset.
The dataset for training was taken from Google’s atomic visual action (AVA) dataset. This is a binary classification between fighting and non-fighting class (explained further in the Dataset Preparation section). Each class contains an approximately equal number of instances. The videos for the non-fighting class comprises instances from the eat, sleep, and drive class made available with the AVA dataset3.
The configuration of the Intel Xeon Scalable processor is as follows:
Table 1. Intel® Xeon® Gold processor configuration
|CPU op-mode(s)||32-bit, 64-bit|
|Byte Order||Little Endian|
|On-line CPU(s) list||0–23|
|Thread(s) per core||2|
|Core(s) per socket||6|
|Non-uniform memory access (NUMA) node(s)||2|
|Vendor ID||Genuine Intel|
|Model name||Intel® Xeon® Gold 6128 processor 3.40 GHz|
|NUMA node0 CPU(s)||0-5,12-17|
|NUMA node1 CPU(s)||6-11,18-23|
Prerequisite dependencies to proceed with the development of this use case are outlined below:
Table 2. Software configuration
|Operating System||CentOS* 7.3.1|
In addition to being time consuming, a CNN requires millions of data points to be trained from scratch. In this context, with only 545 video clips in the fighting class and 450 in the non-fighting class, training a network from scratch could result in an over-fitted network. Therefore, we opted for transfer learning, which minimizes the training time and helps attain better inference. The experiment uses a pretrained 3D CNN ResNet network, initialized with the weights of the Kinetics video action dataset. Fine-tuning of the network is done by training the final layers with the acquired AVA training dataset customized to the fight classification. This fine-tuned model is later used for inference.
Image-based features extracted using 2D convolutions are not directly suitable for deep learning on video-based classifications. Learning and preserving spatiotemporal features is vital here. One of the alternatives for capturing this information is 3D ConvNet4. In 2D ConvNets, convolution and pooling operations are performed spatially, whereas in 3D ConvNets these are done spatiotemporally. The difference in treatment of multiple frames as input is marked in the figures below:
Figure 1. 2D convolution on multiple frames4
Figure 2. 3D convolution4
As shown, 2D convolutions applied on multiple images (treating them as different channels), results in an image (figure 1). Even though input is three dimensional—that is, W, H, L, where L is the number of input channels— the output shape is a 2D matrix. Here, convolutions are calculated across two directions and the filter depth matches the input channels. Consequently, there is a loss of temporal information of the input signal after every convolution.
Input shape = [W,H,L] filter = [k,k,L] output = 2D
On the other hand, 3D convolution preserves the temporal information of the input signal and results in an output volume (figure 2). The same phenomenon is applicable for 2D and 3D pooling operations as well. Here, the convolutions are calculated across three directions, giving the output shape of a 3D volume.
Input shape = [W,H,L] filter = [k,k,d] output = 3D
3D ConvNet models temporal information better because of its 3D convolution and 3D pooling operations.
In our case, video clips are referred with a size of c × l × h × w, where c is the number of channels, l is length in number of frames, and h and w are the height and width of the frame, respectively. We also refer 3D convolution and pooling kernel size by d×k ×k, where d is kernel temporal depth and k is kernel spatial size.
The dataset for training is acquired from the Google AVA3. The original AVA dataset contains 452 videos split into 242 for training, 66 for validation, and 144 for test. Each video has 15 minutes annotated in one-second intervals, resulting in 900 annotated segments. These annotations are specified by two CSV files, ava_train_v2.0.csv and ava_val_v2.0.csv. The CSV file has the following information.
- video_id: YouTube* identifier.
- middle_frame_timestamp: in seconds from the start of the YouTube video.
- person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top-left, and (1.0, 1.0) corresponds to the bottom-right.
- action_id: identifier of an action class.
Among the 80 action classes available, only the fighting class (450 samples) is considered for positive samples for the current use case, and an aggregate of 450 samples (150 per class) are taken from the eat, sleep, and drive sub classes to form the non-fighting class. The YouTube videos are downloaded using the command-line utility, youtube-dl.
Each clip is four seconds long and has approximately 25 frames per second. The frames for each clip are extracted into a separate folder with the folder name as the name of the video clip. These are extracted using the ava extraction script.
The ffmpeg library is used for converting the available AVA video clips to frames. The frames are then converted to type Float Tensor using the Tensor class provided with PyTorch. This conversion results in efficient memory management as the tensor operations in this class do not make memory copies. The methods either transform the existing tensor or return a new tensor referencing the same storage.
3D CNN ResNet
The architecture followed for the current use case is ResNet based with 3D convolutions. A basic ResNet block consists of two convolutional layers and each convolutional layer is followed by batch normalization and a rectified linear unit (ReLU). A shortcut pass5 connects the top of the block to the layer just before the last ReLU in the block. For our experiments, we use the relatively shallow ResNet-34 that adopts the basic blocks.
Figure 3. Architecture of 3d cnn resnet – 34
When the dimensions increase (dotted line shortcuts in the given figure), the following two options are considered:
- Shortcut performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter.
- The projection shortcut is used to match dimensions (done by 1×1 convolutions).
For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
We have used Type A shortcuts with the ResNet-34 basic block to avoid increasing the number of parameters of the relatively shallow network.
The 3D convolutions are used to directly extract the spatiotemporal features from raw videos. A two-channeled approach of using a combination of RGB color space and optical flows as inputs to the 3D CNNs is used on the Kinetics dataset to derive the pretrained network. As pretraining on large-scale datasets is an effective way to achieve good performance levels on small datasets, we expect the 3D ResNet-34 pretrained on Kinetics to perform well for this use case.
The training is performed using the Intel Xeon Scalable processor. The pretrained weights used for this experiment can be downloaded from GitHub*. This model is trained on the Kinetics Video dataset.
A brief description of the pretrained model is provided below:
resnet-34-kinetics-cpu.pth: --model resnet --model_depth 34 --resnet_shortcut A
The solution is based on the 3D-Resnets-PyTorch implementation by Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.
The trained model is inferred on the YouTube videos downloaded from the test dataset in AVA. The video clips are further broken down into frames and are passed to the classifier. These are then converted to Torch* tensors. The frames obtained per video clip are divided into segments, and a classification score is obtained for each of the segments. The classification results are written on to the video frames and stitched back into a video. Inferencing is done from the code in this GitHub link.
Given that input videos are located in ./videos, the following command is used for inference:
python main.py --input ./input --video_root ./videos --output ./output.json --model ./resnet-34-kinetics.pth --mode score
The following gif is extracted from the video results obtained by passing a video clip to the trained PyTorch model.
Figure 4. Inferred GIF
Conclusion and Future Work
The results are obtained with a high level of accuracy. AVA contains samples from movies that are at least a decade old. Hence, to test the efficacy of the trained model, inferencing was done on external videos (CCTV footage, recently captured fight sequences, and so on). The results showed that the quality of the video or the period during which the video was captured did not influence the accuracy. As a future work, we could enhance the classification problem with detection. Also, the same experiment can be carried out using recurrent neural network techniques.
About the Authors
Astha Sharma and Sandhiya S are Technical Solution Engineers working on behalf of the Intel® AI Developer Program.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.