Using Deep Siamese networks for trajectory analysis to extract motion patterns in videos

This paper investigates the use of Siamese networks for trajectory similarity analysis in surveillance tasks. Speciﬁcally, the proposed approach uses an auto-encoder as a part of training a discriminative twin (Siamese) network to perform trajectory similarity analysis, thus presenting an end-to-end framework to perform an online motion pattern extraction in the scene with an ability to incorporate new incoming trajectory(ies) incrementally. The effectiveness of the proposed method is evaluated on four challenging public real-world datasets containing both vehicle and person targets, and compared with ﬁve existing meth-ods. The proposed method consistently shows better or comparable performance than the existing methods on all datasets.

Introduction: Motion patterns represent spatiotemporal trends of moving targets in a scene and can be extracted based on the analysis of motion information [1][2][3][4].Indeed, these extracted patterns of motion of targets offer useful information that aid in performing activity analysis [5], behaviour prediction [6], tracking [7], and abnormality detection [8].
Existing motion pattern extraction methods are generally classified as interframe-motion-based and multiframe-motion-based methods.Interframe-motion-based methods [9][10][11] rely on using motion information of targets between two consecutive frames to extract motion patterns.These methods generally perform more robustly in crowded scenarios but are often considered not suitable for extracting long-range motion patterns [12].Multiframe-motion-based methods [1- 3,13] instead use motion information across multiple frames, e.g.short-duration tracks or complete trajectories, as estimated with video tracking and are regarded to be more helpful in extracting long-range patterns.These methods generally involve first estimating tracked trajectories of targets [1- 3,12], followed by encoding trajectory information into feature space(s) [2, [13][14][15], and performing trajectory clustering [1, 2, 4, 13,14] or classification [3,15] to determine dominant patterns.Some of these approaches [2,4,13,14] work in an offline manner as they assume availability of all estimated trajectories a priori with an inability to incorporate new trajectory(ies) at clustering or classification stage incrementally.Online approaches [1, 3,15] do exist that enable updating extracted clusters incrementally with incoming trajectory(ies).Recently there has been a growing focus on the use of deep learning techniques in performing various vision tasks including those involving trajectory analysis [16,17] but are not specifically aimed at motion pattern extraction.
This paper presents a framework that is built upon using auto-encoder architecture as part of training of the Siamese networks to perform trajectory similarity analysis for extracting motion patterns.Unlike existing methods [2, 4, 13, 14], this paper proposes an online approach that allows incorporating new incoming trajectory(ies) in an incremental manner.Moreover, unlike existing online approaches [1, 3,15], the proposed method presents an end-to-end deep learning based trained network without the need for separate explicit feature extraction and clustering/classification stages; hence no requirement either for an enhanced discriminative ability of feature(s) as in [1, 3,15] or an appropriate choice of clustering techniques as in refs.[1,3].We show the effectiveness of the proposed method by evaluating and comparing the performance with several state-of-the-art methods on four challenging public real-world datasets with significant variability.
Problem definition: Let X be a set of trajectories estimated by a tracker in a video sequence, V : X = {X j } J j=1 , where J is the number of estimated trajectories.X j is the estimated trajectory for target j: start and k j end are the first and final frame numbers of X j , respectively.X k, j is the estimated state of target j at frame k : k = 1, . . ., K with K as the total number of frames in V .X k, j = (x k, j , y k, j ), where (x k, j , y k, j ) denotes at frame k the position of target j on the image plane.The analysis of trajectories (X ) aids in identifying the motion patterns that refer to the representative spatiotemporal trends of moving targets (people, vehicles etc.) in a scene.
Neural network architecture: The overall architecture of the neural network is split into two parts: an encoding part trained as an auto-encoder and a second part containing its extension into a twin (or Siamese) neural network.The full architecture can be seen in Figure 1.An auto-encoder was chosen as the first stage as it has been shown to be very effective in creating a trajectory feature vector to discriminate between clusters [2] and is therefore expected to be a good discriminator between trajectories as part of a Siamese network.
In order to generate the feature vector f j for a trajectory X j , an autoencoder, also known as auto-associator or Diabolo network [18], is first trained to reproduce the set of trajectories X .Once a network has been trained, the output of its smallest layer is used as the feature vector f j.A separate network has to be trained for each dataset due to the varying length of the input vectors described above.A network consisting only of fully-connected layers cannot handle input vectors of varying sizes without introducing further methods of normalisation.Due to this reason as well as the difference in scenes of different datasets, we opted to use separate networks.
An auto-encoder is an arrangement of a neural network where the output, once trained, is an estimation of the provided input vector.Outputs of any of the layers in a trained network can be used as a representation of the input, due to the ability of rest of the network to reproduce the input vector.In this case, the input is the vectorisation of the data of X j and is denoted as V A neural network is a combination of small units called neurons that are built up into multiple layers.The type of neuron used in this paper is based on the McCulloch-Pitts neuron [19]; this multiplies a single-dimensional input vector with a weight vector summed with a bias node, and outputs a single value: , where y l,n denotes the output of pre-activation function y of the neuron n in layer l.I l is the length of the input vector X l for a particular layer, w i and x l,i is a value in the weight vector and input vector, respectively, and b is the bias value.The weight vector is initialised to random values.A sigmoidal function is used for the activation function: where Y is the output of the neuron.The neurons are then placed alongside each other as a layer.The output of a layer l can be described as follows: where the layer L l is a vector containing all of the outputs of the neurons in the layer.The next layer is then provided with the output of the previous one as its input vector (known as a fully-connected layer), excluding the first layer (l = 1) for which the input vector is the vectorised trajectory V X j rather than a previous layer: (2) ELECTRONICS LETTERS wileyonlinelibrary.com/iet-elThe architecture of this type of network contains two stages: an encoding and a decoding stage.For encoding, the number of neurons in each layer decreases so that the dimensionality of the original input vector is effectively reduced when passed through the network.The decoding stage is the opposite: a set of layers incrementing in size up to the original length of the input vector.Both of these stages consist of one or more layers.As mentioned above, this method uses pre-defined scales based on the length of the original feature vector to determine the number of neurons in each layer in the encoding stage.Two layers are used: the first is 10% of the length of the vector, the second 5% of the vector.Only one layer is used in the decoding stage of the same size as the original vector.These scales are static across all of the datasets used in this study.The values of layer sizes are listed in Table 1.
The concept of a Siamese network [20] is to create a network of two parts.The network takes two inputs (V X j 0 and V X j 1 ) through the previously trained feature network (excluding the final decoding layer) to produce two outputs, in this case: L 2,0 and L 2,1 .For the purposes of training and testing, L 2,1 is an average of all the trajectories in L 2,0 's class.A depiction of the full network can be seen in Figure 1.The absolute difference of the two vectors is calculated for the input of the Siamese layer: The input is then forwarded into a layer equal to the size of L 2 , S 0 , into a single layer of a single neuron S 1 .Both layers use a sigmoidal activation function as in Equation 1.
The datasets are split into various sets with the following ratios: training 0.4, validation 0.2, testing 0.4.There are two stages for training the overall network.Both parts of training use the normalised direction preserving ADAM optimiser (ND-ADAM) [21] using Pytorch [22].
At the first stage, an initial pre-training is performed at encoding part of the network before moving on to train the Siamese network as a whole.Specifically, it involves training the auto-encoder separately to the rest of the network.As described in the "Auto-Encoder" section above, the auto-encoder is trained to reproduce its input.This was trained for 20,000 epochs with a learning rate of 1e −3 with a weight decay of 5e −6these values have been previously validated as optimal values for this phase of training via a grid search on parameters.The mean squared error (MSE) is used as the loss function, with the validation set being monitored at each epoch.
The second stage involves training the network as a twin neural network.As described previously, the final layer of the auto-encoder is removed and a new layer is added for the similarity output.The layers of the auto-encoder are frozen so no further optimisation is performed, and training occurs for 10,000 iterations on the singular output neuron after the differencing of feature vectors.The training of the overall Siamese network is approached with a one-shot approach [23] over multiple epochs, also known as a few-shot due to training over multiple iterations.In every iteration, for each class a random trajectory is chosen to be compared to both another of the same class, as well as random trajectory from another class.Rather than training on all the data in a traditional epoch, this few-shot methodology mitigates overfitting to any one class as each class gets equal training.
A Bayes optimisation search is performed on the learning rate and weight decay for the training of the second stage.Rather than a grid search as above, this was chosen due to the much larger training time of the "Siamese" part of the network.The best network is chosen based on the largest validation set F1-score across all models for each dataset.As mentioned above, both training phases extract the best state of the network based on the validation set.This is another method to attempt to reduce overfitting to the training set.

Experimental validation and analysis:
In this section, we present the experimental validation of the proposed method by first describing the  datasets and evaluation criteria followed by an analysis and comparison of the results with existing approaches.In order to show the effectiveness of the proposed approach, we used four challenging publicly available real-world datasets (Table 2, Figure 2), namely Traffic Junction [13], Parking Lot [13], Train Station [6], and Students003 [24].Traffic Junction and Parking Lot are recorded from a mobile aerial platform and contain vehicle and person targets.Real trajectories are used with the induced camera motion already compensated for both datasets, as provided by the original authors [13].Train Station and Students003 are recorded from a top-down(ish) fixed camera, offering scenes that are highly crowded with people moving in varying directions inside a train station and an outside square.For both datasets we used the available trajectories by respective authors [6,24].In Train Station we use longer trajectories (length>600) as this dataset also contains short-duration tracklets generated by repeated tracker initialisations that are not within the scope of this work.
As for preparing a trajectory dataset, there are multiple ways to do so [25].Broadly, these come under four categories: transformation, resampling, substitute or adding additional features.In this paper, we used a re-sampling methodology to achieve a static overall size per dataset based on the mean average trajectory length of its dataset.The x and y co-ordinates are also normalised between 0 and 1 based on the minimum and maximum of the dataset's original co-ordinate space.
For evaluating the performance of the extracted motion patterns, we use the precision (P), recall (R) and F-score (F 1) measures.P provides the assessment by penalising the correct (true positive) patterns with respect to incorrect (false positive) patterns.R provides the assessment by penalising the correct (true positive) patterns with respect to missed (false negative) patterns.If an extracted pattern belongs to a groundtruth cluster, it is deemed correct.We used the ground truth as provided by authors [13].Figure 3 provides a visualisation of the extracted patterns in terms of the predicted clusters on each dataset by the proposed method.
We tested the generalisation ability of the proposed method by training the Siamese network separately with the model corresponding to each dataset and then evaluating the performance across all other datasets (Table 3).Expectedly, the proposed method has shown the best  To further demonstrate the usefulness of the proposed method, we also compared the performance with five state-of-the-art approaches, namely DFTfeat [1], MULTfeat [14], DWTfeat [13], DEEPfeat [2] and Movelets [26].Table 4 summarises the evaluation results of all methods in terms of P, R and F 1 scores on all datasets.On Traffic Junction, the proposed method and Movelets have outperformed existing approaches based on P, R and F 1 scores.On Train Station, the proposed method again shows the best performance based on P, R and F 1, followed by Movelets.On Parking Lot, the proposed method is the best based on P and F 1, and shows slightly lower R = 0.94 than DWTfeat (R = 1.00).It is important to note that while R is a bit higher for DWTfeat than that of the proposed method, the former has a significantly lower P = 0.65 (due to a much higher number of false positives) that the latter with P = 0.96.On Students003, Movelets showed the best performance (P = R = F 1 = 0.96), with the proposed method actually also showing a comparable performance (P = R = F 1 = 0.91).This slight inferior performance is apparently attributed to the fact that the proposed method has mostly categorised trajectories belonging to the three clusters (clusters 1-3 in Figure 3c) as a single cluster.Indeed, evidently the motion patterns in these three clusters are largely similar in terms of their length and shape (all spanned horizontally across the image).Seemingly, the proposed method relying on the Siamese network is comparatively much better suited for detecting longer patterns than distinguishing among alike patterns differing slightly only in vertical placement.

Conclusion:
We presented an end-to-end deep learning framework based on trained Siamese networks, which enabled trajectory analysisbased extraction of dominant motion patterns in an online manner allowing an incremental incorporation of new incoming trajectory(ies).We performed an experimental validation and comparison of the proposed method on four challenging publicly available real-world datasets.The results show that the proposed method has outperformed existing methods on three datasets (Traffic Junction, Parking Lot, Train Station), while achieving a comparable performance on the fourth one (Students003).Future work aims to not only look into testing on more datasets with different scenarios, but also extending the few-shot methodology to train across multiple datasets as this is expected to generalise the network even better.

Conflict of interest:
The authors declare no conflict of interest.

Fig. 1
Fig. 1 Architecture of the full Siamese network with all layers fully connected

Fig. 3
Fig. 3 Visualisation of the extracted patterns in terms of the predicted clusters.The predicted clusters are shown on planes along z-axis in each plot.The colour of a trajectory corresponds to the relevant ground truth class as indicated in the legend.The darker side of a trajectory indicates the start of the trajectory

Table 1 .
The hidden layer sizes of auto-encoder for different datasets

Table 2 .
Summary of the datasets used in the study

Table 4 .
Evaluation results of methods in terms of P, R and F 1 scores; the higher the score, the better.Top two methods are shown in bold