Exploring Self-Supervised Vision Transformers For Gait Recognition in The Wild Part 3
Nov 24, 2023
We use a simple upsample to interpolate between neighboring joints (Figure 3). However, naively doing so on the skeleton results in spurious joints across the skeleton, regardless of the choice of skeleton formats (e.g., OpenPose or COCO), since the joint ordering does not preserve any semantic meaning.
Joint sequencing refers to moving the joints of the body in a certain order. This type of exercise plays an important role in maintaining physical health and enhancing body memory.
First of all, joint sequencing can promote coordinated movement of various parts of the body, enhance muscle strength, and improve cardiopulmonary function, which all help to maintain good health. At the same time, regular joint alignment can reduce body stiffness and pain, improve posture and balance, and delay the decline of physical function caused by age.
Secondly, joint sequencing can also enhance memory. Research shows that exercise can stimulate the growth and connections of brain neurons, thereby promoting the exchange of information between neurons, which can help enhance memory and learning abilities. The sequenced movement of joints requires the brain to give accurate instructions for the movements of various body parts, which is of great help to the development of the brain's coordination and memory abilities.
Finally, joint sequencing can also relieve stress and anxiety and improve emotional stability. Exercise can release some substances in the body, such as dopamine and endorphins, which can help relieve emotional problems and improve the body's sense of happiness and satisfaction. These factors also have a positive impact on improving memory.
In summary, there is a positive relationship between joint sequencing and memory. Through regular joint sequencing exercises, you can promote healthy body and brain development, enhance memory and learning abilities, and improve emotional stability. Let us pay attention to good health and enjoy the beauty of life together! It can be seen that we need to improve memory, and Cistanche deserticola can significantly improve memory because Cistanche deserticola is a traditional Chinese medicinal material that has many unique effects, one of which is to improve memory. The efficacy of minced meat comes from the various active ingredients it contains, including acid, polysaccharides, flavonoids, etc. These ingredients can promote brain health in various ways.

Click know supplements to improve memory
This observation is in line with the work of Yang et al. [44], which proposes a tree structure skeleton image (TSSI) to propose the spatial relationships between joints. It is based on a depth-first tree traversal ordering of joints, which preserves the skeleton’s structural information. Figure 3 (right), showcases the effects of different skeleton formats and upsample methods. For this resizing method, we used TSSI format and bicubic interpolation.
Furthermore, we experimented with two upscaling methods, which were learnable during training. We used a simple linear layer applied to each flattened skeleton to increase the number of joints. This is the most straightforward manner to transform each skeleton, but it does not account for any spatial relationships between joints. To address this, we also employ a set of 2D deconvolutional layers on the skeleton sequence for resizing while also taking the structural information into account; for this method, we also employ the TSSI format.
Table 1 showcases the results for each resizing method for all architectures. The models were trained and evaluated on CASIA-B for 200 epochs, and we show results for normal walking. For the rest of our experiments, we chose to upsample the skeleton sequence with bicubic interpolation.


While there are several possible self-supervised pretraining procedures, we opted for a contrastive pretraining approach since it is the same procedure for the actual retrieval task of gait recognition. Contrastive approaches encourage representations belonging to the same class to be close in latent space, while simultaneously being distant from representations belonging to different classes.
In particular, we use Supervised Contrastive [45] for pretraining. SupCon loss operates on a multi-viewed batch: each sample in the batch has multiple augmented versions of itself. It was shown to naturally be more robust to data corruption, it alleviates the need for careful selection of triplets since the gradient encourages learning from hard examples and is less sensitive to hyperparameters.

3.4. Data Augmentation
Training in a self-supervised contrastive manner with SupCon loss implies the use of data augmentation to provide multiple augmented “views” of the same skeleton sequence. Augmentations used for our walking skeleton sequences are in line with other works in this area [10,12,30]. The main augmentation used is random temporal cropping with a period length of T = 64 frames.
Given that skeletons are tracked for a variable duration of time, we use cropping to ensure that all sequences have the same length. Moreover, a walking person might change direction and perform other actions across the tracked duration; consequently, cropping induces more variability for the same sequence.
Furthermore, we modify the walking pace by slowing down or speeding up the walk. We used speed modifiers of {0.5, 0.75, 1, 1.25, 1.5, 1.75, 2.0}. This is adapted from the work of Wang et al. [48] for self-supervised learning of video representations. Moreover, pace modification has been used in gait analysis in the past [33].
We also use random flipping with a probability of 0.5, sequence reversal with a probability of 0.5, additive Gaussian noise for each joint with σ = 0.005, and random dropout of 5% of joints with a probability of 0.01 to simulate missing joints from the pose estimation model.

3.5. Initialization Methods
To gauge the impact of self-supervised pretraining performance on the proposed architectures, we explore three different initialization methods. Table 2 showcases the various datasets used in the literature. While CASIA-B [6] and FVG[29] are controlled datasets, mostly used for evaluation, we use DenseGait [12] and GREW [7] for self-supervised pretraining for the five architectures. DenseGait and GREW are two of the largest realistic gait databases, collected in outdoor settings, which presumably contain the majority of gait variations, behaviors, and complexities present in everyday life. We choose these datasets to have more general gait representations, to allow for gait authentication in general surveillance scenarios.

DenseGait Pretraining DenseGait is a large-scale “in the wild” gait dataset collected automatically from surveillance streams. It contains 217 K tracked skeleton sequences, extracted using AlphaPose [22], in various environments and camera angles from multiple geographical locations. Since DenseGait is collected automatically, its annotations in terms of the tracking identifier are noisy, and the dataset might contain sequences about the same subject, although this is a rare occurrence. However, this is the case for the majority of large-scale unlabeled datasets that contain samples belonging to the same semantic class, which is considered unrelated during training. We pre-train each architecture on DenseGait and use the trained parameters for further downstream performance evaluation of the controlled datasets.
GREW Pretraining GREW is another outdoor “in the wild” dataset but is carefully annotated such that it contains walking subjects across multiple days and different types of clothing. However, to conform to the requirements of the self-supervised regime, we discard the annotations and treat each walking sequence as a separate person. GREW is 2× smaller than DenseGait, containing 128 K skeleton sequences, while also having each pedestrian tracked for a smaller average duration [12]. We also train each architecture on GREW and use the trained parameters for downstream performance evaluation.
Random Initialization This initialization method corresponds to no pretraining (i.e., training from scratch). Each architecture is trained with random weight initialization on the downstream datasets. This method is a baseline to compare the performance gains by pretraining.
3.6. Evaluation
Downstream task performance is evaluated in two different manners for gait recognition. We directly test the retrieval capabilities of pre-trained architectures, without finetuning for a specific task. This method corresponds to zero-shot transfer. Further, we fine-tune each architecture using a 10× smaller learning rate than during pretraining, with a progressively smaller learning rate at the beginning of the network, corresponding to layer-wise learning rate decay (LLRD) [50] policy.
The evaluation of the downstream performance is performed on two popular gait recognition datasets: CASIA-B [6] and FVG [29]. Both datasets feature a small number of subjects under strict walking protocols, which are controlled over various confounding factors: camera angle, clothing, accessories, and walking speed.
CASIA-B is an indoor dataset containing 124 subjects captured from 11 synchronized cameras. Each individual walks in three different conditions: normal walking, clothing change, and bag carrying. Since its release, it has been a staple for benchmarking gait analysis models, being one of the most used datasets in this area. We use the first 62 subjects as the training set and the remainder of the 62 for performance evaluation. For gait recognition, we choose to evaluate performance on a per-angle basis in a “leave-one-out” setting, in which the gallery set contains all walking angles except the probe angle.
Front-View Gait is another popular dataset for gait recognition, having 226 subjects walking outdoors under various protocols. Different from CASIA-B, FVG features additional confounding factors: walking speed, cluttered background, and the passage of time (i.e., some subjects have registered walks that span a year). Furthermore, all subjects are captured with a front-facing camera angle, which is considered the most difficult angle for gait recognition since it contains the smallest amount of perceived joint movement variation. We used the first 136 subjects for training and the rest for performance evaluation. Performance evaluation for gait recognition adheres to the protocols outlined by the authors, in which we use the normal walking sequence in the gallery set and the other conditions in the probe set.

For all evaluation scenarios, we use deterministic cropping in the center of the walking sequence and do not use any test-time augmentations.
4. Experiments and Results
4.1. Evaluation of CASIA-B
We pre-train each architecture on DenseGait and GREW, respectively, and evaluate the performance on CASIA-B and FVG. In the first set of experiments, we are interested in evaluating the performance of CASIA-B in a fine-tuning scenario after pretraining, with progressively larger training samples. We trained each network on the first 62 identities, with all available walking variations, and kept the rest for testing. Recognition evaluation was performed using the first 4 normal walking samples as the gallery set, and the rest as a probe set. Figure 4 showcases the accuracy for each of the three walking variations (normal—NM, clothing—CLm and carry bag—BG) for CASIA-B. For this scenario, we randomly sampled K = {1, 2, 3, 5, 7, 10} walks per subject, per angle, and trained the model. While the performance is relatively similar between architectures, it is clear that pretraining offers a significant boost in performance compared to random initialization, regardless of the pretraining dataset choice.
Moreover, SimpleViT, CrossFormer, and Twins-SVT seem to have similar high performance, while Token2Token is slightly lagging. This suggests that the progressive tokenization method used in Token2Token, which was specifically designed for image-like structures, does not effectively capture the characteristics of gait sequences. There is a noticeable difference between the pretraining datasets: DenseGait seems to offer a consistent performance increase in the two walking variations (CL and BG) when compared to GREW. This is indicative of the fact that DenseGait includes more challenging and realistic scenarios that better prepare the model for conditions where the walking pattern is affected by external factors.



The architectures are trained to map walking sequences to an embedding space, where the proximity between points reflects the similarity of the corresponding walking sequences. This means that the embeddings of unseen gait sequences from the same identity should be close to each other in the embedding space and form clusters, while the embeddings of different identities should be further away from each other and form distinct clusters. This is important as it allows the model to generalize to unseen gait sequences and accurately identify the individual by employing a nearest-neighbor approach. In Figure 5, we present the clustering for the embeddings for each identity in the test set of CASIA-B after dimensionality reduction with t-SNE [51]. We used the 256-dimensional embedding vector and projected it into two dimensions. SimpleViT and CrossFormer seem to have the best separation of identities, irrespective of camera viewpoint.

4.2. Evaluation of FVG
Similarly, we evaluated the performance of each architecture on FVG, which is qualitatively different from CASIA-B, as it contains only a single viewing angle. We fine-tuned each pre-trained network on a fraction f = {0.1, 0.2, 0.3, 0.5, 0.7, 1.0} of the 12 runs per person in the training set. Fine-tuning results are presented in Figure 6. Results follow a similar trend to those for CASIA-B: SimpleViT and CrossFormer have consistently high performances, and the use of a pretraining dataset is beneficial for downstream performance. Further, pretraining on DenseGait seems to carry on a constant accuracy improvement. As noted by Cosma and Radoi [12], DenseGait contains subjects tracked for a longer duration, and this provides more variation in the contrastive learning objective, similar to random cropping for self-supervised pretraining for natural images. Similar to the results on CASIA-B, the clothing variation severely lags behind the normal walking scenario.
In Table 4, we present more fine-grained results on the testing set of FVG between pre-trained models, similar to the CASIA-B scenario. Pretraining results are consistent: pretraining on DenseGait directly correlates to the improvement in the downstream accuracy. While pretraining on both datasets improves performance in all scenarios, the improvement is particularly significant in the CBG (Cluttered Background) scenario which usually consists of having more people in the video, similar to what would be expected in realistic settings. This improvement likely comes from the fact that DenseGait and GREW were gathered in natural, uncontrolled environments, making them more realistic and challenging, thus better preparing the model for similar conditions to the ones in the CBG scenario. The ranking between models is similar to CASIA-B: SimpleViT, CrossFormer, and Twins-SVT consistently outperform CaiT and Token2Token. For both CASIA-B and FVG, CaiT slightly lags behind other models.

4.3. Spatiotemporal Sensitivity Test
One particularity of vision transformers is the arbitrary choice of patch dimensions, which can prove to be crucial in the final performance. In the case of image processing, patch dimensions are not especially important, due to the translational invariance of the semantic content in an image.
For skeleton sequences, however, patch dimensions
correspond to specific and interpretable features of the input: patch height represents the
amount of spatial information contained in a patch (i.e., number of skeletal joints included),
while the temporal dimensions represent the amount of included temporal information
(i.e., number of frames). The balance between the two should be carefully considered in the
use of adapted vision transformers for gait analysis. In Figure 7, we showcase a heatmap in
which every cell is the performance of a trained model (randomly initialized) on CASIA-B
for normal walking. We train each model for 50 epochs for a fair comparison, and to gauge
the convergence speed at a fixed number of steps.
We constructed two heatmaps, one
for SimpleViT and one for CaiT since they have a similar underlying backbone, and it is
straightforward to modify the patch sizes. The same process can be performed for the other
tested architectures. We conclude that smaller patch sizes correspond to a positive increase
in modeling performance for skeleton sequences, while the trade-off between spatial and
temporal dimensions is not crucial, since performance is similar—the heatmap matrix is
fairly symmetric by the second diagonal. Therefore, smaller square patch sizes such as
(2, 4) across spatial and temporal dimensions fare best for this task, while larger patch sizes
such as (32, 32) contain too little discriminative information. However, smaller patch sizes
imply increasing the number of patches, which does require more computing power. For
our setup of two NVIDIA RTX 3060 GPUs, we reported out-of-memory errors for some
combinations of smaller patch sizes.

The most likely reason for the improved performance with smaller patch sizes is that the architecture can better capture the complexity of the walking pattern by computing more intricate interactions between patches. Patches with the largest possible spatial sizes and smallest possible temporal sizes can be considered full representations of skeletons, whereas patches with the largest possible temporal sizes and smallest possible spatial sizes capture the entirety of an individual joint’s movements. As can be observed, the highest accuracy is achieved when the patch size incorporates a balance of both spatial and temporal information, which correspond to small movements of closely connected joints.
5. Discussion and Conclusions
In this work, we provided a comprehensive evaluation of five popular variants of the vision transformer adapted for skeleton sequence processing. Our efforts are in line with the recent advancements in deep learning to essentially unify the different modalities under the transformer architecture. We proposed a spatial upsampling method for skeletons (bicubic upsampling with TSSI skeleton format) to artificially increase the number of joints, such that the sequence can be properly consumed by the transformer encoders. Furthermore, each architecture was trained under the self-supervision training paradigm on two general and large-scale gait datasets (i.e., DenseGait and GREW), and subsequently evaluated on two datasets for gait recognition in controlled environments (i.e., CASIA-B and FVG). We chose to adopt the self-supervised learning paradigm to obtain general gait features, not constrained to a particular walking variation or camera viewpoint.
Our results imply the need for high quantity, high quality, and diverse datasets for pretraining gait analysis models. We showed that pretraining on DenseGait offers consistent accuracy improvements over GREW, due to the increase in size, the number of variations, and the average walking duration [12]. The most significant benefit, however, is in situations with low amounts of training data available. Our results show that training from scratch leads to significantly worse results than fine-tuning even with modest amounts of data (i.e., 10 sequences per person). Currently, most gait approaches are performed indoors in strictly controlled environments, which cannot be generalized to the complexities of real-world interactions. Diverse training datasets are crucial for performing accurate in-the-wild behavioral analysis, especially since gait is a biometric feature easily influenced by external environmental factors, as well as internal and emotional components.

Our ablation study shows that smaller spatial-temporal patches are beneficial for better downstream results. This insight informs future developments of architectures for skeleton sequences, which have previously relied on processing an individual skeleton on a single patch [12].
Alongside concurrent efforts to bring gait analysis into realistic settings, our work further enables the transition of gait authentication and behavioral analysis from indoor, controlled environments to outdoor, real-world settings. In-the-wild gait recognition will become ubiquitous with the developments of smart sensors and efficient neural architectures that process motion-driven behavior in real-time.


References
1. Kyeong, S.; Kim, S.M.; Jung, S.; Kim, D.H. Gait pattern analysis and clinical subgroup identification: A retrospective observational study. Medicine 2020, 99, e19555. [CrossRef] [PubMed]
2. Michalak, J.; Troje, N.F.; Fischer, J.; Vollmar, P.; Heidenreich, T.; Schulte, D. Embodiment of Sadness and Depression—Gait Patterns Associated With Dysphoric Mood. Psychosom. Med. 2009, 71, 580–587. [CrossRef] [PubMed]
3. Willems, T.M.; Witvrouw, E.; De Cock, A.; De Clercq, D. Gait-related risk factors for exercise-related lower-leg pain during shod running. Med. Sci. Sports Exerc. 2007, 39, 330–339. [CrossRef] [PubMed]
4. Singh, J.P.; Jain, S.; Arora, S.; Singh, U.P. Vision-based gait recognition: A survey. IEEE Access 2018, 6, 70497–70527. [CrossRef]
5. Makihara, Y.; Nixon, M.S.; Yagi, Y. Gait recognition: Databases, representations, and applications. Comput. Vis. Ref. Guide 2020, 1–13.
6. Yu, S.; Tan, D.; Tan, T. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 4, pp. 441–444.
7. Zhu, Z.; Guo, X.; Yang, T.; Huang, J.; Deng, J.; Huang, G.; Du, D.; Lu, J.; Zhou, J. Gait Recognition in the Wild: A Benchmark. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021.
8. Chao, H.; He, Y.; Zhang, J.; Feng, J. Gaitset: Regarding gait as a set for cross-view gait recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8126–8133.
9. Fan, C.; Peng, Y.; Cao, C.; Liu, X.; Hou, S.; Chi, J.; Huang, Y.; Li, Q.; He, Z. Gaitpart: Temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14225–14233.
10. Cosma, A.; Radoi, I.E. WildGait: Learning Gait Representations from Raw Surveillance Streams. Sensors 2021, 21, 8387. [CrossRef] [PubMed]
For more information:1950477648nn@gmail.com






