Exploring Self-Supervised Vision Transformers For Gait Recognition in The Wild Part 1
Nov 24, 2023
Abstract:
The manner of walking (gait) is a powerful biometric that is used as a unique fingerprinting method, allowing unobtrusive behavioral analytics to be performed at a distance without subject cooperation.
We all know that exercise helps good health. In addition to this, exercise also helps improve memory. Walking is the simplest and easiest form of exercise to practice, and many people enjoy relaxing while walking or jogging. Now, more research shows that walking does powerful things for the brain.
First, walking stimulates the brain's nervous system, which helps strengthen brain function. When the body moves, our heart rate and blood flow increase, which also stimulates the brain to produce more neurons and synapses. The connections between these neurons and synapses can create new neural networks and faster thought processes.
Second, walking can relieve stress and anxiety, which is very important for improving memory. When the mind and body are in a state of tension, depression, or anxiety, the brain releases a hormone called cortisol. Cortisol damages neurons and synapses in the brain, which can lead to memory loss. Walking relieves stress and anxiety, reduces cortisol production in the body, and helps maintain healthy neurons and synapses.
Finally, walking increases blood circulation to the brain. Some studies show that good blood circulation can help improve memory. As we age, blood vessels in the brain gradually become clogged, resulting in insufficient oxygen supply to the brain. Walking can enhance the health of the heart, allowing the heart to deliver oxygen and nutrients to the brain more effectively, thus promoting memory and brain function.
Therefore, walking is a great form of exercise for both young and old. In addition to improving physical health, walking can also help improve memory. Let us walk a distance every day to make ourselves healthier and better! It can be seen that we need to improve memory, and Cistanche deserticola can significantly improve memory because Cistanche deserticola is a traditional Chinese medicinal material that has many unique effects, one of which is to improve memory. The efficacy of minced meat comes from the various active ingredients it contains, including acid, polysaccharides, flavonoids, etc. These ingredients can promote brain health in various ways.

Click know 10 ways to improve memory
As opposed to more traditional biometric authentication methods, gait analysis does not require explicit cooperation of the subject and can be performed in low-resolution settings, without requiring the subject’s face to be unobstructed/visible. Most current approaches are developed in a controlled setting, with clean, gold-standard annotated data, which powered the development of neural architectures for recognition and classification.
Only recently has gait analysis ventured into using more diverse, large-scale, and realistic datasets to pre-trained networks in a self-supervised manner. Self-supervised training regime enables learning diverse and robust gait representations without expensive manual human annotations. Prompted by the ubiquitous use of the transformer model in all areas of deep learning, including computer vision, in this work, we explore the use of five different vision transformer architectures directly applied to self-supervised gait recognition.
We adapt and retrain the simple ViT, CaiT, CrossFormer, Token2Token, and TwinsSVT on two different large-scale gait datasets: GREW and DenseGait. We provide extensive results for zero-shot and fine-tuning on two benchmark gait recognition datasets, CASIA-B and FVG, and explore the relationship between the amount of spatial and temporal gait information used by the visual transformer.
Our results show that designing transformer models for processing motion uses a hierarchical approach (i.e., CrossFormer models) on finer-grained movement fairs comparatively better than previous whole-skeleton approaches.
Keywords:
gait recognition; biometric authentication; vision transformer; pose estimation; selfsupervised learning; contrastive learning.
1. Introduction
How we move contains significant clues about ourselves. In particular, our gait (manner of walking) has been closely studied in medicine [1], psychology [2], and sports science [3]. Recently, gait analysis has received increased attention [4,5] from the computer science community coinciding with the exponential progress of deep learning and widespread availability of computing hardware.
AI-powered gait analysis systems have been able to successfully recognize subjects [6–10], estimate demographics such as gender and age [11], and estimate external attributes such as clothing [12], without using any external appearance cues. These results are not surprising, given the large amount of individual differences in gait, which are due to differences in musculoskeletal structure, genetic and environmental factors, as well as the walker’s emotional state and personality [13].
Current systems are only really trained and tested in controlled indoor environments. Most methods use the CASIA-B dataset [6] as the standard benchmark for gait recognition models, containing 124 subjects walking indoors in a strictly controlled manner captured with multiple cameras. Complexity in the real world cannot be fully modeled by such restrained scenarios. Only recently the focus has been on modeling gait "in the wild", with datasets such as DenseGait [12], GREW [7], and Gait3D [14].

Gathering a large-scale dataset that is clean and fully annotated represents a tremendous effort in terms of both financial resources and allocated time. The GREW dataset [7] reportedly took 3 months of continuous work to be gathered and annotated. While such approaches have been useful in developing neural architectures for processing gait [8,9], they are not sufficiently diverse to be properly used in more relaxed, real-world environments.
The AI community has been slowly moving away from this approach in other areas, with methods for self-supervised learning for both vision [15] and language [16] gaining significant traction and often surpassing traditional supervised methods. Recent progress in self-supervised learning showed that self-supervised models are more robust and exhibit emerging behaviors, not explicitly defined during training.
For instance, DINO [17], a vision transformer trained in a self-supervised regime, learned class-specific features enabling unsupervised object segmentation without using any such labels during training. Cosma and Radoi [10] proposed the first contrastive method for self-supervised learning for gait analysis, by training an ST-GCN [18] on a smaller version of DenseGait [12]. Their method obtained reasonable results on downstream gait recognition tasks and showed that there is a strong correlation between the pre-trained dataset size and zero-shot transfer performance.
While many approaches for gait analysis have been utilizing silhouettes extracted from background subtraction [6,8,9], extracting silhouettes in real surveillance scenarios implies the use of more advanced techniques, such as instance segmentation [19], which come at a high computational cost. Sequences of silhouettes occupy significant storage space and are not sufficiently flexible to be used in other adjacent tasks, such as activity recognition. Moreover, silhouettes encode subtle appearance cues, which makes it unclear to what extent movement is utilized in the identification [20].
On the other hand, 2D pose estimation models have become increasingly accurate and computationally efficient [21,22]. Skeletons are cheap to extract, and currently more reliable than 3D meshes and 3D poses, especially at a distance. Moreover, 2D skeletons are significantly more lightweight than silhouettes in terms of long-term storage.
Current architectures for processing sequences of skeletons utilize the natural spatial graph structure present in the human skeleton, introducing an inductive bias in the model design. Models such as the popular ST-GCN [18] and MS-G3D [23] have seen impressive results for skeleton-based action recognition.
Concurrently, there has been an explosion in the use of transformer models in almost all areas of deep learning since their initial application for natural language processing.
Transformers are considered a more general architecture, with few inductive biases. Initially, transformers have struggled to match CNN models for image classification [24], but are currently surpassing other models and are showing promising results in self-supervised scenarios, more so than other types of architectures, transformers have shown impressive learning capacity and emergent behaviors under self-supervision [17].
Cosma and Radoi [12] were the first to propose GaitFormer, a direct adaptation of the vision transformer encoder model for gait recognition, utilizing individual skeletons as input “patches”, essentially only performing temporal attention, ignoring spatial attention relationships.
GaitFormer was trained in a self-supervised fashion and surpassed other gait recognition methods even without any fine-tuning. Such previous work is encouraging and paves the way for a more in-depth study of the potential application of transformer architectures for gait analysis. Can vision transformer models be adapted for self-supervised learning of skeleton gait representations?
The main architectural issue in vision transformers is defining the proper relationships between image patches, which define local and global information. When applied to gait, the choice of patch dimensions corresponds to the amount of encoded temporal and spatial information of the skeleton sequence.
In this work, we present an extensive study of five different vision transformers, adapted for gait recognition. We explore the classical ViT model [24], CaiT [25], CrossFormer [26], TwinsSVT [27], and token-to-token ViT [28].

Each architecture is trained separately in a contrastive self-supervised manner on two large-scale "in the wild" datasets of 2D gait skeleton sequences: DenseGait—an automatically gathered dataset from raw surveillance streams, and GREW, a smaller dataset that contains clean human annotations.
We explore transfer capabilities across two controlled datasets for gait recognition, CASIA [6], and FVG [29]. For each dataset, we analyze direct (zero-shot) transfer and data efficiency during fine-tuning by training with progressively larger subsets of the datasets. Moreover, we conduct an ablation study on the relationship between spatial and temporal dimensions for patch sizes for SimpleViT and CaiT, the standard backbones for most of the vision transformers to date.
The rest of the paper is organized as follows. We conduct a high-level overview of related works about gait recognition models and vision transformers. We observe that gait representation models highly benefit from self-supervised training to have more robust and general embeddings, and transformer models have shown great modeling capacity in self-supervised training regimes.
Further, we mathematically describe the five architectures that we benchmark and describe the data preprocessing and skeleton transformations needed to be performed, such that vision transformers have to operate seamlessly on skeleton sequences. We also describe data augmentations, training and benchmarking datasets, and experimental setups.
We showcase results on CASIA-B and FVG for each of the five architectures and the two ’pretraining in-the-wild’ datasets. Finally, we make an ablation study on the relationship between the spatial and temporal patch sizes and provide a brief discussion of our results. We make our source code publicly available on GitHub (https://github.com/cosmaadrian/gait-vit, accessed on 28 February 2023) for transparency and reproducibility.
2. Related Work
In this section, we make a brief overview of existing methods for gait recognition in controlled environments and "in the wild". Further, we describe main the developments of transformer models and, in particular, their application in the vision domain.
2.1. Gait Recognition
Similarly to face-based identification, gait recognition relies on metric learning. As opposed to traditional biometric authentication methods, which rely on a single image (e.g., face recognition) and require extensive cooperation (e.g., iris-based biometric authentication), gait features are processed as a sequence of motion snapshots. Such gesture dynamics require more complexity in determining the most informative sub-sequence but enable the use of unobtrusive authentication at a distance.
In this context, the task implies training an encoder network to map walking sequences to an embedding space where the embedding similarity corresponds to the similarity of the gait. Embeddings of walks that belong to the same person should be close to the embedding space and those who come from different identities need to be more distant. In this embedding space, inference can be made by obtaining the embedding of the gait sequence and utilizing the nearest-neighbor approach on a database of known walks.
Current approaches in gait-based recognition are divided into two categories: appearance-based [8,9] and model-based [10,12,30]. Appearance-based methods first obtain the silhouettes of the walking subjects with background subtraction or segmentation algorithms from each video frame.
Then the sequence of silhouettes is fed into CNN-based architectures which extract spatial and temporal features which are aggregated into a final embedding for recognition. Model-based approaches extract the skeletons from RGB videos with pose estimation models [21,22]. Sequences of skeletons are usually processed by models that rely on graph convolutions [10,30] for obtaining the embedding of the gait.
GaitSet, the work of Chao et al. [8], regards the gait as an unordered set of silhouettes. The authors argue that this representation is more flexible than a silhouette sequence because it is robust to different arrangements of frames or the combination of multiple walking directions and variations. They utilize convolution layers for each silhouette to obtain image-level features and combine them into a set-level feature with Set Pooling. They obtain the final output by employing their version of Horizontal Pyramid Matching [31].
Fan et al. [9] noticed the fact that specific parts of the human silhouette should have their spatiotemporal expression as each one has a unique pattern. Their architecture, GaitPart, utilizes focal convolution layers (FConvs), which are a specialized type of convolution with a more restricted receptive field. The authors argue that the FConvs aid their architecture in learning more fine-grained features for different parts of the moving body. They also introduce the micro-motion capture modules, which are employed to extract the features of small temporal sequences.
Teepe et al. [30] propose GaitGraph, which leverages an adapted graph convolutional network called ResGCN [32] for encoding the spatiotemporal features obtained from the sequence of skeletons. Li et al. [33] propose PTP, which is a structure that aggregates multiple temporal features from one gait cycle based on their analysis of the most important stages of walking.
They also utilize a graph convolutional network for spatial feature extraction, which works together with PTP. The authors introduce a novel data augmentation method that modifies the gait to have multiple paces in a more realistic cycle.
However, different from previous works, we are aiming to explore the performance of gait recognition architectures in self-supervised scenarios. Inspired by tremendous progress in the computer vision domain, we propose to adapt existing vision transformer architectures to operate on skeleton sequences instead of images and to test their modeling capacity in self-supervised scenarios. Most other works [8,9,30] focus their efforts on developing neural architectures that achieve impressive results on gait recognition on controlled datasets.
However, we intend to remove the need for highly expensive manual annotations for gait datasets and explore ways in which self-supervised learning is appropriate for gait analysis.

Previous works in this domain [10,12] showed potential for learning good gait representations from weakly annotated datasets. Cosma and Radoi [12] proposed GaitFormer, the first transformer-based architecture for processing skeleton sequences, inspired by the ViT [24] model. Similar to [12], we attempt to explore the performance of other vision transformer models, with different spatial and temporal dynamics in the patch processing mechanism. Large-scale datasets for gait recognition have been proposed in the past [7,12], which allows for the development of general architectures for representation learning.
For more information:1950477648nn@gmail.com






