Traffic Sign Recognition Based On The YOLOv3 Algorithm Part 2

2. Algorithm Fundamentals

2.1. The YOLOv3 Algorithm

YOLOv3 [14] is Redmon's improved, single-stage target detection algorithm based on YOLOv2, which has improved detection accuracy and real-time performance and outperforms other algorithms in terms of speed and accuracy.

In recent years, the rapid development of artificial intelligence technology has enabled various intelligent detection systems to be applied in various fields. Detection accuracy is an important indicator for evaluating the quality of an intelligent system, and memory is one of the core capabilities that supports the operation of an intelligent system. So, what is the relationship between the two?

First of all, we need to make it clear that detection accuracy and memory are not a simple "positive correlation" or "negative correlation". There is a high degree of interaction and coordination between them. In many cases, the detection accuracy of an intelligent system depends on its memory, that is, its ability to understand and learn sample data.

For example, in the field of face recognition, a good face recognition system needs to be able to accurately identify different faces and match them with the person's information in the known face database. This requires the intelligent system to have a strong memory, be able to store the information of known faces in the database, and use it flexibly in subsequent recognition tasks.

Similarly, in the medical field, intelligent systems need to understand and remember a large amount of medical knowledge to assist doctors in disease diagnosis and treatment plan design. This also requires the intelligent system to have strong memory and learning capabilities, to continuously absorb new medical knowledge, and to cross-verify and upgrade with the existing knowledge base.

Of course, the relationship between detection accuracy and memory is not one-way. On the contrary, good detection accuracy can also promote the memory improvement of intelligent systems. For example, in some classification and recognition tasks, intelligent systems need to continuously provide feedback and optimization to continuously improve their accuracy and precision, thereby further strengthening the ability to understand and remember sample data.

In general, detection accuracy and memory are two indispensable elements for the operation of intelligent systems. They have complex interactions and relationships that need to be fully considered and coordinated. Only by continuously improving the detection accuracy and continuously strengthening the memory and learning capabilities of the intelligent system can the comprehensive development and application of the intelligent system be truly realized. It can be seen that we need to improve memory, and Cistanche deserticola can significantly improve memory, because Cistanche deserticola can also regulate the balance of neurotransmitters, such as increasing the levels of acetylcholine and growth factors. These substances are very important for memory and learning. In addition, Meat can also improve blood flow and promote oxygen delivery, which can ensure that the brain receives sufficient nutrients and energy, thereby improving brain vitality and endurance.

improving brain function

Click know supplements to boost memory

YOLOv3 is currently the most popular algorithm in the YOLO family and is widely used in real detection scenarios [15]; the YOLOv3 network structure is shown in Figure 1.

increase brain power

The complete convolutional structure used by YOLOv3 is not constrained by the size of the image input.

The pooling and fully connected layers are removed from the entire network structure, and a convolutional layer with a step size of 2 is used instead of the pooling layer for the downsampling operation, which prevents the loss of target information during pooling and facilitates the detection of small targets [16].

In addition, YOLOv3 replaces the DarkNet-19 network structure of YOLOv2 with the DarkNet-53 feature extraction layer.

The DarkNet-53 network, which successfully resolves the gradient problem of the deep network and the loss of original information during the multi-layer convolutional operation to better extract features and improve detection and classification [17], borrows the residual network structure of ResNet [18] and uses the original output of the previous layer as part of the input in the latter layer of the network.

As shown in Figure 2, the residual module in YOLOv3 consists of two convolutional layers and a shortcut layer.

increase memory power

Furthermore, YOLOv3 uses the notion of a feature pyramid network (FPN) (19] and introduces the feature pyramid network to forecast feature maps at three scales, with detection scales of 13 x 13, 26 x 26, and 52 x 52.

The method of feature extraction by the convolutional neural network is bottom-up in the FPN network, and the process of upsampling the convolutional layer feature maps is top-down, as shown in Figure 3.

2.2. Spatial Pyramidal Pooling Structure

The spatial pyramid pooling (SPP) structure (20] solves the problem of repeated extraction of image features by convolutional neural networks and greatly improves the detection efficiency; the SPPNet network structure is shown in Figure 4.

improve cognitive function

To ensure that the resolution of the input image matches the feature dimension of the fully connected layer in a neural network with a fully connected layer, region cropping and scaling operations on the input image are required.

Scaling and cropping processes will result in the loss of picture feature information, lowering detection accuracy and affecting detection outcomes: however, scaling and cropping processes will result in the loss of picture feature information detection accuracy and affect detection results, whereas SPPNet can overcome limitations the fixed size of the input image, saving the computational cost 21.

improve short term memory

3. Improved YOLOv3

3.1. Improved YOLOv3 Network Structure

The basic feature extraction network is commonly downsampled five times, with a downsampling rate of 2, and the multiplicity of five times downsampling is 32 to the fifth power of two, according to the COCO dataset description.

If downsampling is continued, the feature map obtained will be one, and the target information will be lost. Small targets are fewer than 32 × 32 pixels, medium targets are 32 × 32–96 × 96 pixels, and giant targets are greater than 96 × 96 pixels [22].

As illustrated in Figure 5, the TT100K traffic sign dataset used in this work was mostly made up of small and medium targets, with large targets accounting for just 7.4% of the total dataset and tiny targets accounting for 42.5% [23].

increase memory

The TT100K dataset has a high resolution, with each image having a resolution of 2048 × 2048 pixels and the largest traffic signs among the small targets accounting for less than 0.1% of the entire image, posing a significant challenge to the target detection algorithm.

Small targets have limited features and necessitate great localization precision.

Despite the introduction of the FPN structure in YOLOv3 to leverage multi-scale feature fusion to produce predictions by fusing the findings of distinct feature layers, which is critical for small target identification, the results were still unsatisfactory.

In the YOLOv3 network, the shallow layer contains less feature semantic information but a precise target location, whereas the deep layer has more but a coarse target location.

As a result, shallow convolutional layers are used to predict small targets and deep convolutional layers are used to predict large targets. A fourth feature prediction scale of size 152 × 152 was added to the three feature prediction scales of the YOLOv3 network structure to fully utilize the shallow features in the network to anticipate small targets.

With an input image size of 608 × 608, the output image feature size was 152 × 152 after convolution and a two-fold upsampling of the input image, and the feature layer was induced through the routing layer; this feature extraction was fused with the 11th layer feature to increase the fourth feature prediction scale.

In addition, the SPP module was added to realize the merging of local and global features by borrowing the notion of SPPNet and combining it with YOLOv3.

Before the YOLO detection layer, the SPP module was integrated between the fifth and sixth convolutional layers, and the SPP module's feature maps and feature maps pooled were reconnected and passed to the next detection network layer.

improve working memory

To accomplish the feature map level fusion of local and global features, the SPP module's maximum pooling kernel should be as close to the size of the feature map to be pooled as possible.

To minimize the computational effort caused by the SPP module, enrich the feature map expression capability, and increase the detection impact, the SPP module in this research was composed of two parallel branches, each of which was composed of a 19 × 19 max pooling layer and a jump connection. Figure 6 depicts the improved YOLOv3 network structure.

ways to improve brain function

3.2. Improved Loss Function

The loss function of YOLOv3 is composed of the center coordinate loss (lossy), width–height coordinate loss (loss), confidence loss (lossconf), and classification loss (losses). The central coordinate loss is represented by:

improve your memory

where λcoord denotes the coordinate loss weight; λnoobj denotes the confidence loss weight without an object; I obj ij denotes whether the jth anchor box of the ith cell is responsible for the object (1 or 0); I nobby ij denotes the jth anchor box of the ith grid that is not responsible for the object; (xi,yi,w j i,h j I, C j I, P j i ) denotes the predicted target box coordinates, confidence, and category; and (xˆ j i,yˆ j i,wˆ j i, ˆh j I, Cˆ j I, Pˆ j i ) denotes the real target box coordinates, confidence, and category

The YOLOv3 loss function is represented by Equation (5), where the mean square error (MSE) loss function is used for the bounding box regression and cross-entropy is utilized as the loss function in lossconf and locals.

loss = lossxy + losswh − losscon f − losscls (5)

However, utilizing MSE as the bounding box regression's loss function is unfavorable to small target detection, sensitive to object scale, and focuses on big-scale targets while being unfriendly to small-scale objects.

To balance the loss of large and small targets and maximize the detection results by weakening the influence of the bounding box size on the width and height loss function, the IoU-type loss function was employed in this paper, and the metric loss generated by IoU was used as a performance Equation (6).

IoU = |A ∩ B| |A ∪ B| (6)

When the bounding box and the target box do not overlap, IoU = 0 does not reflect the distance gap between the two boxes; when the prediction box and the labeled box completely overlap, IoU = 1, the bounding box's center point cannot be determined, and the size gap with the target box cannot be further optimized.

DIoU loss [24] is independent of size; thus, big sizes will not result in a large loss. Because a tiny size produces a little loss, which can address the problem, this work used the DIoU loss, whose calculation formula is presented in Equation (7).

D IoU loss = 1 − IoU + ρ 2 b, b gt c 2 (7)

where b and b gt denote the central points, ρ is the Euclidean distance, and c is the diagonal length of the smallest enclosing box covering the two boxes.

DIoU loss minimizes the distance between two target frames directly, converges quickly, and is more in line with the target frame regression mechanism, which takes into account the distance between the target and anchor, the overlap rate, and the scale, making the target frame regression more stable, while still providing the gradient direction for the bounding box when it does not overlap with the target frame.

help with memory