Abstract:In response to large number of parameters and poor robustness of object detection algorithms in complex orchard environment, an improved YOLO v7 network for apple maturity (immature, semimature, mature) detection was proposed. With YOLO v7 as the baseline network, a window multi-head self-attention mechanism (Swin transformer, ST) was adopted into the feature extraction structure to greatly reduce the parameters and computational complexity. In order to improve the ability of the model for detecting small targets in distant images, adaptively spatial feature fusion (ASFF) module was adopted into the feature fusion structure to optimize the Head part, effectively utilizing shallow and deep features and enhancing the performance of the feature scale invariance. Wise intersection over union (WIoU) was used to replace the original complete intersection over union (CIoU) loss function, thus accelerating the convergence speed and detection accuracy. The experimental results showed that the improved YOLO v7-ST-ASFF model had significantly improved the detection speed and accuracy on the test set of the apple images. The average detection precision, recall, mean average precision (mAP) for different maturity levels can reach 92.5%,84.2% and 93.6%, all of which were better than that of Faster R-CNN, SSD, YOLO v3, YOLO v5, YOLO v7 and YOLO v8 object detection models. The detection effects were good for multi, single, frontlight, backlight, distant and close targets, as well as bagged and unpacked targets. The size of the model was 53.4MB, and the ADT was 45ms, which was also better than that of other models. The improved YOLO v7-ST-ASFF model can meet the detection of apple targets in complex orchard environment, providing effective exploration for automated fruit and vegetable picking by robots.