Abstract:Issues of low-level data management and high knowledge granularity exist in current rice breeding question answering systems. In addition, there is a lack of publicly available labeled data for named entity recognition in rice breeding, and manual annotation can be costly. To address these issues, an approach based on text data augmentation to the named entity recognition was proposed for rice breeding questions. The rice breeding knowledge graph was created to assist in subdividing larger named entity categories in rice breeding, such as rice characteristics entities, into smaller subcategories, such as resistance to abiotic stress and eating quality. It helped to enhance entity boundaries and reduce knowledge granularity. Responding to the challenge of high annotation costs for rice breeding data that results in suboptimal performance in named entity recognition, the DA-BERT-BILSTM-CRF model was presented by introducing a data augmentation layer into the BERT-BILSTM-CRF model. Using manually labeled rice breeding questions as training data, the proposed model was compared with three other baseline models. In the overall named entity recognition experiment under the small class entity division, the model achieved a precision of 93.86%, a recall of 92.82%, and an F1 score of 93.34%. Compared with the best-performing BERT-BILSTM-CRF model among the three baseline models, the model outperformed by 4.98, 5.3 and 5.15 percentages points, respectively. Meanwhile, it also performed better in the single-entity recognition metric, achieving a precision of 94.26% and an F1 score of 93.32%. The experiments showed that the proposed approach performed better in both overall named entity recognition and single-class named entity recognition tasks in rice breeding questions.