Machine learning algorithms, bull genetic information, and imbalanced datasets used in abortion incidence prediction models for Iranian Holstein dairy cattle

Prev Vet Med. 2020 Feb:175:104869. doi: 10.1016/j.prevetmed.2019.104869. Epub 2019 Dec 17.

Abstract

The ability to predict abortion incidence, especially in regions with high abortion rates (e.g., Iran), helps improve reproductive performance and, thereby, dairy farm profitability. The objective of this study was to predict pregnancy loss in Iranian dairy herds. For this purpose, the cow history records and bull genetic information available at 6 large commercial dairy farms with cows calved between 2005 and 2014 were extracted from an on-farm record-keeping software. Using WEKA, 12 commonly used machine learning (ML) algorithms were applied to the dataset. The algorithms belonged to 5 classifier groups which were Bayes, meta, functions, rules, and trees. The original dataset including herd-cow factors was randomly divided into 2 subsets: a training dataset and a test one (at a ratio of 60:40). The original dataset was combined with the bull genetic information to create a full dataset. The average abortion rate was 15.4 %, which represented an imbalanced dataset. Therefore, 2 down- and up-sampling techniques were additionally implemented on the original dataset (more specifically on the training one) to create 2 balanced datasets. This ultimately yielded 4 datasets; original, full, down-sampling, and up-sampling. Different algorithms and models were evaluated based on F-measure and area under the curve (AUC). Based on the results obtained, ML algorithms exhibited a high performance in predicting abortion when applied to the balanced dataset. However, their performance varied from 32.3 % (poor) to 69.2 % (medium upward) when applied to the imbalanced original dataset. In addition to the imbalance in the original dataset, the reason for these poor results were attributed to the high proportion of unknown risk factors underlying abortion incidence. Even when including the bull genetic information, it did not lead to any significant improvements in the prediction model. From among the datasets used, the Bayes algorithms outperformed the others in predicting pregnancy losses while rules had the worst performance. Furthermore, while the Bayes algorithms were not affected by the type of dataset (balanced or imbalanced), substantial increases in F-measure and AUC were observed for rules, trees, and functions with balanced datasets. Overall, the balanced models outperformed the others, with the down-sampling method exhibiting the highest performance. Despite the fact that the prediction models used in this study did not perform as expected, it was shown that they can be beneficially used to predict and reduce pregnancy losses, despite their moderate accuracy, especially when used for herds with high abortion rates and low reproductive performances.

Keywords: Abortion; Dairy herds; Machine learning; Prediction.

MeSH terms

  • Abortion, Induced / statistics & numerical data
  • Abortion, Induced / veterinary*
  • Abortion, Veterinary / epidemiology*
  • Algorithms
  • Animals
  • Cattle / genetics*
  • Cattle Diseases / epidemiology*
  • Dairying
  • Datasets as Topic*
  • Incidence
  • Iran / epidemiology
  • Machine Learning*
  • Male
  • Models, Theoretical