Cocoa Ripeness Classification Using Vision Transformer
DOI:
https://doi.org/10.37385/jaets.v6i2.6663Keywords:
Cocoa Ripeness Classification, Vision Transformer, Shifted Patch Tokenization, Locality Self Attention, Agricultural Computer VisionAbstract
The quality of manual methods for assessing the ripeness of cocoa pods is subjective and varies from one person to another because of the intense labor required and variation of light and background conditions within the field. This research implemented an automated classification approach for cocoa ripeness classification utilizing Vision Transformer (ViT) with Shifted Patch Tokenization (SPT) and Locality Self Attention (LSA) to improve classification accuracy. The model proposed in this research achieved an accuracy of 82.65% and a macro F1 score of 82.71 on the exam with 1,559 images captured under varying illumination backgrounds and complex scenes. The model also proved better than baseline CNN architectures such as VGG, MobileNet, and ResNet in identifying visually progressive stages of ripeness and demonstrated greater generalization in cocoa ripeness classification. The findings of this research indicate the benefits of reducing manual intervention with careful inspection without compromising quality assurance standards in cocoa production. This work demonstrates new ways of applying transformer models to address computer vision problems in agriculture which is a step towards precision and smart farming.
Downloads
References
Ala’a, R., & Ibrahim, R. W. (2024). Classification of tomato leaf images for detection of plant disease using conformable polynomials image features. MethodsX, 13, 102844. https://doi.org/10.1016/j.mex.2024.102844
Alimjan, G., Sun, T., Liang, Y., Jumahun, H., & Guan, Y. (2018). A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. International Journal of Pattern Recognition and Artificial Intelligence, 32(2), 1850004. https://doi.org/10.1142/S0218001418590127
Borhani, Y., Khoramdel, J., & Najafi, E. (2022). A deep learning based approach for automated plant disease classification using vision transformer. Scientific Reports, 12(1), 11554. https://doi.org/10.1038/s41598-022-15163-0
Brigato, L., & Iocchi, L. (2021). A close look at deep learning with small data. 2020 25th International Conference on Pattern Recognition (ICPR), 2490–2497.
Charco, J. L., Yanza-Montalvan, A., Zumba-Gamboa, J., Alonso-Anguizaca, J., & Basurto-Cruz, E. (2024). ViTSigat: Early Black Sigatoka Detection in Banana Plants Using Vision Transformer. Conference on Information and Communication Technologies of Ecuador, 117–130. https://doi.org/10.1007/978-3-031-75431-9_8
Chitta, S., Yandrapalli, V. K., & Sharma, S. (2024). Deep Learning for Precision Agriculture: Evaluating CNNs and Vision Transformers in Rice Disease Classification. 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0, 1–6. https://doi.org/10.1109/OTCON60325.2024.10687983
De Silva, M., & Brown, D. (2023). Multispectral plant Disease Detection with Vision transformer–convolutional neural network hybrid approaches. Sensors, 23(20), 8531. https://doi.org/10.3390/s23208531
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv Preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929
El Sakka, M., Mothe, J., & Ivanovici, M. (2024). Images and CNN applications in smart agriculture. European Journal of Remote Sensing, 57(1), 2352386. https://doi.org/10.1080/22797254.2024.2352386
Emmamuel, A., Asim, U., Yu, H., Kim, S., & others. (2022). 3D-CNN method over shifted patch tokenization for MRI-based diagnosis of Alzheimer’s disease using segmented hippocampus. Journal of Multimedia Information System, 9(4), 245–252. https://doi.org/10.33851/JMIS.2022.9.4.245
Ergün, E. (2025). High precision banana variety identification using vision transformer based feature extraction and support vector machine. Scientific Reports, 15(1), 10366. https://doi.org/10.1038/s41598-025-95466-0
Eric, O., Gyening, R.-M. O. M., Appiah, O., Takyi, K., & Appiahene, P. (2023). Cocoa beans classification using enhanced image feature extraction techniques and a regularized Artificial Neural Network model. Engineering Applications of Artificial Intelligence, 125, 106736. https://doi.org/10.1016/j.engappai.2023.106736
Essah, R., Anand, D., & Singh, S. (2022). An intelligent cocoa quality testing framework based on deep learning techniques. Measurement: Sensors, 24, 100466. https://doi.org/10.1016/j.measen.2022.100466
Food and Agriculture Organization (FAO). (2023). Indonesia: Upgrading bulk cocoa into fine cocoa. https://openknowledge.fao.org/server/api/core/bitstreams/684e2bd3-6b91-48f5-a7cd-4125c5c74cab/content
Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050–1059. https://dl.acm.org/doi/10.5555/3045390.3045502
Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2022). Attention mechanisms in computer vision: A survey. Computational Visual Media, 8(3), 331–368. https://link.springer.com/article/10.1007/s41095-022-0271-y
Guo, Q., Qiu, X., Xue, X., & Zhang, Z. (2019). Low-rank and locality constrained self-attention for sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), 2213–2222. https://doi.org/10.1109/TASLP.2019.2944078
International Cocoa Organization (ICCO). (2022). Top 10 cocoa-producers and the issue of child labor in the industry. https://www.developmentaid.org/news-stream/post/176254/top-10-cocoa-producers
Joshi, B., Bansal, S., & Sharma, C. (2023). Classification of Tomato Leaf Disease using Feature Extraction with KNN Classifier. 2023 Seventh International Conference on Image Information Processing (ICIIP), 541–546. https://doi.org/10.1109/ICIIP61524.2023.10537671
Juncai, H., Yaohua, H., Lixia, H., Kangquan, G., & Satake, T. (2015). Classification of ripening stages of bananas based on support vector machine. International Journal of Agricultural and Biological Engineering, 8(6), 99–103. https://doi.org/10.3965/j.ijabe.20150806.1275
Khaki, S., & Wang, L. (2019). Crop yield prediction using deep neural networks. Frontiers in Plant Science, 10, 621. https://doi.org/10.3389/fpls.2019.00621
Khaki, S., Wang, L., & Archontoulis, S. V. (2020). A CNN-RNN framework for crop yield prediction. Frontiers in Plant Science, 11, 621. https://doi.org/10.3389/fpls.2019.01750
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in vision: A survey. ACM Computing Surveys (CSUR), 54(10s), 1–41. https://doi.org/10.1145/3505244
Kharamat, W., Wongsaisuwan, M., & Wattanamongkhol, N. (2020). Durian ripeness classification from the knocking sounds using convolutional neural network. 2020 8th International Electrical Engineering Congress (iEECON), 1–4. https://doi.org/10.1109/iEECON48109.2020.229571
Kulkarni, A., Shivananda, A., & Sharma, N. R. (2022). Explainable AI for computer vision. In Computer Vision Projects with PyTorch: Design and Develop Production-Grade Models (pp. 325–340). Springer. https://doi.org/10.1007/978-1-4842-8273-1_10
Lee, S. H., Lee, S., & Song, B. C. (2021). Vision transformer for small-size datasets. arXiv Preprint arXiv:2112.13492. https://doi.org/10.48550/arXiv.2112.13492
Lin, F., Crawford, S., Guillot, K., Zhang, Y., Chen, Y., Yuan, X., Chen, L., Williams, S., Minvielle, R., Xiao, X., & others. (2023). Mmst-vit: Climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5774–5784. https://doi.org/10.1109/ICCV51070.2023.00531
Liu, J., & Wang, X. (2021). Plant diseases and pests detection based on deep learning: A review. Plant Methods, 17, 22. https://doi.org/10.1186/s13007-021-00722-9
Lopes, J. F., da Costa, V. G. T., Barbin, D. F., Cruz-Tirado, L. J. P., Baeten, V., & Barbon Junior, S. (2022). Deep computer vision system for cocoa classification. Multimedia Tools and Applications, 81(28), 41059–41077. https://doi.org/10.1007/s11042-022-13097-3
Mishra, A., & Malhotra, M. (2024). A Dual Approach with Grad-CAM and Layer-Wise Relevance Propagation for CNN Models Explainability. International Conference on Innovation and Emerging Trends in Computing and Information Technologies, 116–129. https://doi.org/10.1007/978-3-031-80842-5_10
Nahak, P., Pratihar, D. K., & Deb, A. K. (2025). Tomato maturity stage prediction based on vision transformer and deep convolution neural networks. International Journal of Hybrid Intelligent Systems, 21(1), 61–78. https://doi.org/10.3233/HIS-240021
Paneru, B., Paneru, B., & Shah, K. B. (2024). Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction and Recommendations for Farmers. arXiv Preprint arXiv:2410.01827. https://doi.org/10.48550/arXiv.2410.01827
Pothen, Z., & Nuske, S. (2016). Automated assessment and mapping of grape quality through image-based color analysis. IFAC-PapersOnLine, 49(16), 72–78. https://doi.org/10.1016/j.ifacol.2016.10.014
Rad, R. (2024). Vision transformer for multispectral satellite imagery: Advancing landcover classification. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 8176–8183. https://doi.org/10.1109/WACV57701.2024.00799
Reedha, R., Dericquebourg, E., Canals, R., & Hafiane, A. (2022). Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sensing, 14(3), 592. https://doi.org/10.3390/rs14030592
Shimazu, R., Leow, C. S., Buayai, P., Makino, K., Mao, X., & Nishizaki, H. (2024). High Quality Color Estimation of Shine Muscat Grape Using Vision Transformer. 2024 International Conference on Cyberworlds (CW), 195–202. https://doi.org/10.1109/CW64301.2024.00028
Siregar, B., Pradaning, R., & Hizriadi, A. (2023). Cocoa Ripeness Level Sorting System Using Integrated Computer Vision Technology On Conveyor Belt. 2023 8th International Conference on Electrical, Electronics and Information Engineering (ICEEIE), 1–6. https://doi.org/10.1109/ICEEIE59078.2023.10334634
Suban, I. B., Paramartha, A., Fortwonatus, M., & Santoso, A. J. (2020). Identification the maturity level of carica papaya using the k-nearest neighbor. Journal of Physics: Conference Series, 1577(1), 012028. https://doi.org/10.1088/1742-6596/1577/1/012028
Ulukaya, S., & Deari, S. (2025). A robust vision transformer-based approach for classification of labeled rices in the wild. Computers and Electronics in Agriculture, 231, 109950. https://doi.org/10.1016/j.compag.2025.109950
Yasin, A., & Fatima, R. (2023). On the Image-Based Detection of Tomato and Corn leaves Diseases: An in-depth comparative experiments. arXiv Preprint arXiv:2312.08659. https://doi.org/10.48550/arXiv.2312.08659
Yu, C., Wang, J., Chen, Y., & Wu, Z. (2019). Accelerating deep unsupervised domain adaptation with transfer channel pruning. 2019 International Joint Conference on Neural Networks (IJCNN), 1–8. https://doi.org/10.1109/IJCNN.2019.8851810
Zhao, J., Berge, T. W., & Geipel, J. (2023). Transformer in UAV Image-Based Weed Mapping. Remote Sensing, 15(21), 5165. https://doi.org/10.3390/rs15215165
Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., & Jin, R. (2021). Elsa: Enhanced local self-attention for vision transformer. arXiv Preprint arXiv:2112.12786. https://doi.org/10.48550/arXiv.2112.12786