Cocoa Ripeness Classification Using Vision Transformer

Authors

  • Febryanti Sthevanie Telkom University
  • Untari Novia Wisesty Telkom University
  • Gia Septiana Wulandari Telkom University
  • Kurniawan Nur Ramadhani Telkom University

DOI:

https://doi.org/10.37385/jaets.v6i2.6663

Keywords:

Cocoa Ripeness Classification, Vision Transformer, Shifted Patch Tokenization, Locality Self Attention, Agricultural Computer Vision

Abstract

The quality of manual methods for assessing the ripeness of cocoa pods is subjective and varies from one person to another because of the intense labor required and variation of light and background conditions within the field. This research implemented an automated classification approach for cocoa ripeness classification utilizing Vision Transformer (ViT) with Shifted Patch Tokenization (SPT) and Locality Self Attention (LSA) to improve classification accuracy. The model proposed in this research achieved an accuracy of 82.65% and a macro F1 score of 82.71 on the exam with 1,559 images captured under varying illumination backgrounds and complex scenes. The model also proved better than baseline CNN architectures such as VGG, MobileNet, and ResNet in identifying visually progressive stages of ripeness and demonstrated greater generalization in cocoa ripeness classification. The findings of this research indicate the benefits of reducing manual intervention with careful inspection without compromising quality assurance standards in cocoa production. This work demonstrates new ways of applying transformer models to address computer vision problems in agriculture which is a step towards precision and smart farming.

Downloads

Download data is not yet available.

References

Ala’a, R., & Ibrahim, R. W. (2024). Classification of tomato leaf images for detection of plant disease using conformable polynomials image features. MethodsX, 13, 102844. https://doi.org/10.1016/j.mex.2024.102844

Alimjan, G., Sun, T., Liang, Y., Jumahun, H., & Guan, Y. (2018). A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. International Journal of Pattern Recognition and Artificial Intelligence, 32(2), 1850004. https://doi.org/10.1142/S0218001418590127

Borhani, Y., Khoramdel, J., & Najafi, E. (2022). A deep learning based approach for automated plant disease classification using vision transformer. Scientific Reports, 12(1), 11554. https://doi.org/10.1038/s41598-022-15163-0

Brigato, L., & Iocchi, L. (2021). A close look at deep learning with small data. 2020 25th International Conference on Pattern Recognition (ICPR), 2490–2497.

Charco, J. L., Yanza-Montalvan, A., Zumba-Gamboa, J., Alonso-Anguizaca, J., & Basurto-Cruz, E. (2024). ViTSigat: Early Black Sigatoka Detection in Banana Plants Using Vision Transformer. Conference on Information and Communication Technologies of Ecuador, 117–130. https://doi.org/10.1007/978-3-031-75431-9_8

Chitta, S., Yandrapalli, V. K., & Sharma, S. (2024). Deep Learning for Precision Agriculture: Evaluating CNNs and Vision Transformers in Rice Disease Classification. 2024 OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 4.0, 1–6. https://doi.org/10.1109/OTCON60325.2024.10687983

De Silva, M., & Brown, D. (2023). Multispectral plant Disease Detection with Vision transformer–convolutional neural network hybrid approaches. Sensors, 23(20), 8531. https://doi.org/10.3390/s23208531

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv Preprint arXiv:2010.11929. https://doi.org/10.48550/arXiv.2010.11929

El Sakka, M., Mothe, J., & Ivanovici, M. (2024). Images and CNN applications in smart agriculture. European Journal of Remote Sensing, 57(1), 2352386. https://doi.org/10.1080/22797254.2024.2352386

Emmamuel, A., Asim, U., Yu, H., Kim, S., & others. (2022). 3D-CNN method over shifted patch tokenization for MRI-based diagnosis of Alzheimer’s disease using segmented hippocampus. Journal of Multimedia Information System, 9(4), 245–252. https://doi.org/10.33851/JMIS.2022.9.4.245

Ergün, E. (2025). High precision banana variety identification using vision transformer based feature extraction and support vector machine. Scientific Reports, 15(1), 10366. https://doi.org/10.1038/s41598-025-95466-0

Eric, O., Gyening, R.-M. O. M., Appiah, O., Takyi, K., & Appiahene, P. (2023). Cocoa beans classification using enhanced image feature extraction techniques and a regularized Artificial Neural Network model. Engineering Applications of Artificial Intelligence, 125, 106736. https://doi.org/10.1016/j.engappai.2023.106736

Essah, R., Anand, D., & Singh, S. (2022). An intelligent cocoa quality testing framework based on deep learning techniques. Measurement: Sensors, 24, 100466. https://doi.org/10.1016/j.measen.2022.100466

Food and Agriculture Organization (FAO). (2023). Indonesia: Upgrading bulk cocoa into fine cocoa. https://openknowledge.fao.org/server/api/core/bitstreams/684e2bd3-6b91-48f5-a7cd-4125c5c74cab/content

Gal, Y., & Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050–1059. https://dl.acm.org/doi/10.5555/3045390.3045502

Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., & Hu, S.-M. (2022). Attention mechanisms in computer vision: A survey. Computational Visual Media, 8(3), 331–368. https://link.springer.com/article/10.1007/s41095-022-0271-y

Guo, Q., Qiu, X., Xue, X., & Zhang, Z. (2019). Low-rank and locality constrained self-attention for sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12), 2213–2222. https://doi.org/10.1109/TASLP.2019.2944078

International Cocoa Organization (ICCO). (2022). Top 10 cocoa-producers and the issue of child labor in the industry. https://www.developmentaid.org/news-stream/post/176254/top-10-cocoa-producers

Joshi, B., Bansal, S., & Sharma, C. (2023). Classification of Tomato Leaf Disease using Feature Extraction with KNN Classifier. 2023 Seventh International Conference on Image Information Processing (ICIIP), 541–546. https://doi.org/10.1109/ICIIP61524.2023.10537671

Juncai, H., Yaohua, H., Lixia, H., Kangquan, G., & Satake, T. (2015). Classification of ripening stages of bananas based on support vector machine. International Journal of Agricultural and Biological Engineering, 8(6), 99–103. https://doi.org/10.3965/j.ijabe.20150806.1275

Khaki, S., & Wang, L. (2019). Crop yield prediction using deep neural networks. Frontiers in Plant Science, 10, 621. https://doi.org/10.3389/fpls.2019.00621

Khaki, S., Wang, L., & Archontoulis, S. V. (2020). A CNN-RNN framework for crop yield prediction. Frontiers in Plant Science, 11, 621. https://doi.org/10.3389/fpls.2019.01750

Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). Transformers in vision: A survey. ACM Computing Surveys (CSUR), 54(10s), 1–41. https://doi.org/10.1145/3505244

Kharamat, W., Wongsaisuwan, M., & Wattanamongkhol, N. (2020). Durian ripeness classification from the knocking sounds using convolutional neural network. 2020 8th International Electrical Engineering Congress (iEECON), 1–4. https://doi.org/10.1109/iEECON48109.2020.229571

Kulkarni, A., Shivananda, A., & Sharma, N. R. (2022). Explainable AI for computer vision. In Computer Vision Projects with PyTorch: Design and Develop Production-Grade Models (pp. 325–340). Springer. https://doi.org/10.1007/978-1-4842-8273-1_10

Lee, S. H., Lee, S., & Song, B. C. (2021). Vision transformer for small-size datasets. arXiv Preprint arXiv:2112.13492. https://doi.org/10.48550/arXiv.2112.13492

Lin, F., Crawford, S., Guillot, K., Zhang, Y., Chen, Y., Yuan, X., Chen, L., Williams, S., Minvielle, R., Xiao, X., & others. (2023). Mmst-vit: Climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5774–5784. https://doi.org/10.1109/ICCV51070.2023.00531

Liu, J., & Wang, X. (2021). Plant diseases and pests detection based on deep learning: A review. Plant Methods, 17, 22. https://doi.org/10.1186/s13007-021-00722-9

Lopes, J. F., da Costa, V. G. T., Barbin, D. F., Cruz-Tirado, L. J. P., Baeten, V., & Barbon Junior, S. (2022). Deep computer vision system for cocoa classification. Multimedia Tools and Applications, 81(28), 41059–41077. https://doi.org/10.1007/s11042-022-13097-3

Mishra, A., & Malhotra, M. (2024). A Dual Approach with Grad-CAM and Layer-Wise Relevance Propagation for CNN Models Explainability. International Conference on Innovation and Emerging Trends in Computing and Information Technologies, 116–129. https://doi.org/10.1007/978-3-031-80842-5_10

Nahak, P., Pratihar, D. K., & Deb, A. K. (2025). Tomato maturity stage prediction based on vision transformer and deep convolution neural networks. International Journal of Hybrid Intelligent Systems, 21(1), 61–78. https://doi.org/10.3233/HIS-240021

Paneru, B., Paneru, B., & Shah, K. B. (2024). Analysis of Convolutional Neural Network-based Image Classifications: A Multi-Featured Application for Rice Leaf Disease Prediction and Recommendations for Farmers. arXiv Preprint arXiv:2410.01827. https://doi.org/10.48550/arXiv.2410.01827

Pothen, Z., & Nuske, S. (2016). Automated assessment and mapping of grape quality through image-based color analysis. IFAC-PapersOnLine, 49(16), 72–78. https://doi.org/10.1016/j.ifacol.2016.10.014

Rad, R. (2024). Vision transformer for multispectral satellite imagery: Advancing landcover classification. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 8176–8183. https://doi.org/10.1109/WACV57701.2024.00799

Reedha, R., Dericquebourg, E., Canals, R., & Hafiane, A. (2022). Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sensing, 14(3), 592. https://doi.org/10.3390/rs14030592

Shimazu, R., Leow, C. S., Buayai, P., Makino, K., Mao, X., & Nishizaki, H. (2024). High Quality Color Estimation of Shine Muscat Grape Using Vision Transformer. 2024 International Conference on Cyberworlds (CW), 195–202. https://doi.org/10.1109/CW64301.2024.00028

Siregar, B., Pradaning, R., & Hizriadi, A. (2023). Cocoa Ripeness Level Sorting System Using Integrated Computer Vision Technology On Conveyor Belt. 2023 8th International Conference on Electrical, Electronics and Information Engineering (ICEEIE), 1–6. https://doi.org/10.1109/ICEEIE59078.2023.10334634

Suban, I. B., Paramartha, A., Fortwonatus, M., & Santoso, A. J. (2020). Identification the maturity level of carica papaya using the k-nearest neighbor. Journal of Physics: Conference Series, 1577(1), 012028. https://doi.org/10.1088/1742-6596/1577/1/012028

Ulukaya, S., & Deari, S. (2025). A robust vision transformer-based approach for classification of labeled rices in the wild. Computers and Electronics in Agriculture, 231, 109950. https://doi.org/10.1016/j.compag.2025.109950

Yasin, A., & Fatima, R. (2023). On the Image-Based Detection of Tomato and Corn leaves Diseases: An in-depth comparative experiments. arXiv Preprint arXiv:2312.08659. https://doi.org/10.48550/arXiv.2312.08659

Yu, C., Wang, J., Chen, Y., & Wu, Z. (2019). Accelerating deep unsupervised domain adaptation with transfer channel pruning. 2019 International Joint Conference on Neural Networks (IJCNN), 1–8. https://doi.org/10.1109/IJCNN.2019.8851810

Zhao, J., Berge, T. W., & Geipel, J. (2023). Transformer in UAV Image-Based Weed Mapping. Remote Sensing, 15(21), 5165. https://doi.org/10.3390/rs15215165

Zhou, J., Wang, P., Wang, F., Liu, Q., Li, H., & Jin, R. (2021). Elsa: Enhanced local self-attention for vision transformer. arXiv Preprint arXiv:2112.12786. https://doi.org/10.48550/arXiv.2112.12786

Downloads

Published

2025-06-08

How to Cite

Sthevanie, F., Wisesty, U. N., Wulandari, G. S., & Ramadhani, K. N. (2025). Cocoa Ripeness Classification Using Vision Transformer. Journal of Applied Engineering and Technological Science (JAETS), 6(2), 984–996. https://doi.org/10.37385/jaets.v6i2.6663