LLMs Performance on Vietnamese High School Biology Examination

Full Text (PDF, 919KB), PP.14-30

Views: 0 Downloads: 0


Xuan-Quy Dao 1 Ngoc-Bich Le 2,*

1. School of Engineering, Eastern International University, Binh Duong, Vietnam

2. School of Biomedical Engineering, International University, Vietnam National University HCM City, HCM City, Vietnam

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2023.06.02

Received: 17 May 2023 / Revised: 25 Jun. 2023 / Accepted: 28 Jul. 2023 / Published: 8 Dec. 2023

Index Terms

ChatGPT, Microsoft Bing Chat, Google Bard, large language models, biology education


Large Language Models (LLMs) have received significant attention due to their potential to transform the field of education and assessment through the provision of automated responses to a diverse range of inquiries. The objective of this research is to examine the efficacy of three LLMs - ChatGPT, BingChat, and Bard - in relation to their performance on the Vietnamese High School Biology Examination dataset. This dataset consists of a wide range of biology questions that vary in difficulty and context. By conducting a thorough analysis, we are able to reveal the merits and drawbacks of each LLM, thereby providing valuable insights for their successful incorporation into educational platforms. This study examines the proficiency of LLMs in various levels of questioning, namely Knowledge, Comprehension, Application, and High Application. The findings of the study reveal complex and subtle patterns in performance. The versatility of ChatGPT is evident as it showcases potential across multiple levels. Nevertheless, it encounters difficulties in maintaining consistency and effectively addressing complex application queries. BingChat and Bard demonstrate strong performance in tasks related to factual recall, comprehension, and interpretation, indicating their effectiveness in facilitating fundamental learning. Additional investigation encompasses educational environments. The analysis indicates that the utilization of BingChat and Bard has the potential to augment factual and comprehension learning experiences. However, it is crucial to acknowledge the indispensable significance of human expertise in tackling complex application inquiries. The research conducted emphasizes the importance of adopting a well-rounded approach to the integration of LLMs, taking into account their capabilities while also recognizing their limitations. The refinement of LLM capabilities and the resolution of challenges in addressing advanced application scenarios can be achieved through collaboration among educators, developers, and AI researchers.

Cite This Paper

Xuan-Quy Dao, Ngoc-Bich Le, "LLMs Performance on Vietnamese High School Biology Examination", International Journal of Modern Education and Computer Science(IJMECS), Vol.15, No.6, pp. 14-30, 2023. DOI:10.5815/ijmecs.2023.06.02


[1]L. Chen, P. Chen, and Z. Lin, “Artificial Intelligence in Education: A Review,” IEEE Access, vol. 8, pp. 75264–75278, 2020, doi: 10.1109/ACCESS.2020.2988510.
[2]X. Q. Dao, N. B. Le, and T. M. T. Nguyen, “AI-Powered MOOCs: Video Lecture Generation,” ACM Int. Conf. Proceeding Ser., pp. 95–102, Mar. 2021, doi: 10.1145/3459212.3459227.
[3]T. M. T. Nguyen, T. H. Diep, B. B. Ngo, N. B. Le, and X. Q. Dao, “Design of Online Learning Platform with Vietnamese Virtual Assistant,” in ACM International Conference Proceeding Series, Feb. 2021, pp. 51–57, doi: 10.1145/3460179.3460188.
[4]J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv Prepr. arXiv1810.04805, 2018, doi: https://doi.org/10.48550/arXiv.1810.04805.
[5]Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv Prepr. arXiv1907.11692, 2019.
[6]C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” J. Mach. Learn. Res., vol. 21, pp. 1–67, 2020.
[7]T. B. Brown et al., “Language Models are Few-Shot Learners,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877–1901, 2020.
[8]X.-Q. Dao et al., “VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models,” arXiv Prepr. arXiv2305.12199, May 2023, doi: 10.48550/arXiv.2305.12199.
[9]G. Tsatsaronis et al., “An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition,” BMC Bioinformatics, vol. 16, no. 1, pp. 1–28, 2015, doi: 10.1186/s12859-015-0564-6.
[10]J. Welbl, N. F. Liu, and M. Gardner, “Crowdsourcing Multiple Choice Science Questions,” arXiv Prepr. arXiv1707.06209, pp. 94–106, 2017, doi: 10.18653/v1/w17-4413.
[11]D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” arXiv Prepr. arXiv2009.03300, 2020, [Online]. Available: http://arxiv.org/abs/2009.03300.
[12]P. Lu et al., “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 2507--2521, Sep. 2022.
[13]OpenAI, “GPT-4 Technical Report,” arXiv Prepr. arXiv2303.08774, 2023, doi: https://doi.org/10.48550/arXiv.2303.08774.
[14]E. Agathokleous, C. J. Saitanis, C. Fang, and Z. Yu, “Use of ChatGPT: What does it mean for biology and environmental science?,” Sci. Total Environ., vol. 888, p. 164154, 2023, doi: https://doi.org/10.1016/j.scitotenv.2023.164154.
[15]Y. Tong and L. Zhang, “Discovering the next decade’s synthetic biology research trends with ChatGPT,” Synth. Syst. Biotechnol., vol. 8, no. 2, pp. 220–223, 2023, doi: 10.1016/j.synbio.2023.02.004.
[16]A. HS Kumar, “Analysis of ChatGPT Tool to Assess the Potential of its Utility for Academic Writing in Biomedical Domain,” Biol. Eng. Med. Sci. Reports, vol. 9, no. 1, pp. 24–30, 2023, doi: 10.5530/bems.9.1.5.
[17]E. Shue, L. Liu, B. Li, Z. Feng, X. Li, and G. Hu, “Empowering Beginners in Bioinformatics with ChatGPT,” Quantitative Biology, Vol. 11 (2), pp. 105-108, doi: 10.15302/J-QB-023-0327.