Dataset of vocabulary in Uzbek primary education : extraction and analysis in case of the school corpus

Madatov, Khabibulla; Sattarova, Sapura; Vičič, Jernej

SLO

First page / Show document

Show document

A- | A+ | Print

Title:	Dataset of vocabulary in Uzbek primary education : extraction and analysis in case of the school corpus
Authors:	ID Madatov, Khabibulla (Author) ID Sattarova, Sapura (Author) ID Vičič, Jernej (Author)
Files:	RAZ_Madatov_Khabibulla_2025.pdf (342,87 KB) MD5: B099D0590099A4FB7D1438D190B9CE01 https://www.sciencedirect.com/science/article/pii/S2352340925000812
Language:	English
Work type:	Article
Typology:	1.01 - Original Scientific Article
Organization:	FAMNIT - Faculty of Mathematics, Science and Information Technologies
Abstract:	The main goal of this research work is to determine the number of new words that a primary school pupil should know/acquire during each academic year. To accomplish this, we have created two datasets. The first dataset was compiled based on the "Explanatory Vocabulary of the Uzbek Language" (EDUL). The second dataset was created from 35 primary school textbooks for grades 1-4 approved by the Ministry of Preschool and School Education of the Republic of Uzbekistan, and it was named the "Uzbek Primary School Corpus" (UPSC) by authors. Using the "Comparative Lemma Extraction Method" (CLEM) proposed by the authors of the article, a vocabulary for grades 1-4 was created, and the problem of determining the number of new words (disregarding word forms as Uzbek is a morphologically rich language) that primary school pupils should learn each academic year was solved.
Keywords:	Uzbek language, primary school, corpus construction, natural language processing (NLP), comparative Lemma extraction method
Publication date:	03.02.2025
Year of publishing:	2025
Number of pages:	str. 1-12
Numbering:	Vol. 59, article 111349
PID:	20.500.12556/RUP-21537
UDC:	004.65:811.5
ISSN on article:	2352-3409
DOI:	10.1016/j.dib.2025.111349
COBISS.SI-ID:	225129475
Publication date in RUP:	08.08.2025
Views:	1111
Downloads:	10
Metadata:
:	Copy citation

Average score:	(0 votes)
Your score:	Voting is allowed only for logged in users.
Share:

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a journal

Title:	Data in brief
Publisher:	Elsevier
ISSN:	2352-3409
COBISS.SI-ID:	32117977

Document is financed by a project

Funder:	EC - European Commission
Project number:	739574
Name:	Renewable materials and healthy environments research and innovation centre of excellence
Acronym:	InnoRenew CoE

Funder:	EC - European Commission
Project number:	610170-EPP-1-2019-1-ES-EPPKA2-CBHE-JP
Name:	Establishment of training and research centers and Courses development on Intelligent BigData Analysis in CA

Licences

License:	CC BY 4.0, Creative Commons Attribution 4.0 International

Link:	http://creativecommons.org/licenses/by/4.0/
Description:	This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.

Secondary language

Language:	Slovenian
Keywords:	uzbeški jezik, osnovna šola, konstrukcija korpusa, obdelava naravnega jezika (NLP), metoda primerjalne ekstrakcije lem

Comments

Leave comment

You must log in to leave a comment.

Comments (0)

0 - 0 / 0

There are no comments!

Back