Lupa

Show document Help

A- | A+ | Print
Title:TF-IDF-based classification of Uzbek educational texts
Authors:ID Madatov, Khabibulla (Author)
ID Sattarova, Sapura (Author)
ID Vičič, Jernej (Author)
Files:.pdf RAZ_Madatov_Khabibulla_2025.pdf (286,87 KB)
MD5: 83753809D57B917BE2F9B9F317E52201
 
URL https://www.mdpi.com/2076-3417/15/19/10808
 
Language:English
Work type:Article
Typology:1.01 - Original Scientific Article
Organization:FAMNIT - Faculty of Mathematics, Science and Information Technologies
Abstract:This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging. The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model). The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development. A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors. Experimental results on the internal test set of 70 files show that LR achieved 92.9% accuracy (precision = 0.94, recall = 0.93, F1 = 0.93), while CS performed comparably with 91.4% accuracy (precision = 0.92, recall = 0.91, F1 = 0.92). In contrast, k-NN obtained only 28.6% accuracy, confirming its weakness in high-dimensional sparse feature spaces. External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable. Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.
Keywords:Uzbek language, text classification, low-resource languages, TF-IDF, cosine similarity, linear regression, k-Nearest Neighbors
Publication version:Version of Record
Publication date:08.10.2025
Year of publishing:2025
Number of pages:str. 1-13
Numbering:Vol. 15, iss. 19, [article no.] 10808
PID:20.500.12556/RUP-21966 This link opens in a new window
UDC:004:811.5:81'322
ISSN on article:2076-3417
DOI:10.3390/app151910808 This link opens in a new window
COBISS.SI-ID:253709315 This link opens in a new window
Publication date in RUP:17.10.2025
Views:311
Downloads:3
Metadata:XML DC-XML DC-RDF
:
Copy citation
  
Average score:(0 votes)
Your score:Voting is allowed only for logged in users.
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Record is a part of a journal

Title:Applied sciences
Shortened title:Appl. sci.
Publisher:MDPI
ISSN:2076-3417
COBISS.SI-ID:522979353 This link opens in a new window

Licences

License:CC BY 4.0, Creative Commons Attribution 4.0 International
Link:http://creativecommons.org/licenses/by/4.0/
Description:This is the standard Creative Commons license that gives others maximum freedom to do what they want with the work as long as they credit the author.

Secondary language

Language:Slovenian
Title:Is open source the future of AI
Keywords:uzbeški jezik, klasifikacija besedil, jeziki z omejenimi viri, TF-IDF, kosinusna podobnost, linearna regresija, k-najbližji sosedje


Comments

Leave comment

You must log in to leave a comment.

Comments (0)
0 - 0 / 0
 
There are no comments!

Back
Logos of partners University of Maribor University of Ljubljana University of Primorska University of Nova Gorica