Lupa

Izpis gradiva Pomoč

A- | A+ | Natisni
Naslov:TF-IDF-based classification of Uzbek educational texts
Avtorji:ID Madatov, Khabibulla (Avtor)
ID Sattarova, Sapura (Avtor)
ID Vičič, Jernej (Avtor)
Datoteke:.pdf RAZ_Madatov_Khabibulla_2025.pdf (286,87 KB)
MD5: 83753809D57B917BE2F9B9F317E52201
 
URL https://www.mdpi.com/2076-3417/15/19/10808
 
Jezik:Angleški jezik
Vrsta gradiva:Članek v reviji
Tipologija:1.01 - Izvirni znanstveni članek
Organizacija:FAMNIT - Fakulteta za matematiko, naravoslovje in informacijske tehnologije
Opis:This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging. The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model). The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development. A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors. Experimental results on the internal test set of 70 files show that LR achieved 92.9% accuracy (precision = 0.94, recall = 0.93, F1 = 0.93), while CS performed comparably with 91.4% accuracy (precision = 0.92, recall = 0.91, F1 = 0.92). In contrast, k-NN obtained only 28.6% accuracy, confirming its weakness in high-dimensional sparse feature spaces. External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable. Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.
Ključne besede:Uzbek language, text classification, low-resource languages, TF-IDF, cosine similarity, linear regression, k-Nearest Neighbors
Verzija publikacije:Objavljena publikacija
Datum objave:08.10.2025
Leto izida:2025
Št. strani:str. 1-13
Številčenje:Vol. 15, iss. 19, [article no.] 10808
PID:20.500.12556/RUP-21966 Povezava se odpre v novem oknu
UDK:004:811.5:81'322
ISSN pri članku:2076-3417
DOI:10.3390/app151910808 Povezava se odpre v novem oknu
COBISS.SI-ID:253709315 Povezava se odpre v novem oknu
Datum objave v RUP:17.10.2025
Število ogledov:312
Število prenosov:3
Metapodatki:XML DC-XML DC-RDF
:
Kopiraj citat
  
Skupna ocena:(0 glasov)
Vaša ocena:Ocenjevanje je dovoljeno samo prijavljenim uporabnikom.
Objavi na:Bookmark and Share


Postavite miškin kazalec na naslov za izpis povzetka. Klik na naslov izpiše podrobnosti ali sproži prenos.

Gradivo je del revije

Naslov:Applied sciences
Skrajšan naslov:Appl. sci.
Založnik:MDPI
ISSN:2076-3417
COBISS.SI-ID:522979353 Povezava se odpre v novem oknu

Licence

Licenca:CC BY 4.0, Creative Commons Priznanje avtorstva 4.0 Mednarodna
Povezava:http://creativecommons.org/licenses/by/4.0/deed.sl
Opis:To je standardna licenca Creative Commons, ki daje uporabnikom največ možnosti za nadaljnjo uporabo dela, pri čemer morajo navesti avtorja.

Sekundarni jezik

Jezik:Slovenski jezik
Naslov:Is open source the future of AI
Ključne besede:uzbeški jezik, klasifikacija besedil, jeziki z omejenimi viri, TF-IDF, kosinusna podobnost, linearna regresija, k-najbližji sosedje


Komentarji

Dodaj komentar

Za komentiranje se morate prijaviti.

Komentarji (0)
0 - 0 / 0
 
Ni komentarjev!

Nazaj
Logotipi partnerjev Univerza v Mariboru Univerza v Ljubljani Univerza na Primorskem Univerza v Novi Gorici