Lupa

Iskanje po repozitoriju Pomoč

A- | A+ | Natisni
Iskalni niz: išči po
išči po
išči po
išči po
* po starem in bolonjskem študiju

Opcije:
  Ponastavi


1 - 3 / 3
Na začetekNa prejšnjo stran1Na naslednjo stranNa konec
1.
TF-IDF-based classification of Uzbek educational texts
Khabibulla Madatov, Sapura Sattarova, Jernej Vičič, 2025, izvirni znanstveni članek

Opis: This paper presents a baseline study on automatic Uzbek text classification. Uzbek is a morphologically rich and low-resource language, which makes reliable preprocessing and evaluation challenging. The approach integrates Term Frequency–Inverse Document Frequency (TF–IDF) representation with three conventional methods: linear regression (LR), k-Nearest Neighbors (k-NN), and cosine similarity (CS, implemented as a 1-NN retrieval model). The objective is to categorize school learning materials by grade level (grades 5–11) to support improved alignment between curricular texts and students’ intellectual development. A balanced dataset of Uzbek school textbooks across different subjects was constructed, preprocessed with standard NLP tools, and converted into TF–IDF vectors. Experimental results on the internal test set of 70 files show that LR achieved 92.9% accuracy (precision = 0.94, recall = 0.93, F1 = 0.93), while CS performed comparably with 91.4% accuracy (precision = 0.92, recall = 0.91, F1 = 0.92). In contrast, k-NN obtained only 28.6% accuracy, confirming its weakness in high-dimensional sparse feature spaces. External evaluation on seven Uzbek literary works further demonstrated that LR and CS yielded consistent and interpretable grade-level mappings, whereas k-NN results were unstable. Overall, the findings establish reliable baselines for Uzbek educational text classification and highlight the potential of extending beyond lexical overlap toward semantically richer models in future work.
Ključne besede: Uzbek language, text classification, low-resource languages, TF-IDF, cosine similarity, linear regression, k-Nearest Neighbors
Objavljeno v RUP: 17.10.2025; Ogledov: 330; Prenosov: 3
.pdf Celotno besedilo (286,87 KB)
Gradivo ima več datotek! Več...

2.
Dataset of vocabulary in Uzbek primary education : extraction and analysis in case of the school corpus
Khabibulla Madatov, Sapura Sattarova, Jernej Vičič, 2025, izvirni znanstveni članek

Opis: The main goal of this research work is to determine the number of new words that a primary school pupil should know/acquire during each academic year. To accomplish this, we have created two datasets. The first dataset was compiled based on the "Explanatory Vocabulary of the Uzbek Language" (EDUL). The second dataset was created from 35 primary school textbooks for grades 1-4 approved by the Ministry of Preschool and School Education of the Republic of Uzbekistan, and it was named the "Uzbek Primary School Corpus" (UPSC) by authors. Using the "Comparative Lemma Extraction Method" (CLEM) proposed by the authors of the article, a vocabulary for grades 1-4 was created, and the problem of determining the number of new words (disregarding word forms as Uzbek is a morphologically rich language) that primary school pupils should learn each academic year was solved.
Ključne besede: Uzbek language, primary school, corpus construction, natural language processing (NLP), comparative Lemma extraction method
Objavljeno v RUP: 08.08.2025; Ogledov: 539; Prenosov: 7
.pdf Celotno besedilo (342,87 KB)
Gradivo ima več datotek! Več...

3.
Lists of Uzbek Stopwords
Khabibulla Madatov, Shukurla Bekchanov, Jernej Vičič, 2021, zaključena znanstvena zbirka raziskovalnih podatkov

Ključne besede: stopwords, collection, uzbek language
Objavljeno v RUP: 18.11.2021; Ogledov: 3412; Prenosov: 32
URL Povezava na celotno besedilo

Iskanje izvedeno v 0.01 sek.
Na vrh
Logotipi partnerjev Univerza v Mariboru Univerza v Ljubljani Univerza na Primorskem Univerza v Novi Gorici