The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The model is trained with 80% of annotated corpus and tested with 20% of test set. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |