PENGGUNAAN DICTIONARY-BASED DAN CORPUS-BASED THESAURUS UNTUK PEMBOBOTAN TERM PADA PENGELOMPOKAN DOKUMEN BERITA BERBAHASA INDONESIA

Amelia Sahira Rahma, Vit Zuraida, Dimas Fanny Hebrasianto Permadi

Abstract


Huge numbers of digital news document in Indonesian Language led to the need for automatic document clustering based on topic so readers would have an easier access to news articles in the same topic. One of the major problems in document clustering is low relevancy in the clustering result so the documents are not grouped based on their appropriate topic. This paper proposed a new term weighting method that employs combination of corpus-based thesaurus and dictionary-based thesaurus to consider conceptual similarity between terms. This method is evaluated using K-Means algorithm to 253 news document in Indonesian language.  Experimental results show that the proposed term weighting method is able to achieve good performance.


References


S. Staab dan G. Stumme, “Wordnet Improves Text Document Clustering,” dalam Proceeding of the SIGIR 2003 Semantic Web Workshop, 2003, hal. 541-544.

S. L. Bang, J. D. Yang dan H. J. Yang, “Hierarchical document categorization with k-NN and concept-based thesauri,” Information Processing and Management, vol. 42, hal. 387-406, 2006.

C. H. Li, J. C. Yang dan S. C. Park, “Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet,” Expert System with Applications, vol. 39, hal. 765-772, 2012.

J. Xu dan W. B. Croft, “Query expansion using local and global document analysis,” dalam Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, 1996.

H. Xu dan B. Yu, “Automatic thesaurus construction for spam filtering using revised back propagation neural network,” Expert Systems with Applications, vol. 37, hal. 18-23, 2010.

C. H. Li, W. Song dan S. C. Park, “An automatically constructed thesaurus for neural network based document categorization,” Expert Systems with Applications, vol. 36, hal. 10969-10975, 2009.

R. Steinberger , B. Pouliquen dan J. Hagman, “Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC,” dalam Third International Conference CICLing, Mexico City, 2002.

A. Z. Arifin, I. P. A. Kerta Mahendra dan H. T. Ciptaningtyas, “Enhanced Confix Stripping Stemmer and Ants Algorithm for Classifying News Document In Indonesian Language,” The 5th International Conference on Information & Communication Technology and Systems, 2008.




DOI: http://dx.doi.org/10.36564/njca.v2i1.25

DOI (PDF (Bahasa Indonesia)): http://dx.doi.org/10.36564/njca.v2i1.25.g18

Refbacks

  • There are currently no refbacks.


Copyright (c) 2017 Amelia Sahira Rahma, Vit Zuraida, Dimas Fanny Hebrasianto Permadi

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

NJCA(Nusantara Journal of Computers and Its Applications)
Published by Computer Society of Nahdlatul Ulama, Indonesia.
Office : PO.BOX 1 Paiton Probolinggo kodepos 67291 Jawa Timur, Indonesia

DECREE OF THE MINISTER OF LAW AND HUMAN RIGHTS OF THE REPUBLIC OF INDONESIA
NUMBER AHU-0060541.AH.01.07.YEAR 2016