OUT-OF-VOCABULARY HANDLING IN UNSTRUCTURED DATA USING MODIFIED SOUNDEX PHONETIC RULE AND SIMILARITY ALGORITHMS

Afiyati, Afiyati and Azhari, Azhari and Sari, Anny Kartika (2023) OUT-OF-VOCABULARY HANDLING IN UNSTRUCTURED DATA USING MODIFIED SOUNDEX PHONETIC RULE AND SIMILARITY ALGORITHMS. ICIC Express Letters, 17 (9). pp. 979-987. ISSN 1881803X

[thumbnail of OUTOFVOCABULARY-HANDLING-IN-UNSTRUCTURED-DATA-USING-MODIFIED-SOUNDEX-PHONETIC-RULE-AND-SIMILARITY-ALGORITHMSICIC-Express-Letters.pdf] Text
OUTOFVOCABULARY-HANDLING-IN-UNSTRUCTURED-DATA-USING-MODIFIED-SOUNDEX-PHONETIC-RULE-AND-SIMILARITY-ALGORITHMSICIC-Express-Letters.pdf
Restricted to Registered users only

Download (1MB) | Request a copy

Abstract

The biggest challenge in analyzing unstructured data, such as data from social media, is the presence of out-of-vocabulary (OOV) words, which are words that are not listed in the standard dictionary. They can decrease the accuracy of text classication tasks. In this study, the modication of the Soundex phonetic algorithm is proposed to handle the problem by adding the last character to the Soundex code to accelerate the discovery of the most similar word to replace each OOV word. Four similarity algorithms, namely sequence matcher, Levenshtein distance, Jaro similarity, and Jaro-Winkler similarity, are used to nd the best replacement for the OOV words. The method is applied to Indonesian tweet data, in which 73% of them are OOV words; hence, the Indonesian phonetic rule is applied to the modied Soundex algorithm. It was found that the best correlation value of 1.00 was obtained between the sequence matcher and Jaro-Winkler algorithms. The similarity values of the modied Soundex with the Indonesian phonetic rule for Jaro and Jaro-Winkler were higher (0.92) than the sequence matcher (0.75).
The normalization time using the proposed method was faster than the original Soundex algorithm. The results of OOV normalization were proven to increase the accuracy of the
sarcasm detection task.

Item Type: Article
Uncontrolled Keywords: Out-of-vocabulary, Phonetic rule, Similarity measures, Soundex algorithm, Unstructured data
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions: Faculty of Mathematics and Natural Sciences > Computer Science & Electronics Department
Depositing User: Wiyarsih Wiyarsih
Date Deposited: 15 Aug 2024 08:36
Last Modified: 15 Aug 2024 08:36
URI: https://ir.lib.ugm.ac.id/id/eprint/3037

Actions (login required)

View Item
View Item