Normalisasi Kata Tidak Baku yang Tidak Disingkat dengan Jarak Perubahan

I Gusti Bagus Baskara Nugraha; Rafi Dwi Rizqullah

I Gusti Bagus Baskara Nugraha Institut Teknologi Bandung
Rafi Dwi Rizqullah Institut Teknologi Bandung

Keywords: voice assistant, kamus, kata tidak baku, normalisasi, jarak Levenshtein, jarak Jaro-Winkler

Abstract

Voice assistant technology is growing rapidly and its use has begun to spread to daily use. However, voice assistant usages are still limited to standard conversation languages. Meanwhile, Indonesian people are accustomed to informal language in daily conversation. This research gives solution to overcome the problem of voice assistants with informal words or words that will not be found in formal word dictionary. We propose text normalization using Levenshtein distance. Test result shows that normalization using Levenshtein distance outperform the normalization using Longest Common Subsequence (LCS) distance with accuracy difference of 8.34%.

References

M. Escherich dan W. Goertz. (2015) “Market Trends: Voice as a UI on Consumer Devices—What Do Users Want?” [Online], https://www.gartner.com/doc/3021226/market-trends-voice-uiconsumer/, tanggal akses: 2-Nov-2017.

S. Kleinberg (2018) “5 ways voice assistance is shaping consumer behavior,” [Online], https://thinkwithgoogle.com/consumerinsights/voice-assistance-consumer-experience/, tanggal akses: 30-Jul-2018.

A. Na’im dan H. Syaputra, Kewarganegaraan, Suku Bangsa, Agama dan Bahasa Sehari-hari Penduduk Indonesia: Hasil Sensus Penduduk 2010, Sumarwanto dan T. Irianto, Ed. Jakarta, Indonesia: Badan Pusat Statistik, 2012.

A. Chaer dan L. Agustina, Sosiolinguistik: Suatu Pengantar. Jakarta-Indonesia: Rineka Cipta, 1995.

P. Bojanowski, E. Grave, A. Joulin, dan T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, Vol. 5, hal. 135–146, 2017.

T.S. Saragih, “Normalisasi Teks pada Teks Twitter Berbahasa Indonesia menggunakan Algoritme Jarak String pada R”, Skripsi, Institut Teknologi Bandung, Bandung, Indonesia, 2017.

H. Schütze, C.D. Manning, dan P. Raghavan, Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008.

V.I. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals,” Soviet Physics Doklady, Vol. 10, hal. 707-710, 1966.

W.E. Winkler, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,” Proceedings of the Section on Survey Research, 1990, hal. 354-359.

M.P. Van der Loo, “The Stringdist Package for Approximate String Matching,” The R Journal, Vol. 6, hal. 111–122, 2014.

M.A. Jaro, “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida,” Journal of the American Statistical Association, Vol. 84, hal. 414–420, 1989.

(2019) KBBI Offline Remake with Qt, [Online], https://github.com/bgli/kbbi-qt, tanggal akses: 1-Feb-2019.

I. Lanin, J. Geovedi, dan W. Soegijoko, “Perbandingan Distribusi Frekuensi Kata Bahasa Indonesia di Kompas, Wikipedia, Twitter, dan Kaskus,” KOLITA 11: Konferensi Linguistik Tahunan Atma Jaya Kesebelas, 2013, hal. 249–252.

Journal Metrics (January 2024)
Acceptance Rate	29%
Submission to First Decision	± 36 days
Acceptance to Publication	± 30 days
Acreditation	Sinta 2
h-index	29
5 Year Citations	3549

Username
Password
Remember me
Register