Imputation Method on Opak Watershed Data, Special Region of Yogyakarta
Abstract
The data availability of water resources in Indonesia has several complex problems related to the perfection of data. The problems taking place when collecting data in several Indonesian agencies are the accuracy and completeness of the data. There are several methods that can be used to handle missing value imputation, such as k-Nearest Neighbors Imputation (k-NNi) and Multivariate Imputation by Chained Equation (MICE). This study seeks to compare and find the most appropriate method using the Opak watershed dataset in Special Region of Yogyakarta. The characteristics of the Opak watershed lies in its fan shape that provides a lower concentration-time and produces a higher flow. The results of the statistical validation comparison showed that the most consistent average value of RMSE and MAE was the k-NNi method with a value of k = 28. As for the comparison of R-Squared values, the k-NNi method with a value of k = 28 obtained the best average value with 80%, followed by the k-NNi method of k = 7 as the default k value with a percentage of 73%. Among the applied methods, the MICE comparison method obtained the lowest average percentage value with 63%.
References
S. Kamwaga, D.M.M. Mulungu, dan P. Valimba, “Assessment of Empirical and Regression Methods for Infilling Missing Streamflow Data in Little Ruaha Catchment Tanzania,” Phys. Chem. Earth, Vol. 106, hal. 17–28, 2018.
R.J. Abrahart, F. Anctil, P. Coulibaly, C.W. Dawson, N.J. Mount, L.M. See, A.Y. Shamseldin, D.P. Solomatine, E. Toth, dan R.L. Wilby, “Two decades of Anarchy? Emerging Themes and Outstanding Challenges for Neural Network River Forecasting,” Prog. Phys. Geogr., Vol. 36, No. 4, hal. 480–513, 2012.
E. Acuña dan C. Rodriguez, “The Treatment of Missing Values and its Effect on Classifier Accuracy,” dalam Classification, Clustering, and Data Mining Applications, D. Banks, F.R. McMorris, P. Arabie, dan W. Gaul, Eds., Berlin, Jerman: Springer, 2004, hal. 639-647.
J. Luengo, S. García, dan F. Herrera, “A Study on the Use of Imputation Methods for Experimentation with Radial Basis Function Network Classifiers Handling Missing Attribute Values: The Good Synergy between RBFNs and EventCovering Method,” Neural Networks, Vol. 23, No. 3, hal. 406–418, Apr. 2010.
L. Sunitha, M. Balraju, dan J. Sasikiran, “Data Mining: Estimation of Missing Values Using Lagrange Interpolation Technique,” Int. J. Adv. Res. Comput. Eng. Technol., Vol. 2, No. 4, hal. 1579–1582, 2013.
W.M. Campion dan D.B. Rubin, “Multiple Imputation for Nonresponse in Surveys,” J. Mark. Res., Vol. 26, No. 4, hal. 485-486, 1989.
A. Jadhav, D. Pramod, dan K. Ramanathan, “Comparison of Performance of Data Imputation Methods for Numeric Dataset,” Appl. Artif. Intell., Vol. 33, No. 10, hal. 913–933, 2019.
C. Curley, R.M. Krause, R. Feiock, dan C.V. Hawkins, “Dealing with Missing Data: A Comparative Exploration of Approaches Using the Integrated City Sustainability Database,” Urban Aff. Rev., Vol. 55, No. 2, hal. 591–615, 2019.
R.J. Little, “Selection Model (Missing Data),” dalam Wiley StatsRef Stat. Ref. Online, Hoboken, AS: Wiley, 2016, hal. 1–5.
D.B. Rubin, “Inference and Missing Data,” Biometrika, Vol. 63, No. 3, hal. 581–592, 1976.
R.J.A. Little, “Missing-Data Adjustments in Large Surveys,” J. Bus. Econ. Stat., Vol. 6, No. 3, hal. 287–296, 1988.
J.L. Schafer dan J.W. Graham, “Missing Data: Our View of the State of the Art,” Psychol. Methods, Vol. 7, No. 2, hal. 147–177, 2002.
P. Schmitt, J. Mandel, dan M. Guedj, “A Comparison of Six Methods for Missing Data Imputation,” J. Biom. Biostat., Vol. 6, No. 1, hal. 1–6, 2015.
G. Chhabra, V. Vashisht, dan J. Ranjan, “A Review on Missing Data Value Estimation Using Imputation Algorithm,” J. Adv. Res. Dyn. Control Syst., Vol. 11, No. 7-Special Issue, hal. 312–318, 2019.
A. Kowarik dan M. Templ, “Imputation with the R package VIM,” J. Stat. Softw., Vol. 74, No. 7, hal. 1-16, 2016.
C. Cortes, L.D., Jackel, dan W-P. Chiang, “Limits in Learning Machine Accuracy Imposed by Data Quality,” Proc. the 1st Int. Conf. Knowl. Discovery Data Mining (KDD-95 Proc.), 1994, pp. 57–62.
J.M. Engels dan P. Diehr, “Imputation of Missing Longitudinal Data: A Comparison of Methods,” J. Clin. Epidemiol., Vol. 56, No. 10, hal. 968–976, 2003.
D.M.P. Murti, U. Pujianto, A.P. Wibawa, dan M.I. Akbar, “K-Nearest Neighbor (K-NN) based Missing Data Imputation,” Proc. - 2019 5th Int. Conf. Sci. Inf. Technol. Embrac. Ind. 4.0 Towar. Innov. Cyber Phys. Syst. (ICSITech 2019), 2019, hal. 83–88.
J. Maillo, S. Ramírez, I. Triguero, dan F. Herrera, “kNN-IS: An Iterative Spark-based Design of the k-Nearest Neighbors Classifier for Big Data,” Knowledge-Based Syst., Vol. 117, No. C, hal. 3–15, 2017.
G.E.A.P.A. Batista dan M.C. Monard, “A Study of k-Nearest Neighbour as an Imputation Method,” Conf. Soft Comput. Sys. - Design, Manag. Appl. (HIS 2002), 2002, hal. 1–10.
B. Suthar, H. Patel, dan A. Goswami, “A Survey: Classification of Imputation Methods in Data Mining,” Int. J. Emerg. Technol. Adv. Eng., Vol. 2, No. 1, hal. 309–312, 2012.
D. Priya, R. dan Sivaraj, R., “A Review of Missing Data Handling Methods,” Int. J. Eng. Technol. Sci., Vol. 2, No. 2, hal. 2349–3968, 2015.
Doreswamy, I. Gad, dan B.R. Manjunatha, “Performance Evaluation of Predictive Models for Missing Data Imputation in Weather Data,” 2017 Int. Conf. Adv. Comput. Commun. Informatics (ICACCI 2017), 2017, hal. 1327–1334.
Y. Sun, A.K.C. Wong, dan M.S. Kamel, “Classification of Imbalanced Data: A Review,” Int. J. Pattern Recognit. Artif. Intell., Vol. 23, No. 4, hal. 687–719, 2009.
M.J. Azur, E.A. Stuart, C. Frangakis, dan P.J. Leaf, “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” Int. J. Methods Psychiatr. Res., Vol. 20, No. 1, hal. 40–49, 2011.
S. van Buuren dan K. Groothuis-Oudshoorn, “MICE: Multivariate Imputation by Chained Equations in R,” J. Stat. Softw., Vol. 45, No. 3, hal. 1–67, 2011.
P.H. Rezvan, K.J. Lee, dan J.A. Simpson, “The Rise of Multiple Imputation: A Review of the Reporting and Implementation of the Method in Medical Research Data Collection, Quality, and Reporting,” BMC Med. Res. Methodol., Vol. 15, No. 1, hal. 1–14, 2015.
G. Chhabra, V. Vashisht, dan J. Ranjan, “A Comparison of Multiple Imputation Methods for Data with Missing Values,” Indian J. Sci. Technol., Vol. 10, No. 19, hal. 1–7, 2017.
D.F. Hamilton, M. Ghert, dan A.H.R.W. Simpson, “Interpreting Regression Models in Clinical Outcome Studies,” Bone Jt. Res., Vol. 4, No. 9, hal. 152–153, 2015.
T. Chai dan R.R. Draxler, “Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? -Arguments Against Avoiding RMSE in the Literature,” Geosci. Model Dev., Vol. 7, No. 3, hal. 1247–1250, 2014.
A.A. Suryanto, “Penerapan Metode Mean Absolute Error (MEA) dalam Algoritma Regresi Linear untuk Prediksi Produksi Padi,” Saintekbu, Vol. 11, No. 1, hal. 78–83, 2019.
W. Wang dan Y. Lu, “Analysis of the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) in Assessing Rounding Model,” IOP Conf. Ser. Mater. Sci. Eng., Vol. 324, hal. 1-10, 2018.
M.J. Hartmann dan G. Carleo, “Neural-Network Approach to Dissipative Quantum Many-Body Dynamics,” Phys. Rev. Lett., Vol. 122, No. 25, Art. 250502, 2019.
K. Crammer, “On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines,” J. Mach. Learn. Res. (JMLR), Vol. 2, No. 2, hal. 265–292, 2002.
X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-H. Zhou, M. Steinbach, D.J. Hand, dan D. Steinberg, “Top 10 Algorithms in Data Mining,” Knowl. Inf. Syst., Vol. 14, hal. 1-37, 2008.
J. Shao, “Linear Model Selection by Cross-Validation,” J. Am. Stat. Assoc., Vol. 88, No. 422, hal. 486–494, 1993.
Suprapto, dkk., Katalog Basis Data 2014 Sumber Daya Air, Jakarta, Indonesia: Pusat Penelitian dan Pengembangan Sumber Daya Air, 2014.
S.L. Dingman, Physical Hydrology, 3rd ed., Illinois, AS: Waveland Press, 2015.
S.J. Goldman, T.A. Bursztynsky, and K. Jackson, Erosion and Sediment Control Handbook, 1st ed., New York, AS: McGraw-Hill, 1986.
© Jurnal Nasional Teknik Elektro dan Teknologi Informasi, under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License.