{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Metode Klasifikasi Naive Bayes \n",
"\n",
"Naive Bayes adalah metode klasifikasi berbasis probabilitas yang digunakan untuk memprediksi kelas dari suatu data berdasarkan fitur-fiturnya. Naive Bayes didasarkan pada Teorema Bayes dengan asumsi bahwa setiap fitur bersifat independen satu sama lain. Metode ini dapat digunakan pada data kategorikal dan data numeriks. Jika data terdiri dari data numeriks maka memerlukan pendekatan Gaussian Naive Bayes. \n",
"\n",
"Rumus : $$P(Y|X) = \\frac{P(X|Y) \\cdot P(Y)}{P(X)}$$ \n",
"\n",
"Penjelasan : \n",
"* $P(Y|X)$: Probabilitas posterior (probabilitas kelas $Y$ diberikan fitur $X$).\n",
"* $P(X|Y)$: Likelihood (probabilitas fitur $X$ diberikan kelas $Y$).\n",
"* $P(Y)$: Probabilitas awal (prior) dari kelas $Y$.\n",
"* $P(X)$: Evidence (probabilitas fitur $X$ secara keseluruhan).\n",
"\n",
"## Langkah-langkah Klasifikasi Naive Bayes \n",
"1. Menyiapkan dataset \n",
"* Mengumpulkan data training yang berisi fitur-fitur X dan label y \n",
"* Pastikan data memiliki atribut (fitur) numerik atau kategorikal.\n",
"* Jika menggunakan data numerik, bisa menerapkan Gaussian Naïve Bayes. \n",
"\n",
"2. Menghitung Prior / probabilitas awal $p(Y)$ \n",
"* Hitung frekuensi dari setiap kelas berdasarkan dataset \n",
"* Rumus : \n",
"\n",
"$$P(Y) = \\frac{\\text{jumlah total data kelas (y) }}{\\text{jumlah total data}}$$ \n",
"\n",
"3. Menghitung mean dan variansi fitur setiap kelas \n",
"\n",
"$$Mean = \\frac{\\text{jumlah data (y) }}{\\text{jumlah total data}}$$ \n",
"\n",
"$$Variance = \\frac{(x_1 - \\mu)^2 + (x_2 - \\mu)^2 + \\dots + (x_n - \\mu)^2}{N}$$ \n",
"\n",
"4. Rumus Gaussian untuk likelihood\n",
"\n",
"$$P(X_i | Y) = \\frac{1}{\\sqrt{2\\pi\\sigma^2}} \\cdot e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}$$ \n",
"\n",
"* $P(Xi∣Y)$ : Probabilitas fitur $Xi$ pada kelas $y$ \n",
"* $\\mu$ : mean/rata-rata fitur \n",
"* $σ^2$ : variance \n",
"\n",
"5. Posterior probabilitas \n",
"\n",
"$$P(Y | X) = \\frac{P(X | Y) \\cdot P(Y)}{P(X)}$$ \n",
"\n",
"* $P(Y | X)$ = Probabilitas suatu kelas $( Y )$ diberikan fitur $( X )$(probabilitas posterior) \n",
"* $ P(X | Y)$ = Probabilitas fitur $( X )$ pada kelas $( Y )$(likelihood) \n",
"* $P(Y)$ = Probabilitas awal dari kelas $( Y )$ (prior) \n",
"* $P(X)$ = Probabilitas keseluruhan fitur $( X )$ (evidence), dapat diabaikan dalam klasifikasi Naïve Bayes karena hanya sebagai faktor normalisasi. \n",
"\n",
"6. Memprediksi hasil \n",
"\n",
"* membandingkan nilai $P(Y∣X)$ untuk semua kelas.\n",
"* memilih kelas dengan nilai $P(Y∣X)$ terbesar sebagai kelas prediksi atau hasil klasifikasi. \n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Klasifikasi Naive Bayes Tanpa Outlier"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" class | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" Iris-setosa | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.9 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 3 | \n",
" Iris-setosa | \n",
" 1.3 | \n",
" 0.2 | \n",
" 4.7 | \n",
" 3.2 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" Iris-setosa | \n",
" 1.5 | \n",
" 0.2 | \n",
" 4.6 | \n",
" 3.1 | \n",
"
\n",
" \n",
" 3 | \n",
" 5 | \n",
" Iris-setosa | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.0 | \n",
" 3.6 | \n",
"
\n",
" \n",
" 4 | \n",
" 6 | \n",
" Iris-setosa | \n",
" 1.7 | \n",
" 0.4 | \n",
" 5.4 | \n",
" 3.9 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id class petal_length petal_width sepal_length sepal_width\n",
"0 2 Iris-setosa 1.4 0.2 4.9 3.0\n",
"1 3 Iris-setosa 1.3 0.2 4.7 3.2\n",
"2 4 Iris-setosa 1.5 0.2 4.6 3.1\n",
"3 5 Iris-setosa 1.4 0.2 5.0 3.6\n",
"4 6 Iris-setosa 1.7 0.4 5.4 3.9"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np \n",
"import matplotlib.pyplot as plt \n",
"import pandas as pd \n",
"\n",
"dataset = pd.read_csv('hasil_no-outlier.csv')\n",
"x = dataset.iloc[:, -4:].values # Variabel independen / fitur\n",
"y = dataset['class'].values # Variabel dependen \n",
"# y = dataset['class'].value_counts() # menghitung jumlah data tiap class \n",
"dataset.head(5)\n",
"# print(y)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 0.8333333333333334\n"
]
},
{
"data": {
"text/plain": [
"array([[11, 0, 0],\n",
" [ 0, 7, 3],\n",
" [ 0, 2, 7]])"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state=42) \n",
"from sklearn.preprocessing import StandardScaler\n",
"sc = StandardScaler()\n",
"x_train = sc.fit_transform(x_train)\n",
"x_test = sc.transform(x_test)\n",
"\n",
"from sklearn.naive_bayes import GaussianNB\n",
"classifier = GaussianNB()\n",
"classifier.fit(x_train, y_train)\n",
"\n",
"y_pred = classifier.predict(x_test)\n",
"y_pred\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"cm = confusion_matrix(y_test, y_pred)\n",
"\n",
"from sklearn.metrics import accuracy_score\n",
"print(\"Accuracy : \", accuracy_score(y_test, y_pred))\n",
"cm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Klasifikasi Naive Bayes Data Outlier"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" class | \n",
" petal_length | \n",
" petal_width | \n",
" sepal_length | \n",
" sepal_width | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Iris-setosa | \n",
" 86.4 | \n",
" 70.0 | \n",
" 20.1 | \n",
" 30.5 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Iris-setosa | \n",
" 1.4 | \n",
" 0.2 | \n",
" 4.9 | \n",
" 3.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Iris-setosa | \n",
" 1.3 | \n",
" 0.2 | \n",
" 4.7 | \n",
" 3.2 | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Iris-setosa | \n",
" 1.5 | \n",
" 0.2 | \n",
" 4.6 | \n",
" 3.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Iris-setosa | \n",
" 1.4 | \n",
" 0.2 | \n",
" 5.0 | \n",
" 3.6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id class petal_length petal_width sepal_length sepal_width\n",
"0 1 Iris-setosa 86.4 70.0 20.1 30.5\n",
"1 2 Iris-setosa 1.4 0.2 4.9 3.0\n",
"2 3 Iris-setosa 1.3 0.2 4.7 3.2\n",
"3 4 Iris-setosa 1.5 0.2 4.6 3.1\n",
"4 5 Iris-setosa 1.4 0.2 5.0 3.6"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np \n",
"import matplotlib.pyplot as plt \n",
"import pandas as pd \n",
"\n",
"dataset = pd.read_csv('hasil_gabungan.csv')\n",
"x1 = dataset.iloc[:, -4:].values # Variabel independen\n",
"y1 = dataset['class'].values # Variabel dependen\n",
"# y = dataset['class'].value_counts() # menghitung jumlah data tiap class \n",
"\n",
"dataset.head(5)\n",
"# print(y)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy : 0.6333333333333333\n"
]
},
{
"data": {
"text/plain": [
"array([[11, 0, 0],\n",
" [ 0, 7, 3],\n",
" [ 0, 2, 7]])"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"x1_train, x1_test, y1_train, y1_test = train_test_split(x1,y1, test_size = 0.2, random_state=42) \n",
"\n",
"from sklearn.preprocessing import StandardScaler\n",
"sc1 = StandardScaler()\n",
"x1_train = sc.fit_transform(x1_train)\n",
"x1_test = sc.transform(x1_test)\n",
"\n",
"from sklearn.naive_bayes import GaussianNB\n",
"classifier1 = GaussianNB()\n",
"classifier1.fit(x1_train, y1_train)\n",
"\n",
"y1_pred = classifier1.predict(x1_test)\n",
"y1_pred\n",
"\n",
"from sklearn.metrics import confusion_matrix\n",
"cm1 = confusion_matrix(y1_test, y1_pred)\n",
"\n",
"from sklearn.metrics import accuracy_score\n",
"print(\"Accuracy : \", accuracy_score(y1_test, y1_pred))\n",
"cm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Kesimpulan \n",
"\n",
"Keberadaan outlier dapat mengganggu performa klasifikasi Naïve Bayes karena dapat menurunkan akurasi secara signifikan. Terlihat pada data tanpa outlier akurasinya mencapai 0.833 lebih tinggi dari pada data dengan outlier yaitu 0,633, maka penting untuk melakukan analisis dan penanganan outlier sebelum menerapkan model untuk mendapatkan hasil yang lebih akurat."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}