"
],
"text/plain": [
" no-recurrence-events 30-39 premeno 30-34 0-2 no 3 left left_low \\\n",
"0 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 right right_up \n",
"1 no-recurrence-events 40-49 premeno 20-24 0-2 no 2 left left_low \n",
"2 no-recurrence-events 60-69 ge40 15-19 0-2 no 2 right left_up \n",
"3 no-recurrence-events 40-49 premeno 0-4 0-2 no 2 right right_low \n",
"4 no-recurrence-events 60-69 ge40 15-19 0-2 no 2 left left_low \n",
"\n",
" no.1 \n",
"0 no \n",
"1 no \n",
"2 no \n",
"3 no \n",
"4 no "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv('./breast-cancer.csv')\n",
"data.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Breast Cancer Data Description
\n",
" \n",
"This breast cancer domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic for providing the data.
\n",
"Let's now add column labels to all columns in the data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Class
\n",
"
age
\n",
"
menopause
\n",
"
tumor-size
\n",
"
inv-nodes
\n",
"
node-caps
\n",
"
deg-malig
\n",
"
breast
\n",
"
breast-quad
\n",
"
irradiat
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
no-recurrence-events
\n",
"
40-49
\n",
"
premeno
\n",
"
20-24
\n",
"
0-2
\n",
"
no
\n",
"
2
\n",
"
right
\n",
"
right_up
\n",
"
no
\n",
"
\n",
"
\n",
"
1
\n",
"
no-recurrence-events
\n",
"
40-49
\n",
"
premeno
\n",
"
20-24
\n",
"
0-2
\n",
"
no
\n",
"
2
\n",
"
left
\n",
"
left_low
\n",
"
no
\n",
"
\n",
"
\n",
"
2
\n",
"
no-recurrence-events
\n",
"
60-69
\n",
"
ge40
\n",
"
15-19
\n",
"
0-2
\n",
"
no
\n",
"
2
\n",
"
right
\n",
"
left_up
\n",
"
no
\n",
"
\n",
"
\n",
"
3
\n",
"
no-recurrence-events
\n",
"
40-49
\n",
"
premeno
\n",
"
0-4
\n",
"
0-2
\n",
"
no
\n",
"
2
\n",
"
right
\n",
"
right_low
\n",
"
no
\n",
"
\n",
"
\n",
"
4
\n",
"
no-recurrence-events
\n",
"
60-69
\n",
"
ge40
\n",
"
15-19
\n",
"
0-2
\n",
"
no
\n",
"
2
\n",
"
left
\n",
"
left_low
\n",
"
no
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Class age menopause tumor-size inv-nodes node-caps \\\n",
"0 no-recurrence-events 40-49 premeno 20-24 0-2 no \n",
"1 no-recurrence-events 40-49 premeno 20-24 0-2 no \n",
"2 no-recurrence-events 60-69 ge40 15-19 0-2 no \n",
"3 no-recurrence-events 40-49 premeno 0-4 0-2 no \n",
"4 no-recurrence-events 60-69 ge40 15-19 0-2 no \n",
"\n",
" deg-malig breast breast-quad irradiat \n",
"0 2 right right_up no \n",
"1 2 left left_low no \n",
"2 2 right left_up no \n",
"3 2 right right_low no \n",
"4 2 left left_low no "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_index = [ 'Class', 'age','menopause','tumor-size','inv-nodes','node-caps','deg-malig','breast','breast-quad','irradiat']\n",
"data.columns = data_index\n",
"data.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Data Variables
\n",
" Each row in data.csv contains an individual case of a woman with breastcancer. There are 285 cases in this data set. The data set is available from UCI repository (http://archive.ics.uci.edu/ml/datasets/Breast+Cancer) \n",
"\n",
" \n",
"Each row, or sample consists of the following attributes:\n",
"* **1. Age:** age (in years at last birthday) of the patient at the time of diagnosis;\n",
"* **2. Menopause:** whether the patient is pre- or postmenopausal at time of diagnosis; \n",
"* **3. Tumor size:** the greatest diameter (in mm) of the excised tumor; \n",
"* **4. Inv-nodes:** the number (range 0 - 39) of axillary lymph nodes that contain metastatic breast cancer visible on histological examination;\n",
"* **5. Node caps:** if the cancer does metastasise to a lymph node, although outside the original site of the tumor it may remain “contained” by the capsule of the lymph node. However, over time, and with more aggressive disease, the tumor may replace the lymph node and then penetrate the capsule, allowing it to invade the surrounding tissues; (yes = 1, no = 0)\n",
"* **6. Degree of malignancy:** the histological grade (range 1-3) of the tumor. Tumors that are grade 1 predominantly consist of cells that, while neoplastic, retain many of their usual characteristics. Grade 3 tumors predominately consist of cells that are highly abnormal; \n",
"* **7. Breast:** breast cancer may obviously occur in either breast(left = 1, right = 2)\n",
"* **8. Breast quadrant:** the breast may be divided into four quadrants, using the nipple as a central point;(left_up = 1, left_low = 2, right_up = 3, right_low = 4, central = 5)\n",
"* **9. Irradiation:** radiation therapy is a treatment that uses high-energy x-rays to destroy cancer cells.(yes = 1, no = 0) \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"##### data\n",
"#data[data['breast-quad'].str.contains('left') & data['breast'].str.contains('right')]\n",
"#data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find any null variables if they exist"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"Perform Test and Train split\n",
"\n",
"
\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## REMINDER: Training Phase\n",
"\n",
"In the **training phase**, the learning algorithm uses the training data to adjust the model’s parameters to minimize errors. At the end of the training phase, you get the trained model.\n",
"\n",
"\n",
" \n",
"In the **testing phase**, the trained model is applied to test data. Test data is separate from the training data, and is previously unseen by the model. The model is then evaluated on how it performs on the test data. The goal in building a classifier model is to have the model perform well on training as well as test data.\n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [],
"source": [
"features_train, features_test, Output_train, Output_test = train_test_split(features, Output, test_size = 0.33, random_state = 324)\n"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number instances in features_train dataset: (198, 6)\n",
"Number instances in Output_train dataset: (198,)\n",
"Number instances in features_test dataset: (86, 6)\n",
"Number instances in Output_test dataset: (86,)\n"
]
}
],
"source": [
"print(\"Number instances in features_train dataset: \", features_train.shape)\n",
"print(\"Number instances in Output_train dataset: \", Output_train.shape)\n",
"print(\"Number instances in features_test dataset: \", features_test.shape)\n",
"print(\"Number instances in Output_test dataset: \", Output_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"SMOTE Technique to address Data Imbalance\n",
"\n",
"
"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Before OverSampling, counts of Recurrent Class '1': 53\n",
"Before OverSampling, counts of No-Recurrent Class '0': 137 \n",
"\n"
]
}
],
"source": [
"print(\"Before OverSampling, counts of Recurrent Class '1': {}\".format(sum(Output_train==1)))\n",
"print(\"Before OverSampling, counts of No-Recurrent Class '0': {} \\n\".format(sum(Output_train==0)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Resampling using SMOTE"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [],
"source": [
"sm = SMOTE(random_state=2)\n",
"features_train_res, Output_train_res = sm.fit_sample(features_train, Output_train)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"After OverSampling, the shape of features_X: (286, 6)\n",
"After OverSampling, the shape of Output_y: (286,) \n",
"\n",
"After OverSampling, counts of Recurrent Class '1': 143\n",
"After OverSampling, counts of Non-Recurrent Class '0': 143\n"
]
}
],
"source": [
"print('After OverSampling, the shape of features_X: {}'.format(features_train_res.shape))\n",
"print('After OverSampling, the shape of Output_y: {} \\n'.format(Output_train_res.shape))\n",
"\n",
"print(\"After OverSampling, counts of Recurrent Class '1': {}\".format(sum(Output_train_res==1)))\n",
"print(\"After OverSampling, counts of Non-Recurrent Class '0': {}\".format(sum(Output_train_res==0)))"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"age float64\n",
"menopause int64\n",
"tumor-size int64\n",
"inv-nodes int64\n",
"node-caps int32\n",
"deg-malig int64\n",
"dtype: object"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check features of the training and testing sets.\n",
"\n",
"#type(features_train)\n",
"features_train.dtypes\n",
"#type(features_test)\n",
"#type(Output_train)\n",
"#Output_train.dtype\n",
"#type(Output_test)\n",
"#features_train.describe()\n",
"#Output_train.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"