{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Exercise_Solutions.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/titsitits/Python_Data_Science/blob/master/Exercise_Solutions.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DfQ-d2onvooo",
        "colab_type": "text"
      },
      "source": [
        "Mickaël Tits\n",
        "CETIC\n",
        "mickael.tits@cetic.be"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Nx49_9RZwiWR",
        "colab_type": "text"
      },
      "source": [
        "# Exercice Chapitre 1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n_5e4iD2OgAg",
        "colab_type": "text"
      },
      "source": [
        "Corrigez ce code rempli d'erreurs de syntaxes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dRswVERVOpqV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "prix = 100000\n",
        "mine = False\n",
        "\n",
        "while True:\n",
        "  \n",
        "prix += 20000\n",
        "print(prix)\n",
        "  \n",
        "  if prix > 150000 and prix < 200000 and mine == False\n",
        "    print(\"j'achète\")\n",
        "  mine = True\n",
        "  \n",
        "  elif prix > 200000 and mine:\n",
        "    print(\"Je \n",
        "          revends\")\n",
        "    mine = False\n",
        "    \n",
        "  break"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "my2OvxwTwkMW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "prix = 100000\n",
        "mine = False\n",
        "while True:\n",
        "  \n",
        "  prix += 20000\n",
        "  print(prix)\n",
        "  \n",
        "  if prix > 150000 and prix < 200000 and mine == False:\n",
        "    print(\"j'achète\")\n",
        "    mine = True    \n",
        "  \n",
        "  elif prix > 200000 and mine:\n",
        "    print(\"Je revends\")\n",
        "    mine = False\n",
        "    \n",
        "    break\n",
        "  "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iaqayQP0v8VO",
        "colab_type": "text"
      },
      "source": [
        "# Exercice Chapitre 2"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "paz4AzXjOYxx",
        "colab_type": "text"
      },
      "source": [
        "Filtrez la liste de prix ci-dessous, pour ne garder que les prix entre 200000 et 300000."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "N0IsrBxEM9k5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "prices = [50000, 1000, 500000, 200000, 300000, 400000, 260000, 270000]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MgbUBQS9jwme",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "\n",
        "#Boucle:\n",
        "filtered = []\n",
        "for p in prices:\n",
        "  if p >= 200000 and p <= 300000:\n",
        "    filtered.append(p)\n",
        "    \n",
        "#List-comprehension:\n",
        "filtered = [p for p in prices if p >= 200000 and p <= 300000 ]\n",
        "\n",
        "print(filtered)\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jsnDb2Go8XdJ",
        "colab_type": "text"
      },
      "source": [
        "# Exercice Chapitre 5"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "efxswSMMO284",
        "colab_type": "text"
      },
      "source": [
        "## Les maisons plus grandes sont-elles plus chères ?\n",
        "\n",
        "Divisez les maisons en deux groupes égaux (de part et d'autre de la surface médiane) et utilisez la méthode .describe() pour comparer les prix."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Zo-bLCZO8Y3i",
        "colab_type": "code",
        "outputId": "6452a177-2df6-4358-8dcc-226d1323da1b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 140
        }
      },
      "source": [
        "#df[\"large_houses\"] = df.surface > df.surface.median()\n",
        "#df.groupby(\"large_houses\")[\"price\"].describe()\n",
        "\n",
        "#ou: (il n'est pas nécessaire d'ajouter une colonne dans le dataframe, \n",
        "#il suffit de donner comme argument du groupby une series de la même longueur que le dataframe)\n",
        "\n",
        "is_large = df.surface > df.surface.median()\n",
        "df.groupby(is_large)[\"price\"].describe()\n",
        "\n",
        "#ou directement:\n",
        "df.groupby(df.surface > df.surface.median())[\"price\"].describe()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>count</th>\n",
              "      <th>mean</th>\n",
              "      <th>std</th>\n",
              "      <th>min</th>\n",
              "      <th>25%</th>\n",
              "      <th>50%</th>\n",
              "      <th>75%</th>\n",
              "      <th>max</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>surface</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>False</th>\n",
              "      <td>7.0</td>\n",
              "      <td>318571.428571</td>\n",
              "      <td>92453.334432</td>\n",
              "      <td>150000.0</td>\n",
              "      <td>280000.0</td>\n",
              "      <td>320000.0</td>\n",
              "      <td>400000.0</td>\n",
              "      <td>400000.0</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>True</th>\n",
              "      <td>6.0</td>\n",
              "      <td>406666.666667</td>\n",
              "      <td>164276.190200</td>\n",
              "      <td>230000.0</td>\n",
              "      <td>335000.0</td>\n",
              "      <td>350000.0</td>\n",
              "      <td>447500.0</td>\n",
              "      <td>700000.0</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "         count           mean            std  ...       50%       75%       max\n",
              "surface                                       ...                              \n",
              "False      7.0  318571.428571   92453.334432  ...  320000.0  400000.0  400000.0\n",
              "True       6.0  406666.666667  164276.190200  ...  350000.0  447500.0  700000.0\n",
              "\n",
              "[2 rows x 8 columns]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 401
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TF1htMvN9mnS",
        "colab_type": "code",
        "outputId": "ec295a27-f31d-4417-8ac8-f7defdbf8e09",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "#Bonus: Anova\n",
        "\n",
        "#Pour avoir un code générique, ou extrait tous les groupes obtenus sous forme d'une liste\n",
        "groups = df.groupby(is_large)[\"price\"]\n",
        "group_list = [g[1].values for g in groups]\n",
        "\n",
        "#La méthode f_oneway prend autant d'arguments que d'échantillon à comparer. Pour transformer une liste en arguments, on utilise le symbole *\n",
        "stats.f_oneway(*group_list)\n"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "F_onewayResult(statistic=1.4810829736928035, pvalue=0.24907116578758384)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 400
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mSXHJbmDDy52",
        "colab_type": "code",
        "outputId": "d518db76-0e5a-4e0b-bda4-86fae3101618",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 110
        }
      },
      "source": [
        "#Bonus: corrélation\n",
        "df[[\"price\",\"surface\"]].corr()"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>price</th>\n",
              "      <th>surface</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>price</th>\n",
              "      <td>1.000000</td>\n",
              "      <td>0.233623</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>surface</th>\n",
              "      <td>0.233623</td>\n",
              "      <td>1.000000</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "            price   surface\n",
              "price    1.000000  0.233623\n",
              "surface  0.233623  1.000000"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 402
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sqvqd_OCD42o",
        "colab_type": "code",
        "outputId": "c1f59eca-8577-4950-bc51-ad721593e9ad",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 300
        }
      },
      "source": [
        "#Bonus: scatter plot (plus de détails au chapitre suivant)\n",
        "df.plot.scatter(\"surface\",\"price\")"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<matplotlib.axes._subplots.AxesSubplot at 0x7ff1fe2c4400>"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 403
        },
        {
          "output_type": "display_data",
          "data": {
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZsAAAEKCAYAAADEovgeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAHB1JREFUeJzt3X+cVfV95/HXe2AcpkEFgVCXwaCF\nJiUpok4VS9PNmkSRTcEtPvyxSWBTV7ox2Y3bzcMfTTekmu5DySPNxjQhtTErZG3USBLYrMYQtEkf\n+yjqoPzwZ5n8sAxVpPxQiTgZMp/943xHL/OYuXNn4HvPDPN+Ph73cc/9nHPu9zuXA+855345X0UE\nZmZmOTWU3QEzMzv+OWzMzCw7h42ZmWXnsDEzs+wcNmZmlp3DxszMsnPYmJlZdg4bMzPLzmFjZmbZ\njS27A8PF5MmTY8aMGWV3w8xsRNm8efO/RMSUgbZz2CQzZsygra2t7G6YmY0okp6vZTtfRjMzs+wc\nNmZmlp3DxszMsnPYmJlZdg4bMzPLLlvYSHq7pC0Vj1ckXSvpFEkbJO1IzxPT9pJ0m6R2SdsknV3x\nXsvS9jskLauonyNpe9rnNklK9T7bMBtp9h7sZOvOA+w92Fl2V8yOSrawiYjnImJuRMwFzgFeA74D\n3ABsjIhZwMb0GuBiYFZ6LAdWQREcwArgPOBcYEVFeKwCrq7Yb0Gq99eG2Yixbssu5t/6EB/62iPM\nv/Uh1m/ZVXaXzIasXpfR3gv8JCKeBxYDq1N9NXBJWl4MrInCJmCCpFOBi4ANEbEvIvYDG4AFad1J\nEbEpirmt1/R6r77aMBsR9h7s5Pq123i9q5tXOw/zelc3163d5jMcG7HqFTZXAN9My1Mj4oW0/CIw\nNS1PA3ZW7NORatXqHX3Uq7VxBEnLJbVJatuzZ8+gfyizXDr2H6Kx4ci/no0NDXTsP1RSj8yOTvaw\nkXQCsAj4Vu916YwkcrZfrY2IuD0iWiOidcqUAe+2YFY3LROb6eruPqLW1d1Ny8TmknpkdnTqcWZz\nMfB4ROxOr3enS2Ck55dSfRcwvWK/llSrVm/po16tDbMRYdL4JlYumcO4xgZObBrLuMYGVi6Zw6Tx\nTWV3zWxI6nFvtCt58xIawHpgGXBLel5XUf+4pLspBgO8HBEvSHoQ+B8VgwIuBG6MiH1phNs84BFg\nKfClAdowGzEWzZ3G/JmT6dh/iJaJzQ4aG9Gyho2ktwDvB/64onwLcK+kq4DngctS/X5gIdBOMXLt\nIwApVG4GHkvb3RQR+9LyNcCdQDPwQHpUa8NsRJk0vskhY8cFFV9pWGtra/iuz2ZmgyNpc0S0DrSd\n7yBgZmbZOWzMzCw7h42ZmWXnsDEzs+wcNmZmlp3DxszMsnPYmJlZdg4bMzPLzmFjZmbZOWzMzCw7\nh42ZmWXnsDEzs+wcNmZmlp3DxszMsnPYmJlZdg4bMzPLzmFjZmbZOWzMzCw7h42ZmWXnsDEzs+wc\nNmZmlp3DxszMsnPYmJlZdg4bMzPLzmFjZmbZOWzMzCw7h42ZmWWXNWwkTZB0n6RnJT0j6XxJp0ja\nIGlHep6YtpWk2yS1S9om6eyK91mWtt8haVlF/RxJ29M+t0lSqvfZhpmZlSP3mc0Xge9HxDuAM4Fn\ngBuAjRExC9iYXgNcDMxKj+XAKiiCA1gBnAecC6yoCI9VwNUV+y1I9f7aMDOzEmQLG0knA78P3AEQ\nEb+MiAPAYmB12mw1cElaXgysicImYIKkU4GLgA0RsS8i9gMbgAVp3UkRsSkiAljT6736asPMzEqQ\n88zmdGAP8L8kPSHpa5LeAkyNiBfSNi8CU9PyNGBnxf4dqVat3tFHnSptHEHSckltktr27NkzlJ/R\nzMxqkDNsxgJnA6si4izgF/S6nJXOSCJjH6q2ERG3R0RrRLROmTIlZzfMzEa1nGHTAXRExCPp9X0U\n4bM7XQIjPb+U1u8Cplfs35Jq1eotfdSp0oaZmZUgW9hExIvATklvT6X3Ak8D64GeEWXLgHVpeT2w\nNI1Kmwe8nC6FPQhcKGliGhhwIfBgWveKpHlpFNrSXu/VVxtmZlaCsZnf/z8Dd0k6Afgp8BGKgLtX\n0lXA88Bladv7gYVAO/Ba2paI2CfpZuCxtN1NEbEvLV8D3Ak0Aw+kB8At/bRhZmYlUPGVhrW2tkZb\nW1vZ3TAzG1EkbY6I1oG28x0EzMwsO4eNmZll57AxM7PsHDZmZpadw8bMzLJz2JiZWXYOGzMzy85h\nY2Zm2TlszMwsO4eNmZll57AxM7PsHDZmZpadw8bMzLJz2JiZWXYOGzMzy85hY2Zm2TlszMwsO4eN\nmZll57AxM7PsHDZmZpadw8bMzLJz2JiZWXYOGzMzy85hY2Zm2TlszMwsO4eNmZlllzVsJP1c0nZJ\nWyS1pdopkjZI2pGeJ6a6JN0mqV3SNklnV7zPsrT9DknLKurnpPdvT/uqWhtmZlaOepzZ/JuImBsR\nren1DcDGiJgFbEyvAS4GZqXHcmAVFMEBrADOA84FVlSExyrg6or9FgzQhpmZlaCMy2iLgdVpeTVw\nSUV9TRQ2ARMknQpcBGyIiH0RsR/YACxI606KiE0REcCaXu/VVxtmZlaC3GETwA8kbZa0PNWmRsQL\naflFYGpangbsrNi3I9Wq1Tv6qFdrw8zMSjA28/v/XkTskvRWYIOkZytXRkRIipwdqNZGCsDlAKed\ndlrObpiZjWpZz2wiYld6fgn4DsV3LrvTJTDS80tp813A9IrdW1KtWr2ljzpV2ujdv9sjojUiWqdM\nmTLUH9PMzAaQLWwkvUXSiT3LwIXAk8B6oGdE2TJgXVpeDyxNo9LmAS+nS2EPAhdKmpgGBlwIPJjW\nvSJpXhqFtrTXe/XVhpmZlSDnZbSpwHfSaOSxwN9GxPclPQbcK+kq4HngsrT9/cBCoB14DfgIQETs\nk3Qz8Fja7qaI2JeWrwHuBJqBB9ID4JZ+2jAzsxKoGMhlra2t0dbWVnY3jnt7D3bSsf8QLRObmTS+\nqezumNlRkrS54r+29Cv3AAGzN6zbsovr126jsaGBru5uVi6Zw6K50wbe0cxGPN+uxupi78FOrl+7\njde7unm18zCvd3Vz3dpt7D3YWXbXzKwOHDZWFx37D9HYcOTh1tjQQMf+QyX1yMzqyWFjddEysZmu\n7u4jal3d3bRMbC6pR2ZWTw4bq4tJ45tYuWQO4xobOLFpLOMaG1i5ZI4HCZiNEh4gYHWzaO405s+c\n7NFoZqOQw8bqatL4JoeM2Sjky2hmZpadw8bMzLJz2JiZWXYOGzMzy85hY2Zm2dUcNpLeJul9abm5\nZ/oAMzOzgdQUNpKuBu4D/jqVWoDv5uqUmZkdX2o9s/kYMB94BSAidgBvzdUpMzM7vtQaNp0R8cue\nF5LGAp4Ix8zMalJr2PxI0p8CzZLeD3wL+D/5umVmZseTWsPmBmAPsB34Y4opnP8sV6fMzOz4Uuu9\n0ZqBr0fE3wBIGpNqr+XqmJmZHT9qPbPZSBEuPZqBHx777piZ2fGo1rAZFxEHe16k5V/L0yUzMzve\n1Bo2v5B0ds8LSecAns/XzMxqUut3NtcC35L0z4CAXwcuz9YrMzM7rtQUNhHxmKR3AG9Ppecioitf\nt8zM7HhSNWwkXRARD0n6w16rflMSEfHtjH0zM7PjxEBnNv8aeAj4gz7WBeCwMTOzAVUNm4hYIakB\neCAi7q1Tn8zM7Dgz4Gi0iOgGrhtqA5LGSHpC0vfS69MlPSKpXdI9kk5I9ab0uj2tn1HxHjem+nOS\nLqqoL0i1dkk3VNT7bMPMzMpR69DnH0r6pKTpkk7pedS47yeAZype3wp8ISJmAvuBq1L9KmB/qn8h\nbYek2cAVwDuBBcBXUoCNAb4MXAzMBq5M21Zr45jbe7CTrTsPsPdgZ64msmjf/Sr3te2kfferZXfF\nzEaBWoc+X07xHc01vepnVNtJUgvwb4G/AP5EkoALgH+fNlkNfAZYBSxOy1DMnfNXafvFwN0R0Qn8\nTFI7cG7arj0ifprauhtYLOmZKm0cU+u27OL6tdtobGigq7ublUvmsGjutGPdzDH36e9uZ82mf3rj\n9dLzT+Omxb9dYo/M7HhX65nNbIqziK3AFuBLFGcaA/mfFJfgutPrScCBiDicXncAPf86TwN2AqT1\nL6ft36j32qe/erU2jpm9Bzu5fu02Xu/q5tXOw7ze1c11a7cN+zOc9t2vHhE0AGv+4Z98hmNmWdUa\nNquB3wJuowia2anWL0kfAF6KiM1H1cOMJC2X1Capbc+ePYPat2P/IRobjvz4Ghsa6Ng/vG+ssGXn\ngUHVzcyOhVovo70rImZXvH5Y0tMD7DMfWCRpITAOOAn4IjBB0th05tEC7Erb7wKmAx1pcraTgb0V\n9R6V+/RV31uljSNExO3A7QCtra2DmgyuZWIzXd3dR9S6urtpmdjczx7Dw9zpEwZVNzM7Fmo9s3lc\n0ryeF5LOA9qq7RARN0ZES0TMoPiC/6GI+CDwMHBp2mwZsC4tr0+vSesfiohI9SvSaLXTgVnAo8Bj\nwKw08uyE1Mb6tE9/bRwzk8Y3sXLJHMY1NnBi01jGNTawcskcJo1vOtZNHVMzp57I0vNPO6K29PzT\nmDn1xJJ6ZGajgYp/mwfYqPjS/e1Az8X+04DngMNARMScAfZ/D/DJiPiApDOAu4FTgCeAD0VEp6Rx\nwDeAs4B9wBUVX/5/Cvij1N61EfFAqi+k+F5oDMV8O3+R6n22Ua2Pra2t0dZWNT/7tPdgJx37D9Ey\nsXnYB02l9t2vsmXnAeZOn+CgMbMhk7Q5IloH3K7GsHlbtfUR8fwg+jYsDTVszMxGs1rDptYbcY74\nMDEzs/LU+p2NmZnZkDlszMwsO4eNmZll57AxM7PsHDZmZpadw8bMzLJz2JiZWXYOGzMzy85hY2Zm\n2TlsbMQoc1bUkTojq9lwUesUA2alKnNW1JE6I6vZcOIzGxv2ypwVdaTOyGo23DhsbNgrc1bUkToj\nq9lw47CxYa/MWVFH6oysZsONw8aGvTJnRR2pM7KaDTc1TZ42GnjytOGvzFlRR+qMrGa5HdPJ08yG\ng0njm0r7h77Mts1yqecvUQ4bM7NRqN5D+v2djZnZKFPGkH6HjZnZKFPGkH6HjQ2ab91iNrKVMaTf\n39nYoPjWLWYjX8+Q/ut6/V3OOUjAYWM1q7zO+zrFb0XXrd3G/JmTPVLLbIRZNHca82dO9mg0G356\nrvP2BA28eZ3XYWM28tRzSL+/s7Ga+dYtZjZUDhurmW/dYmZDle0ymqRxwI+BptTOfRGxQtLpwN3A\nJGAz8OGI+KWkJmANcA6wF7g8In6e3utG4CrgV8B/iYgHU30B8EVgDPC1iLgl1ftsI9fPOprU+zqv\nmR0fcp7ZdAIXRMSZwFxggaR5wK3AFyJiJrCfIkRIz/tT/QtpOyTNBq4A3gksAL4iaYykMcCXgYuB\n2cCVaVuqtGHHwKTxTZw5fYKDxsxqli1sonAwvWxMjwAuAO5L9dXAJWl5cXpNWv9eSUr1uyOiMyJ+\nBrQD56ZHe0T8NJ213A0sTvv014aZmZUg63c26QxkC/ASsAH4CXAgIg6nTTqAnv+kMQ3YCZDWv0xx\nGeyNeq99+qtPqtKGmZmVIGvYRMSvImIu0EJxJvKOnO0NlqTlktokte3Zs6fs7piZHbfqMhotIg4A\nDwPnAxMk9QxMaAF2peVdwHSAtP5kioECb9R77dNffW+VNnr36/aIaI2I1ilTphzVz2hmZv3LFjaS\npkiakJabgfcDz1CEzqVps2XAurS8Pr0mrX8oipnd1gNXSGpKo8xmAY8CjwGzJJ0u6QSKQQTr0z79\ntWFmZiXIeQeBU4HVadRYA3BvRHxP0tPA3ZI+CzwB3JG2vwP4hqR2YB9FeBART0m6F3gaOAx8LCJ+\nBSDp48CDFEOfvx4RT6X3ur6fNszMrASeFjoZ6rTQZU0X7GmKzWw48LTQdVDWHZB952UzG2l8u5oh\nKmOmuzLbNTM7Gg6bISpjprsy2zUzOxoOmyEq6w7IvvOymY1EDpshKusOyL7zspmNRB6Nlng0mpnZ\n4Hk0Wp3Uc6a74dCumdlQ+DKamZll57AxM7PsHDZmZpadw8bMzLJz2JiZWXYOGzMzy85hY2Zm2Tls\nzMwsO4eNmZll57AxM7PsHDZWV3sPdrJ15wHPv2M2yvjeaFY3nmHUbPTymY3VhWcYNRvdHDZWF55h\n1Gx0c9hYXXiGUbPRzWFjdeEZRs1GNw8QsLpZNHca82dO9gyjZqOQw8bqyjOMmo1OvoxmZmbZOWzM\nzCy7bGEjabqkhyU9LekpSZ9I9VMkbZC0Iz1PTHVJuk1Su6Rtks6ueK9lafsdkpZV1M+RtD3tc5sk\nVWvDzMzKkfPM5jDw3yJiNjAP+Jik2cANwMaImAVsTK8BLgZmpcdyYBUUwQGsAM4DzgVWVITHKuDq\niv0WpHp/bZiZWQmyhU1EvBARj6flV4FngGnAYmB12mw1cElaXgysicImYIKkU4GLgA0RsS8i9gMb\ngAVp3UkRsSkiAljT6736asPMzEpQl+9sJM0AzgIeAaZGxAtp1YvA1LQ8DdhZsVtHqlWrd/RRp0ob\nZmZWguxhI2k8sBa4NiJeqVyXzkgiZ/vV2pC0XFKbpLY9e/bk7IaZ2aiWNWwkNVIEzV0R8e1U3p0u\ngZGeX0r1XcD0it1bUq1avaWPerU2jhARt0dEa0S0TpkyZWg/pJmZDSjnaDQBdwDPRMRfVqxaD/SM\nKFsGrKuoL02j0uYBL6dLYQ8CF0qamAYGXAg8mNa9Imleamtpr/fqqw0zMytBzjsIzAc+DGyXtCXV\n/hS4BbhX0lXA88Blad39wEKgHXgN+AhAROyTdDPwWNrupojYl5avAe4EmoEH0oMqbZiZWQlUfKVh\nra2t0dbWVnY3zMxGFEmbI6J1oO18BwEzM8vOYWNmZtk5bMzMLDuHjZmZZeewMTOz7Bw2ZmaWncPG\nzMyyc9iYmVl2DhszM8vOYWNmZtk5bEapvQc72brzAHsPdpbdFTMbBXLeiNOGqXVbdnH92m00NjTQ\n1d3NyiVzWDR32sA7mpkNkc9sRpm9Bzu5fu02Xu/q5tXOw7ze1c11a7f5DMfMsnLYjDId+w/R2HDk\nH3tjQwMd+w+V1CMzGw0cNqNMy8Rmurq7j6h1dXfTMrG5pB6Z2WjgsBllJo1vYuWSOYxrbODEprGM\na2xg5ZI5TBrfVHbXzOw45gECo9CiudOYP3MyHfsP0TKx2UFjZtk5bEapSeObHDJmVje+jGZmZtk5\nbMzMLDuHjZmZZeewMTOz7Bw2ZmaWnSKi7D4MC5L2AM9nevvJwL9keu+j4X4Njvs1OO7X4IzUfr0t\nIqYM9CYOmzqQ1BYRrWX3ozf3a3Dcr8FxvwbneO+XL6OZmVl2DhszM8vOYVMft5fdgX64X4Pjfg2O\n+zU4x3W//J2NmZll5zMbMzPLzmFzlCR9XdJLkp6sqH1O0rOStkn6jqQJqT5D0iFJW9Ljq3Xu12ck\n7apof2HFuhsltUt6TtJFde7XPRV9+rmkLalez89ruqSHJT0t6SlJn0j1UyRtkLQjPU9MdUm6LX1m\n2ySdXed+lXqMVelXqcdYlX6VeoxJGifpUUlbU7/+PNVPl/RI+lzukXRCqjel1+1p/Yw69+uu9Of0\nZPo725jq75H0csXn9emaG4sIP47iAfw+cDbwZEXtQmBsWr4VuDUtz6jcroR+fQb4ZB/bzga2Ak3A\n6cBPgDH16lev9Z8HPl3C53UqcHZaPhH4x/S5rARuSPUbKv4sFwIPAALmAY/UuV+lHmNV+lXqMdZf\nv8o+xtJxMj4tNwKPpOPmXuCKVP8q8NG0fA3w1bR8BXBPnfu1MK0T8M2Kfr0H+N5Q2vKZzVGKiB8D\n+3rVfhARh9PLTUDLcOhXFYuBuyOiMyJ+BrQD59a7X5IEXEZxcNdVRLwQEY+n5VeBZ4BpFJ/N6rTZ\nauCStLwYWBOFTcAESafWq19lH2NVPq/+1OUYG6hfZR1j6Tg5mF42pkcAFwD3pXrv46vnuLsPeG/q\ne136FRH3p3UBPMoxOL4cNvn9EcVvwD1Ol/SEpB9JencJ/fl4uvTy9Z5LQhR/GXdWbNNB9X84cnk3\nsDsidlTU6v55pUsWZ1H8ljc1Il5Iq14Epqblun9mvfpVqdRjrI9+DYtjrJ/Pq7RjTNKYdPnuJWAD\nxdndgYpfGio/kzc+r7T+ZWBSPfoVEY9UrGsEPgx8v2KX89NltwckvbPWdhw2GUn6FHAYuCuVXgBO\ni4izgD8B/lbSSXXs0irgN4C5qS+fr2PbtbiSI3/jrPvnJWk8sBa4NiJeqVyXfssrZfhmf/0q+xjr\no1/D4hir8udY2jEWEb+KiLkUZwnnAu/I0c5g9e6XpHdVrP4K8OOI+Pv0+nGK29OcCXwJ+G6t7Ths\nMpH0H4APAB9M/0iRLiHsTcubKX6z+c169SkidqcDqxv4G968jLELmF6xaUuq1Y2kscAfAvf01Or9\neaXf4tYCd0XEt1N5d8/lsfT8UqrX7TPrp1+lH2N99Ws4HGNVPq/Sj7HUzgHgYeB8isuvPTMmV34m\nb3xeaf3JwN469WtBancFMIUihHu2eaXnsltE3A80Sppcy/s7bDKQtAC4DlgUEa9V1KdIGpOWzwBm\nAT+tY78qv1P4d0DPiLD1wBVpBMzpqV+P1qtfyfuAZyOio6dQz88rXQ+/A3gmIv6yYtV6YFlaXgas\nq6gvVWEe8HLF5bbs/Sr7GKvSr1KPsSp/jlDiMZba6Rkx2Ay8n+L7pIeBS9NmvY+vnuPuUuChnl8o\n6tCvZyX9R+Ai4Mr0i0PP9r/e892RpHMpMqS2EBzKqAI/jhjN8U2KU/EuimuuV1F8+bkT2JIePaNK\nlgBPpdrjwB/UuV/fALYD2ygO5lMrtv8UxW91zwEX17NfqX4n8J96bVvPz+v3KC6Rbav4c1tIcZ18\nI7AD+CFwStpewJfTZ7YdaK1zv0o9xqr0q9RjrL9+lX2MAXOAJ1K/nuTN0XBnUIRuO/AtoCnVx6XX\n7Wn9GXXu1+H0Z9XzGfbUP54+r60UA1N+t9a2fAcBMzPLzpfRzMwsO4eNmZll57AxM7PsHDZmZpad\nw8bMzLJz2JgNU5K+mW778l/L7ovZ0Ro78CZmVk/pf4xPBn4nImaW3R+zY8FnNmaZSHqLpP+bblr4\npKTLVcylMjmtb5X0d2n5M5K+Ien/UfzHyB8A01TMGfJuSVdLeiy911pJv5b2m6piPput6fG7qf4h\nFfOUbJH01z3/S96sLA4bs3wWAP8cEWdGxLs48s65fZkNvC8irgQWAT+JiLlR3ATx2xHxO1HcAPEZ\nijtCANwG/CjVzwaekvRbwOXA/ChusPgr4IPH/KczGwRfRjPLZzvweUm3Ukw49feqPiXJ+og41M+6\nd0n6LDABGA88mOoXAEuhuHsv8LKkDwPnAI+l9pp58waiZqVw2JhlEhH/qGK66IXAZyVtpLjnVM8V\nhXG9dvlFlbe7E7gkIramuz2/p8q2AlZHxI1D6bdZDr6MZpaJpH8FvBYR/xv4HMVlrp9TnHVAcRPI\nWp0IvJBun195SWwj8NHU3hhJJ6fapZLemuqnSHrb0fwsZkfLZzZm+fw28DlJ3RR3uf4oxSWtOyTd\nDPzdIN7rv1PMOLknPZ+Y6p8Abpd0FcV3Mx+NiH+Q9GfADyQ1pLY/Bjx/9D+S2dD4rs9mZpadL6OZ\nmVl2DhszM8vOYWNmZtk5bMzMLDuHjZmZZeewMTOz7Bw2ZmaWncPGzMyy+/9Mp0cfR4WYrwAAAABJ\nRU5ErkJggg==\n",
            "text/plain": [
              "<Figure size 432x288 with 1 Axes>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JmLLbgt2Hg9W",
        "colab_type": "text"
      },
      "source": [
        "# Exercices Chapitre 6\n",
        "\n",
        "## 1. Recherchez les maisons ayant un prix anormal\n",
        "\n",
        "Cherchez les maisons ayant un prix s'écartant fortement du IQR (inter-quantile range).\n",
        "\n",
        "Solution: vous devriez trouver une maison à 700000€.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8fFsG7TYJHoF",
        "colab_type": "text"
      },
      "source": [
        "### Solution Exercice 1"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "rWG8K6M8fG9U",
        "colab": {}
      },
      "source": [
        "#Inter-quantile range\n",
        "Q1 = df.price.quantile(0.25)\n",
        "Q3 = df.price.quantile(0.75)\n",
        "IQR = Q3 - Q1\n",
        "h = 2\n",
        "print(\"limits:\", (Q1 - h * IQR), \"to\", (Q3 + h * IQR))\n",
        "\n",
        "#Les données s'écartant fortement des quantiles sont potentiellement des anomalies (outlier en anglais)\n",
        "is_outlier = (df.price < (Q1 - h * IQR)) | (df.price > (Q3 + h * IQR))\n",
        "df.loc[is_outlier]\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VHnDqnnzJr_7",
        "colab_type": "text"
      },
      "source": [
        "On retrouve la maison de la Rue de la Loi 50 à Bruxelles qui est effectivement beaucoup plus chère que toutes les autres."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_-ROwVwUlJfl",
        "colab_type": "text"
      },
      "source": [
        "## 2. Définissez une fonction générique permettant de détecter les anomalies sur n'importe quelle colonne numérique\n",
        "\n",
        "Modularisez le code écrit à l'exercice précédent pour le rendre générique.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hc0lIIs9ZU29",
        "colab_type": "text"
      },
      "source": [
        "### Solution Exercice 2"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7x79_EkqZY2z",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def detect_outliers(df, column, h = 2, left = 0.25, right = 0.75):\n",
        "    \n",
        "  Q1 = df[column].quantile(left)\n",
        "  Q3 = df[column].quantile(right)\n",
        "  IQR = Q3 - Q1\n",
        "  #print(\"limits:\", (Q1 - h * IQR), \"to\", (Q3 + h * IQR))\n",
        "\n",
        "  #Les données s'écartant fortement des quantiles sont potentiellement des anomalies (outlier en anglais)\n",
        "  is_outlier = (df[column] < (Q1 - h * IQR)) | (df[column] > (Q3 + h * IQR))\n",
        "  outliers = df.loc[is_outlier]\n",
        "  \n",
        "  if len(outliers) > 0:    \n",
        "    print(\"Anomalies en terme de\", column, \":\")\n",
        "    display(outliers)\n",
        "  \n",
        "  return outliers\n",
        "\n",
        "out = detect_outliers(df, 'price')\n",
        "out = detect_outliers(df, 'surface')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WfUCoCq1XxTI",
        "colab_type": "text"
      },
      "source": [
        "#### Solution plus robuste: assertion du type"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lDuOK5v7X2gd",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def detect_outliers(df, column, h = 2, left = 0.25, right = 0.75):\n",
        "  \n",
        "  \"\"\"\n",
        "  df: un DataFrame pandas\n",
        "  col: nom de la colonne du dataframe à analyser (string)\n",
        "  \"\"\"\n",
        "  \n",
        "  #On peut éventuellement vérifier les types des arguments  \n",
        "  if type(column) != str:\n",
        "    raise Exception(\"L'argument 'col' doit être un string\")\n",
        "  \n",
        "  #l'instruction \"assert\" permet de directement générer une exception si une condition n'est pas respectée:\n",
        "  #assert <condition>, <exception message if condition not respected>\n",
        "  assert type(column) == str, \"L'argument 'col' doit être un string\"\n",
        "  assert type(df) == pd.DataFrame, \"L'argument 'df' doit être un DataFrame pandas\"\n",
        "    \n",
        "  Q1 = df[column].quantile(left)\n",
        "  Q3 = df[column].quantile(right)\n",
        "  IQR = Q3 - Q1\n",
        "  #print(\"limits:\", (Q1 - h * IQR), \"to\", (Q3 + h * IQR))\n",
        "\n",
        "  #Les données s'écartant fortement des quantiles sont potentiellement des anomalies (outlier en anglais)\n",
        "  is_outlier = (df[column] < (Q1 - h * IQR)) | (df[column] > (Q3 + h * IQR))\n",
        "  outliers = df.loc[is_outlier]\n",
        "  \n",
        "  if len(outliers) > 0:    \n",
        "    print(\"Anomalies en terme de\", column, \":\")\n",
        "    display(outliers)\n",
        "  \n",
        "  return outliers\n",
        "\n",
        "out = detect_outliers(df, 'price')\n",
        "out = detect_outliers(df, 'surface')\n",
        "\n",
        "#Erreur d'assertion\n",
        "try:\n",
        "  out = detect_outliers(42, 'surface')\n",
        "except Exception as e:\n",
        "  print(e)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "exGQnJJclNHE",
        "colab_type": "text"
      },
      "source": [
        "## 3. Définissez une fonction générique permettant de détecter les anomalies sur une colonne catégorielle\n",
        "\n",
        "Pour les variables catégorielles, vous pouvez simplement détecter une catégorie particulièrement rare (nombre d'occurrences plus petit qu'un seuil nmin). \n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1qSBlQToZez0",
        "colab_type": "text"
      },
      "source": [
        "### Solution Exercice 3"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "w02bi1caZoQW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#variables catégorielles (le nombre de chambres peut être considéré comme variable catégorielle aussi)\n",
        "def detect_rare_cat(df, column, nmin = 2):\n",
        "  \n",
        "  \"\"\"\n",
        "  Cette fonction permet de détecter des outliers dans des variables catégorielles\n",
        "  \"\"\"\n",
        "  \n",
        "  counts = df.groupby(column)['price'].count()\n",
        "  rare_cats = counts[counts < nmin]\n",
        "  outliers = df[df[column].isin(rare_cats.index)]\n",
        "  \n",
        "  if len(outliers) > 0:    \n",
        "    print(\"Anomalies en terme de\", column, \":\")\n",
        "    display(outliers)\n",
        "  \n",
        "  return outliers\n",
        "\n",
        "out = detect_rare_cat(df, 'city')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bgxvdXf9lPzM",
        "colab_type": "text"
      },
      "source": [
        "## 4. Détectez automatiquement le type d'une variable\n",
        "\n",
        "Indice: vous pouvez aussi définir une variable comme catégorielle si elle compte un nombre limité de valeurs uniques. (Utilisez la méthode pd.Series.unique())\n",
        "Attention: le nombre de chambres 'rooms' peut à la fois être considéré comme catégoriel et numérique. Utilisez la méthode pd.DataFrame. \n",
        "Indice: pour vérifier si une colonne est numérique, vous pouvez essayer de la convertir en float: series.astype(float) et rediriger l'erreur.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "F2pWFHgUd55O",
        "colab_type": "text"
      },
      "source": [
        "### Solution Exercice 4"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "smlFBP5Nd8mE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def is_cat(series, relative_threshold = 0.5, absolute_threshold = 20):\n",
        "  \n",
        "  nvalues = len(series)  \n",
        "  ncats = len(series.unique())\n",
        "  \n",
        "  #On considère la variable comme catégorielle si le nombre de valeurs uniques et plus petit qu'un seuil relatif ou absolu\n",
        "  return (ncats/nvalues <= relative_threshold) and (ncats <= absolute_threshold)\n",
        "\n",
        "def is_num(series):\n",
        "  try:\n",
        "    series.astype(float)\n",
        "    return True\n",
        "  except:\n",
        "    return False\n",
        "\n",
        "is_categorical = df.apply(is_cat)\n",
        "is_numeric = df.apply(is_num)\n",
        "is_numeric\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q0TeASzvlSOt",
        "colab_type": "text"
      },
      "source": [
        "## 5. Recherchez les anomalies de manière systématique"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UZitkG9-Jay6",
        "colab_type": "text"
      },
      "source": [
        "### Solution Exercice 5"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a01M1eSyS_5n",
        "colab_type": "text"
      },
      "source": [
        "#### Solution simple\n",
        "Pour chaque colonne (boucle for), on vérifie le type de la variable (catégorielle et/ou numérique), et on détecte les outliers selon le type."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fkgbNaOVTCiQ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cols = df.columns\n",
        "\n",
        "for col in cols:\n",
        "  \n",
        "  #Détection d'outliers pour les variables catégorielles\n",
        "  if is_cat(df[col]):\n",
        "    outliers = detect_rare_cat(df, col)\n",
        "    \n",
        "    if len(outliers) == 0:\n",
        "      print(\"No categorical outlier detected for columns\", col)\n",
        "  \n",
        "  #Détection d'outliers pour les variables numériques\n",
        "  if is_num(df[col]):\n",
        "    outliers = detect_outliers(df, col)\n",
        "    \n",
        "    if len(outliers) == 0:\n",
        "      print(\"No numeric outlier detected for columns\", col)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "izdRlCtkTGLc",
        "colab_type": "text"
      },
      "source": [
        "#### Solution plus complète\n",
        "On ressemble tous outliers (et leurs ids), en omettant les doublons. On peut ensuite les supprimer du dataframe.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "T17xMWjTJ2w7",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "cat_cols = is_categorical[is_categorical == True].index\n",
        "num_cols = is_numeric[is_numeric == True].index\n",
        "\n",
        "\n",
        "#Pour chaque colonne catégorielle, on extrait lest outliers\n",
        "cat_outliers = [detect_rare_cat(df, cat) for cat in cat_cols]\n",
        "#rassemblons les outliers dans un seul dataframe (nous pouvons retirer les doublons)\n",
        "cat_outliers = pd.concat(cat_outliers).drop_duplicates()\n",
        "\n",
        "#Pour chaque colonne numérique, on extrait les outliers\n",
        "num_outliers = [detect_outliers(df, c) for c in num_cols]\n",
        "#Rassemblons les outliers trouvés dans un seul dataframe \n",
        "num_outliers = pd.concat(num_outliers).drop_duplicates()\n",
        "\n",
        "\n",
        "#Rassemblons tous les outliers\n",
        "outliers = pd.concat([num_outliers, cat_outliers]).drop_duplicates()\n",
        "print(\"Tous les outliers:\")\n",
        "display(outliers)\n",
        "\n",
        "#On supprime les outliers (à partir de leur index)\n",
        "df_filtered = df.drop(outliers.index)\n",
        "\n",
        "print(\"Dataframe filtré:\")\n",
        "display(df_filtered)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hBkKlSPzpoJm",
        "colab_type": "text"
      },
      "source": [
        "#### Solution encore plus complète\n",
        "On peut définir des fonctions s'appliquant directement sur une Series, et renvoyant un Series de booléens pour chaque item. On peut alors utiliser la méthode \"apply\" pour extraire toutes les détections d'anomalies d'un coup. De plus, on peut ainsi vérifier le nombre de fois qu'une observation (une maison) a été reconnue comme anormale, et en extraire un indice plus global d'anormalité."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mA4tjMBqppy2",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#Définissons des fonctions de détection d'outliers sur des Series, pour pouvoir utiliser la méthode \"apply\" dessus:\n",
        "\n",
        "def is_cat(series, relative_threshold = 0.5, absolute_threshold = 20):\n",
        "  \n",
        "  nvalues = len(series)  \n",
        "  ncats = len(series.unique())\n",
        "  \n",
        "  #On considère la variable comme catégorielle si le nombre de valeurs uniques et plus petit qu'un seuil relatif ou absolu\n",
        "  return (ncats/nvalues <= relative_threshold) and (ncats <= absolute_threshold)\n",
        "\n",
        "def is_num(series):\n",
        "  try:\n",
        "    series.astype(float)\n",
        "    return True\n",
        "  except:\n",
        "    return False\n",
        "\n",
        "#Détection d'outliers numériques\n",
        "def detect_numeric_outliers_series(series, h = 2, left = 0.25, right = 0.75):\n",
        "\n",
        "  \n",
        "  Q1 = series.quantile(left)\n",
        "  Q3 = series.quantile(right)\n",
        "  IQR = Q3 - Q1\n",
        "  \n",
        "  #Les données s'écartant fortement des quantiles sont potentiellement des anomalies (outlier en anglais)\n",
        "  is_outlier = (series < (Q1 - h * IQR)) | (series > (Q3 + h * IQR))\n",
        "  if is_outlier.sum() > 0:\n",
        "    print(is_outlier.sum(), \"outliers numériques trouvés pour\", series.name)\n",
        "  return is_outlier\n",
        "\n",
        "#Détection d'outliers catégoriels: on identifie une catégorie rare si elle représente moins de 1% (seuil relatif) des données\n",
        "def detect_rare_cat_series(series, relative_threshold = 0.01, nmin = 2):\n",
        "  counts = series.value_counts()\n",
        "  \n",
        "  #on garde la contrainte la plus souple (< rel_th% des données, ou < nmin)\n",
        "  relative_nmin = relative_threshold*len(series)\n",
        "  tot_nmin = max(nmin, relative_nmin)\n",
        "  \n",
        "  \n",
        "  is_rare = (counts <  tot_nmin)\n",
        "  rare_cats = counts[is_rare]\n",
        "  is_outlier = series.isin(rare_cats.index)\n",
        "  if is_outlier.sum() > 0:\n",
        "    print(is_outlier.sum(), \"outliers catégoriels trouvés pour\", series.name)\n",
        "  return is_outlier\n",
        "\n",
        "\n",
        "#Détection générique des outliers dans une Series\n",
        "def detect_outlier_series(series):\n",
        "  \n",
        "  \"\"\"\n",
        "  Il vaut mieux privilégier la représentation catégorielle plutôt que numérique.\n",
        "  En effet, si il existe peu de valeurs différentes, la détection numérique peut ne pas être efficace:\n",
        "  E.G.: pour une variable binaire dont 20% des données sont 0, on détectera 20% d'outliers (IQR = 0)\n",
        "  \"\"\"\n",
        "  if is_cat(series):\n",
        "    return detect_rare_cat_series(series)  \n",
        "  elif is_num(series):\n",
        "    return detect_numeric_outliers_series(series)\n",
        "\n",
        "  else:\n",
        "    #return a Series containing only False\n",
        "    return pd.Series(False, series.index)\n",
        "\n",
        "\n",
        "  \n",
        "outliers = df.apply(detect_outlier_series)\n",
        "\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yYR6yxgnNL1q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "#De cette manière, on peut observer le nombre de fois qu'une maison est considérée comme anormale\n",
        "outlier_sum = outliers.sum(axis = 1) \n",
        "outlier_sum.plot.bar()\n",
        "plt.xticks(range(len(df.address)), labels = df.address, rotation = \"vertical\")\n",
        "\n",
        "#On peut vérifier pour quelles variables chaque maison est considérée comme anormale\n",
        "display(outliers)\n",
        "\n",
        "#Outliers\n",
        "n_cols = 2\n",
        "display(df[outlier_sum >= n_cols])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sqza1iffq4Li",
        "colab_type": "text"
      },
      "source": [
        "On remarque que la maison de la Rue de la Loi 50, Bruxelles, et celle de charleroi sont considérées deux fois comme des outliers."
      ]
    }
  ]
}