Add files via upload

liyijin-data-PM · web-flow · commit 162fd5de6952 · 2019-12-05T20:43:59.000-05:00
diff --git a/2. Moive Recommendation.ipynb b/2. Moive Recommendation.ipynb
@@ -0,0 +1,256 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Moive Recommendation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## This project is to practice data structures, methods and functions of the Pandas and Numpy"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The goal of the project is to create movie recommendations for a person, based on the person’s and critics’ ratings of the movies. \n",
+    "\n",
+    "The following files will be required to run the program:\n",
+    "1. `IMDB.csv`: A table with movie information\n",
+    "2. `ratings.csv`: A table with ratings of all movies listed in the movies data \n",
+    "    by 100 critics. The column names in the critics data correspond to the name of each critic.\n",
+    "3. `pX.csv`: A table with one person’s ratings of a subset of the movies in the movies data set, \n",
+    "    where X is a number. The column name in the file indicates the name of the person.\n",
+    "    \n",
+    "    \n",
+    "All personal ratings are integer numbers in the 1..10 range."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "** How does this program function:** <br>\n",
+    "1. The user will be asked to specify the `subfolder` in the current working directory, where the files are stored, along with the `names of the critics`, `person` and `movies data files`.\n",
+    "2. Determine and output the names of three critics, whose ratings of the movies are closest to the person’s ratings based on the `Euclidean distance` metric.\n",
+    "3. Use the `ratings by the critics` identified in item 2 to determine which movies to recommend. Display information about recommended movies as described below.<br>\n",
+    "a. The movie recommendations must consist of the top-rated movies in each movie genre, based on the average ratings of movies by the three critics identified in step 2 above.<br>\n",
+    "b. Movie genre is determined by the Genre1 column of the movies data.<br>\n",
+    "c. Recommendations must be listed in alphabetical order by genre.<br>\n",
+    "d. Missing data (e.g. running time) should not be included."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os.path\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "def main():\n",
+    "    '''\n",
+    "    The main function that is called to start the program.  \n",
+    "    '''\n",
+    "    filesNames = input('Please enter the name of the folder with files, the name of movies file,\\\n",
+    "    \\nthe name of critics file, the name of personal ratings file, separated by spaces:\\n')\n",
+    "    print() #print a new line\n",
+    "    filesNamesLst = filesNames.split(' ') \n",
+    "    currentWorkDir = os.getcwd()\n",
+    "    subfolderName = filesNamesLst[0]\n",
+    "    #create a DataFrame for movies with selected columns\n",
+    "    movieFileName = filesNamesLst[1] \n",
+    "    movieFilePath = os.path.join(currentWorkDir, subfolderName, movieFileName)\n",
+    "    movieDataFrame = pd.read_csv(movieFilePath, \\\n",
+    "                        encoding = 'unicode_escape').loc[:, ['Title', 'Genre1', 'Year', 'Runtime']] \n",
+    "    #create a DataFrame for critics ratings\n",
+    "    criticsFileName = filesNamesLst[2] \n",
+    "    criticsFilePath = os.path.join(currentWorkDir, subfolderName, criticsFileName)\n",
+    "    criticsDataFrame = pd.read_csv(criticsFilePath) \n",
+    "    #create a DataFrame for personal ratings\n",
+    "    personalFileName = filesNamesLst[3] \n",
+    "    personalFilePath = os.path.join(currentWorkDir, subfolderName, personalFileName)\n",
+    "    personalDataFrame = pd.read_csv(personalFilePath) \n",
+    "    #call functions to run the program\n",
+    "    topThreeCriticsLst = findClosestCritics(criticsDataFrame, personalDataFrame) \n",
+    "    print(topThreeCriticsLst, '\\n') \n",
+    "    movieRecommendation = recommendMovies(criticsDataFrame, personalDataFrame, \\\n",
+    "                                          topThreeCriticsLst, movieDataFrame)\n",
+    "    personName = personalDataFrame.columns[1]\n",
+    "    printRecommendations(movieRecommendation, personName)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def findClosestCritics(criticsDataFrame, personalDataFrame):\n",
+    "    '''\n",
+    "    This function is to return a list of three critics, whose ratings of movies are most similar \n",
+    "    to those provided in the personal ratings data based on Euclidean distance. The lower the \n",
+    "    distance, the closer, thus more similar, the critic's ratings are to the person's. \n",
+    "     \n",
+    "    Parameters:\n",
+    "    criticsDataFrame - provides data about critics ratings\n",
+    "    personalDataFrame - provides data about personal ratings \n",
+    "    '''\n",
+    "    \n",
+    "    # merge critics file and personal file by the same movie title\n",
+    "    criticsPersonRating = pd.merge(criticsDataFrame, personalDataFrame) \n",
+    "    # a new DataFrame with only critics' ratings after merging without Title column\n",
+    "    criticRating = criticsPersonRating.iloc[:,1:-1] \n",
+    "    # indexed by the movie titles\n",
+    "    criticRating.index = criticsPersonRating['Title'] \n",
+    "    # person's rating value without the person's name\n",
+    "    personRatingValue = criticsPersonRating[personalDataFrame.columns[1]] \n",
+    "    # to keep the index the same as the critics' rating DataFrame    \n",
+    "    personRatingValue.index = criticsPersonRating['Title'] \n",
+    "    ratingDifference = criticRating.sub(personRatingValue, axis = 0)\n",
+    "    eucliDistance = np.sqrt((ratingDifference**2).apply(np.sum))\n",
+    "    eucliDistance.sort_values(inplace = True) # sort the result from smallest to largest\n",
+    "    # select only the top 3 critics with smaller Euclidean distance \n",
+    "    topThreeCritics = eucliDistance.iloc[:3] \n",
+    "    # generate a list of the critics' names\n",
+    "    topThreeCriticsLst = list(topThreeCritics.index.values) \n",
+    "    \n",
+    "    return topThreeCriticsLst"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def recommendMovies(criticsDataFrame, personalDataFrame, topThreeCriticsLst, movieDataFrame):  \n",
+    "    '''\n",
+    "    This function is to compute the top-rated unwatched movies in each genre category \n",
+    "    based on the average of the three critics' ratings\n",
+    "     \n",
+    "    Parameters:\n",
+    "    criticsDataFrame - provides data about critics' ratings\n",
+    "    personalDataFrame - provides data about personal ratings \n",
+    "    topThreeCriticsLst - a list of three critics, whose ratings of movies are most similar to \n",
+    "    those provided in the personal ratings data\n",
+    "    movieDataFrame - provides data about movies info\n",
+    "    '''\n",
+    "    # prepare the DataFrames for critics rating, person's rating and movie indexed by movie title.\n",
+    "    criticsDataFrame.index = criticsDataFrame['Title']\n",
+    "    criticsDataFrame = criticsDataFrame.iloc[:,1:]\n",
+    "    personalDataFrame.index = personalDataFrame['Title']\n",
+    "    personalDataFrame = personalDataFrame.iloc[:,1:]\n",
+    "    movieDataFrame.index = movieDataFrame['Title']\n",
+    "    movieDataFrame = movieDataFrame.iloc[:,1:]\n",
+    "    # prepare the unwatched movie DataFrame with average ratings \n",
+    "    # from the three critics whose ratings are similar to the person's\n",
+    "    unwatchedCriticRating = criticsDataFrame.loc\\\n",
+    "                            [criticsDataFrame.index.difference(personalDataFrame.index)]\n",
+    "    topThreeCriticsRating = unwatchedCriticRating[topThreeCriticsLst]\n",
+    "    averageCriticsRating = round(topThreeCriticsRating.mean(axis = 1), 2)\n",
+    "    movieDataFrame['Average Rating'] = averageCriticsRating \n",
+    "    movieDataFrame.sort_values('Genre1', inplace = True)\n",
+    "    movieRecommendation = movieDataFrame[movieDataFrame.groupby(by = 'Genre1')['Average Rating'].\\\n",
+    "                                         transform(max) == movieDataFrame['Average Rating']]\n",
+    "    \n",
+    "    return movieRecommendation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def printRecommendations(movieRecommendation, personName):\n",
+    "    '''\n",
+    "    This function is to printout all the recommended movies in alphabetical order by the genre.\n",
+    "    \n",
+    "    Parameters:\n",
+    "    movieRecommendation - provides data about critics' ratings\n",
+    "    personName - the person's name for whom the recommendation is made for\n",
+    "    '''\n",
+    "    print('Recommendations for ', personName, ':', sep = '')\n",
+    "    # get the longest title for formatting later\n",
+    "    moiveTitle = list(movieRecommendation.index.values)\n",
+    "    longestTitle = len(max(moiveTitle, key = len))\n",
+    "    # get each factor (i.e. title, genre etc.) and then print with designed format \n",
+    "    for row in range(len(movieRecommendation)):\n",
+    "        title = movieRecommendation.index[row]\n",
+    "        gener1 = movieRecommendation.loc[title]['Genre1']\n",
+    "        year = movieRecommendation.loc[title]['Year']\n",
+    "        runTime = movieRecommendation.loc[title]['Runtime']\n",
+    "        rating = movieRecommendation.loc[title]['Average Rating']\n",
+    "        if pd.isnull(runTime) != True:\n",
+    "            print('\"', title, '\" ', (longestTitle - len(title))*' ', '(', gener1, '), ', \\\n",
+    "                  'rating: ', rating, ', ', year, ', runs ', runTime, sep = '')\n",
+    "        else:\n",
+    "            print('\"', title, '\" ', (longestTitle - len(title))*' ', \\\n",
+    "                  '(', gener1, '), ', 'rating: ', rating, ', ', year, sep = '')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Please enter the name of the folder with files, the name of movies file,    \n",
+      "the name of critics file, the name of personal ratings file, separated by spaces:\n",
+      "data1 IMDB.csv ratings.csv p8.csv\n",
+      "\n",
+      "['Quartermaine', 'Arvon', 'Merrison'] \n",
+      "\n",
+      "Recommendations for Catulpa:\n",
+      "\"Star Wars: The Force Awakens\"    (Action), rating: 9.67, 2015, runs 136 min\n",
+      "\"The Grand Budapest Hotel\"        (Adventure), rating: 9.0, 2014, runs 99 min\n",
+      "\"The Martian\"                     (Adventure), rating: 9.0, 2015, runs 144 min\n",
+      "\"Kubo and the Two Strings\"        (Animation), rating: 9.67, 2016\n",
+      "\"How to Train Your Dragon\"        (Animation), rating: 9.67, 2010\n",
+      "\"Hacksaw Ridge\"                   (Biography), rating: 9.33, 2016, runs 139 min\n",
+      "\"What We Do in the Shadows\"       (Comedy), rating: 9.0, 2014\n",
+      "\"Prisoners\"                       (Crime), rating: 8.33, 2013, runs 153 min\n",
+      "\"Spotlight\"                       (Crime), rating: 8.33, 2015, runs 128 min\n",
+      "\"The Perks of Being a Wallflower\" (Drama), rating: 9.67, 2012, runs 102 min\n",
+      "\"Shutter Island\"                  (Mystery), rating: 8.33, 2010, runs 138 min\n"
+     ]
+    }
+   ],
+   "source": [
+    "main()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}