|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Moive Recommendation" |
| 8 | + ] |
| 9 | + }, |
| 10 | + { |
| 11 | + "cell_type": "markdown", |
| 12 | + "metadata": {}, |
| 13 | + "source": [ |
| 14 | + "## This project is to practice data structures, methods and functions of the Pandas and Numpy" |
| 15 | + ] |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "metadata": {}, |
| 20 | + "source": [ |
| 21 | + "The goal of the project is to create movie recommendations for a person, based on the person’s and critics’ ratings of the movies. \n", |
| 22 | + "\n", |
| 23 | + "The following files will be required to run the program:\n", |
| 24 | + "1. `IMDB.csv`: A table with movie information\n", |
| 25 | + "2. `ratings.csv`: A table with ratings of all movies listed in the movies data \n", |
| 26 | + " by 100 critics. The column names in the critics data correspond to the name of each critic.\n", |
| 27 | + "3. `pX.csv`: A table with one person’s ratings of a subset of the movies in the movies data set, \n", |
| 28 | + " where X is a number. The column name in the file indicates the name of the person.\n", |
| 29 | + " \n", |
| 30 | + " \n", |
| 31 | + "All personal ratings are integer numbers in the 1..10 range." |
| 32 | + ] |
| 33 | + }, |
| 34 | + { |
| 35 | + "cell_type": "markdown", |
| 36 | + "metadata": {}, |
| 37 | + "source": [ |
| 38 | + "** How does this program function:** <br>\n", |
| 39 | + "1. The user will be asked to specify the `subfolder` in the current working directory, where the files are stored, along with the `names of the critics`, `person` and `movies data files`.\n", |
| 40 | + "2. Determine and output the names of three critics, whose ratings of the movies are closest to the person’s ratings based on the `Euclidean distance` metric.\n", |
| 41 | + "3. Use the `ratings by the critics` identified in item 2 to determine which movies to recommend. Display information about recommended movies as described below.<br>\n", |
| 42 | + "a. The movie recommendations must consist of the top-rated movies in each movie genre, based on the average ratings of movies by the three critics identified in step 2 above.<br>\n", |
| 43 | + "b. Movie genre is determined by the Genre1 column of the movies data.<br>\n", |
| 44 | + "c. Recommendations must be listed in alphabetical order by genre.<br>\n", |
| 45 | + "d. Missing data (e.g. running time) should not be included." |
| 46 | + ] |
| 47 | + }, |
| 48 | + { |
| 49 | + "cell_type": "code", |
| 50 | + "execution_count": 1, |
| 51 | + "metadata": {}, |
| 52 | + "outputs": [], |
| 53 | + "source": [ |
| 54 | + "import os.path\n", |
| 55 | + "import pandas as pd\n", |
| 56 | + "import numpy as np\n", |
| 57 | + "\n", |
| 58 | + "def main():\n", |
| 59 | + " '''\n", |
| 60 | + " The main function that is called to start the program. \n", |
| 61 | + " '''\n", |
| 62 | + " filesNames = input('Please enter the name of the folder with files, the name of movies file,\\\n", |
| 63 | + " \\nthe name of critics file, the name of personal ratings file, separated by spaces:\\n')\n", |
| 64 | + " print() #print a new line\n", |
| 65 | + " filesNamesLst = filesNames.split(' ') \n", |
| 66 | + " currentWorkDir = os.getcwd()\n", |
| 67 | + " subfolderName = filesNamesLst[0]\n", |
| 68 | + " #create a DataFrame for movies with selected columns\n", |
| 69 | + " movieFileName = filesNamesLst[1] \n", |
| 70 | + " movieFilePath = os.path.join(currentWorkDir, subfolderName, movieFileName)\n", |
| 71 | + " movieDataFrame = pd.read_csv(movieFilePath, \\\n", |
| 72 | + " encoding = 'unicode_escape').loc[:, ['Title', 'Genre1', 'Year', 'Runtime']] \n", |
| 73 | + " #create a DataFrame for critics ratings\n", |
| 74 | + " criticsFileName = filesNamesLst[2] \n", |
| 75 | + " criticsFilePath = os.path.join(currentWorkDir, subfolderName, criticsFileName)\n", |
| 76 | + " criticsDataFrame = pd.read_csv(criticsFilePath) \n", |
| 77 | + " #create a DataFrame for personal ratings\n", |
| 78 | + " personalFileName = filesNamesLst[3] \n", |
| 79 | + " personalFilePath = os.path.join(currentWorkDir, subfolderName, personalFileName)\n", |
| 80 | + " personalDataFrame = pd.read_csv(personalFilePath) \n", |
| 81 | + " #call functions to run the program\n", |
| 82 | + " topThreeCriticsLst = findClosestCritics(criticsDataFrame, personalDataFrame) \n", |
| 83 | + " print(topThreeCriticsLst, '\\n') \n", |
| 84 | + " movieRecommendation = recommendMovies(criticsDataFrame, personalDataFrame, \\\n", |
| 85 | + " topThreeCriticsLst, movieDataFrame)\n", |
| 86 | + " personName = personalDataFrame.columns[1]\n", |
| 87 | + " printRecommendations(movieRecommendation, personName)" |
| 88 | + ] |
| 89 | + }, |
| 90 | + { |
| 91 | + "cell_type": "code", |
| 92 | + "execution_count": 2, |
| 93 | + "metadata": {}, |
| 94 | + "outputs": [], |
| 95 | + "source": [ |
| 96 | + "def findClosestCritics(criticsDataFrame, personalDataFrame):\n", |
| 97 | + " '''\n", |
| 98 | + " This function is to return a list of three critics, whose ratings of movies are most similar \n", |
| 99 | + " to those provided in the personal ratings data based on Euclidean distance. The lower the \n", |
| 100 | + " distance, the closer, thus more similar, the critic's ratings are to the person's. \n", |
| 101 | + " \n", |
| 102 | + " Parameters:\n", |
| 103 | + " criticsDataFrame - provides data about critics ratings\n", |
| 104 | + " personalDataFrame - provides data about personal ratings \n", |
| 105 | + " '''\n", |
| 106 | + " \n", |
| 107 | + " # merge critics file and personal file by the same movie title\n", |
| 108 | + " criticsPersonRating = pd.merge(criticsDataFrame, personalDataFrame) \n", |
| 109 | + " # a new DataFrame with only critics' ratings after merging without Title column\n", |
| 110 | + " criticRating = criticsPersonRating.iloc[:,1:-1] \n", |
| 111 | + " # indexed by the movie titles\n", |
| 112 | + " criticRating.index = criticsPersonRating['Title'] \n", |
| 113 | + " # person's rating value without the person's name\n", |
| 114 | + " personRatingValue = criticsPersonRating[personalDataFrame.columns[1]] \n", |
| 115 | + " # to keep the index the same as the critics' rating DataFrame \n", |
| 116 | + " personRatingValue.index = criticsPersonRating['Title'] \n", |
| 117 | + " ratingDifference = criticRating.sub(personRatingValue, axis = 0)\n", |
| 118 | + " eucliDistance = np.sqrt((ratingDifference**2).apply(np.sum))\n", |
| 119 | + " eucliDistance.sort_values(inplace = True) # sort the result from smallest to largest\n", |
| 120 | + " # select only the top 3 critics with smaller Euclidean distance \n", |
| 121 | + " topThreeCritics = eucliDistance.iloc[:3] \n", |
| 122 | + " # generate a list of the critics' names\n", |
| 123 | + " topThreeCriticsLst = list(topThreeCritics.index.values) \n", |
| 124 | + " \n", |
| 125 | + " return topThreeCriticsLst" |
| 126 | + ] |
| 127 | + }, |
| 128 | + { |
| 129 | + "cell_type": "code", |
| 130 | + "execution_count": 3, |
| 131 | + "metadata": {}, |
| 132 | + "outputs": [], |
| 133 | + "source": [ |
| 134 | + "def recommendMovies(criticsDataFrame, personalDataFrame, topThreeCriticsLst, movieDataFrame): \n", |
| 135 | + " '''\n", |
| 136 | + " This function is to compute the top-rated unwatched movies in each genre category \n", |
| 137 | + " based on the average of the three critics' ratings\n", |
| 138 | + " \n", |
| 139 | + " Parameters:\n", |
| 140 | + " criticsDataFrame - provides data about critics' ratings\n", |
| 141 | + " personalDataFrame - provides data about personal ratings \n", |
| 142 | + " topThreeCriticsLst - a list of three critics, whose ratings of movies are most similar to \n", |
| 143 | + " those provided in the personal ratings data\n", |
| 144 | + " movieDataFrame - provides data about movies info\n", |
| 145 | + " '''\n", |
| 146 | + " # prepare the DataFrames for critics rating, person's rating and movie indexed by movie title.\n", |
| 147 | + " criticsDataFrame.index = criticsDataFrame['Title']\n", |
| 148 | + " criticsDataFrame = criticsDataFrame.iloc[:,1:]\n", |
| 149 | + " personalDataFrame.index = personalDataFrame['Title']\n", |
| 150 | + " personalDataFrame = personalDataFrame.iloc[:,1:]\n", |
| 151 | + " movieDataFrame.index = movieDataFrame['Title']\n", |
| 152 | + " movieDataFrame = movieDataFrame.iloc[:,1:]\n", |
| 153 | + " # prepare the unwatched movie DataFrame with average ratings \n", |
| 154 | + " # from the three critics whose ratings are similar to the person's\n", |
| 155 | + " unwatchedCriticRating = criticsDataFrame.loc\\\n", |
| 156 | + " [criticsDataFrame.index.difference(personalDataFrame.index)]\n", |
| 157 | + " topThreeCriticsRating = unwatchedCriticRating[topThreeCriticsLst]\n", |
| 158 | + " averageCriticsRating = round(topThreeCriticsRating.mean(axis = 1), 2)\n", |
| 159 | + " movieDataFrame['Average Rating'] = averageCriticsRating \n", |
| 160 | + " movieDataFrame.sort_values('Genre1', inplace = True)\n", |
| 161 | + " movieRecommendation = movieDataFrame[movieDataFrame.groupby(by = 'Genre1')['Average Rating'].\\\n", |
| 162 | + " transform(max) == movieDataFrame['Average Rating']]\n", |
| 163 | + " \n", |
| 164 | + " return movieRecommendation" |
| 165 | + ] |
| 166 | + }, |
| 167 | + { |
| 168 | + "cell_type": "code", |
| 169 | + "execution_count": 4, |
| 170 | + "metadata": {}, |
| 171 | + "outputs": [], |
| 172 | + "source": [ |
| 173 | + "def printRecommendations(movieRecommendation, personName):\n", |
| 174 | + " '''\n", |
| 175 | + " This function is to printout all the recommended movies in alphabetical order by the genre.\n", |
| 176 | + " \n", |
| 177 | + " Parameters:\n", |
| 178 | + " movieRecommendation - provides data about critics' ratings\n", |
| 179 | + " personName - the person's name for whom the recommendation is made for\n", |
| 180 | + " '''\n", |
| 181 | + " print('Recommendations for ', personName, ':', sep = '')\n", |
| 182 | + " # get the longest title for formatting later\n", |
| 183 | + " moiveTitle = list(movieRecommendation.index.values)\n", |
| 184 | + " longestTitle = len(max(moiveTitle, key = len))\n", |
| 185 | + " # get each factor (i.e. title, genre etc.) and then print with designed format \n", |
| 186 | + " for row in range(len(movieRecommendation)):\n", |
| 187 | + " title = movieRecommendation.index[row]\n", |
| 188 | + " gener1 = movieRecommendation.loc[title]['Genre1']\n", |
| 189 | + " year = movieRecommendation.loc[title]['Year']\n", |
| 190 | + " runTime = movieRecommendation.loc[title]['Runtime']\n", |
| 191 | + " rating = movieRecommendation.loc[title]['Average Rating']\n", |
| 192 | + " if pd.isnull(runTime) != True:\n", |
| 193 | + " print('\"', title, '\" ', (longestTitle - len(title))*' ', '(', gener1, '), ', \\\n", |
| 194 | + " 'rating: ', rating, ', ', year, ', runs ', runTime, sep = '')\n", |
| 195 | + " else:\n", |
| 196 | + " print('\"', title, '\" ', (longestTitle - len(title))*' ', \\\n", |
| 197 | + " '(', gener1, '), ', 'rating: ', rating, ', ', year, sep = '')" |
| 198 | + ] |
| 199 | + }, |
| 200 | + { |
| 201 | + "cell_type": "code", |
| 202 | + "execution_count": 5, |
| 203 | + "metadata": {}, |
| 204 | + "outputs": [ |
| 205 | + { |
| 206 | + "name": "stdout", |
| 207 | + "output_type": "stream", |
| 208 | + "text": [ |
| 209 | + "Please enter the name of the folder with files, the name of movies file, \n", |
| 210 | + "the name of critics file, the name of personal ratings file, separated by spaces:\n", |
| 211 | + "data1 IMDB.csv ratings.csv p8.csv\n", |
| 212 | + "\n", |
| 213 | + "['Quartermaine', 'Arvon', 'Merrison'] \n", |
| 214 | + "\n", |
| 215 | + "Recommendations for Catulpa:\n", |
| 216 | + "\"Star Wars: The Force Awakens\" (Action), rating: 9.67, 2015, runs 136 min\n", |
| 217 | + "\"The Grand Budapest Hotel\" (Adventure), rating: 9.0, 2014, runs 99 min\n", |
| 218 | + "\"The Martian\" (Adventure), rating: 9.0, 2015, runs 144 min\n", |
| 219 | + "\"Kubo and the Two Strings\" (Animation), rating: 9.67, 2016\n", |
| 220 | + "\"How to Train Your Dragon\" (Animation), rating: 9.67, 2010\n", |
| 221 | + "\"Hacksaw Ridge\" (Biography), rating: 9.33, 2016, runs 139 min\n", |
| 222 | + "\"What We Do in the Shadows\" (Comedy), rating: 9.0, 2014\n", |
| 223 | + "\"Prisoners\" (Crime), rating: 8.33, 2013, runs 153 min\n", |
| 224 | + "\"Spotlight\" (Crime), rating: 8.33, 2015, runs 128 min\n", |
| 225 | + "\"The Perks of Being a Wallflower\" (Drama), rating: 9.67, 2012, runs 102 min\n", |
| 226 | + "\"Shutter Island\" (Mystery), rating: 8.33, 2010, runs 138 min\n" |
| 227 | + ] |
| 228 | + } |
| 229 | + ], |
| 230 | + "source": [ |
| 231 | + "main()" |
| 232 | + ] |
| 233 | + } |
| 234 | + ], |
| 235 | + "metadata": { |
| 236 | + "kernelspec": { |
| 237 | + "display_name": "Python 3", |
| 238 | + "language": "python", |
| 239 | + "name": "python3" |
| 240 | + }, |
| 241 | + "language_info": { |
| 242 | + "codemirror_mode": { |
| 243 | + "name": "ipython", |
| 244 | + "version": 3 |
| 245 | + }, |
| 246 | + "file_extension": ".py", |
| 247 | + "mimetype": "text/x-python", |
| 248 | + "name": "python", |
| 249 | + "nbconvert_exporter": "python", |
| 250 | + "pygments_lexer": "ipython3", |
| 251 | + "version": "3.6.5" |
| 252 | + } |
| 253 | + }, |
| 254 | + "nbformat": 4, |
| 255 | + "nbformat_minor": 2 |
| 256 | +} |
0 commit comments