Your Tasks
Last updated
Last updated
In this project you'll be working with a subset of the MovieLens Dataset. Unlike the data set you worked on long ago in Project 1, the table here won't be organized in tables of records but rather as collections of documents! Documents are similar to records in the sense that they are used to group together pieces of data, but unlike the records we covered for SQL databases, documents can have fields that are not primitive data types. For example, the following document has three fields, two of which aren't primitive data types:
The following section will introduce you to the dataset.
This section will be helpful to reference as you're completing the tasks. A nice way to get an example of a record for any of the four collections (for example, ratings
) is to call db.ratings.findOne()
in the mongo shell.
The movies_metadata
collection contains documents with metadata for every movie in the MovieLens Dataset. There are many possible fields in each document, but some you may find useful for completing this project are:
movieId (int)
: A unique identifier for the movie. Corresponds to the field of the same name in keywords
, ratings
, and credits
.
title (string)
: The title of the movie.
tagline (string)
: A short phrase usually used to advertise the movie.
release_date (int)
: The date of the film's release, as a UNIX Timestamp.
budget (inconsistent)
: The movie's production budget in USD. See question 2iii for more details on this field's type.
vote_average (number)
: The average rating given by viewers on a scale from 0 to 10.
vote_count (int)
: The total number of ratings given.
revenue (int)
: The movie's revenue in USD
runtime (int)
: The length of the movie in minutes
genres (array)
: 0 or more documents each containing the following:
name (string)
: the name of one of the movie's genres
id (int)
: a unique identifier for the genre
Example document:
The keywords
collection contains documents with keywords related to certain movies. Each document in the collection has the following attributes:
movieId (int)
: A unique identifier for the movie. Corresponds to the field of the same name in movies_metadata
, ratings
, and credits
keywords (array)
: 0 or more documents each containing the following:
id (int)
: A unique identifier for a keyword
name (string)
: A keyword associated with the movie, for example "based on novel", "pirate", and "murder".
Example document:
The ratings
collection contains roughly 10,000 ratings submitted by specific viewers. Each document consists of the following format:
userId (int)
: A unique identifier for the user who gave the rating
movieId (int)
: A unique identifier for the movie. Corresponds to the field of the same name in movies_metadata
, keywords
, and credits
rating (number)
: The rating the user gave the movie, from 0 to 5.
timestamp (int)
: UNIX timestamp of when this rating was given
The credits
collection contains details on writers, actors, directors, and technicians who worked on the production of the movies in the data set. Each document consists of the following:`
movieId (int)
: A unique identifier for the movie. Corresponds to the field of the same name in movies_metadata
, keywords
, and ratings
cast (array)
: 0 or more documents of the each containing the following fields:
id (int)
: A unique identifier for the actor who played the character
character (string)
: The name of the character in the movie
name (string)
: The name of the actor as listed in the movie's credits
crew (array)
: 0 or more documents each containing the following fields:
id (int)
: A unique identifier for the crew member
department (string)
: The department the crew member worked in
job (string)
: The job title of the crew member (e.g. "Audio Technician", "Director")
name (string)
: The name of the crew member as listed in the movie's credits
Example document:
This task will walk you through step by step how to construct a query in MongoDB, and introduce you to some helpful operations for use in the rest of the tasks. If you're already comfortable doing queries in MongoDB feel free to skip to the end of this section to complete the corresponding question in query/q0.js
.
Inside the file query/q0.js
you should see the following:
Try replacing the todo
on line 3 so that the line becomes db.ratings.aggregate([
instead. This tells mongo that we want to pull in documents from the ratings
collection. Now try running the the following in a terminal: python3 test.py q0 --view
. This should run the query and give something similar to the following output (note that the _id field will likely differ, which is fine):
This query attempts to read every single document in the ratings
collection, and would be analogous to something like the following:
The query from the earlier section attempts to read every document in the ratings
collections. This next query will selectively match only specific documents matching a specific property.
It may be useful to see how this would look as a SQL query:
After pasting in the new query into query/q0.js
and running test.py q0 --view
you should see output resembling the following:
Lets break down the query.
db.ratings.aggregate([ ... ])
: This tells mongo that we'll be running an "aggregate" operation. We can pass in a list [ ... ]
known as the "pipeline", which will execute a series of operations. Each operation in the pipeline is known as a "stage", and each stage operates on the output of the previous stage.
{$match: ...}
: This is the only element of the pipeline so far. This tells mongo that we want documents from the collection that match certain properties, much like a WHERE
clause.
{timestamp: { $gte: ..., $lt: ...}}
: This tells mongo that we only want documents where the timestamp field has a value greater than or equal to 838857600
and less than 849398400
.
838857600
and 849398400
: The value in the timestamp
field is the number of seconds that have elapsed since January 1, 1970. This is more commonly known as "Unix time", "POSIX time", "Epoch time" or "UNIX Timestamp" and is a common way to represent times in computer systems. The two timestamps we use in the query correspond to October 1st, 1996 and December 1st, 1996 respectively.
There's no shortage of online tools like this one to convert between human readable times and UNIX time.
If you've ever heard of rumors about the world ending in January, 2038, that has to do with the way these timestamps are stored!
Next we'll extend our pipeline to perform an aggregate similar to what we might attempt in SQL.
The equivalent SQL expression would look like the following:
After pasting in the new query into query/q0.js
and running test.py q0 --view
:
A group "stage" in the pipeline always takes the following form:
In our above example, we grouped by the movieId
field. To indicate that we were referring to a field and not the string literal "movieId"
we prefixed it $
. After that we had three expressions representing values we wanted to compute in the aggregate. The first two expressions computed the min
and max
values of the rating
field.
The last column looks a bit peculiar: count: {$sum: 1}
. This assigns the count field to the accumulated sum of the value 1
. This means we add together n
copies of the value 1
where n
is the number of documents in each group, giving the total count of documents in each group.
Next lets try to see the top 10 movies with the most ratings. Continuing to build off our previous query:
Our equivalent query in SQL now looks like:
After pasting in the new query into query/q0.js
and running test.py q0 --view
:
The sort stage always takes the following form:
{ $sort: { <field1>: <sort order>, <field2>: <sort order> ... } }
The fields refer to which field of the output to sort on. The sort order is always either 1
to indicate ascending order or -1
for descending order.
The limit stage just needs to specify a number to limit the number of possible documents.
Next, we'll do the equivalent of a join in a SQL database to figure out what movies we're the IDs correspond to. Building again off the previous query:
Normally this is where we would put the expected output, but if you tried it out yourself you should know it looks like a complete mess! It may help to look at a single formatted document here. Try running python3 test.py q0 --format
, which should give a cleaned up version of the first document returned by the query:
There are two things to note here:
The "movies" field corresponds to an array of matching documents, not a single document. If there were multiple documents in movies_metadata
with a matching id, they would all appear in that list. If there were no matching documents, movie
would map to an empty list.
The contents of movie
are entire documents. In SQL this would be like having an entire record stored in a single column! As the name NoSQL implies though, we're not bound to the rule of SQL that all values must be atomic, and so we're allowed to have this nested structure.
The result of the last query was a bit messy, so lets do some cleaning up. We'll create a new field, name
, to store the name of the movie, rename count
and get rid of the column field _id
.
After pasting in the new query into query/q0.js
and running test.py q0 --view
:
A few things to note here:
The 0's and 1's stand for "drop" and "keep" respectively, i.e. "_id" gets dropped because its corresponding value in the project was 0.
To get the title of the first movie in the "movies" array we used {$first: "$movies.title"}
. We used .title
to access the title
attribute of each document in movies
. Try running the query without the $first
operator ("title": "$movies.title"
) and see what happens.
You can rename columns using a project.
You should now see some output that looks like this:
Congrats, you've built your first query! You can run python3 test.py q0
to test that the provided solution works. This section may seem a bit intimidating, but its mostly there for you to reference back to as you work through the remaining tasks.
Note on grading: question 0 and the questions in Task 1 are worth 1 point each, while the questions in Tasks 2 and 3 are worth 2 points each.
i. After spending over a year social distancing, you find yourself reading a lot of marvel comics and watching a lot of Disney movies. Find the IDs of all movies labeled with the keyword "mickey mouse" or "marvel comic" by writing a query on the keywords
collection. Order your output in ascending order of movieId
. The output documents should have the following fields:
Hint: start by trying to find movieIds labeled with just "mickey mouse", then use the $or operator to match documents with either of the labels.
Hint: you may find the $elemMatch operator useful here to select the appropriate documents. For example, the following query would match any movie with English listed as one of its spoken languages:
ii. We're interested in the best comedy films to watch. Return the id, title, average vote, and vote count of the top 50 comedy movies ordered from highest to lowest by average vote, breaking ties by descending order of vote count, and any further ties in ascending order of movieId
. Only include movies with 50 or more votes. The output documents should have the following fields:
Useful operators: $elemMatch, $gte for matching
Hint: genre names are case sensitive!
iii. Do movies get more good reviews than bad reviews? Is it the other way around? We want to know! For each possible rating find how many times that rating was given. Include the rating and the number of the times the rating was given and output in descending order of the rating. The output documents should have the following fields:
Hint: the building your first query section gives an example of how to get a count
iv. You've discovered a critic who always seems to know exactly which movies you would love and which ones you would hate. Their true name is a mystery, but you know their user id: 186
. Find critic 186's five most recent movie reviews, and create create a document with the following fields:
Useful operators: Look into the $push operator
Hint: You may find it helpful to see what happens when you group by null.
i. The TAs are having a movie night but they're having trouble choosing a movie! Luckily, Joe has read about IMDb's Weighted Rating which assigns movies a score based on demographic filtering. The weighted ranking () is calculated as follows:
is the number of votes for the movie
is the minimum votes required to be listed in the chart
is the average rating of the movie (this is stored in the field vote_average
)
is the mean vote across the whole report. For the purposes of this question this value is approximately 7, which you can hardcode into your query.
We would like to set a minimum number of votes to make sure the score is accurate. For this question we will assume the minimum votes required to be listed is 1838. Return the 20 highest rated movies according to this formula. The output should contain three fields: title
with the title of the movie, vote_count
with the number of votes the movie received, and score
which contains the WR for the associated movie rounded to two decimal places. How many movies can you recognize on this list? Sort in descending order of score
, and break ties in descending order of vote_count
and ascending order of title
. Your output documents should have the following fields:
ii. The TAs consider the prospect of making their own feature film on the beauty and joy of databases, and want to think of a catchy tagline. Run the following to see some examples taglines:
db.movies_metadata.aggregate({$project: {"_id": 0, "tagline": 1}})
Notice how the second one is "Roll the dice and unleash the excitement!" We want to see the 20 most common words (length > 3) across all taglines in descending order. In order to do this, we would need to split our sample tagline into its constituent words ("Roll", "the", "dice", "and", "unleash", "the", "excitement!").
To make things interesting, we will limit the words to length >3 to remove filler words, prepositions, and some pronouns (in the previous example, remove "the" and "and"). We also want to trim off any surrounding punctuation (periods, commas, question marks, or exclamation points) in a word and set all words to lowercase (our final set of words that will be included in our table for our example tagline is "roll, "dice", "unleash", "excitement", without the exclamation mark). Order your output by descending order of count
. Your output documents should have the following fields:
Can you guess what the most popular words might be?
Hint: Look up the $unwind
stage to see what it does.
Useful operators:
$split can be used to convert a string to an array. For example splitting the string "a proper copper coffee pot" by " " (a space) will create the array ["a", "proper", "copper", "coffee", "pot"]
$toLower converts a string to lowercase
$trim can be used to trim off surrounding punctuation marks
$strLenCP can be used to get the length of a string. Make sure to check for length after removing punctuation marks!
iii. How much does it cost to make a movie? The TAs were hoping to write a query for this but realized something that will haunt them for the rest of their lives... Mongo's lack of schema requirements means that the budget field of documents metadata isn't always an integer! Even worse, sometimes the field doesn't even exist! It looks like whoever prepared the data set always did one of the following:
If they didn't know the budget of a given movie they did one of the following:
Set the budget
field to false
Set the budget
field to null
Set the budget
field to an empty string: ""
Excluded the budget
field from the document
If they did know the budget of a given movie they did the following:
Set the budget
field to a number value, for example 186
Set the budget
field to a string with prefix $
, for example "$186"
Set the budget
field to a string with the the postfix "USD", for example "186 USD"
Group the budgets by their value rounded to the nearest multiple of ten million, and return the count for each rounded value. Additionally include an extra group "unknown"
for the count of movies where the budget was not known. Order by ascending order of rounded budget. Your output documents should have the following fields:
Hint: You can check if a field is present in a document by checking whether the field is equal to undefined
i. Comic book writer Stan Lee was known to make cameos in film adaptations of his works. Find the release date, title, and the name of the character Lee played for every movie Lee has appeared in. Order the results in descending order of release date. Your output documents should have the following fields:
Useful operators: you may find $unwind (used in 2ii) handy here
Hint: Stan Lee's id in the credits collection is 7624
.
ii. Director Wes Anderson is known for his unique visual and narrative style, and frequently collaborates with certain actors. Find the 5 actors who have appeared the most often in movies where Anderson is listed on the crew with the title "Director". Your output should include the actor's name, id, and the number of times the actor has collaborated with Anderson. Order in descending order of the number of collaborations. Break ties in ascending order of the actor's id. Your output documents should have the following fields:
Hint: Wes Anderson's id in the credits collection is 5655
. To get started, try to match all documents in credits
where he is listed as the director.
Hint: Like the above question, $unwind can be useful here.
Hint: You may need to group by multiple fields for this question. For example, {$group: _id: {val1: "$field.val1", val2: "$field.val2"}}
will group documents by both val1 and val2, similar to how the expression GROUP BY val1, val2
would in SQL.
Congrats, you're finished with the last project! There are no hidden tests for this assignment, so whatever score you see on the autograder after this will be your score on this assignment. Follow the instructions in the next section to submit your work.
You can run your answers through mongo directly by running mongo movies
to open the database and then entering your query directly:
This can help you catch any syntax errors in your queries. Alternatively you can run a query directly through the testing script by pasting in your query into the appropriate file (we'll use query/q0.js
as an example) and running python3 test.py q0 --view
. This will print up to ten of the query's results.
You can request more results with the --batch_size
flag (i.e. python3 test.py q0 --view --batch_size 20
will give the first twenty results).
If you find yourself dealing with large, hard to read documents, you can view a formatted version of the first document by using --format
instead of --view
. For example, using the provided query from the Building your first query section of the spec:
To run a test, from within the fa24-proj6-yourname
directory:
Before we run a full test on your output, we check that the format of your output matches what we're expecting. Format in this context means that all the field names we expect are there, there are no extra field names, and that the types corresponding to the field names match. Here's an example of some mismatched format for diffs/q2i.diff
The above output tells you two things:
One of your documents had an extra field. In this case, looking at the "Your document" section, your output had an extra field called "foo"
One of your documents has a mismatched type for one of its fields. In this case, look at the "Your document" section, your output had a value of type null for the field "movieId", when it should have been a number.
If you pass the format check, we'll run a diff against your query and the expected output. Become familiar with the UNIX diff format, if you're not already, because our tests saves a simplified diff for any query executions that don't match in diffs/
. As an example, the following output for diffs/q2ii.diff:
:
This indicates that:
your output has an extra document {"_id": "about", "count": 535}
(the -
at the beginning means the expected output doesn't include this line but your output has it)
your output is missing the document {"_id": "only", "count": 521}
(the plus at the beginning means the expected output does include those lines but your output is missing it).
If there is neither a +
nor -
at the beginning then it means that the line is in both your output and the expected output (your output is correct for that line).
If you care to look at the query outputs directly, ours are located in the expected_output
directory. Your output should be located in your solution's your_output
directory once you run the tests.
When we generate the diffs we'll be doing some basic reformatting to reorder things that don't have an inherent order to make sure that the results are consistent with each other. These include:
field names will be rearranged to be in alphabetical order
arrays when we don't ask for an explicit order to them