Your Tasks
Last updated
Last updated
In light of recent advancements in Generative AI, CS186 staff has developed a variety of techniques to detect usage of ChatGPT, Bard, Copilot, and other Generative AI tools. Students are cautioned to use such tools in accordance with our Generative AI policy, as per the syllabus.
In this project you'll be implementing B+ tree indices. Since you'll be diving into the codebase for the first time, we've provided an introduction to the skeleton code.
Every modern database supports a variety of data types to use in records, and RookieDB is no exception. For consistency and convenience, most implementations choose to have their own internal representation of their data types built on top of the implementation language's defaults. In RookieDB, we represent them using data boxes.
A data box can contain data of the following types:
Boolean
(1 byte)
Int
(4 bytes)
Float
(4 bytes)
Long
(8 bytes)
String(N)
(N bytes)
You'll be working with the abstract DataBox
class, which implements Comparable<DataBox>
. You may find it useful to review how the Comparable interface works for this project.
A record in a table is uniquely identified by its page number (the number of the page on which it resides) and its entry number (the record's index on the page). These two numbers (pageNum
, entryNum
) comprise a RecordId
. For this project, we'll be using record IDs in our leaf nodes as pointers to records in the data pages.
The index
directory contains a partial implementation of an Alternative 2 B+ tree, an implementation that you will complete in this project. Some of the important files in this directory are:
BPlusTree.java
- This file contains the class that manages the structure of the B+ tree. Every B+ tree maps keys of a type DataBox
(a single value or "cell" in a table) to values of type RecordId
(identifiers for records on data pages). An example of inserting and a retrieving records using keys can be found in the comments at BPlusTree.java line 130
BPlusNode.java
- A B+ node represents a node in the B+ tree, and contains similar methods to BPlusTree
such as get
, put
and delete
. BPlusNode
is an abstract class and is implemented as either a LeafNode
or an InnerNode
LeafNode.java
- A leaf node is a node with no descendants that contains pairs of keys and Record IDs that point to the relevant records in the table, as well a pointer to its right sibling. More details can be found LeafNode.java line 15
InnerNode.java
- An inner node is a node that stores keys and pointers (page numbers) to child nodes (which themselves may either be an inner node or a leaf node). More details can be found InnerNode.java line 15
BPlusTreeMetadata.java
- This file contains a class that stores useful information such as the order and height of the tree. You can access instances of this class using the this.metadata
instance variables available in all of the classes listed above.
You should read through all of the code in the index
directory. Many comments contain critical information on how you must implement certain functions. For example, BPlusNode::put
specifies how to redistribute entries after a split. You are responsible for reading these comments. Here are a few of the most notable points:
Generally, B+ trees do support duplicate keys. However, our implementation of B+ trees does not support duplicate keys. You will throw an exception whenever a duplicate key is inserted. But you don't have to do so for deleting an absent key.
Our implementation of B+ trees assumes that inner nodes and leaf nodes can be serialized on a single page. You do not have to support nodes that span multiple pages.
Our implementation of delete
does not rebalance the tree. Thus, the invariant that all non-root leaf nodes in a B+ tree of order d
contain between d
and 2d
entries is broken. Note that actual B+ trees do rebalance after deletion, but we will not be implementing rebalancing trees in this project for the sake of simplicity.
There are a few parts in this project where a method will take in objects of the type LockContext
. You don't need to worry too much about these objects right now; they will become more relevant in Project 4.
If there are any methods you wish to call that require these objects, use the ones passed in to the method you are implementing, or defined in the class of the method you are implementing (this.lockContext
for BPlusTree
and this.treeContext
for InnerNode
and LeafNode
).
This part of the project makes extensive use of Optional<T>
objects. We recommend reading through the documentation here to get a feel for them. In particular, we use Optional
s for values that may not necessarily be present. For example, a call to get
may not yield any value for a key that doesn't correspond to a record, in which case an Optional.empty()
would be returned. If the key did correspond to a record, a populated Optional.of(RecordId(pageNum, entryNum))
would be returned instead.
Here's a diagram that shows the structure of the project with color-coded components. You may find it helpful to refer back to this after you start working on the tasks.
Green boxes: functions that you need to implement
White boxes: next to each function, contains a quick summary of the important points that you need to consider for that function. To find more detailed descriptions look at the comments of each method.
Orange boxes: hints for each function which may point you to helper functions.
LeafNode::fromBytes
You should first implement the fromBytes
in LeafNode
. This method reads a LeafNode
from a page. For information on how a leaf node is serialized, see LeafNode::toBytes
. For an example on how to read a node from disk, see InnerNode::fromBytes
. Your code should be similar to the inner node version but should account for the differences between how inner nodes and leaf nodes are serialized. You may find the documentation in ByteBuffer.java
helpful.
Once you have implemented fromBytes
you should be passing TestLeafNode::testToAndFromBytes
.
get
, getLeftmostLeaf
, put
, remove
After implementing fromBytes
, you will need to implement the following methods in LeafNode
, InnerNode
, and BPlusTree
:
get
getLeftmostLeaf
(LeafNode
and InnerNode
only)
put
remove
For more information on what these methods should do refer to the comments in BPlusTree
and BPlusNode
.
Each of these methods, although split into three different classes, can be viewed as one recursive action each - the BPlusTree
method starts the call, the InnerNode
method is the recursive case, and the LeafNode
method is the base case. It's suggested that you work on one method at a time (over all three classes).
We've provided a sync()
method in LeafNode
and InnerNode
. The purpose of sync()
is to ensure that representation of a node in our buffers is up-to-date with the representation of the node in program memory. Do not forget to call sync()
when implementing the two mutating methods (put
and remove
); it's easy to forget.
You will need to implement the following methods in BPlusTree
:
scanAll
scanGreaterEqual
In order to implement these, you will have to complete the BPlusTreeIterator
inner class in BPlusTree.java
to complete these two methods.
After completing this task, you should be passing TestBPlusTree::testRandomPuts
Your implementation does not have to account for the tree being modified during a scan. For the time being you can think of this as there being a lock that prevents scanning and mutation from overlapping, and that the behavior of iterators created before a modification is undefined (you can handle any problems with these iterators however you like, or not at all).
Much like the methods from the Task 2 you'll need to implement bulkLoad
within all three of LeafNode
, InnerNode
, and BPlusTree
. Since bulk loading is a mutating operation you will need to call sync()
. Be sure to read the instructions in BPluNode::bulkLoad
carefully to ensure you split your nodes properly. We've provided a visualization of bulk loading for an order 2 tree with fill factor 0.75 (powerpoint slides here):
After this, you should pass all the Project 2 tests we have provided to you (and any you add yourselves)! These are all the provided tests in database.index.*
.
To help you debug we have implemented the toDotPDFFile
method of BPlusTree
. You can add a call to this method in a test to generate a PDF file of your B+ tree.
For example,
If you get "Cannot run program "dot"
, you need to install GraphViz. GraphViz is a software package that generates visualizations of network style graphs.
Navigate to CommandLineInterface.java
and run the code to start our CLI. This should open a new panel in IntelliJ at the bottom. Click on this panel. We've provided 3 demo tables (Students
, Courses
, Enrollments
). Recall from Project 0 that we can run queries on this CLI. Let's try running the following query:
After implemting our B+ Tree index in Project 2, we can now create indices on columns of tables! Let's try running the command below:
This creates an index on the sid column of the Students
table. Unfortuantely, we do not have enough demo data to actually observe much speedup. Theoretically, however, we can create indices on certain columns to speed up lookup queries. Let's run exit
to terminate the CLI.
Move on to the next sections for details on testing and on submitting the assignment.