In this project you'll be implementing B+ tree indices. Since you'll be diving into the code base for the first time we've provided an introduction to the existing skeleton code.
Every modern database supports a variety of data types to use in records, and RookieDB is no exception. For consistency and convenience most implementations choose to have their own internal representation of their data types built on top of the implementation language's defaults. In RookieDB we represent them using data boxes.
A data box can contain data of the following types:
Long(8 bytes) and
String(N)(N bytes). For this project you'll be working with the abstract
DataBoxclass which implements
Comparable<DataBox>. You may find it useful to review how the Comparable interface works for this project.
A record in a table is uniquely identified by its page number (the number of the page on which it resides) and its entry number (the record's index on the page). These two numbers (pageNum, entryNum) comprise a
RecordId. For this project we'll be using record IDs in our leaf nodes as pointers to records in the data pages.
BPlusTree.java- This file contains the class that manages the structure of the B+ tree. Every B+ tree maps keys of a type
DataBox(a single value or "cell" in a table) to values of type
RecordId(identifiers for records on data pages). An example of inserting and a retrieving records using keys can be found in the comments at
BPlusNode.java- A B+ node represents a node in the B+ tree, and contains similar methods to
BPlusNodeis an abstract class and is implemented as either a
BPlusTreeMetadata.java- This file contains a class that stores useful information such as the order and height of the tree. You can access instances of this class using the
this.metadatainstance variables available in all of the classes listed above.
You should read through all of the code in the
indexdirectory. Many comments contain critical information on how you must implement certain functions. For example,
BPlusNode::putspecifies how to redistribute entries after a split. You are responsible for reading these comments. Here are a few of the most notable points:
- Generally, B+ trees do support duplicate keys. However, our implementation of B+ trees does not support duplicate keys. You will throw an exception whenever a duplicate key is inserted.
- Our implementation of B+ trees assumes that inner nodes and leaf nodes can be serialized on a single page. You do not have to support nodes that span multiple pages.
- Our implementation of delete does not rebalance the tree. Thus, the invariant that all non-root leaf nodes in a B+ tree of order
2dentries is broken. Note that actual B+ trees do rebalance after deletion, but we will not be implementing rebalancing trees in this project for the sake of simplicity.
There are a few parts in this project where a method will take in objects of the type
LockContext. You do not need to worry too much about these objects right now; they will become more relevant in Project 4.
If there are any methods you wish to call that require these objects, use the ones passed in to the method you are implementing, or defined in the class of the method you are implementing (
This part of the project makes extensive use of
Optional<T>objects. We recommend reading through the documentation here to get a feel for them. In particular, we use
Optionals for values that may not necessarily be present. For example, a call to
getmay not yield any value for a key that doesn't correspond to a record, in which case an
Optional.empty()would be returned. If the key did correspond to a record, a populated
Optional.of(RecordId(pageNum, entryNum))would be returned instead.
Here's a diagram that shows the structure of the project with color-coded components. You may find it helpful to refer back to this after you start working on the tasks.
(Click on the image to zoom in)
- Green Boxes: functions that you need to implement
- White boxes: next to each function, contains a quick summary of the important points that you need to consider for that function. To find more detailed descriptions look at the comments of each method.
- Orange boxes: hints for each function which may point you to helper functions.
You should first implement the
LeafNode. This method reads a
LeafNodefrom a page. For information on how a leaf node is serialized, see
LeafNode::toBytes. For an example on how to read a node from disk, see
InnerNode::fromBytes. Your code should be similar to the inner node version but should account for the differences between how inner nodes and leaf nodes are serialized. You may find the documentation in
Once you have implemented
fromBytesyou should be passing
fromBytes, you will need to implement the following methods in
For more information on what these methods should do refer to the comments in
Each of these methods, although split into three different classes, can be viewed as one recursive action each - the
BPlusTreemethod starts the call, the
InnerNodemethod is the recursive case, and the
LeafNodemethod is the base case. It's suggested that you work on one method at a time (over all three classes).
We've provided a
InnerNode. The purpose of
sync()is to ensure that representation of a node in our buffers is up-to-date with the representation of the node in program memory. Do not forget to call
sync()when implementing the two mutating methods (
remove); it's easy to forget.
You will need to implement the following methods in
After completing this Task you should be passing
Your implementation does not have to account for the tree being modified during a scan. For the time being you can think of this as there being a lock that prevents scanning and mutation from overlapping, and that the behavior of iterators created before a modification is undefined (you can handle any problems with these iterators however you like, or not at all).
Much like the methods from the Task 2 you'll need to implement
bulkLoadwithin all three of
BPlusTree. Since bulk loading is a mutating operation you will need to call
sync(). Be sure to read the instructions in
BPluNode::bulkLoadcarefully to ensure you split your nodes properly. We've provided a visualization of bulk loading for an order 2 tree with fill factor 0.75 (powerpoint slides here):
To help you debug we have implemented the
BPlusTree. You can add a call to this method in a test to generate a PDF file of your B+ tree.
BPlusTree tree = ...
CommandLineInterface.javaand run the code to start our CLI. This should open a new panel in IntelliJ at the bottom. Click on this panel. We've provided 3 demo tables (Students, Courses, Enrollments). Recall from project 0 that we can run queries on this CLI. Let's try running the following query:
SELECT * FROM Students AS s WHERE s.sid = 1;
After implemting our B+ Tree index in project 2, we can now create indices on columns of tables. Let's try running the command below
CREATE INDEX on Students(sid);
This creates an index on the sid column of the Students table. Unfortuantely, we do not have enough demo data to actually observe much speedup. But theoretically, we can create indices on certain columns to speed up lookup queries. Let's run
exitto terminate the CLI.
Move on to the next sections for details on testing and on submitting the assignment.