Teaching students to work on state of the art NoSQL databases

Image by:

Opensource.com

In a recent post, I introduced an initiative, along with Dima Kassab, for teaching open source NoSQL databases. We collaborated to prepare course materials for three NoSQL databases to 22 students at the Informatics Department of SUNY Albany, and we made all those material available under a Creative Commons by Attribution License.

We covered MongoDB in the first session and Neo4j in the second. The third session took place on November 13, 2012 and focused on the M database—whose importance in healthcare and financial applications I've discussed.

The sequential introduction of these databases benefited from following the same data example—how to manage a database of Movies—as it was modeled to fit in the framework of each one of these powerful databases. It was quite useful to be able to contrast the databases among themselves and with the Relational approach to data modeling.

MongoDB : a document database
Neo4j: a graph database
M: a hierarchical database

The command line

During our first session on MongoDB, we planned hands-on exercises based on interactions at the command line level. They included topics such as insertion of records, queries, and updates. Since many of these exercises involved typing commands using the JSON notation, we found that a lot more time was used for the exercises than we initially anticipated. This is not a criticism of the JSON format, which actually we found to be quite intuitive and easy to assimilate for the students, but rather an observation of the fact that when we need to type a good number of curly brackets, double quotes, commas, and colons, all in the right places, a significant amount of time goes to composing the line in the screen, and this effort distracts students from the actual interaction with the database—which is supposed to be the focus of the exercise.

We also found that students are not very familiar with the use of the command line itself. A finding that was curiously a topic of discussion at the Open Source workshop at RPI on April 2012 organized by OpenHatch. We have found that students in engineering do not necessarily get introduced to the use of the command line, and unfortunately spend too much of their time interacting with GUIs, IDEs, and more recently, Web interfaces. These last ones are all fine... but... a "geek" should know the command line in order to be proficient and to get access to the power tools of the trade. Something must be done on this front in the academic community; there are some efforts from our ITK BarCamp project.

The distractions with the command line affected the speed with which we had anticipated to run the hands-on exercises, and consumed some of the time that we had initially allocated for TBL group discussions.

Not having the time to introduce specific training on the use of the command line, in the second session, when covering Neo4j, we ironically focused on using the web interface in order to interact with the database. In the case of the Graph database, we found extremely helpful the graphic representation that displays the current state of nodes and relationships in the graph. Data entry and simple queries were also done via the data browser provided in the web interface. For advanced queries, we used the Gremlin command line built-in the web interface. Gremlin allowed us to compose queries for very interesting questions, such as: "List the movies that friends of your friends like."

The server

Making a server available where the students could login to perform their practical exercises turned out to be surprisingly difficult. We approached this by setting a Linux machine in the Amazon EC2 cloud. And faced the fact that basic security practices got in the way of the training.

Yes, we are aware that security is important when in production, but... this made us wish for a "Hackerspace" server where students could work in a sandbox environment without having to deal with passwords and ssh keys. In that way, we could have focused the learning on the schema-free aspects of the databases, and not dedicate time to making sure each one gained access to the server. Particularly given that the server would be a disposable one, intended only to be used in the class.

This experience was consistent with the challenges we faced when doing practical work in our Open Source Software Practices class at RPI. There was a need for "access to servers" intended to be used for student work, where security and structure can be sacrificed in exchange for agility and ease of access.

Working in the common database

In all the course sessions (MongoDB, Neo4j, and M), the server was purposely set up in such a way that all the students were adding data to the same database. This was done with the goal of letting them see how the data grew as all of them worked together. This was successful from the point of view of showing that a real database installation will get continuous updates from many different users, and that issues such as data inconsistency and duplication will naturally arise. The collaboration with the group was also a glimpse on how a large number of participants can work together in an open source community.

M, a hierarchical database (our third session)

The introduction of M and the hierarchical database concepts went very smoothly in the third session. We found that once the students had been exposed to the document and graph paradigms, the hierarchical one was easy to absorb.

During the exercises we used minimal features of the M language, and purposely focused on differentiating the language—from the hierarchical aspects of the M database itself.

One of the main changes we introduced in this third session was the use of two specific step-by-step lessons. This was done in response to the observation that attention was diluted when we expected students to do a combination of reading and hands-on exercises in an 80-minute class. This worked nicely, because with the lessons more time was dedicated to interacting directly with the database. In retrospect, most of the tutorial reading could have been incorporated in the homework that students regularly did to prepare for a TBL session. This again would be consistent with the approach that Salman Khan puts forward in Kahn Academy, where the role of homework and classwork are reinvented.

Why open source

The choice of using open source databases for this course was based on the following principles:

These are databases that are widely used in industry and government.
The licenses allowed us to make unfettered use of them for academic purposes.
Their installation and configuration was within the reach of the class instructors.
Students will have access to them any time with the simple effort of installation.

There is also, of course, our commitment and conviction that open source and education go hand-in-hand and promote each other.

Today, academic institutions are not taking as much advantage as they could from the powerful resources made available everywhere by open source projects. On the other hand, open source communities are also missing the opportunity to engage the minds of young bright college students who can make great contributions to open projects.

Wrapping up

A fourth and final session was held on Thursday November 15th to wrap up the series. Students reviewed the concepts of the three NoSQL databases and evaluated their experience of the sessions themselves (more on this soon).

We plan to repeat some of the sessions in another database class at SUNY by the end of November, and also in a class in the MBA program on the first week of December.

For the Spring semester, we are planning on incorporating the concepts of databases with web front ends, since they are such a common approach today for deploying applications in production. Think of the LAMP stack, for example.

This resonates with some of the comments that Salman Khan makes in his book, regarding the current artificial compartmentalization of education. Here we have a case where web design and databases are taught as two separate topics, despite the fact that in practice they are usually deployed together as part of one common application. One could argue then, that a database class and a web development class should be more tightly integrated in order to reflect the reality of engineering and business practices.