6.893 Class Project Information Page

This is a list of possible class projects related to database systems. Projects are to be done in teams of 2-3 people. You are not required to choose a project from this list -- pending, of course, approval of your project with the instructor. In general, a good source of projects can be people in your research group who have challenging data management issues (simply designing a schema and installing an off-the-shelf database is not a challenging issue!) Other possibilities include implementing one of the systems we have studied (or will study) in class -- for example, you might add support for a new type of index to the Postgres open source database, or implement optimistic concurrency control (see the Red Book) and compare it to locking-based concurrency control.

You must turn in a preliminary project proposal on 10/6. This should consist of about 1 page of text describing what you plan to study in your project, as well as a list of your team members. The proposal will not be graded, but not turning in a proposal will adversely affect your final grade on the project. I will meet with each group individually for a few minutes during the next few weeks to discuss the project proposal.

The deliverables for your project are a research paper (10-15 pages) similar to the sort we have read throughout the semester. Projects are very self-directed, so make sure your topic is something you feel comfortable building as a team over the next 10 weeks.

Possible projects (and where to look for more information, when available):

  1. TinyDB Projects: TinyDB is a database system that runs on networks of tiny, wireless sensing devices (e.g., Crossbow Motes and allows users to collect information from the via declarative queries. There are many possible kinds of extensions to TinyDB; some ideas: There is a simulator and hardware available for students interested in one of these projects. Contact me (Professor Madden) for more information about TinyDB.

  2. Replicated State Machines : Explore the feasibility of replicated state machines (see your 6.033 class notes) for database replication. The challenge here is that databases don't guarantee deterministic execution of a sequence of concurrently executed queries, so some cleverness is required to keep all replicas in the same state. If you are interested in this project, contact me -- there is some basic infrastructure and thinking that has been done already about this topic.

  3. Stream Monitoring Applications: A recent trend in database systems has to do with "stream monitoring" -- we will read about the Aurora system (see the Red Book) later in the class. The basic idea is to apply database technology to continuous streams of tuples, rather than to relations stored on disk. There are several stream database systems that are available -- we have access to an early commercial system here at MIT. A good project would involve taking this system and using it to build an interesting application -- e.g., a network intrusion detector that scan data coming into a machine, or a virus tracker the catalogs virus in incoming emails. Contact Professor Hari Balakrishnan and me for more information.

  4. CiteSeer: The PDOS research group at MIT has access to the data for the CiteSeer system, which is currently not stored in a relational database. An interesting project would be to port the system to a database engine, and extend the functionality with some new and interesting queries enabled by SQL, or to compare the performance before and after the porting. Contact Jeremy Stribling (strib AT MIT.EDU) for more information.

  5. Query-Driven Data Acquisition: Implement a software profiler or network monitoring tool that decides what to monitor (or how to instrument code or the network) based on user-specified "queries" in some language (possibly SQL) of your devising. Show that the performance of this system (e.g., software latency or bandwidth consumed by the monitor) is less than with naive collection of all data. Contact me for more information and ideas.

  6. Database Design and Visualization for a (Road) Traffic Application: We are starting a new project to monitor road congestion using 802.11-equipped, battery-powered servers in individual cars. Data will be collected opportunistically, when 802.11 connections are available. We need a database system infrastructure to store this data, as well a set of tools to display and query it. Queries will have a geo-spatial flavor (e.g., "find slow roads in a particular region"). Contact me.

  7. Personal Information Management: Build a utility for collecting and querying structured and unstructured data about a user's data, files, and general computing environment -- for example, this tool might manage lists of URLs, automatically extracting keywords about those URLs from the associated web sites and storing them in a database so that users can search for their favorite web pages without remembering URLs. Or build a system that stores URLs and keywords with downloaded files to facilitate file search. Or build a tool that keeps a database of a users preferences files (or registry entries) and allows users to "rollback" inadvertent and / or nasty changes.

  8. Database Performance in Haystack. Haystack is a "universal information client" useful for storing all sorts of personal data -- email, contacts, calendars, etc. It stores data in RDF (an XML-like semi-structured representation) and uses a custom in-memory database system. The Haystack team would like to be able to replace this system with an off-the-shelf database engine, but there are some significant performance issues in doing so. An interesting project would be to analyze the cause of these performance issues, devise a set of recommendations for how to improve database performance, and implement this in Haystack. Contact Professor David Karger for more information.

  9. Exploiting Structured Data in Haystack: One of the challenges of building a database system for unstructured data (e.g., data where there is no fixed schema) is that formulating queries over such data is very hard (because users don't know which fields to query). One interesting question is to try to build a set of tools that encourage users to insert data that has a similar schema as records already in the system. A similar challenge is to devise a programming environment that makes it easy for developers to query data that has no definite schema (but where something about the schema is known.) Contract Professor David Karger.

  10. Integrity Constraints in Streaming Systems: Current stream data processing systems do not provide any way for users to specify constraints on the data that they can handle -- for example, a financial analysis application might want to verify that no purchases exceed some available balance. This project would involve defining what such constraints mean, and exploring ways to efficiently check and react to them. Contact me.

    Last modified: Wed Sep 29 22:30:56 EDT 2004