CSE-506 (Fall 2009) Homework Assignment #3 (100 points, 20% of your overall grade) Version 3 (11/3/2009) Due Thursday 12/10/2008 @ 11:59pm * SCHEDULE 1. The project begins on Wednesday 11/4/09, starting with a discussion of your project idea with the instructor and/or TA. The discussion is optional but strongly recommended: this way you can ensure that you understand the project goal correctly, you estimate the amount of work reasonably and the number of people in your group is appropriate. 2. You must submit a detailed 2-3 page design/plan document by Friday 11/13/09 or sooner, which outlines your plan for this project, motivation, design, etc. You must commit the report to CVS using the name "design.txt" and ALSO leave a printed copy of the report in the mailbox outside my office by 5:00pm (not midnight) on 11/13/09. This document will be reviewed by the instructor and returned back to you within 24-48 hours. This is very useful to help you fine-tune your project and ensure you're on the right track. In fact, the more you discuss your project with the instructor, the better your project will be. Note: you should start on your project asap: don't wait until after I returned the graded design docs to get started! 3. If you work in a team/group, then you MUST declare your project group by 11/13/09, and get a shared CVS repository. 4. The final project source code and final README report are due on Thursday 12/10/2009. 5. Demos will be held between Friday 12/11/09 and Sunday 12/13/09. Unless otherwise indicated, all deadlines are 11:59pm on the due date. Late submission penalties will apply with one exception: you are NOT allowed to miss your demo. Many demos will be scheduled in series; if you miss yours, or you are late, your demo slot will end precisely on the time it's supposed to, to allow the next group to start on time. * TEAMS You can work alone or in groups of 2-4. You must declare your team members if you work in a group. Most projects are designed for 2-3 people. But I like to give each team enough flexibility to design their own project's scope, modified from my brief descriptions below. It allows you to adjust the number of people required for the project by adding or omitting some particular features. If you work alone, I will take that into consideration when assigning your final course grade (as per the posted class grading policies). At the same time, if you work in a group of 3-4 people, I will expect a more substantial project done. In general the amount of effort per person should be 1-1.5x times that each of you spent for HW2. Note, however, that for this project you have just 4 weeks: this means you should start doing some of the work/coding before you get your design document reviewed/approved. * GRADING This project is worth 20% of your overall course grade. The grading is divided as follows, on a scale of 0-100 points: - initial design/plan document: 5 points - final detailed README report (design/implementation/etc.): 20 points - demo: 40 points - code inspection and independent testing by graders: 35 points Each document you supply/commit should have a project title, date, names and emails of students in that project group. Be sure to spell-check and proof-read your document. 1. Final README report should be detailed. I prefer if you follow the model of a conference-quality research paper, with sections such as Abstract, Introduction, Background, Design, Implementation, Evaluation, Related Work, Conclusions, Future Work, and a professional Bibliography. 2. Demo: this is your chance to show the graders your work. You will start compiling it from a pre-checked out source repository that the grader checked out. Then you will demo your work in VMware. Be sure to show off the basic functionality, as well a any special and/or advanced features you've implemented. (i.e., impress us! :-) The graders will be asking ALL of your team members questions to ascertain your level of knowledge. It is imperative that all team members be ready to answer questions. The demos will be 45 minutes each (and while that may seem too long, trust me that you often run out of time and wish you'd have more time.) It's important therefore that you plan your demo time well, leaving time for the TA to ask questions. 3. Source code and independent testing: here we will ensure the quality of your code. Don't forget that all code you submit must comply with Documentation/CodingStandards and pass the compliance script in scripts/checkpatch.pl (run checkpatch in strict mode please). Also be sure that your code passes LOCKDEP testing. Throughout the project, you are *STRONGLY* encouraged to discuss your documents and progress with the instructor and/or the TA. You are welcome to give me drafts of your documents before the deadline, or come discuss your progress with me: I will give you "free" advise and review your documents/code/progress for free. This way you can have enough time to improve your project before it gets officially graded. In the past, people who came and talked to us frequently, often did much better in the project, because you can get an early feedback from us about your work. Extra credit: there is none for this assignment, but given the open nature of this project, we will grade more openly. For example, teams with particularly clever solutions, exceptionally good code, and/or a "research-like" final report, may get some extra credit points. As an added motivation, I am looking for a small number of individuals to join my research lab, FSL, as project students. I am also looking for 1-2 RAs. Good candidates to join FSL are those who do very well in this and other classes, and especially in this project, demonstrating good kernel coding abilities, good technical writing skills, and an excellent understanding and motivation for conducting research. Interested students should contact me directly. * SUBMISSION You will submit all material via CVS to your "hw3/" private or group directory: sources, design docs, etc. Late submission penalties will apply as usual. If you work as a group, you MUST use a shared group repository: you can use the shared group repository from hw2, if the you stay in the same group, or ask me for a new CVS group repository (email me the names and OSLAB account names of all new members). * PROJECT SELECTION You may choose from any of the projects below, or design your own project. If you choose one of the projects below, you can modify it (add or remove stuff) based on your group's size and you particular interests. If you do design your own project, you MUST get the instructor's approval ahead of time (WELL BEFORE 11/10/09!). Be sure to design a project that is just the right amount of work for your team size (often, students pick projects that are too ambitious for the given amount of time available). With each project description, I also listed how many people this project is suitable for in [square brackets]. The exact group size will depend on discussions with me to define the scope. I also include URLs with each project which can give you some useful info: papers to read, software to download, etc. In addition, if you like to do more projects in the area of operating systems, speak to me so we can arrange for a project that can be carved into different components (one component will be done as class project, the rest continued as a spring/summer project, with an eye toward submitting a technical conference/workshop paper). If you want to get some ideas of other possible projects, feel free to discuss it with me, with the TA, or any other class student (online or offline). You're also welcome to discuss possible OS projects with other professors and your adviser: as long as the project has a decent OS "flavor", I'll consider it. Note that not every OS project requires writing lots of kernel code; OS projects could include small amounts of instrumentation, tracing, monitoring, analysis and benchmarking. You may also consult the FSL projects page for ideas: http://www.fsl.cs.sunysb.edu/all-projects.html Many of those projects have parts that are suitable CSE-506, although some of them have already been done. You could read some of the papers we have published, and especially look at the "Conclusions and Future Work" section at the end of the paper: this could tell you what _new_ work you might be able to do that hasn't been done before. If you're interested, come see me after you have read a bit about the existing projects. You may also consult any other projects done in the department, as long as they have a significant systems/OS component. * LIST OF PROJECTS (check for updates after the end of this list) The list of projects below does not have any particular order. For most of these projects, you're expected to work with these source-code management tools: git and guilt (both will be provided in the cse506 VM image). ------------------------------------------------------------------------------ SYSCALL-INHERIT: Linux doesn't allow modules to override system calls. BSD does allow it, by inheriting the syscall dispatch vector per process. A new module or policy can make a new process use an entirely new set of syscall vectors atomically. Study the BSD implementation and implement some or all of it for Linux; show that it works by providing a kernel module that loads up dummy syscall wrappers for some syscalls. You will probably have to change the task structure to accommodate a syscall vector, and change fork/exit/clone. [1-4] ------------------------------------------------------------------------------ DCACHE/ICACHE ACLs: As some of you have noted in HW2, once an object is cached in the dcache, permission checks on it are ignored. This is bad if you want to prevent some users from accessing a file, or even being able to see that a file/directory exists. Figure out how to add ACLs to cached dentries/inodes. When a cached object is about to be returned to a user process, you should check to see if the process has permissions to access this object, without changing the object's status in the cache. When a new object is about to be inserted into the dcache, you have to figure out what ACLs, if any, should be attached to the object (you'd probably have to change struct dentry). Also it'd be nice to have a way to dynamically add/remove ACLs to objects in the cache. You may have to add a new dentry operation called ->d_check_perm, perhaps other ops as you see fit. This op will have to be invoked for every cached object before it's used (i.e., VFS modifications). Demonstrate the usefulness of this project by modifying one or more file systems such as ext2 or ecryptfs to support said ACLs in these new dentry ops. Your design should be extra careful about performance overheads, because the dcache is in the critical path of the OS. [1-3] ------------------------------------------------------------------------------ Persistent obj->private fields: Any data you put into an object's private field is only saved in memory. Devise a method by which individual inodes, dentries, and/or super_block objects, can have their private data saved automatically and persistently in the file system. You can create a new field in each data structure, called ->persistent if you want. Your system should be flexible such that persistency will be available on per-object basis, iff the object->flags has a special PERSISTENT flag on. The file system should set this flag on/off as it sees fit, but it should be the VFS which does all the hard work to save the per-object data -- with the help of the file system (e.g., the file systems ->read and/or ->write methods may have to be called, or new methods you devise). For each object type (D, I, SB), figure out what are the best ways to save the info: append/prepend to a file; global info; separate files; databases; EAs; etc. Note also that you'll have to figure out how to re-load any previously saved persistent object data, when an object is re-instantiated in memory. [1-4] ------------------------------------------------------------------------------ EXT2SUM: Add native file data checksumming support into ext2 (or ext3, ext4, or any other disk-based f/s). There's a lot of flexibility in this project. You can consider checksumming for only data, only meta-data, or both. You can consider checksumming per file, page, or f/s block. The checksums can be stored in the inode, as extra data, as an extended attribute, or any other variation you choose. You can consider to checksum only regular files, or also directories and possibly other entities. You might also want to enhance the user-level utilities (fsck, mkfs, debugfs, etc.) to handle your checksumming support. Checksumming should be a mount-time option to enabled/disable it. You can pick one checksumming algorithm, or several out of what's crytoapi supports. Optional: think about security and administration: how to create checksums safely (only "approved" users should be allowed to that, possibly in single-user mode or the like. Your code must pass the full LTP regression test suite (every test which passes on a vanilla f/s must pass when you add the checksumming support). [1-4] Tripwire project (google for it) ------------------------------------------------------------------------------ DedupFS: Modify any file system you want, or write one based on wrapfs, to perform data de-duplication. This is a "compression" technique that shares any of data chunks that are identical within/across files, so duplicated data is saved only once. To identify chunks, a checksum is often used. To know when duplicated chunks are no longer needed, refcounts are used. Note: you may rely on ext2sum or use btrfs for checksums, and build your dedup support on top of it. [2-4] Basic Deduplication: There are three phases to deduplication: (1) segmenting, (2) labeling, (3) seg storing. For segmenting, simple deduplicating systems simply break files up into pages. For labeling, a basic checksum is used, like an md5sum. For seg storing, an on-disk index is used in conjunction with a block allocator. Since you can assume that blocks are all of the same size (4KiB) your allocator implementation can be very simplistic. To store a segment, you first allocate a block, checksum the dirty page, write back the dirty page to the allocated block, and then insert a key-value pair into your index. For your on-disk index, you might consider using a hash table. The key idea is that if you have already stored a block with a particular checksum, you won't store it twice. So instead of storing two blocks, you store one block and two check sums. This is where the compression comes into play. You will also have to store inode attributes, as well as directory files. The standard way to store this information is via the existing file system. Typically one would write the checksums in place of the blocks to the underlying file, and store the blocks in a separate seg store as described above. During a read, you will read the checksums from the file, and retrieve the blocks paired with the checksums from the seg-store into the page cache. You could use a stackable file system to achieve this purpose. Be sure to report the correct file size (i.e., size as opposed to (size/page_size)*checksum_size). Finally you do not have to deal with the case where the underlying file system already has data in it. Just assume these users will simply copy this data back into your file system if they want to access that data alongside deduplicated data. You _do_ have to support remounting the file system of course! Advanced Deduplication - Bloom Filtering: There are two major performance issues with deduplication. The first is that you must perform an index lookup for every single block to see if it is already in the seg store. When the index becomes larger than RAM you will have to perform random I/O for every single write, even appends. The standard way to avoid this is to use a Bloom filter. A Bloom filter is a mapping of key-value pairs. Bloom filters do not have bucket lists. If there is a collision during an insert into the Bloom filter, we do nothing. Every time you insert a new pair into the seg store, you will first perform a lookup of on the Bloom filter, and _only_ if the bit is '1' will you proceed to check if the checksum has been stored in the seg store already. If it is not '1' you will store it in the seg store, and insert a pair into the Bloom filter. This will allow you to avoid performing lookups of checksums you know can't have been inserted yet. Although a negative cache (like negative dentries) would achieve the same purpose, a Bloom filter is guaranteed to be only as large as the hash table you allocate, and even this table is merely an array of bits, so Bloom filters achieve better economies of RAM. This is important as it allows you to avoid more lookups. However there are some issues that you will have to figure out. Is it necessary to store the Bloom filter across umount and mount? How do you store the Bloom filter if it is? Most queries are on newly inserted items, but Bloom filters age (old checksums will collide with new checksums, forcing a lookup only to find the checksums are different). Can you incorporate the age of the checksum in your Bloom filter decision? Make sure your design is _correct_! Advanced Deduplication - Block Grouping: Another major performance issue for deduplicating file systems is that they will write blocks normally co-located in a file in random order to a seg-store. To resolve this issue you should implement a clustering mechanism for your block allocator. Your allocator can take an inode and block offset (in the file) as a hint, and during allocation of your block in the seg store you would pass the inode and file offset in. The allocator would try to allocate blocks belonging to the same inode from the same cluster. The blocks in each cluster would be sorted by their block offset. You will have to handle page overwrites in an intelligent manner. You will not want to simply append new page values to the end of a cluster, as this will cause fragmentation, but instead you will want to have a configurable option (during format) that allows for some percentage of each file to be used for page overwrites. Distribute holes uniformly through out the cluster, and when you must perform a page overwrite, utilize the nearest hole. For instance: We format the file system with a hole of size 2 for every 8 blocks in a cluster, and a cluster size of 20 blocks. Notice that we mandate the cluster size is a multiple of (2 + 8). x - used block | - cluster start/end _ - free block |____________________| 01234567 01234567 Now the user writes some blocks to this cluster for inode i in order of page offsets 0, 1, 2, 3, and 4: |xxxxx_______________| 01234567 01234567 Now the user writes some blocks to 7, 8, and 10: |xxxxx__x__x_x_______| 01234567 01234567 Now the user inserts the value of block 8 in some other file, and the index links that file's checksum at that offset to this particular block. So if we overwrite block 8 we can't simply delete it. That's OK, we'll use our overflow hole. |xxxxx__xx_x_x_______| 01234567 01234567 We put the new value for block 8 into the overflow hole. This scheme assumes that duplicate blocks will typically be used with other block of the same value in the same order. For instance you won't just use block 3 in another file, but blocks 0, 1, 2, 3, and 4. So fragmentation induced by this scheme should be small. You must design a reasonable method of deleting blocks no longer pointed to by any checksum in the index (in any file). Group Sizes: Depending on your group size, you will be responsible for more or less of this total system. 2 person group: You will be responsible for Basic Deduplication. 3 person group: You will be responsible for Basic Deduplication and Bloom Filtering. 4 person group: You will be responsible for Basic Deduplication, Bloom Filtering, and Block Grouping. Conclusion: Deduplication is a powerful way to compress infrequently used portions of a large file system. It is an exciting new area and some of the most challenging problems are intelligently handling the Bloom filtering policies, and the block grouping policies. There is currently no good deduplication file system for Linux. Having a stackable implementation would be a perfect fit letting people continue to use their underlying file system, and might make a lot of people excited! Good luck! Sponsor: FSL PhD student Richard "Rick" Spillane. You may contact Rick as well to get more details. ------------------------------------------------------------------------------ Any-path checkstack.pl: checkstack.pl only shows the stack footprint of individual functions. Enhance the script (and the kernel as needed), or write a new script/program, that calculates ALL possible paths through a compiled kernel (and modules), and reports the call paths which'd have the most stack use. In particular, it should report paths that exceed the current kernel's max stack size. It's important that your system will not report false positives (i.e., call paths that can never occur); you also have to take into account that there's a lot of function pointer indirection (e.g., operation vectors) that are not filled until run time. Also stackable file systems and any levels of indirection (e.g., LVM/RAID) increase stack usage. Find plausible paths that indeed consume a lot of stack space, and demonstrate them by configuring a kernel and modules, and using it, such that it triggers a stack overflow oops. Yes, in this project, your success will be measured by how many oopses you produce. (As an added incentive, if you manage to do this project by only modifying user-level code/scripts, there's a good chance it can be contributed and accepted by Linus into the mainline kernel.) [1-3] ------------------------------------------------------------------------------ EXT2INSDEL: Support native insertion/deletion of blocks into ext2 (or any other disk based f/s of your choosing). The idea here is to allow block-aligned insertions/deletions into a file, and all you'll need to do is move the physical indirect inode pointer, and not the whole data blocks: this should improve insert/delete performance by a factor of log(n). You'll have to provide a system call API for user programs to do that; ideally the user API should not be limited to block aligned units, and the kernel will take care of mapping byte-offsets to block offsets as efficiently as possible. [1-3] ------------------------------------------------------------------------------ EXT3INSDEL: Do the same as ext2insdel, but for a journalling file system. Here, you have to deal with log records in the journal which have to be coded for the right ins/del operation. This is a bit more challenging, but will provide a more reliable solution and possibly easier at times, due to ext3's guaranteed write-ordering semantics. [2-4] ------------------------------------------------------------------------------ EAfs: Extended attributes are useful, but not every file system supports them. Implement a way to add EAs to any file system by storing them in plain files. The EA files can be per file/directory, or globally (e.g., one "database" file holds all of the EAs). You can implement this as a thin stackable file system, possibly based on wrapfs; or as native VFS support. Prove that your system works for a file system that doesn't have EAs now (e.g., vfat or nfs). You need to show that EAfs works for at least 1-2 file systems, the more the better. Note that since EAs are often used for security, you may wish your EA files to be secure (e.g., hidden from view, inaccessible to anyone but the kernel, etc.). [1-2] ------------------------------------------------------------------------------ ACLFS: ACLs are very useful. Design a stackable file system (based on, say, wrapfs), which can "stack" on top of any other file system, and automatically add ACL support. Use Linux-supported APIs for ACLs, using extended attributes (EAs), or any other mechanism. If using EAs, they should be supported already in the lower file system. Implement as many ACLs as possible (by UID, GID, PID, SID, time-of-day, etc.). You may combine this project with EAfs. [1-3] ------------------------------------------------------------------------------ Gimple DB: Whole-Kernel Analysis This project involves aggregating an entire kernel's GCC intermediate representation into a database, making it available to analyses that span the entire kernel. This is suitable for 2-4 students with good experience with compilers and programming languages. Sound static analysis of an operating system kernel requires an accurate view of kernel functions as they will actually execute. While not insurmountable, this obstacle can turn an interesting research problem into an exercise in frustration for anyone who does not want to deal with the myriad subtleties and gotchas inherent in low-level C code, like the kind of code that appears throughout the Linux kernel. At the source level, a soup of macros and conditional compilation make it difficult to learn what code actually executes and what data it touches. That problem goes away at the binary level, but so does the rich type information that is necessary for useful analysis. The compiler, however, has knowledge of exactly how each function should execute (or else it wouldn't be a very good compiler), in the form of GIMPLE: GCC's three-address code intermediate representation. GIMPLE three-address statements are already broken up into basic blocks, and each statement is annotated with all the type information that the compiler has access to, as well as all the extra information (such as SSA use-def chains) that the compiler uses to find optimization opportunities. The DB Dump GCC plug-in persists all this information to an SQL database, so that it is possible to search the GIMPLE of even a very large program. As is, however, the database is not ready for serious analysis. The main goal of this project is to devise useful whole-kernel static analysis questions and then to modify the GIMPLE database schema so that it is possible to efficiently answer these questions. 1. Potential Static Analysis Questions: For full credit, the project should result in (a) an interesting (but simple) statistic about kernel structures and (b) a more sophisticated inter-procedural analysis. These examples would make a good project: - Simple statistical query: Which global variables have the most accesses? Which variables are accessed from the most kernel components? - Inter-procedural query: For each function, what locks might be held when the function is called? What locks are always held? 2. GIMPLE DB Schema Answering these kinds of questions requires knowledge of inter-file kernel structure. The current DB Dump implementation dumps files in isolation: it does not discover any sharing between files. For example, if multiple files use the struct inode datatype, each of those files will maintain its own copy of that type, making it difficult to make the query "find me all statements that access a struct inode variable." To enable the queries in part 1, this project should modify the DB Dump plug-in to discover this kind of sharing. The finished DB Dump will play the role of the linker, tying together shared variables and datatypes. Sponsor: FSL PhD student Justin Seyster. You may contact Justin for more details. ------------------------------------------------------------------------------ FAKEFS: Fake File System: The choice of a back-end file system affects end-to-end performance of a Network File System (NFS) significantly. Consequently, when comparing different NFS implementations, the lower file system should be the same for all implementations in comparison. But which specific file system should be used? One of the ways to perform fair NFS benchmarking is to "fake" the back-end file system --- a fake file system --- that pretends that requested operation is completed, but does not perform any I/O operations, does not store bulk of data in the memory, and uses as less CPU time as possible. I.e., such file system does its best to return control back to NFS server as soon as possible. In this project you will need to implement a fake file system (fakefs). While this file system should not store any data, it has to maintain a minimal amount of metadata in order not to break basic the functionality provided to upper levels - VFS/NFS/User application. E.g., file name and file length are definitely part of such crucial information. When data is requested from the fakefs, zero pages should be returned back, provided that the current offset in the file is less then the faked file size. The functionality provided by the fakefs is somewhat similar to sparse files but on a wider scale. You may base fakefs either on wrapfs or tmpfs. Fakefs must be optimized to use as little memory and CPU time as possible. At then end you should present a working fakefs implementation, which passes certain basic LTP tests. You will also need to run filebench on fakefs locally and through NFS and present the results. Resources: - Linux Test Project (LTP): http://ltp.sourceforge.net/ - FileBench: http://www.solarisinternals.com/wiki/index.php/FileBench Group size: 1-2 Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to get more details. ------------------------------------------------------------------------------ Friendly Apps: What if user applications are friendly? An increasing number of research projects exploit the possibility to trade-off reliability for performance. The reasoning behind is that many applications have inherent error tolerance. Modern operating systems assume that any running application might be malicious (deliberately or not). As a result, a code of typical OS abounds in security and reliability checks. We claim that in certain situations, a lot of security burdens might be omitted in the name of performance. In fact, nowadays a significant number of applications are carefully tested or formally validated. Consequently, the probability of application's invalid behavior is low. If the application in question performs non-critical service, security/reliability checks might be omitted, increasing the program's performance (and reducing the code size). We call applications that do not exhibit malicious behavior as "friendly applications". In this project we want to classify security checks performed in the Linux kernel, account the chances of each security check class to be exploited by application and remove security checks class by class, measuring kernel performance at each stage. As a result you should provide: - a patch against the recent kernel version with a lot of: #ifdef CONFIG_SECCHECK_CLASS1 ... #endif #ifdef CONFIG_SECCHECK_CLASS2 ... #endif ... - GNU/Linux booting for each configuration above. Notice that it might require the modification of some user programs used during the boot, if they are not friendly. Let's call this operation forcing the friendship. :) - Test software and test results, proving (or denying) the concept. Examples of the checks that might be omitted: if (!(file->f_mode & FMODE_WRITE)) return -EBADF; if (unlikely(!access_ok(VERIFY_READ, buf, count))) return -EFAULT; More complex examples can arise inside schedulers and other kernel subsystems. In this project you will need to look through and understand a lot of the kernel code (not only file system related). Group size: 1-2 Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to get more details. ------------------------------------------------------------------------------ Accurate Block Trace for Writes: Linux blktrace is a block layer I/O tracing mechanism which provides detailed information about request queue operations up to user space. We would like to use this mechanism to account disk power consumption per process. However, there is a serious technical limitation in blktrace that does not allow us to implement our idea right away. Below is a typical blktrace output when you run a /bin/dd process: % blktrace -d /dev/sda -o - | blkparse -i - ... 8,0 1 11 0.009507758 C W 223490 + 56 [pdflush] ... You may notice that although "dd" is the process responsible for the write, blktrace reports pdflush as the origin process for this write operation. The reason is that writes are typically asynchronous in Linux and pdflush is the process that flushes dirty pages to the disk. Even worse, the design of the Linux kernel currently does not maintain information about the process that made the page dirty. Often, the process that dirtied the page might have even been terminated already at the point when the page is flushed. In this project you will need to modify the memory management system in the way that we can keep track of processes responsible for writes. Some dirty pages might be shared among several processes, in this case both processes must be assigned to that dirty page. PID + should be used as writer's identifier. You will also need to modify blktrace in the way that it displays the information propagated from the upper level. Resources: - Blktrace git: http://git.kernel.dk/?p=blktrace.git Group size: 1-2 Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to get more details. ------------------------------------------------------------------------------ VSNFS: very simple network file system: Last year several CSE506 students have developed vsrfs: very simple real (i.e., disk-based) file system. An example of such file system and motivation behind it can be found at . In this project we want to implement a very simple NETWORK-based file system. Such file system might be used as a template for a quick creation of network file systems tuned for specific usage scenarios. First, you will need to design a simple protocol describing client and server communication. The protocol should be rich enough only to support basic file system operations, such as lookup, file creation and removal, reads and writes. All other operations might be omitted for simplicity. Mount and umount operations should also be described within the protocol. You may assume that server has only one export point and can service only one client at a time. You can also assume that underlying protocol is perfect, as well as server and clients are: they never crash and power never goes down. After the vsnfs protocol is designed, you will need to implement both server and client sides. We recommend to use the Remote Procedure Calls (RPC) subsystem when implementing client-server model. Here is an example how vsnfs can work. On the server node user invokes # vsnfsd.start /export/dir command, that starts a vsnfd - server kernel thread. vsnsd registers a new RPC server in the RPC framework and waits for the remote procedure calls from the client. Let's assume that on the client user mounts vsnfs, reads a file, and umounts vsnfs: # mount -t vsnfs hostname:/export /mountpoint # ls /mountpoint f1 f2 f3 # cat f1 hello world! # umount /mountpoint The session between client and server may look like that: ---client---- | ----server---- | mount_req{"/export/point"} -----------------> | <---------- mount_reply{"ok", root_fhandle} | readdir_req{root_fhandle} ------------------> | <------------ readdir_reply{"ok", "f1", "f2", "f3"} | lookup_req{root_fhandle, "f1"} -----------> | <------------- lookup_reply{"ok", f1handle} | read_req{f1handle} ---------------------> | <------------- read_reply{"ok", "hello word!"} | umount_req{"/export/point"} ---------------> | For simplicity you can omit permission checks, limit maximum number of files in the root directory, and always transfer the files completely over the network. Server maintains a table that maps file handles to the local file system inodes. It is up to you if the implementation will work on top of TCP, UDP, or both. You can use caching for performance optimization, but be careful to synchronize caches in time. Interoperability between different hardware architectures can be omitted. NFS implementation in the Linux kernel can be used as a source of ideas or starting point. The resulting file system should pass as many tests from the LTP suite as possible. You also will need to run certain micro benchmarks using Filebench and present the results. Ideally it should work on top of the vsrfs, so that we have a vsrfs + vsnfs working couple. Resources: - RFC5531: "RPC: Remote Procedure Call Protocol Specification Version 2" - Linux Test Project (LTP): http://ltp.sourceforge.net/ - FileBench: http://www.solarisinternals.com/wiki/index.php/FileBench - RFC1094: "NFS: Network File System Protocol Specification" Group size: 3-4 Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to get more details. ------------------------------------------------------------------------------ 1FFS: One-File File System "Caihong" is Chinese for rainbow and refers to the many different colors of cached pages on top of a single object store. Introduction: Programming in the kernel is very difficult. Commonly used systems should be optimized for performance and placed within the kernel. However, there is a significant risk of new instabilities and massive development costs. The immense number of files users must manage today has caused many developers to push the most complex components of their storage software to outside of the kernel as performance concerns become dwarfed by architectural complexity concerns. Users have already started to automatically generate data via digital photos and caching internet interactions. As the number of files users generate steadily increases, files will have to be automatically named and indexed by many indexes in the future, which will all run from a user-level context. The OS kernel will act as a "bottom-half" which provides service requests to a particular PC's disk(s). However, current OSes have become jacks of all trades, and the demand for file system performance has caused kernel designers to forgo the classical guarantees of fsync(2) and compromise on the potential benefits of a more comprehensive mmap(2). New architectural challenges stand in the way of fixing these issues for user-level storage software, and currently such software vendors either turn off the disk cache, or maintain a custom kernel. Furthermore, user-level file systems such as FUSE incur heavy message-passing overheads and have no way to implement a page writeback mechanism that the kernel can rely upon to reduce memory pressure. In this project you will solve the most fundamental of these problems, and develop novel solutions to an open research question. This specification will begin by providing you with the first target use-case of this work, and then move into a full specification. Finally, it will finish with several architectural suggestions that should better ensure correctness, and ease the design and implementation considerably. After the description, valid team configurations for the assignment will be discussed. Use-Case Scenario: There are several intended use-case scenarios for this work, including a user-level index on top of a kernel level object store, a user-level transactional library API, and a new kind of versatile file-system with much low message passing overheads than FUSE. I will describe the transactional library API use-case to you. In this use-case the developer wants to provide a transactional library to other developers. His library consists of two major on-disk data structures. The first is the block file. The second is the journal. Updates to the block file are always paired with a message to the journal. This journal message can either undo, or repeat the paired update to the block file. For instance: Update to block file: "Update block 32 to this new 4KiB value." Update to journal file: "Append the old 4KiB value at block 32 in the block file, and append the new 4KiB value for block 32." The update to the journal file can either restore the original value of block 32 (undo) or re-apply the intended update (redo). The key to this use-case scenario is that for each message pair, the journal message should always hit the disk platter before the block file message. This way if we lose power at any point, we will have either not modified the block file, or will have enough information to carry the whole operation through, or undo any changes it had made. Currently the transactional API uses a user-level hash-table to cache blocks. The time spent in user-level to perform basic page caching is immensely expensive. Benchmarks show that almost 50% of the overhead of the library is simply due to time spent in user-mode doing page caching. The implementer of this library would like to use mmap. He would simply mmap the block file, and mmap the journal, and keep the block file mapping 'pinned' or unable to writeback to disk, until he explicitly 'forces' or syncs the journal file mapping. Any page faulting and caching details would be handled natively by the kernel, significantly improving performance. However the kernel provides now way to pin pages except with mlock, which will kill processes that lock too much RAM, and has no two-way communication mechanism to ask the user process to relieve memory pressure due to security concerns. When the user-level transaction API _does_ write the journal to disk, it uses fsync(2) which will only schedule block writes and not actually flush the disk cache, compromising atomicity. Currently the transactional API warns that the disk cache should be turned off to avoid this problem, again heavily degrading performance. Finally, the developer utilizes the convenience of allocating files with arbitrary names, appending to them, truncating them, and removing them. These kind of 'object-store' functions are also utilized by many other user-level storage vendors. This is one of the main reasons file systems are commonly used as back end for these products, despite their many draw-backs. However the message passing overheads for many manipulations of files can be costly, and almost all file systems implement fsync(2) and mlock(2) in an efficient but incorrect manner. The Solution and Proposed Specification: This developer is just one of many that complain about the poor tools the kernel provides to user-level storage engines. You will develop a comprehensive solution to this kind of developer. Your solution will provide depending on your group size: - a single-file file system that includes: - a RAM leasing implementation - a correct implementation of fsync(2) that can run with the disk cache _on_ - a new mpin(2) and munpin(2) routine that acts on whole mappings - a user-level object allocation library - a user-level object store that can be used to - create files, - append to files, - truncate files, - remove files. Proposed Specification: Single-File File System: You will develop a single-file file system. This is a necessary step as we want to provide mmap operations on top of a block device, but specify the semantics of these mmap operations (for the purpose of proper msync and mpin). File systems are the typical servers of mmap on files in the kernel, so the most natural way to achieve this is to provide a single-file file system, that exports a partition as a large mmap-able file. There is no need to retain inode properties for this file across umount and mount, the inode can be maintained completely in RAM, and initialized with hard-coded values that set the owner to root, and permissions to 0 for ALL and GROUP (to prevent TOCTTOU attacks). Root can then change the permissions if desired after mounting. This removes the need for inode de/serialization and should simplify get_inode considerably. Mpin and munpin should be implemented with a separate on-disk swap. The reason is that if the kernel wants to relieve memory pressure, and we do not want to OOM kill the process, and not write back the dirty pages to disk, we must write them somewhere else to relieve pressure. We write them to swap. Data in swap need not be stored or retrieved across either crashes, or umount and mount (its swap and so its temporary). You should maintain a single super-block page that stores: (1) The magic number of the file system (2) The size of the swap partition for the file system The magic number is to prevent users from accidentally mounting devices formatted by other file systems and over-writing valuable data. The file system checks the magic number during mount, and if it does not find it, returns an error requesting the user to format the partition with mkfs. You should write these values to the super-block during mkfs, and retrieve them during mount. They can not be changed so there is no need to update them. Here is the usage of mkfs.1ffs: mkfs.1ffs -o Mounting 1ffs takes no options: mount -t 1ffs A RAM Leasing Implementation: One of the problems with user-level page cache implementations is that they consume kernel memory. Currently such systems use carefully tuned kernels in concurrence with malloc and the regular swap partition to handle memory tug-of-wars between user-level storage layers and the kernel. To avoid swapping to disk when the user-level storage could simply be asked to unpin some pages during a flush, you will implement a RAM leasing implementation. lease_t lease_acquire() int lease_relinquish(lease_t lease) Lease acquire will ask the kernel for a lease ID, and if granted, a unique positive id will be returned, and if not, ENOMEM will be returned. This lease ID will be used for all future requests related to that lease. int lease_truncate(void *addr, lease_t lease, size_t pages) Lease truncate will specify for a mmap-ing at address 'addr' that the user requests that 'pages' pages (4096*pages bytes) can be pinned in RAM within that mapping without incurring swapping. If the call is successful, 0 is returned, and the user can proceed to call lease_shrink, otherwise ENOMEM is returned. int lease_shrink(void *addr, lease_t lease, size_t *pages) The call to lease_shrink will block until the kernel wants to reduce this lease's number of pages, at which point processes blocking on this call will return, and the amount to shrink the lease by will be recorded in 'pages'. The process has some period of time to unpin pages before they are swapped out. The user process can repeatedly call lease_truncate to expand on a shrunken lease to get RAM back from the kernel. A Page Pinning Implementation: 1ffs consists of the block file which is exposed as a single large file as described above, but it also consists of a swap partition. This swap partition is where pinned pages are written back to disk if the kernel is suffering from memory pressure. The total sum of all outstanding leases can not exceed the size of the swap partition, to ensure that all pinned pages can be written back during memory pressure. int mpin(void *addr, size_t len_in_pages) int munpin(void *addr) The command users will use to pin pages is 'mpin' and should take advantage of existing mmap code as much as possible. For this assignment 'mpin' can effect a whole mapping (i.e., you can just mark the vma). During page writeback your 1ffs implementation should write back pages belonging to a pinned vma to the swap, and mark them clean. Munpin does the obvious thing and clears a vma as being 'pinned' and thus during writeback 1ffs will simply writeback the mapping to its original location (and not bother writing it to swap). You will have to maintain an in-RAM index that can be used to find the location of pinned pages in the swap disk. Using the radix tree implementation for this purpose may well make sense. There is no need to serialize this index to disk as the contents of swap do not need to be restored in case of crash, or across umount and mount. A Page Forcing Implementation: When msync-ing pages in an mmap belonging to 1ffs, if the MS_SYNC flag is passed, 1ffs must use an ordered write to cause a SCSI disk cache flush. This is done by calling 'set_buffer_ordered' on the last buffer head written to disk, before calling 'submit_bh' on that buffer head. This will ensure that the mapping is correctly and synchronously written to disk. If forcing a mapping where some of the pages are swapped out to the swap partition and are not resident in RAM, they will have to be read in from swap before they are written out. You will have to use your swap index to find these non-resident pages. Finally you should be able to msync (force) mpinned pages. This is a necessary feature as user-level storage vendors will want to keep pages pinned, but occasionally write them out to disk to either clean them or for durability's sake. An Object Allocation Library: 1ffs is a powerful tool in that it exposes a page-cache interface that user processes can access with few context switches. You will implement an object allocation library in user space which provides a slab allocator interface on top of 1ffs. The API will be similar to kmem_cache. You will need to allocate pages to store slab object sizes, and pointers to the first slab. Slabs may be linked lists with pointers to other slabs. You will need to store within each slab which objects are free or not. This information should be updated using mmap and msync. In addition you will need to implement a small journal. This journal will include updates to the metadata of the system (redo actions). Updates to the journal should be flushed to disk before updates to the metadata. You should mpin the metadata pages of the object store on disk. When you want to write your changes to the metadata pages of the object store, simply fsync the journal (which will have identical semantics to msync) and then msync your metadata pages. This write ordering will ensure crash-resistance of your object allocator. An Object Store Library: The object store library is built directly on top of the Object Allocation Library. You will provide 'objects'. Objects can be created, in which case the create function will return an object ID. They can be deleted. They can be appended to and expanded, and they can be truncated. They also have inodes, and their inode number is equivalent to their object ID. These objects, their metadata, their block maps, and so on, should be allocated as objects from your Object Allocation Library. Since updates to the object store metadata will also have to be consistent and crash-resistant, you will want to journal these updates as well. Having two journals is inefficient as appends to both will cause undue disk seeks, so share the Object Allocation Library journal. So you will interleave Object Allocation redo actions with Object Store redo actions. Architectural Tips and Tricks: get_block: get_block is the routine called by most of the generic vfs routines. Get_block is responsible for allocating new blocks, and fetching blocks at a given offset. Since 1ffs has a very simple lay-out, the implementation of get_block should be relatively straight forward. Use mmap: The existing mmap code is mature and powerful, including many useful helper routines for finding mappings in files, merging them, and so on. You should use these helpers and try to just slightly modify the mmap system to support mpin and munpin. set_buffer_ordered: ext3/4 use set_buffer_ordered to get correct and crash-resistant disk semantics with the disk cache on. So should you. Calling set_buffer_ordered on a buffer head before submitting with submit_bh will cause a SCSI disk cache flush, which is necessary for correctness. See how to use it by looking at: fs/jbd/commit.c:128 Hard-code stuff: If you look at the FAT file system implementation, '/' is a hard-coded dentry. This is fine, and your '/' should also be hard coded. Since 1ffs is a single file that can't be removed or changed (except for having pages overwritten) then you could also hard code its dentry, as well as its inode. The inode for 1ffs should really just have the owner set to the UID of the user mounting the file system, and have all permissions turned off. Then allow setattrs on this inode as normal to make it accessible to other users. Don't bother writing the inode to disk. New System Calls: If possible, avoid new system calls, however mpin is semantically different enough from mlock that it deserves its own call. The lease operations might be encodable as ioctls, but if this is too difficult or tricky, adding system calls for these is acceptable. Orthogonality: The project has several components of rough orthogonality: - The mpin code which consists of marking vma's in the mmap code - The page writeback code for 1ffs (writeback_inode) - The lease code, which requires study of memory management in the kernel, and can be FS specific in this case - The object allocation library - The object store library Each component can be developed relatively independently with some integration testing later on. The libraries in particular can be developed on a fake 'mpin' interface and be written and tested in advance. Team Configurations: Larger teams are responsible for more work. The most important aspect of the project is the kernel component, which is mandatory, and the user-level object store implementation built on top is for larger teams. 2 person team: All kernel components 3 person team: All kernel components and the Object Allocation Library 4 person team: All kernel components and the Object Allocation Library, and the Object Store Library. Conclusion: The project addresses an open research question, and brings to light some interesting opportunities to reduce message passing costs in Linux like OSes for user-level storage layers. Providing correct msync and mpin semantics for user-level storage software is a major step forward too. Good luck! Sponsor: FSL PhD student Richard "Rick" Spillane. You may contact Rick as well to get more details. ------------------------------------------------------------------------------ Union/Overlay Mounts: First, read this big picture doc: http://valerieaurora.org/union/design.txt Sponsor: Valerie "Val" Aurora (vaurora AT redhat DOT com). Note: this is an external sponsor of the project, and she's busy. So I ask that you contact me (Erez Zadok) first, before contacting her. I can probably answer most of your questions. Val has proposed three projects, described below. I marked them [A], [B], and [C]. The main benefit of picking one of these projects is that any good work you do has a good chance to get integrated into Val's kernel.org git tree; eventually her code will be integrated in the mainline Linux kernel, and your name will forever be famous. :-) Team sizes can be anywhere from 1-4 for this. [A] Whiteouts/fallthrus ----------------------- Whiteouts and fallthrus are special kinds of directory entries. A file system serving as the topmost writable layer of a union mount must implement both whiteouts and fallthrus. A whiteout says that this directory entry is logically deleted, even if a matching directory entry exists in some lower layer. On a lookup, the file system must return a dentry with the whiteout flag set. To create a whiteout, the VFS calls a file system provided operation: int (*whiteout) (struct inode *, struct dentry *, struct dentry *); (A directory may be flagged as opaque, which prevents any lookup from going to the lower layer - logically a whiteout for every entry not present in the top layer directory.) A fallthru is an explicit instruction to lookup this directory entry in a lower layer. A fallthru overrides an opaque flag on the directory. A fallthru is useful as a placeholder for readdir(). It may also be useful for implementing rename()s more cheaply. On a lookup, the file system layer with the fallthru returns a dentry with the fallthru flag set. The VFS then knows to call lookup on the lower layer file systems. To create a fallthru, the VFS calls a file system provided operation: int (*fallthru) (struct inode *, struct dentry *); We have three ideas for how to implement fallthrus and whiteouts in a typical file system. In general, it ends up being an backwards-incompatible change in on-disk format, but it is possible to use clever tricks to make it compatible, much like the ext3 htree directory entries. However, most writable overlays will be created anew, so backwards compatibility is not a requirement. 1. Normal directory entry with a special inode number. 2. Normal directory entry with a special file type. 3. New kind of symlink with a flag denoting whiteout or fallthru. (Note that tmpfs and other in-memory file systems simply use dcache entries directly, so don't need to implement anything beyond that.) We've implemented #2 for ext2 and jffs2. We implemented #1 for ext2/3 but found it to be less elegant. #3 is a new idea and may be more elegant. I'd like you to implement #3 for any of ext2, ext3, btrfs, xfs, jffs2. For background, read the patches implementing whiteouts/fallthrus for tmpfs and ext2. Look for appropriate commits from branch "overlay" of: http://git.kernel.org/?p=linux/kernel/git/val/linux-2.6.git;a=summary [B] Really really read-only mount for NFS ----------------------------------------- One of the key assumptions of union mounts is that the layers below the topmost writable layer are truly read-only and cannot become read-write while the top layer is mounted. Implementing this restriction - a "really really read-only" mount - is trivial for local file systems. However, for NFS file systems, we have to communicate this requirement to the server, and the server has to notify us if it can't make this guarantee any more (where notification can consist of something as brute force as making the whole mount go away). This will require (1) defining a new NFS mount option, (2) making the server up the really really read-only count on the exported file system, (3) defining some sort of failure mode for the case of the server rebooting and coming up with the exported file system writable or the like. See the section entitled "Mount" in: http://valerieaurora.org/union/design.txt And this mailing list discussion: http://markmail.org/message/3mkgnvo4pswxd7lp [C] Performance testing and debugging ------------------------------------- The Union Mounts code is very young. It has bugs. And it's hardly efficient. Run as many regression suites as you can, using various Union Mount configurations on top of various lower file systems. Profile the code to find bottlenecks (using profiling tools such as prof/gprof, Oprofile, OSprof, etc.). Fix those bottlenecks. Fix bugs you find. Note: it might be easier to include testing/optimizations as part of sub-projects [A] and [B]. ------------------------------------------------------------------------------ <<>> * Revision History v1: original. v2: detailed dedupfs [11/3/09] v3: added three union mounts related projects. [11/4/09] v4: clarification on securing checksum db in ext2sum [11/14/09]