CSE-506 (Fall 2009) Homework Assignment #3
(100 points, 20% of your overall grade)
Version 3 (11/3/2009)
Due Thursday 12/10/2008 @ 11:59pm
* SCHEDULE
1. The project begins on Wednesday 11/4/09, starting with a discussion of
your project idea with the instructor and/or TA. The discussion is
optional but strongly recommended: this way you can ensure that you
understand the project goal correctly, you estimate the amount of work
reasonably and the number of people in your group is appropriate.
2. You must submit a detailed 2-3 page design/plan document by Friday
11/13/09 or sooner, which outlines your plan for this project,
motivation, design, etc. You must commit the report to CVS using the
name "design.txt" and ALSO leave a printed copy of the report in the
mailbox outside my office by 5:00pm (not midnight) on 11/13/09. This
document will be reviewed by the instructor and returned back to you
within 24-48 hours. This is very useful to help you fine-tune your
project and ensure you're on the right track. In fact, the more you
discuss your project with the instructor, the better your project will
be. Note: you should start on your project asap: don't wait until after
I returned the graded design docs to get started!
3. If you work in a team/group, then you MUST declare your project group by
11/13/09, and get a shared CVS repository.
4. The final project source code and final README report are due on Thursday
12/10/2009.
5. Demos will be held between Friday 12/11/09 and Sunday 12/13/09.
Unless otherwise indicated, all deadlines are 11:59pm on the due date. Late
submission penalties will apply with one exception: you are NOT allowed to
miss your demo. Many demos will be scheduled in series; if you miss yours,
or you are late, your demo slot will end precisely on the time it's supposed
to, to allow the next group to start on time.
* TEAMS
You can work alone or in groups of 2-4. You must declare your team members
if you work in a group. Most projects are designed for 2-3 people. But I
like to give each team enough flexibility to design their own project's
scope, modified from my brief descriptions below. It allows you to adjust
the number of people required for the project by adding or omitting some
particular features. If you work alone, I will take that into consideration
when assigning your final course grade (as per the posted class grading
policies). At the same time, if you work in a group of 3-4 people, I will
expect a more substantial project done. In general the amount of effort per
person should be 1-1.5x times that each of you spent for HW2. Note,
however, that for this project you have just 4 weeks: this means you should
start doing some of the work/coding before you get your design document
reviewed/approved.
* GRADING
This project is worth 20% of your overall course grade. The grading is
divided as follows, on a scale of 0-100 points:
- initial design/plan document: 5 points
- final detailed README report (design/implementation/etc.): 20 points
- demo: 40 points
- code inspection and independent testing by graders: 35 points
Each document you supply/commit should have a project title, date, names and
emails of students in that project group. Be sure to spell-check and
proof-read your document.
1. Final README report should be detailed. I prefer if you follow the model
of a conference-quality research paper, with sections such as Abstract,
Introduction, Background, Design, Implementation, Evaluation, Related
Work, Conclusions, Future Work, and a professional Bibliography.
2. Demo: this is your chance to show the graders your work. You will start
compiling it from a pre-checked out source repository that the grader
checked out. Then you will demo your work in VMware. Be sure to show
off the basic functionality, as well a any special and/or advanced
features you've implemented. (i.e., impress us! :-) The graders will be
asking ALL of your team members questions to ascertain your level of
knowledge. It is imperative that all team members be ready to answer
questions. The demos will be 45 minutes each (and while that may seem
too long, trust me that you often run out of time and wish you'd have
more time.) It's important therefore that you plan your demo time well,
leaving time for the TA to ask questions.
3. Source code and independent testing: here we will ensure the quality of
your code. Don't forget that all code you submit must comply with
Documentation/CodingStandards and pass the compliance script in
scripts/checkpatch.pl (run checkpatch in strict mode please). Also be
sure that your code passes LOCKDEP testing.
Throughout the project, you are *STRONGLY* encouraged to discuss your
documents and progress with the instructor and/or the TA. You are welcome
to give me drafts of your documents before the deadline, or come discuss
your progress with me: I will give you "free" advise and review your
documents/code/progress for free. This way you can have enough time to
improve your project before it gets officially graded. In the past, people
who came and talked to us frequently, often did much better in the project,
because you can get an early feedback from us about your work.
Extra credit: there is none for this assignment, but given the open nature
of this project, we will grade more openly. For example, teams with
particularly clever solutions, exceptionally good code, and/or a
"research-like" final report, may get some extra credit points. As an added
motivation, I am looking for a small number of individuals to join my
research lab, FSL, as project students. I am also looking for 1-2 RAs.
Good candidates to join FSL are those who do very well in this and other
classes, and especially in this project, demonstrating good kernel coding
abilities, good technical writing skills, and an excellent understanding and
motivation for conducting research. Interested students should contact me
directly.
* SUBMISSION
You will submit all material via CVS to your "hw3/" private or group
directory: sources, design docs, etc. Late submission penalties will apply
as usual. If you work as a group, you MUST use a shared group repository:
you can use the shared group repository from hw2, if the you stay in the
same group, or ask me for a new CVS group repository (email me the names and
OSLAB account names of all new members).
* PROJECT SELECTION
You may choose from any of the projects below, or design your own project.
If you choose one of the projects below, you can modify it (add or remove
stuff) based on your group's size and you particular interests. If you do
design your own project, you MUST get the instructor's approval ahead of
time (WELL BEFORE 11/10/09!). Be sure to design a project that is just the
right amount of work for your team size (often, students pick projects that
are too ambitious for the given amount of time available).
With each project description, I also listed how many people this project is
suitable for in [square brackets]. The exact group size will depend on
discussions with me to define the scope. I also include URLs with each
project which can give you some useful info: papers to read, software to
download, etc.
In addition, if you like to do more projects in the area of operating
systems, speak to me so we can arrange for a project that can be carved into
different components (one component will be done as class project, the rest
continued as a spring/summer project, with an eye toward submitting a
technical conference/workshop paper).
If you want to get some ideas of other possible projects, feel free to
discuss it with me, with the TA, or any other class student (online or
offline). You're also welcome to discuss possible OS projects with other
professors and your adviser: as long as the project has a decent OS
"flavor", I'll consider it. Note that not every OS project requires writing
lots of kernel code; OS projects could include small amounts of
instrumentation, tracing, monitoring, analysis and benchmarking.
You may also consult the FSL projects page for ideas:
http://www.fsl.cs.sunysb.edu/all-projects.html
Many of those projects have parts that are suitable CSE-506, although some
of them have already been done. You could read some of the papers we have
published, and especially look at the "Conclusions and Future Work" section
at the end of the paper: this could tell you what _new_ work you might be
able to do that hasn't been done before. If you're interested, come see me
after you have read a bit about the existing projects.
You may also consult any other projects done in the department, as long as
they have a significant systems/OS component.
* LIST OF PROJECTS (check for updates after the end of this list)
The list of projects below does not have any particular order. For most of
these projects, you're expected to work with these source-code management
tools: git and guilt (both will be provided in the cse506 VM image).
------------------------------------------------------------------------------
SYSCALL-INHERIT:
Linux doesn't allow modules to override system calls. BSD does allow it, by
inheriting the syscall dispatch vector per process. A new module or policy
can make a new process use an entirely new set of syscall vectors
atomically. Study the BSD implementation and implement some or all of it
for Linux; show that it works by providing a kernel module that loads up
dummy syscall wrappers for some syscalls. You will probably have to change
the task structure to accommodate a syscall vector, and change
fork/exit/clone. [1-4]
------------------------------------------------------------------------------
DCACHE/ICACHE ACLs:
As some of you have noted in HW2, once an object is cached in the dcache,
permission checks on it are ignored. This is bad if you want to prevent
some users from accessing a file, or even being able to see that a
file/directory exists. Figure out how to add ACLs to cached
dentries/inodes. When a cached object is about to be returned to a user
process, you should check to see if the process has permissions to access
this object, without changing the object's status in the cache. When a new
object is about to be inserted into the dcache, you have to figure out what
ACLs, if any, should be attached to the object (you'd probably have to
change struct dentry). Also it'd be nice to have a way to dynamically
add/remove ACLs to objects in the cache. You may have to add a new dentry
operation called ->d_check_perm, perhaps other ops as you see fit. This op
will have to be invoked for every cached object before it's used (i.e., VFS
modifications). Demonstrate the usefulness of this project by modifying one
or more file systems such as ext2 or ecryptfs to support said ACLs in these
new dentry ops. Your design should be extra careful about performance
overheads, because the dcache is in the critical path of the OS. [1-3]
------------------------------------------------------------------------------
Persistent obj->private fields:
Any data you put into an object's private field is only saved in memory.
Devise a method by which individual inodes, dentries, and/or super_block
objects, can have their private data saved automatically and persistently in
the file system. You can create a new field in each data structure, called
->persistent if you want. Your system should be flexible such that
persistency will be available on per-object basis, iff the object->flags has
a special PERSISTENT flag on. The file system should set this flag on/off
as it sees fit, but it should be the VFS which does all the hard work to
save the per-object data -- with the help of the file system (e.g., the file
systems ->read and/or ->write methods may have to be called, or new methods
you devise). For each object type (D, I, SB), figure out what are the best
ways to save the info: append/prepend to a file; global info; separate
files; databases; EAs; etc. Note also that you'll have to figure out how to
re-load any previously saved persistent object data, when an object is
re-instantiated in memory. [1-4]
------------------------------------------------------------------------------
EXT2SUM:
Add native file data checksumming support into ext2 (or ext3, ext4, or any
other disk-based f/s). There's a lot of flexibility in this project. You
can consider checksumming for only data, only meta-data, or both. You can
consider checksumming per file, page, or f/s block. The checksums can be
stored in the inode, as extra data, as an extended attribute, or any other
variation you choose. You can consider to checksum only regular files, or
also directories and possibly other entities. You might also want to
enhance the user-level utilities (fsck, mkfs, debugfs, etc.) to handle your
checksumming support. Checksumming should be a mount-time option to
enabled/disable it. You can pick one checksumming algorithm, or several out
of what's crytoapi supports. Optional: think about security and
administration: how to create checksums safely (only "approved" users should
be allowed to that, possibly in single-user mode or the like. Your code
must pass the full LTP regression test suite (every test which passes on a
vanilla f/s must pass when you add the checksumming support). [1-4]
Tripwire project (google for it)
------------------------------------------------------------------------------
DedupFS:
Modify any file system you want, or write one based on wrapfs, to perform
data de-duplication. This is a "compression" technique that shares any of
data chunks that are identical within/across files, so duplicated data is
saved only once. To identify chunks, a checksum is often used. To know
when duplicated chunks are no longer needed, refcounts are used. Note: you
may rely on ext2sum or use btrfs for checksums, and build your dedup support
on top of it. [2-4]
Basic Deduplication:
There are three phases to deduplication: (1) segmenting, (2) labeling, (3)
seg storing. For segmenting, simple deduplicating systems simply break
files up into pages. For labeling, a basic checksum is used, like an
md5sum. For seg storing, an on-disk index is used in conjunction with a
block allocator. Since you can assume that blocks are all of the same size
(4KiB) your allocator implementation can be very simplistic. To store a
segment, you first allocate a block, checksum the dirty page, write back the
dirty page to the allocated block, and then insert a key-value pair into your index. For your on-disk index, you might
consider using a hash table.
The key idea is that if you have already stored a block with a particular
checksum, you won't store it twice. So instead of storing two blocks, you
store one block and two check sums. This is where the compression comes
into play.
You will also have to store inode attributes, as well as directory files.
The standard way to store this information is via the existing file system.
Typically one would write the checksums in place of the blocks to the
underlying file, and store the blocks in a separate seg store as described
above. During a read, you will read the checksums from the file, and
retrieve the blocks paired with the checksums from the seg-store into the
page cache. You could use a stackable file system to achieve this purpose.
Be sure to report the correct file size (i.e., size as opposed to
(size/page_size)*checksum_size).
Finally you do not have to deal with the case where the underlying file
system already has data in it. Just assume these users will simply copy
this data back into your file system if they want to access that data
alongside deduplicated data. You _do_ have to support remounting the file
system of course!
Advanced Deduplication - Bloom Filtering:
There are two major performance issues with deduplication. The first is
that you must perform an index lookup for every single block to see if it is
already in the seg store. When the index becomes larger than RAM you will
have to perform random I/O for every single write, even appends.
The standard way to avoid this is to use a Bloom filter. A Bloom filter is
a mapping of key-value pairs. Bloom filters do not have
bucket lists. If there is a collision during an insert into the Bloom
filter, we do nothing. Every time you insert a new
pair into the seg store, you will first perform a lookup of on
the Bloom filter, and _only_ if the bit is '1' will you proceed to check if
the checksum has been stored in the seg store already. If it is not '1' you
will store it in the seg store, and insert a pair into the
Bloom filter. This will allow you to avoid performing lookups of checksums
you know can't have been inserted yet. Although a negative cache (like
negative dentries) would achieve the same purpose, a Bloom filter is
guaranteed to be only as large as the hash table you allocate, and even this
table is merely an array of bits, so Bloom filters achieve better economies
of RAM. This is important as it allows you to avoid more lookups. However
there are some issues that you will have to figure out. Is it necessary to
store the Bloom filter across umount and mount? How do you store the Bloom
filter if it is? Most queries are on newly inserted items, but Bloom
filters age (old checksums will collide with new checksums, forcing a lookup
only to find the checksums are different). Can you incorporate the age of
the checksum in your Bloom filter decision? Make sure your design is
_correct_!
Advanced Deduplication - Block Grouping:
Another major performance issue for deduplicating file systems is that they
will write blocks normally co-located in a file in random order to a
seg-store. To resolve this issue you should implement a clustering
mechanism for your block allocator. Your allocator can take an inode and
block offset (in the file) as a hint, and during allocation of your block in
the seg store you would pass the inode and file offset in. The allocator
would try to allocate blocks belonging to the same inode from the same
cluster. The blocks in each cluster would be sorted by their block offset.
You will have to handle page overwrites in an intelligent manner. You will
not want to simply append new page values to the end of a cluster, as this
will cause fragmentation, but instead you will want to have a configurable
option (during format) that allows for some percentage of each file to be
used for page overwrites. Distribute holes uniformly through out the
cluster, and when you must perform a page overwrite, utilize the nearest
hole. For instance:
We format the file system with a hole of size 2 for every 8
blocks in a cluster, and a cluster size of 20 blocks. Notice
that we mandate the cluster size is a multiple of (2 + 8).
x - used block
| - cluster start/end
_ - free block
|____________________|
01234567 01234567
Now the user writes some blocks to this cluster for inode i in
order of page offsets 0, 1, 2, 3, and 4:
|xxxxx_______________|
01234567 01234567
Now the user writes some blocks to 7, 8, and 10:
|xxxxx__x__x_x_______|
01234567 01234567
Now the user inserts the value of block 8 in some other file,
and the index links that file's checksum at that offset to
this particular block. So if we overwrite block 8 we can't
simply delete it. That's OK, we'll use our overflow hole.
|xxxxx__xx_x_x_______|
01234567 01234567
We put the new value for block 8 into the overflow hole.
This scheme assumes that duplicate blocks will typically be used with other
block of the same value in the same order. For instance you won't just use
block 3 in another file, but blocks 0, 1, 2, 3, and 4. So fragmentation
induced by this scheme should be small. You must design a reasonable method
of deleting blocks no longer pointed to by any checksum in the index (in any
file).
Group Sizes:
Depending on your group size, you will be responsible for more or less of
this total system.
2 person group:
You will be responsible for Basic Deduplication.
3 person group:
You will be responsible for Basic Deduplication and Bloom
Filtering.
4 person group:
You will be responsible for Basic Deduplication, Bloom
Filtering, and Block Grouping.
Conclusion:
Deduplication is a powerful way to compress infrequently used portions
of a large file system. It is an exciting new area and some of the
most challenging problems are intelligently handling the Bloom
filtering policies, and the block grouping policies. There is
currently no good deduplication file system for Linux. Having a
stackable implementation would be a perfect fit letting people
continue to use their underlying file system, and might make a lot of
people excited! Good luck!
Sponsor: FSL PhD student Richard "Rick" Spillane. You may contact Rick as
well to get more details.
------------------------------------------------------------------------------
Any-path checkstack.pl:
checkstack.pl only shows the stack footprint of individual functions.
Enhance the script (and the kernel as needed), or write a new
script/program, that calculates ALL possible paths through a compiled kernel
(and modules), and reports the call paths which'd have the most stack use.
In particular, it should report paths that exceed the current kernel's max
stack size. It's important that your system will not report false positives
(i.e., call paths that can never occur); you also have to take into account
that there's a lot of function pointer indirection (e.g., operation vectors)
that are not filled until run time. Also stackable file systems and any
levels of indirection (e.g., LVM/RAID) increase stack usage. Find plausible
paths that indeed consume a lot of stack space, and demonstrate them by
configuring a kernel and modules, and using it, such that it triggers a
stack overflow oops. Yes, in this project, your success will be measured by
how many oopses you produce. (As an added incentive, if you manage to do
this project by only modifying user-level code/scripts, there's a good
chance it can be contributed and accepted by Linus into the mainline
kernel.) [1-3]
------------------------------------------------------------------------------
EXT2INSDEL:
Support native insertion/deletion of blocks into ext2 (or any other disk
based f/s of your choosing). The idea here is to allow block-aligned
insertions/deletions into a file, and all you'll need to do is move the
physical indirect inode pointer, and not the whole data blocks: this should
improve insert/delete performance by a factor of log(n). You'll have to
provide a system call API for user programs to do that; ideally the user API
should not be limited to block aligned units, and the kernel will take care
of mapping byte-offsets to block offsets as efficiently as possible. [1-3]
------------------------------------------------------------------------------
EXT3INSDEL:
Do the same as ext2insdel, but for a journalling file system. Here, you
have to deal with log records in the journal which have to be coded for the
right ins/del operation. This is a bit more challenging, but will provide a
more reliable solution and possibly easier at times, due to ext3's
guaranteed write-ordering semantics. [2-4]
------------------------------------------------------------------------------
EAfs:
Extended attributes are useful, but not every file system supports them.
Implement a way to add EAs to any file system by storing them in plain
files. The EA files can be per file/directory, or globally (e.g., one
"database" file holds all of the EAs).
You can implement this as a thin stackable file system, possibly based on
wrapfs; or as native VFS support. Prove that your system works for a file
system that doesn't have EAs now (e.g., vfat or nfs). You need to show that
EAfs works for at least 1-2 file systems, the more the better. Note that
since EAs are often used for security, you may wish your EA files to be
secure (e.g., hidden from view, inaccessible to anyone but the kernel,
etc.). [1-2]
------------------------------------------------------------------------------
ACLFS:
ACLs are very useful. Design a stackable file system (based on, say,
wrapfs), which can "stack" on top of any other file system, and
automatically add ACL support. Use Linux-supported APIs for ACLs, using
extended attributes (EAs), or any other mechanism. If using EAs, they
should be supported already in the lower file system. Implement as many
ACLs as possible (by UID, GID, PID, SID, time-of-day, etc.). You may
combine this project with EAfs. [1-3]
------------------------------------------------------------------------------
Gimple DB: Whole-Kernel Analysis
This project involves aggregating an entire kernel's GCC intermediate
representation into a database, making it available to analyses that span
the entire kernel. This is suitable for 2-4 students with good experience
with compilers and programming languages.
Sound static analysis of an operating system kernel requires an accurate
view of kernel functions as they will actually execute. While not
insurmountable, this obstacle can turn an interesting research problem into
an exercise in frustration for anyone who does not want to deal with the
myriad subtleties and gotchas inherent in low-level C code, like the kind of
code that appears throughout the Linux kernel.
At the source level, a soup of macros and conditional compilation make it
difficult to learn what code actually executes and what data it touches.
That problem goes away at the binary level, but so does the rich type
information that is necessary for useful analysis.
The compiler, however, has knowledge of exactly how each function should
execute (or else it wouldn't be a very good compiler), in the form of
GIMPLE: GCC's three-address code intermediate representation.
GIMPLE three-address statements are already broken up into basic blocks, and
each statement is annotated with all the type information that the compiler
has access to, as well as all the extra information (such as SSA use-def
chains) that the compiler uses to find optimization opportunities.
The DB Dump GCC plug-in persists all this information to an SQL database, so
that it is possible to search the GIMPLE of even a very large program. As
is, however, the database is not ready for serious analysis.
The main goal of this project is to devise useful whole-kernel static
analysis questions and then to modify the GIMPLE database schema so that it
is possible to efficiently answer these questions.
1. Potential Static Analysis Questions:
For full credit, the project should result in (a) an interesting (but
simple) statistic about kernel structures and (b) a more sophisticated
inter-procedural analysis. These examples would make a good project:
- Simple statistical query: Which global variables have the most accesses?
Which variables are accessed from the most kernel components?
- Inter-procedural query: For each function, what locks might be held when
the function is called? What locks are always held?
2. GIMPLE DB Schema
Answering these kinds of questions requires knowledge of inter-file kernel
structure. The current DB Dump implementation dumps files in isolation: it
does not discover any sharing between files. For example, if multiple files
use the struct inode datatype, each of those files will maintain its own
copy of that type, making it difficult to make the query "find me all
statements that access a struct inode variable."
To enable the queries in part 1, this project should modify the DB Dump
plug-in to discover this kind of sharing. The finished DB Dump will play
the role of the linker, tying together shared variables and datatypes.
Sponsor: FSL PhD student Justin Seyster. You may contact Justin for
more details.
------------------------------------------------------------------------------
FAKEFS: Fake File System:
The choice of a back-end file system affects end-to-end performance of a
Network File System (NFS) significantly. Consequently, when comparing
different NFS implementations, the lower file system should be the same for
all implementations in comparison. But which specific file system should be
used? One of the ways to perform fair NFS benchmarking is to "fake" the
back-end file system --- a fake file system --- that pretends that requested
operation is completed, but does not perform any I/O operations, does not
store bulk of data in the memory, and uses as less CPU time as possible.
I.e., such file system does its best to return control back to NFS server as
soon as possible.
In this project you will need to implement a fake file system (fakefs). While
this file system should not store any data, it has to maintain a minimal amount
of metadata in order not to break basic the functionality provided to upper
levels - VFS/NFS/User application. E.g., file name and file length are
definitely part of such crucial information. When data is requested from the
fakefs, zero pages should be returned back, provided that the current offset in
the file is less then the faked file size. The functionality provided by the
fakefs is somewhat similar to sparse files but on a wider scale. You may base
fakefs either on wrapfs or tmpfs. Fakefs must be optimized to use as little
memory and CPU time as possible.
At then end you should present a working fakefs implementation, which passes
certain basic LTP tests. You will also need to run filebench on fakefs
locally and through NFS and present the results.
Resources:
- Linux Test Project (LTP): http://ltp.sourceforge.net/
- FileBench: http://www.solarisinternals.com/wiki/index.php/FileBench
Group size: 1-2
Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to
get more details.
------------------------------------------------------------------------------
Friendly Apps:
What if user applications are friendly?
An increasing number of research projects exploit the possibility to
trade-off reliability for performance. The reasoning behind is that many
applications have inherent error tolerance.
Modern operating systems assume that any running application might be
malicious (deliberately or not). As a result, a code of typical OS abounds
in security and reliability checks. We claim that in certain situations, a
lot of security burdens might be omitted in the name of performance. In
fact, nowadays a significant number of applications are carefully tested or
formally validated. Consequently, the probability of application's invalid
behavior is low. If the application in question performs non-critical
service, security/reliability checks might be omitted, increasing the
program's performance (and reducing the code size). We call applications
that do not exhibit malicious behavior as "friendly applications".
In this project we want to classify security checks performed in the Linux
kernel, account the chances of each security check class to be exploited by
application and remove security checks class by class, measuring kernel
performance at each stage.
As a result you should provide:
- a patch against the recent kernel version with a lot of:
#ifdef CONFIG_SECCHECK_CLASS1
...
#endif
#ifdef CONFIG_SECCHECK_CLASS2
...
#endif
...
- GNU/Linux booting for each configuration above. Notice that it might
require the modification of some user programs used during the boot, if
they are not friendly. Let's call this operation forcing the
friendship. :)
- Test software and test results, proving (or denying) the concept.
Examples of the checks that might be omitted:
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
More complex examples can arise inside schedulers and other kernel
subsystems. In this project you will need to look through and understand a
lot of the kernel code (not only file system related).
Group size: 1-2
Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to
get more details.
------------------------------------------------------------------------------
Accurate Block Trace for Writes:
Linux blktrace is a block layer I/O tracing mechanism which provides
detailed information about request queue operations up to user space. We
would like to use this mechanism to account disk power consumption per
process. However, there is a serious technical limitation in blktrace that
does not allow us to implement our idea right away. Below is a typical
blktrace output when you run a /bin/dd process:
% blktrace -d /dev/sda -o - | blkparse -i -
...
8,0 1 11 0.009507758 C W 223490 + 56 [pdflush]
...
You may notice that although "dd" is the process responsible for the write,
blktrace reports pdflush as the origin process for this write operation.
The reason is that writes are typically asynchronous in Linux and pdflush is
the process that flushes dirty pages to the disk. Even worse, the design of
the Linux kernel currently does not maintain information about the process
that made the page dirty. Often, the process that dirtied the page might
have even been terminated already at the point when the page is flushed.
In this project you will need to modify the memory management system in the
way that we can keep track of processes responsible for writes. Some dirty
pages might be shared among several processes, in this case both processes
must be assigned to that dirty page. PID + should be used as
writer's identifier. You will also need to modify blktrace in the way that
it displays the information propagated from the upper level.
Resources:
- Blktrace git: http://git.kernel.dk/?p=blktrace.git
Group size: 1-2
Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to
get more details.
------------------------------------------------------------------------------
VSNFS: very simple network file system:
Last year several CSE506 students have developed vsrfs: very simple real
(i.e., disk-based) file system. An example of such file system and
motivation behind it can be found at
. In this project we want to
implement a very simple NETWORK-based file system. Such file system might
be used as a template for a quick creation of network file systems tuned for
specific usage scenarios.
First, you will need to design a simple protocol describing client and
server communication. The protocol should be rich enough only to support
basic file system operations, such as lookup, file creation and removal,
reads and writes. All other operations might be omitted for simplicity.
Mount and umount operations should also be described within the
protocol. You may assume that server has only one export point and can
service only one client at a time. You can also assume that underlying
protocol is perfect, as well as server and clients are: they never crash and
power never goes down. After the vsnfs protocol is designed, you will need
to implement both server and client sides. We recommend to use the Remote
Procedure Calls (RPC) subsystem when implementing client-server model. Here
is an example how vsnfs can work. On the server node user invokes
# vsnfsd.start /export/dir
command, that starts a vsnfd - server kernel thread. vsnsd registers a new
RPC server in the RPC framework and waits for the remote procedure calls
from the client. Let's assume that on the client user mounts vsnfs, reads a
file, and umounts vsnfs:
# mount -t vsnfs hostname:/export /mountpoint
# ls /mountpoint
f1 f2 f3
# cat f1
hello world!
# umount /mountpoint
The session between client and server may look like that:
---client---- | ----server----
|
mount_req{"/export/point"} ----------------->
|
<---------- mount_reply{"ok", root_fhandle}
|
readdir_req{root_fhandle} ------------------>
|
<------------ readdir_reply{"ok", "f1", "f2", "f3"}
|
lookup_req{root_fhandle, "f1"} ----------->
|
<------------- lookup_reply{"ok", f1handle}
|
read_req{f1handle} --------------------->
|
<------------- read_reply{"ok", "hello word!"}
|
umount_req{"/export/point"} --------------->
|
For simplicity you can omit permission checks, limit maximum number of files
in the root directory, and always transfer the files completely over the
network. Server maintains a table that maps file handles to the local file
system inodes.
It is up to you if the implementation will work on top of TCP, UDP, or both.
You can use caching for performance optimization, but be careful to
synchronize caches in time. Interoperability between different hardware
architectures can be omitted. NFS implementation in the Linux kernel can be
used as a source of ideas or starting point.
The resulting file system should pass as many tests from the LTP suite as
possible. You also will need to run certain micro benchmarks using Filebench
and present the results. Ideally it should work on top of the vsrfs, so
that we have a vsrfs + vsnfs working couple.
Resources:
- RFC5531: "RPC: Remote Procedure Call Protocol Specification Version 2"
- Linux Test Project (LTP): http://ltp.sourceforge.net/
- FileBench: http://www.solarisinternals.com/wiki/index.php/FileBench
- RFC1094: "NFS: Network File System Protocol Specification"
Group size: 3-4
Sponsor: FSL PhD student Vasily Tarasov. You may contact Vasily as well to
get more details.
------------------------------------------------------------------------------
1FFS: One-File File System
"Caihong" is Chinese for rainbow and refers to the many different colors of
cached pages on top of a single object store.
Introduction:
Programming in the kernel is very difficult. Commonly used systems should
be optimized for performance and placed within the kernel. However, there
is a significant risk of new instabilities and massive development costs.
The immense number of files users must manage today has caused many
developers to push the most complex components of their storage software to
outside of the kernel as performance concerns become dwarfed by
architectural complexity concerns. Users have already started to
automatically generate data via digital photos and caching internet
interactions. As the number of files users generate steadily increases,
files will have to be automatically named and indexed by many indexes in the
future, which will all run from a user-level context. The OS kernel will
act as a "bottom-half" which provides service requests to a particular PC's
disk(s). However, current OSes have become jacks of all trades, and the
demand for file system performance has caused kernel designers to forgo the
classical guarantees of fsync(2) and compromise on the potential benefits of
a more comprehensive mmap(2). New architectural challenges stand in the way
of fixing these issues for user-level storage software, and currently such
software vendors either turn off the disk cache, or maintain a custom
kernel. Furthermore, user-level file systems such as FUSE incur heavy
message-passing overheads and have no way to implement a page writeback
mechanism that the kernel can rely upon to reduce memory pressure.
In this project you will solve the most fundamental of these problems, and
develop novel solutions to an open research question. This specification
will begin by providing you with the first target use-case of this work, and
then move into a full specification. Finally, it will finish with several
architectural suggestions that should better ensure correctness, and ease
the design and implementation considerably. After the description, valid
team configurations for the assignment will be discussed.
Use-Case Scenario:
There are several intended use-case scenarios for this work, including a
user-level index on top of a kernel level object store, a user-level
transactional library API, and a new kind of versatile file-system with much
low message passing overheads than FUSE. I will describe the transactional
library API use-case to you.
In this use-case the developer wants to provide a transactional library to
other developers. His library consists of two major on-disk data
structures. The first is the block file. The second is the journal.
Updates to the block file are always paired with a message to the journal.
This journal message can either undo, or repeat the paired update to the
block file. For instance:
Update to block file:
"Update block 32 to this new 4KiB value."
Update to journal file:
"Append the old 4KiB value at block 32 in the block file, and append
the new 4KiB value for block 32."
The update to the journal file can either restore the original value of
block 32 (undo) or re-apply the intended update (redo). The key to this
use-case scenario is that for each message pair, the journal message should
always hit the disk platter before the block file message. This way if we
lose power at any point, we will have either not modified the block file, or
will have enough information to carry the whole operation through, or undo
any changes it had made.
Currently the transactional API uses a user-level hash-table to cache
blocks. The time spent in user-level to perform basic page caching is
immensely expensive. Benchmarks show that almost 50% of the overhead of the
library is simply due to time spent in user-mode doing page caching. The
implementer of this library would like to use mmap. He would simply mmap
the block file, and mmap the journal, and keep the block file mapping
'pinned' or unable to writeback to disk, until he explicitly 'forces' or
syncs the journal file mapping. Any page faulting and caching details would
be handled natively by the kernel, significantly improving performance.
However the kernel provides now way to pin pages except with mlock, which
will kill processes that lock too much RAM, and has no two-way communication
mechanism to ask the user process to relieve memory pressure due to security
concerns.
When the user-level transaction API _does_ write the journal to disk, it
uses fsync(2) which will only schedule block writes and not actually flush
the disk cache, compromising atomicity. Currently the transactional API
warns that the disk cache should be turned off to avoid this problem, again
heavily degrading performance.
Finally, the developer utilizes the convenience of allocating files with
arbitrary names, appending to them, truncating them, and removing them.
These kind of 'object-store' functions are also utilized by many other
user-level storage vendors. This is one of the main reasons file systems
are commonly used as back end for these products, despite their many
draw-backs. However the message passing overheads for many manipulations of
files can be costly, and almost all file systems implement fsync(2) and
mlock(2) in an efficient but incorrect manner.
The Solution and Proposed Specification:
This developer is just one of many that complain about the poor tools the
kernel provides to user-level storage engines. You will develop a
comprehensive solution to this kind of developer. Your solution will
provide depending on your group size:
- a single-file file system that includes:
- a RAM leasing implementation
- a correct implementation of fsync(2) that can run with the
disk cache _on_
- a new mpin(2) and munpin(2) routine that acts on whole
mappings
- a user-level object allocation library
- a user-level object store that can be used to
- create files,
- append to files,
- truncate files,
- remove files.
Proposed Specification:
Single-File File System:
You will develop a single-file file system. This is a necessary step as we
want to provide mmap operations on top of a block device, but specify the
semantics of these mmap operations (for the purpose of proper msync and
mpin). File systems are the typical servers of mmap on files in the kernel,
so the most natural way to achieve this is to provide a single-file file
system, that exports a partition as a large mmap-able file.
There is no need to retain inode properties for this file across umount and
mount, the inode can be maintained completely in RAM, and initialized with
hard-coded values that set the owner to root, and permissions to 0 for ALL
and GROUP (to prevent TOCTTOU attacks). Root can then change the
permissions if desired after mounting. This removes the need for inode
de/serialization and should simplify get_inode considerably.
Mpin and munpin should be implemented with a separate on-disk swap. The
reason is that if the kernel wants to relieve memory pressure, and we do not
want to OOM kill the process, and not write back the dirty pages to disk, we
must write them somewhere else to relieve pressure. We write them to swap.
Data in swap need not be stored or retrieved across either crashes, or
umount and mount (its swap and so its temporary).
You should maintain a single super-block page that stores:
(1) The magic number of the file system
(2) The size of the swap partition for the file system
The magic number is to prevent users from accidentally mounting devices
formatted by other file systems and over-writing valuable data. The file
system checks the magic number during mount, and if it does not find it,
returns an error requesting the user to format the partition with mkfs.
You should write these values to the super-block during mkfs, and retrieve
them during mount. They can not be changed so there is no need to update
them. Here is the usage of mkfs.1ffs:
mkfs.1ffs -o
Mounting 1ffs takes no options:
mount -t 1ffs
A RAM Leasing Implementation:
One of the problems with user-level page cache implementations is that they
consume kernel memory. Currently such systems use carefully tuned kernels
in concurrence with malloc and the regular swap partition to handle memory
tug-of-wars between user-level storage layers and the kernel. To avoid
swapping to disk when the user-level storage could simply be asked to unpin
some pages during a flush, you will implement a RAM leasing implementation.
lease_t lease_acquire()
int lease_relinquish(lease_t lease)
Lease acquire will ask the kernel for a lease ID, and if granted, a unique
positive id will be returned, and if not, ENOMEM will be returned. This
lease ID will be used for all future requests related to that lease.
int lease_truncate(void *addr, lease_t lease, size_t pages)
Lease truncate will specify for a mmap-ing at address 'addr' that the user
requests that 'pages' pages (4096*pages bytes) can be pinned in RAM within
that mapping without incurring swapping. If the call is successful, 0 is
returned, and the user can proceed to call lease_shrink, otherwise ENOMEM is
returned.
int lease_shrink(void *addr, lease_t lease, size_t *pages)
The call to lease_shrink will block until the kernel wants to reduce this
lease's number of pages, at which point processes blocking on this call will
return, and the amount to shrink the lease by will be recorded in 'pages'.
The process has some period of time to unpin pages before they are swapped
out. The user process can repeatedly call lease_truncate to expand on a
shrunken lease to get RAM back from the kernel.
A Page Pinning Implementation:
1ffs consists of the block file which is exposed as a single large file as
described above, but it also consists of a swap partition. This swap
partition is where pinned pages are written back to disk if the kernel is
suffering from memory pressure. The total sum of all outstanding leases can
not exceed the size of the swap partition, to ensure that all pinned pages
can be written back during memory pressure.
int mpin(void *addr, size_t len_in_pages)
int munpin(void *addr)
The command users will use to pin pages is 'mpin' and should take advantage
of existing mmap code as much as possible. For this assignment 'mpin' can
effect a whole mapping (i.e., you can just mark the vma). During page
writeback your 1ffs implementation should write back pages belonging to a
pinned vma to the swap, and mark them clean. Munpin does the obvious thing
and clears a vma as being 'pinned' and thus during writeback 1ffs will
simply writeback the mapping to its original location (and not bother
writing it to swap).
You will have to maintain an in-RAM index that can be used to find the
location of pinned pages in the swap disk. Using the radix tree
implementation for this purpose may well make sense. There is no need to
serialize this index to disk as the contents of swap do not need to be
restored in case of crash, or across umount and mount.
A Page Forcing Implementation:
When msync-ing pages in an mmap belonging to 1ffs, if the MS_SYNC flag is
passed, 1ffs must use an ordered write to cause a SCSI disk cache flush.
This is done by calling 'set_buffer_ordered' on the last buffer head written
to disk, before calling 'submit_bh' on that buffer head. This will ensure
that the mapping is correctly and synchronously written to disk.
If forcing a mapping where some of the pages are swapped out to the swap
partition and are not resident in RAM, they will have to be read in from
swap before they are written out. You will have to use your swap index to
find these non-resident pages.
Finally you should be able to msync (force) mpinned pages. This is a
necessary feature as user-level storage vendors will want to keep pages
pinned, but occasionally write them out to disk to either clean them or for
durability's sake.
An Object Allocation Library:
1ffs is a powerful tool in that it exposes a page-cache interface that user
processes can access with few context switches. You will implement an
object allocation library in user space which provides a slab allocator
interface on top of 1ffs. The API will be similar to kmem_cache. You will
need to allocate pages to store slab object sizes, and pointers to the first
slab. Slabs may be linked lists with pointers to other slabs. You will
need to store within each slab which objects are free or not. This
information should be updated using mmap and msync.
In addition you will need to implement a small journal. This journal will
include updates to the metadata of the system (redo actions). Updates to
the journal should be flushed to disk before updates to the metadata. You
should mpin the metadata pages of the object store on disk. When you want
to write your changes to the metadata pages of the object store, simply
fsync the journal (which will have identical semantics to msync) and then
msync your metadata pages. This write ordering will ensure crash-resistance
of your object allocator.
An Object Store Library:
The object store library is built directly on top of the Object Allocation
Library. You will provide 'objects'. Objects can be created, in which case
the create function will return an object ID. They can be deleted. They
can be appended to and expanded, and they can be truncated. They also have
inodes, and their inode number is equivalent to their object ID. These
objects, their metadata, their block maps, and so on, should be allocated as
objects from your Object Allocation Library.
Since updates to the object store metadata will also have to be consistent
and crash-resistant, you will want to journal these updates as well. Having
two journals is inefficient as appends to both will cause undue disk seeks,
so share the Object Allocation Library journal. So you will interleave
Object Allocation redo actions with Object Store redo actions.
Architectural Tips and Tricks:
get_block:
get_block is the routine called by most of the generic vfs routines.
Get_block is responsible for allocating new blocks, and fetching blocks at a
given offset. Since 1ffs has a very simple lay-out, the implementation of
get_block should be relatively straight forward.
Use mmap:
The existing mmap code is mature and powerful, including many useful helper
routines for finding mappings in files, merging them, and so on. You should
use these helpers and try to just slightly modify the mmap system to support
mpin and munpin.
set_buffer_ordered:
ext3/4 use set_buffer_ordered to get correct and crash-resistant disk
semantics with the disk cache on. So should you. Calling
set_buffer_ordered on a buffer head before submitting with submit_bh will
cause a SCSI disk cache flush, which is necessary for correctness. See how
to use it by looking at:
fs/jbd/commit.c:128
Hard-code stuff:
If you look at the FAT file system implementation, '/' is a hard-coded
dentry. This is fine, and your '/' should also be hard coded. Since 1ffs
is a single file that can't be removed or changed (except for having pages
overwritten) then you could also hard code its dentry, as well as its inode.
The inode for 1ffs should really just have the owner set to the UID of the
user mounting the file system, and have all permissions turned off. Then
allow setattrs on this inode as normal to make it accessible to other users.
Don't bother writing the inode to disk.
New System Calls:
If possible, avoid new system calls, however mpin is semantically different
enough from mlock that it deserves its own call. The lease operations might
be encodable as ioctls, but if this is too difficult or tricky, adding
system calls for these is acceptable.
Orthogonality:
The project has several components of rough orthogonality:
- The mpin code which consists of marking vma's in the mmap code
- The page writeback code for 1ffs (writeback_inode)
- The lease code, which requires study of memory management in the
kernel, and can be FS specific in this case
- The object allocation library
- The object store library
Each component can be developed relatively independently with some
integration testing later on. The libraries in particular can be developed
on a fake 'mpin' interface and be written and tested in advance.
Team Configurations:
Larger teams are responsible for more work. The most important aspect of
the project is the kernel component, which is mandatory, and the user-level
object store implementation built on top is for larger teams.
2 person team:
All kernel components
3 person team:
All kernel components and the Object Allocation Library
4 person team:
All kernel components and the Object Allocation Library, and the
Object Store Library.
Conclusion:
The project addresses an open research question, and brings to light some
interesting opportunities to reduce message passing costs in Linux like OSes
for user-level storage layers. Providing correct msync and mpin semantics
for user-level storage software is a major step forward too. Good luck!
Sponsor: FSL PhD student Richard "Rick" Spillane. You may contact Rick as
well to get more details.
------------------------------------------------------------------------------
Union/Overlay Mounts:
First, read this big picture doc:
http://valerieaurora.org/union/design.txt
Sponsor: Valerie "Val" Aurora (vaurora AT redhat DOT com). Note: this is an
external sponsor of the project, and she's busy. So I ask that you contact
me (Erez Zadok) first, before contacting her. I can probably answer most of
your questions.
Val has proposed three projects, described below. I marked them [A], [B],
and [C]. The main benefit of picking one of these projects is that any good
work you do has a good chance to get integrated into Val's kernel.org git
tree; eventually her code will be integrated in the mainline Linux kernel,
and your name will forever be famous. :-)
Team sizes can be anywhere from 1-4 for this.
[A] Whiteouts/fallthrus
-----------------------
Whiteouts and fallthrus are special kinds of directory entries. A file
system serving as the topmost writable layer of a union mount must implement
both whiteouts and fallthrus.
A whiteout says that this directory entry is logically deleted, even if a
matching directory entry exists in some lower layer. On a lookup, the file
system must return a dentry with the whiteout flag set. To create a
whiteout, the VFS calls a file system provided operation:
int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
(A directory may be flagged as opaque, which prevents any lookup from going
to the lower layer - logically a whiteout for every entry not present in the
top layer directory.)
A fallthru is an explicit instruction to lookup this directory entry in a
lower layer. A fallthru overrides an opaque flag on the directory. A
fallthru is useful as a placeholder for readdir(). It may also be useful
for implementing rename()s more cheaply. On a lookup, the file system layer
with the fallthru returns a dentry with the fallthru flag set. The VFS then
knows to call lookup on the lower layer file systems. To create a fallthru,
the VFS calls a file system provided operation:
int (*fallthru) (struct inode *, struct dentry *);
We have three ideas for how to implement fallthrus and whiteouts in a
typical file system. In general, it ends up being an backwards-incompatible
change in on-disk format, but it is possible to use clever tricks to make it
compatible, much like the ext3 htree directory entries. However, most
writable overlays will be created anew, so backwards compatibility is not a
requirement.
1. Normal directory entry with a special inode number.
2. Normal directory entry with a special file type.
3. New kind of symlink with a flag denoting whiteout or fallthru.
(Note that tmpfs and other in-memory file systems simply use dcache entries
directly, so don't need to implement anything beyond that.)
We've implemented #2 for ext2 and jffs2. We implemented #1 for ext2/3 but
found it to be less elegant. #3 is a new idea and may be more elegant. I'd
like you to implement #3 for any of ext2, ext3, btrfs, xfs, jffs2.
For background, read the patches implementing whiteouts/fallthrus for tmpfs
and ext2. Look for appropriate commits from branch "overlay" of:
http://git.kernel.org/?p=linux/kernel/git/val/linux-2.6.git;a=summary
[B] Really really read-only mount for NFS
-----------------------------------------
One of the key assumptions of union mounts is that the layers below the
topmost writable layer are truly read-only and cannot become read-write
while the top layer is mounted. Implementing this restriction - a "really
really read-only" mount - is trivial for local file systems. However, for
NFS file systems, we have to communicate this requirement to the server, and
the server has to notify us if it can't make this guarantee any more (where
notification can consist of something as brute force as making the whole
mount go away).
This will require (1) defining a new NFS mount option, (2) making the server
up the really really read-only count on the exported file system, (3)
defining some sort of failure mode for the case of the server rebooting and
coming up with the exported file system writable or the like.
See the section entitled "Mount" in:
http://valerieaurora.org/union/design.txt
And this mailing list discussion:
http://markmail.org/message/3mkgnvo4pswxd7lp
[C] Performance testing and debugging
-------------------------------------
The Union Mounts code is very young. It has bugs. And it's hardly
efficient. Run as many regression suites as you can, using various Union
Mount configurations on top of various lower file systems. Profile the code
to find bottlenecks (using profiling tools such as prof/gprof, Oprofile,
OSprof, etc.). Fix those bottlenecks. Fix bugs you find.
Note: it might be easier to include testing/optimizations as part of
sub-projects [A] and [B].
------------------------------------------------------------------------------
<<>>
* Revision History
v1: original.
v2: detailed dedupfs [11/3/09]
v3: added three union mounts related projects. [11/4/09]
v4: clarification on securing checksum db in ext2sum [11/14/09]