|
CSE307 Spring 2007 Stony Brook |
Principles of Programming Languages
Annie Liu Assignment 2 |
Handout A2 Feb. 8, 2007 Due Feb. 20 |
Syntax Checking --- Matching Patterns
This assignment has three goals: (1) writing programs in Python and comparing Python with the languages you used in Assignment 1, (2) checking the syntax of the input data and assisting this task with separate declarations, and (3) exploiting pattern matching based on regular expressions for the checking.
You might have thought about this while doing Assignment 1: a lot of things can go wrong with the format of input data, and the desired format is often not clear from the data. In fact, many of you have asked various questions about this, and I told you to assume that the data has no such problems and we will deal with them in the next assignment. So now we will check and fix syntax errors, and furthermore assist this with separate additional declarations.
Programming in Python
Before doing these, you are asked to first write a program for reading tables, as specified in Assignment 1, but using Python this time, and then compare Python with the other languages you used in Assignment 1 for this Purpose. Include the comparison in your README file. In case you have already used Python in Assignment 1, you should state it clearly, and you should try to improve over it, or at least copy it, for this assignment. There is one additional requirement though: do not define classes, i.e., avoid using object-oriented features.
You may use a different language than Python for this assignment if you are so inclined, but you must justify why it is better to use than Python, in concrete terms: "I need to do a lot of bla-bla-bla, and those can be done in such simple and clear bla-bla-bla ways in the language I choose".
Syntax checking
First, the input is supposed to be data separated by newlines and tabs, for rows and columns, respectively, and each row is supposed to have the same number of columns. The latter seems easy to get wrong, especially because tabs are typically invisible. What if rows have different numbers of columns? Should we pad the shorter rows, truncate the longer rows, or do both to some extent?
Second, we use data in the first row as column names, but they can be excessively long, contain special symbols, have a name occur multiple times, be missing, etc., which make it hard to use them as names. Should we truncate the long ones to some length, take out certain special symbols, add things to differentiate different columns with a same name?
Third, the data can be numbers, strings of some sorts, etc. What should we treat them as for later processings, for example, for comparison, as needed for sorting? Do they need to satisfy certain additional constraints, and if so, how to check and ensure them?
To solve these problems cleanly, we will employ additional declarations, similar as schemas, types, and interfaces, that give specifications for the above. While these specifications could be added to the input file, we put them in a separate file, like header files for some languages. The advantages are that there is no need to change the data file, and that these separate declarations can be more easily used by processings that do not need to know the data.
The declaration file will contain lines of the form "column_name type", where column_name and type are separated by a tab, and type is one of int, float, string, or a Python regular expression, specifying the kind of data in that column.
When run, your program should take 3 arguments from command line: data file name, declaration file name, and output file name; read data from the data file; check it against the declarations in the declaration file; and write the cleaned-up data to the output file, with columns aligned as in Assignment 1, and with data in the first row replaced by the column names from the declaration file.
Pattern matching using regular expressions
Write the syntax checking using Python regular expressions. Do so not only for checking against explicit Python regular expressions, but also int and float, in the declaration file. In general, you should use library operations as much as possible to make your program clearer (simpler and shorter) and often more efficient.
Extra credit suggestions
Here are some ideas. (1) Add an optional third column in the declaration file as the default value of that column, and use it when the corresponding data is missing from the data file. (2) When the declaration file contains repeated column names, read data in those columns as a list of values, in the order of their appearance in the input. (3) Think of other interesting and/or useful things to check and/or to do, and do them; feel free to discuss with me.
Handins
Hand in everything electronically, using Blackboard, by 5pm on the due date. Your handin should include your README file, code, test data, and anything else you have to show your work.
Grading
This assignment will be graded based on 100 points, allocated in proportion to the estimated amount of work for each part. You may do this assignment in a team of two people; the two people will receive the same points. Exceptionally well thought-out and well written work will receive appropriate extra credit. Extra credit work will be graded based on the estimated amount of extra work.