|
CSE307 Spring 2007 Stony Brook |
Principles of Programming Languages
Annie Liu Assignment 3 |
Handout A3 Feb. 20, 2007 Due Mar. 1 |
Semantic Analysis --- Inferring Types
This assignment has two goals: (1) experimenting with automatically inferring information about data in the input table, (2) understanding a state machine model, i.e., a semantics, given to the data in the table and checking static semantics.
You might have thought about this while doing Assignment 2: some information about the input data could be inferred automatically, saving the user from having to specify everything in a separate file. You might also have wondered what data to put in the table and what the data should denote or mean; we did not said anything about this in Assignment 2. We will do both here.
Inferring types
We will generate a declaration file, in the same format as the declaration file in Assignment 2, using results from the following analysis of the data:
First, we determine the number of columns. To ensure that we don't loose any data, we find a row with the maximum number of columns, and use that number as the number of columns.
Second, we generate proper, complete, and unique column names based on the data in the first row and the number of columns. We first make column names proper ids (i.e., strings of letters, digits, and "_" not starting with a digit), by replacing every offending symbol with "_". Then for each column where a column name is missing, we generate "Column_i", where "i" is the index of that column in the table starting at 0 for the left. Finally for all columns that have a same name, we append the name with "_k", where "k" is the index into columns that have the same name, starting at 1 from the left.
Third, we infer whether all data in a column, except for the first row, is int, float, matching some regular expression, or otherwise a string of anything. Specifically, (1) if all data in a column are integers, we conclude the type for that column to be int; (2) otherwise, if all data are floating point numbers, the type is float; (3) otherwise, if all data are one of two possible values, the type is a simple kind of regular expression for the two possible values; and (4) otherwise, the type is string.
Semantic analysis
We now give a semantics, a state machine model, to the data in the table. We consider a state machine to be a set of states, a set of transitions from states to states labeled with input, and a start state. We interpret data in the first row, except for the first column, as the input labels. We interpret data in the first column, except for the first row, as the states. We interpret any other data item in the i-th row and j-th column as the target state of a transition whose source state is the i-th state and whose label is the j-th column. We interpret the data item in the first row and first column as the start state. For simplicity, we use integers as ids for states.
For example, the following table denotes a machine with two states, 1 and 2, and four possible input labels, a, b, c, and d; the start state is 1; in state 1, it can repeatedly take an input a and stay in state 1, or take an input b and go to state 2, where in state 2, it can repeatedly take an input c and stay in state 2, or take an input d and go to state 1.
1 a b c d 1 1 2 2 2 1
First, we make input labels in the first row proper as above, and we make them complete and unique in a different way here, and check whether the state machine is deterministic. Specifically, for all missing column names for input labels, we use "Other". For multiple columns with a same input label as the column name, we merge data in these columns into one column; if any two of the columns have different data in any row, then the state machine is not deterministic, and we flag an error.
Second, we check that states in the first column are proper and complete, make the states unique, and check that the state machine is deterministic. Specifically, we check that all rows in the first column have integer values and flag an error otherwise. For multiple rows with a same value as the state id, we merge data in these rows into one row; if any two of the rows have different data in any column, then the state machine is not deterministic, and we flag an error.
Finally, we check that the start state and the states in the other rows and columns are in the set of states in the first column except for the first row. For data that are missing from these other rows and columns, we assume that they mean an implicit trap state.
When run, your program should take 3 arguments plus an optional argument from command line: input data file name, output declaration file name, output data file name, and an optional tag "-s"; (1) read data from the input data file; (2) if the tag "-s" is not present, generate the declarations in the declaration file, and write the cleaned-up data to the output data file, with columns aligned as before, and with data in the first row replaced by the generated column names; and (3) if the tag "-s" is present, meaning to treat the data as a state machine, check static semantics and process the data as described above, and write the resulting data to the output data file, with columns aligned, with proper input labels used in the first row, and possibly with some columns and rows merged. When there are errors in the input data, try to give the best error messages possible and finish all the processing possible. Again, you should not define classes for this entire assignment.
Extra credit suggestions
Here are some ideas. (1) Think of other kinds of types that can be inferred and represented as regular expressions and infer them, e.g., one of up to k possible values for any given k>1. (2) Think of other kinds of static semantics that can be checked and check them, e.g., whether the trap state is never reachable from the start state following the transitions. (3) Think of other interesting and/or useful things to check and/or to do, and do them; feel free to discuss with me.
Handins
Hand in everything electronically, using Blackboard, by 5pm on the due date. Your handin should include your README file, code, test data, and anything else you have to show your work.
Grading
This assignment will be graded based on 100 points, allocated in proportion to the estimated amount of work for each part. You may do this assignment in a team of two people; the two people will receive the same points. Exceptionally well thought-out and well written work will receive appropriate extra credit. Extra credit work will be graded based on the estimated amount of extra work.