CSE 657: Design and Analysis Seminar, Spring 2004 Schedule

Schedule

	Date	Speaker	Topic	Links
1.	Feb 13	Patrick McDaniel	Authentication in Interdomain Routing
2.	Feb 18	Anita Wasilewska	Proteins Secondary Structure Prediction (P1)	Paper
3.	Feb 25	Anita Wasilewska	Proteins Secondary Structure Prediction (P2)	Paper
4.	Mar 03	Beata Sarna-Starosta	Policy Analysis for Security-Enhanced Linux	Paper
5.	Mar 10
6.	Mar 17	Larry Koved	Beyond Code Generation: Using compiler technology for security and software defect identification	Paper
7.	Mar 24	Ping Yang	Pi-calculus Verification
8.	Mar 31	IV Ramakrishnan
***	Apr 07	Spring Recess
9.	Apr 14
10.	Apr 21
11.	Apr 28	IV Ramakrishnan	On the Power of Semantic Partitioning of Web Documents	Paper
12.	May 05
12.	May 26	Guizhen Yang	The Complexity of Mining Maximal Frequent Itemsets and Maximal Frequent Patterns	Paper

Abstracts

AUTHENTICATION IN INTERDOMAIN ROUTING

Patrick McDaniel
AT&T Labs -- Research

Attacks against Internet routing are increasing in number and severity. Contributing greatly to these attacks is the absence of origin authentication: there is no way to validate if an entity using an address has the right to do so. This vulnerability is not only a conduit for malicious behavior, but indirectly allows seemingly inconsequential misconfigurations to disrupt large portions of the Internet. This talk discusses the semantics, design, and costs of origin authentication in interdomain routing. A formalization of address usage and delegation is presented and broad classes of cryptographic proof systems appropriate for origin authentication are considered.

The costs of origin authentication are largely determined by the form and stability of the served address space. However, prior to this work, little was known about the relevant characteristics of address use on the Internet. Developed from collected interdomain routing data and presented in this talk, our approximate delegation hierarchy shows that current IP address delegation is dense and relatively static. One notable result shows that as few as 16 entities are the source of 80% of the delegation on the Internet. We further show via simulation that these features can be exploited to efficiently implement Internet-scale origin authentication. The talk is concluded with a presentation of thoughts on major problems in routing security and other related future work.

PROTEIN SECONDARY STRUCTURE PREDICTION (PSSP)- A MODEL, DATA-SETS, METHODS AND A META-CLASSIFIER

Anita Wasilewska¹, Victor Robles Forcada², Pedro Larranaga Mugica³
¹Computer Science Department, SUNY at Stony Brook, NY, USA
²Department of Computer Science, Technical University of Madrid Madrid, Spain
³Department of Computer Science, University of the Basque Country, San Sebastian, Spain

Techniques for the prediction of protein secondary structure provide information that is useful both in ad initio structure prediction and as an additional constraint for fold-recognition algorithms. Knowledge of secondary structure alone can help the design of site-directed or deletion mutants that will not destroy the native protein structure. However, for all these applications it is essential that the secondary structure prediction be accurate, or at least that, the reliability for each residue can be assessed. If a protein sequence shows clear similarity to a protein of known three dimensional structure, then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms, as the homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity.

Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable. Accordingly, it is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.

It our talk we first define a formal mathematical model for PSSP problems and use it to present an uniform view of some of the newest supervised classification methods and discuss their results.

We then discuss the most modern PSSP data-sets, and first, second and third generation of PSSP algorithms.

Finally, we evaluate and compare predictive accuracy of nine of PSSP algorithms with our recently built meta-classifier. The meta-classifier uses evolutionary information and combines the predictions from the nine classifiers. It gives a 2\% improvement in predictive accuracy over the best single method and 15\% over the worst one.

POLICY ANALYSIS FOR SECURITY-ENHANCED LINUX

Beata Sarna-Starosta and Scott Stoller
Computer Science Department, SUNY at Stony Brook, NY, USA

Security-Enhanced Linux (SELinux) extends Linux with a flexible mandatory access control mechanism that enforces security policies expressed in SELinux's policy language. Determining whether a given policy meets a site's high-level security goals can be difficult, due to the low-level nature of the policy language and the size and complexity of SELinux policies. We propose a logic-programming-based approach to analysis of SELinux policies. The approach is implemented in a tool that helps users determine whether a policy meets its goals.

BEYOND CODE GENERATION: USING COMPILER TECHNOLOGY FOR SECURITY AND SOFTWARE DEFECT IDENTIFICATION

Larry Koved
Java & Web Services Security, IBM T.J. Watson Research Center, USA

Compiler and optimization technology has a rich history. Traditional use of the technology has been for code generation through a variety of intra- and inter-procedural code analyses. We have been exploring the use of the same basic techniques to address hard problems in security and software defect identification. A number of factors make the analysis problems more challenging. A significant trend in software development is the reuse of components -- from libraries to entire applications. Often the source code for these components is not available. The size and complexity of the software to be analyzed can be quite large, making the analysis even more challenging. This talk will give a high level overview of two of our projects that use whole program control and data flow analyses:

We have been developing a variety of security analyses of Java applications (e.g., servlets, applets, applications). For example, one of the most difficult challenges for an application developer or system administrator is identifying the security authorization requirements of Java applications ("Permissions" to grant the application). The traditional approach is either to turn off security or to run test cases to observe authorization failures and add the required Permissions to the authorization database until the application no longer fails. This talk describes a better approach that automatically computes the authorization policy for an application.

Another project, called SABER, uses whole program analysis to identify coding defects in application code. Saber identifies code that may result in performance problems, outright failure of an application, or violates "best practices" (coding conventions). For example, a performance problem may result when an application uses a class or method that is known to result in poor performance when used in a particular context (e.g., calling the garbage collector during garbage collection). This talk will give an overview of the types of defects that SABER can discover.

ON THE POWER OF SEMANTIC PARTITIONING OF WEB DOCUMENTS

IV Ramakrishnan
Stony Brook University

Enormous amount of semantic data is still being encoded in HTML documents. Identifying and annotating the semantic concepts implicit in such documents will make them directly amenable for Semantic Web processing. This talk will describe a highly automated technique for partitioning HTML documents into semantic structures. Semantic partitioning has several important and powerful implications in practice. For example, it eases the task of formulating queries to retrieve data from Web documents. Knowledge of the schema made explicit via semantic partitioning facilitates automated transformation of HTML documents into more semantics-oriented document formats such as RDF and XML. Yet another application is audio-browseable Web content. By putting a dialog interface to the content of a Web page which is reorganized based on the knowledge of its schema, a user, especially visually disabled individuals, can browse its content using audio and not suffer from information overload. Finally semantic partitioning can enable the creation of self-repairable wrappers, the technology that provides a database-like interface to Web documents.

THE COMPLEXITY OF MINING MAXIMAL FREQUENT ITEMSETS AND MAXIMAL FREQUENT PATTERNS

Guizhen Yang
Department of Computer Science and Engineering
University at Buffalo, The State University of New York

Since the introduction of the Apriori algorithm about a decade ago, the field of data mining has flourished into a research area of significant technological and social importance, with applications ranging from business intelligence to security to bioinformatics. However, in spite of the multitude of data mining algorithms developed, not much effort has been made on the theoretical frontend to study the inherent complexity nature of data mining problems themselves. A thorough investigation of these fundamental problems is greatly needed since it will not only provide invaluable insights into many data mining problems but will also shed new lights on the characteristics of different data mining algorithms and benchmark datasets.

In this talk we seek to provide a theoretical account of the computational difficulty of a genre of data mining problems that deal with maximal frequent patterns, from the perspective of counting the number of solutions. We present the first formal proof that the problem of counting the number of distinct maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #P-complete, thereby providing strong theoretical evidence that the problem of mining maximal frequent itemsets is NP-hard.

We will also extend our complexity analysis to other similar data mining problems dealing with complex data structures, such as sequences, trees, and graphs, which have attracted intensive research interests in recent years. Normally, in these problems a partial order among frequent patterns can be defined in such a way as to preserve the downward closure property, with maximal frequent patterns being those without any successor with respect to this partial order. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #P-complete or #P-hard.

This talk will be self-explanatory. No prior knowledge on data mining and complexity theory will be assumed from audience.

Last updated on Mar 03, 2004 by Radu Grosu