Semantic Web Information Processing:
         from Semistructured Data to Structural Knowledge

                          Guizhen Yang
                http://www.cs.sunysb.edu/~guizyang/
                  Department of Computer Science
           State University of New York at Stony Brook


The vision of the Semantic Web is to define and share machine
processable data on the Web which will enable a variety of automated
tasks ranging from information search to data integration to content
management to Web services. This talk will present our approach to
realizing the Semantic Web vision, by addressing two fundamental
issues: (1) creation of semantic content by transforming unstructured
Web documents into structured data; (2) infrastructure for reasoning
with semantically enriched data.

In the first part of the talk, I will focus on creation of semantic
content from Web documents. Specifically, I will describe novel
techniques for data extraction from Web documents that exhibit a high
degree of precision and recall. The theory behind these techniques is
based on the concept of unambiguity in automatic learning of
extraction patterns and the notion of resilience to changes in Web
documents. I will present complexity results and efficient algorithms
for learning unambiguous and resilient extraction patterns, as well as
experimental results to demonstrate the effectiveness of these
techniques in practice.

In the second part of the talk, I will deal with infrastructure for
reasoning with semantically enriched data. I will present my work on
the design and implementation of Flora-2. Flora-2 unifies the
well-known F-logic, HiLog, and Transaction Logic into one coherent
rule-based, object-oriented knowledge representation system. I will
discuss the engineering issues of language and compiler design,
system architecture, and query optimization, as well as the theoretical
issues related to the new semantics and algorithms for nonmonotonic
multiple value and code inheritance.

Flora-2 (and its predecessor Flora-1) has been used in a variety of
application domains, ranging from Web agents to information
integration in bioinformatics to ontology management to building CASE
systems. Since its last alpha-release less than a year ago it has had
hundreds of downloads and a small community of devoted users. A beta
release is planned in the near future. The source code of Flora-2 is
freely available at http://flora.sourceforge.net/.

At the end of the talk I will outline ongoing and future research on
the Flora-2 system, tree pattern query aggregation, mining semantic
structures of Web documents, and security policy management.