When we reflect about the ingredients of today’s success of the World- Wide Web, one ingredient seems to be essential: easy access to information from diverse sources, each publishing without central control. The enabling technology for this ingredient has been, arguably, the use of a single universal representation format, HTML and its variants. This allows easy access requiring only some form of HTML browser, Everyone can collect and aggregate information, e.g., for indexing such as Google, or for extracting personal information as Spam-Bots.

Surprisingly, this picture changes, when we move to the Web 2.0, the Semantic Web, or whatever other vision of the next generation Web currently en vogue. There we happily build islands: an XML island, an RDF island, a JSON island, an OWL island, a Topic maps island, etc. Of course, there is often legitimate reason to use different representation formats for different kinds of data on the Web. Also, we have already committed significant resources to the deployment of various XML islands and, though to a lesser extent, RDF ones.

Unless we assume that all information nicely fits into one and only one of these islands, we have to consider that, increasingly, Web applications will not only process HTML, but also XML, JSON, RDF, OWL, etc. This is already true for most Web 2.0 applications. When we build such applications, however, we do not want to care about the actual data formats but focus on the task of the application.

In I4, we have developed a solution to this challenge. We argue that query languages such as XSTL, XQuery, SPARQL, XPath, or Xcerpt, which have seen significant success for accessing each of the data islands, should make it easier for the user to access data in different formats. We call for such versatile query languages as tools to bridge the data islands and to allow the integration of data in XML, RDF, Topic Maps, or whatever other format. Though choosing a different solution, the W3C has recognized the importance of such scenarios in the recent GRDDL standard, an approach to bring XML data and microformats to the RDF island.


Contributions

But calling for versatile Web query languages can be only the beginning. Guided by this vision, I4 has developed use cases to illustrate the usefulness of versatile querying, languages to realize the vision, extensions and studies of the effect of versatile query constructs on existing languages, as well as  evaluation methods to show that versatile querying does not have to come at a price:

  1. BulletXcerpt has been refined and redesigned by I4 to become the first truly versatile Web query language. We have succeeded to enable access to multiple Web formats in Xcerpt and, through the CIQCAG algebra, shown that Xcerpt can nevertheless be implemented as efficiently as a specialized language such as XQuery that is restricted to tree-shaped XML data. A prototype for Xcerpt as well as a Web demonstrator are available. We also provide tutorials, Webcasts, and extensive use cases for Xcerpt.

  2. BulletWe have studied the versatile nature of Xcerpt along a large number of use cases, some illustrating general usage patterns (or “feature” use cases), some implementing real-life applications that benefit or rely on access to data in multiple formats and diverse representations (“application” use cases). A list of I4 use cases is available including source code (for most cases).

  3. BulletBasic reasoning-abilities are provided in Xcerpt (similar to those in datalog or logic-programming languages). Often, applications require specialized reasoners, e.g., for time or location data, for ontology reasoning, or for accessing legacy sources such as bioinformatics databases. We have proposed a versatile, blackbox integration of such external reasoners and predicates into rule languages in general and answer-set programming, in particular, in form of the dlvhex language. A prototype for dlvhex as well as some use cases and a Web demonstrator are available.

  4. BulletXcerpt is able to access data in RDF, but exposes some of the technical issues involved in the access scheme to the user. To provide even better access to RDF, we are studying further (syntactical) extensions to Xcerpt along practical RDF use cases as well as the effect of blank nodes, one of the few novelties of RDF, on rule languages such as Xcerpt. We investigate that effect in a formal rule language called RDFLog, an extensions of datalog with blank nodes in facts and rule heads and show that blank nodes come, essentially, for free for non-recursive or weakly recursive (no recursion through blank nodes) programs, but in presence of recursion blank nodes are as expressive as arbitrary function symbols. A prototype for RDFLog as well as a Web demonstrator are available.

  5. BulletCalling for versatile Web languages and proposing languages that consider the requirements of that vision must be complemented by a consideration of the price of such flexibility and expressiveness. For that, we have developed a formal foundation for any Web query language, called CIQLog. We show that the semantics of XQuery, XPath, Xcerpt, and SPARQL (as well as most other Web query languages) can be expressed conveniently in CIQLog. For CIQLog queries, we then define a query algebra, called CIQCAG, that specifies an evaluation method for CIQLog queries and thus queries in any of the above languages. CIQCAG significantly advances the current frontier of highly scalable (linear time and space) tree queries by defining a new class of data graphs (a proper superclass of trees and queries) that can be evaluated as efficiently as tree data. Essentially, we show that certain kinds of “graphness” come for free regarding evaluation complexity. Furthermore, CIQCAG scales to arbitrary shapes of trees and data, yet provides for each restricted class the same or better complexity than all previous approaches. With CIQCAG, we have an evaluation approach that allows versatile query languages to be as efficient as specialized query languages on each kind of data. E.g., Xcerpt on tree queries and tree data is with CIQCAG as efficient as XPath on tree data and thus space optimal. A full specification of CIQLog and CIQCAG is available as well as a prototype for the translation of the above languages to CIQLog. The CIQCAG prototype is still under development.

  6. BulletFinally, we investigate techniques that allow us to encapsulate the effect of rules or queries from other rules or queries in the same program. This is desirable already for standard query languages, but becomes essential for versatile ones, where we frequently have parts of a program that is concerned with access to data in different formats and where manually controlling interactions becomes an unacceptable burden for the user. A first step is a generic module system for rule languages, applied particularly to Xcerpt. A prototype (realized using I3’s Reuseware) of that module system is available together with a number of use cases.

In addition to the above contributions, I4 has been one of the most prolific working groups in REWERSE regarding its contributions to summer schools, tutorials, industry presentations, etc. A list of course material and tutorials is available on the publication page.

 

Bridging Data Islands

Vision of Versatility

Bry et al., “Querying the Web Reconsidered: Design Principles for Versatile Web Query Languages”. IJSWIS, 2005, and Semantic Web-Based Information Systems: State-of-the-Art Applications, chapter 8. CyberTech Publishing, 2007.

Bry et al., Let’s Mix It: Versatile Access to Web Data in Xcerpt. In IIWeb, 2006.


Core Principles

  1. -bridging data islands on the Web by accessing RDF, XML, JSON, etc. in a single language

  2. -localize effect of language constructs to enable reuse and format versatility: transparency, answer-closedness, rules, reasoning, modules

  3. -format, schema, and representational versatility

  4. -incompleteness: focus on what is essential