Tim Furche, François Bry:
Tim Furche (editor):
Scalable, Space-optimal Implementation of Web Queries - Part 2: CIQCAG Algebra and Evaluation.

Complete Text [
.pdf, 2.08MB]
In: (I4-D15b)

Tree queries and tree data have proved to be interesting limitations on the quest to scalable query evaluation. In this deliverable, we explore two directions on how to extent the results from tree query languages such as XPath beyond tree data: (1) Can we find an interesting and practically relevant class of (data) graphs that is a proper superset of trees, yet to which algorithms such as twig joins, so far limited to trees or DAGs, can be extended without affecting (time and space) complexity? (2) Second, we observe that data and, particularly, queries are often mostly trees with only limited non-tree parts. However, any nontree part makes most of the techniques discussed above for tree queries inapplicable. Can we integrate the above technologies in the processing of general graph queries in such a way that (often significant) hierarchical components can be evaluated using polynomial algorithms, limiting the degradation of query complexity to non-tree parts of the query? We answer both questions essentially positively by introducing a novel algebra, called CIQCAG, the compositional, interval-based query and composition algebra for graphs. CIQCAG is a fully algebraic approach to querying Web data (be it in XML, RDF, or other semi-structured shape) that is build around two central contributions: (1) A novel characterization of (data) graphs admissible to interval-based compaction. For this new class of data, called continuous-image graphs (or cigs for short), we can provide linear-space and almost linear-time algorithms for evaluating tree queries rivaling the best known algorithms for tree data. (2) An algebra for a two-phase evaluation of queries separating a tree core of the query from the remaining non-tree constraints. The operations of the algebra closely mirror relational algebra (and can, in fact, be implemented in standard SQL), but a novel data structure influenced by Xcerpt’s memoization matrix and the complete answer aggregates approach allows for exponentially more succinct storage of intermediate results for the tree core of a query. Together with a set of operators on this data structure, called sequence map, this enables CIQCAG to evaluate almost-tree queries with nearly polynomial time limiting the degradation in performance to non-tree parts of the query.



	url = {}