Distributed SPARQL querying with Avalanche

Avalanche is a system designed to allow a data surfer to query the Semantic Web transparently without making any prior assumptions about the data distribution, schema-alignment, pertinent data statistics, data evolution, and data presence (or accessibility of servers). Specifically, Avalanche can perform up-to-date (SPARQL) queries over the indexed Web of Data. Given a query it first gets on-line statistical information about potential data sources, the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide first answers.

We empirically evaluated Avalanche using a data-set consisting of 276 million triples (LUBM) distributed in different degrees of “messiness” over 100 servers as well as the Fedbench data-set

A simple view of the Avalanche execution model is ilustrated in the following figure (including the three major phases: a) source discovery, b) statistics gathering and c) distributed query execution):