A full and detailed description of the PERM project can be found in this link .
Data provenance is information that describes how a given data item was produced. The provenance of a data item includes source and intermediate data as well as the transformations involved in producing a concrete data item. In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items.
Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In the Perm project we try to overcome this disadvantages by developing a novel provenance management system called Perm (Provenance Extension of the Relational Model) that is capable of computing, storing and querying provenance for relational databases. Perm generates provenance by rewriting transformations (queries). For a given query, Perm generates a single query that produces the same result as q but extended with additional attributes used to store provenance data. An important advantage of the approach used in Perm is that the transformed query is also a regular relational algebra statement. Thus, we can use the full expressive power of SQL to, e.g, query the provenance of data items from the result of the original query, store the transformed query as a materialized view, and apply standard query optimization techniques to the execution of transformed query. Perm can be used both to compute provenance on the fly (i.e., at query time) and to store provenance persistently for future access. Perm also supports external provenance and incremental provenance computation reusing stored provenance information.
An important contribution of Perm is that it already covers a far wider range of relational algebra than existing systems. We demonstrated in extensive experiments that Perm can efficiently compute the provenance of complex queries (for example the queries from the TPC-H benchmark) while inducing only a minimal overhead on normal operations.