WOSAK
Workload-Driven Schema Optimization for Analytical Knowledge Graph
Financed by CNRS INS2I ProgramGraphs are ubiquitous, and their relevance is growing in several domains, e.g., financial analysis, healthcare, e-governance, etc [1,2]. The need for analyzing large Knowledge Graphs pushed the data management community to experiment with BigData (BD) processing frameworks like Apache Spark [1,5,6,7]. Indeed, Bigdata platforms excel in analytical processing. Nonetheless, their typical expose relational interfaces that ease the definition and execution of analytical workloads and Extract-Transform-Load jobs [2].
Several design decisions are necessary to perform graph processing on top of Big Data frameworks. In practice, it is necessary to choose the most appropriate partitioning technique to determine the best file format to store data and define a relational schema to organize the graph data. In particular, the latter is critical for the efficiency of the analysis because it directly reflects on the complexity of the workload [4].
Research in this area highlighted several alternative solutions for encoding graph data using the relational data model. We limit our scope to the Resource Description Framework (RDF) as the data model and SPARQL as query language. RDF represents graphs as a set of triples of the form (subject, predicate, object). The graph structure, e.g., node types and edge labels, are defined within the graphs or importers from external vocabularies.
Recent studies showed that results change quite radically while variating the schema, especially the storage formats and partitioning techniques [7]. For example, one relational schema may outperform others while evaluating a particular family of queries, but it increases the sparsity of the data and results in space waste. In practice, each schema is excellent for a particular scenario, and there is no one-fits-all solution [4,7]. Intuitively, combining multiple schemas to obtain the best of both represents a valuable research direction. Nevertheless, hybrid schema solutions require lots of hand-made data engineering work. To the best of our knowledge, an approach that guides the decision process by analyzing the data and the query workload does not exist. Therefore, we propose to investigate such research direction, organizing a project according to the following steps.
Quantify how relational schemata impact the workload complexity and efficiency. Quantifying the relation schema-workload is critical for the subsequent project steps. The goal of this initial step is to define a dataset project that combines traditional schema metadata with features related to graph-specific workloads.
We will focus our investigation on the most popular relational schemas for RDF graphs [2], i.e., (i) Single Statement (ST) Table, (ii) Vertical Table (VT), and (ii) Wide Property (WP) Table [5,6]. Such schemas are designed according to different requirements that are often in contrast. ST is triple-based, i.e., the schema is agnostic from the underlying graph structure. This approach eases data integration (i.e., merging two datasets does not require any schema change) but increases the number of self-joins required to answer complex queries. WPT is dataset-based, i.e., the schema attempts to encode the entire dataset into a single denormalized relation. The schema strictly depends on the graph structure and aims at reducing the cost of star-shaped queries. However, it is often very sparse. VT is graph-based, i.e., the schema requires one relation for any predicate (labeled edge) in the graph structure. The schema aims at reducing redundancy by normalizing the relational graph structure.
Representing a given RDF dataset G using ST, VT, and WPT has consequences for disk occupancy, relation sparsity, duplication. Moreover, it directly impacts the workload complexity. Indeed, translating SPARQL queries in SQL is a schema-dependent task. In the context of Big Data systems, the overall complexity is aggravated by the variety of file formats, e.g., CSV, Avro, parquet, and data partitioning (which for graph data implies data shuffling).
- Design an algorithm for automated schema migration.
A major problem in hybrid schema definition for graph data is the lack of an automated procedure for schema migrations. Considering the schemas ST, VT, and WPT as starting points, we will develop an algorithm for schema migration that considers the dataset characteristics and the impact on the workload. The algorithm should take as an input a dataset profile (as it will be defined in the first step of the project) and the workload, which consists of a set of SPARQL queries and their SQL translations. We imagine the resulting approach to be iterative that attempts to modify the input dataset schema and the workload to reach an optimal hybrid schema.
As an extension of such an algorithm, we plan to incorporate the notions of partitioning and storage file formats.
- Empirically validate the algorithm performance using state-of-the-art benchmarks and big data frameworks.
This project aims to enable workload-driven schema optimization by designing a method for automated schema transformation. To validate the effectiveness of our approach, we plan to re-use state-of-the-art benchmarking like LUBM, SP2Bench, WatDiv, and LDBC social media. These benchmarks include a data generator and sufficiently complex workloads to test the designed solution with different dataset sizes. Moreover, we also plan to test the approach on real-world datasets like Wikidata and Yago [2] that present additional challenges in data preparation and consistency. Last but not least, we plan to use Apache Spark as an initial test framework for our experiments because it is the de-facto standard for Big Data processing and offers a mature SQL interface.
The project will be carried on in collaboration with French and international partners. In particular, Mohamed Ragab from the University of Tartu (Estonia), who contributed to the preliminary work to this project [4,7], and Angela Bonifati from Lyon 1 University in France, who shares the vision and the expertise on Big Graph processing [1].
- Sakr, Sherif, et al. “The future is big graphs: a community view on graph processing systems.” Communications of the ACM64.9 (2021): 62-71.
- Hogan, Aidan, et al. “Knowledge graphs.” Synthesis Lectures on Data, Semantics, and Knowledge 12.2 (2021): 1-257.
- Cossu, Matteo, Michael Färber, and Georg Lausen. “Prost: Distributed execution of SpaRQL queries using mixed partitioning strategies.” arXiv preprint arXiv:1802.05898(2018).
- Ragab, Mohamed, Feras M. Awaysheh, and Riccardo Tommasini. “Bench-Ranking: A First Step Towards Prescriptive Performance Analyses For Big Data Frameworks.”
- Schätzle, Alexander, et al. “S2RDF: RDF querying with SPARQL on spark.” arXiv preprint arXiv:1512.07021 (2015).
- Schätzle, Alexander, et al. “Sempala: Interactive SPARQL query processing on hadoop.” International Semantic Web Conference. Springer, Cham, 2014.
- Ragab, Mohamed, et al. “An In-depth Investigation of Large-scale RDF Relational Schema Optimizations Using Spark SQL.” DOLAP. 2021.