SPARKSQL RDF Processing Benchmarking

Project description

In this Project, we present a systematic comparison of the relevant RDF relational schemas, i.e., Single Statement Table, Property Tables or Vertically-Partitioned Tables queried using Apache Spark. We evaluate the performance Spark SQL querying engine for processing SPARQL queries using three different storage back-ends, namely, PostgresSQL, Hive, and HDFS. For the latter one, we compare four different data formats (CSV, ORC, Avro, and Parquet). We drove our experiment using a representative query workloads from the SP2Bench benchmark scenario. The results of our experiments show many interesting insights about the impact of the relational encoding scheme, storage backends and storage formats on the performance of the query execution process.

Project Phases

phases

Phase#1

In the frst phase of our work, we presented a systematic analysis of the performance of Spark-SQL query engine (mainly the execution time) for answering SPARQL queries over RDF repositories on a centralized single-machine. In particular, we have performed our experiments considering: (i) alternative relational schemas for RDF, i.e., Single Statement Tables, Vertical Tables, and Property Tables; (ii) various storage backends, i.e., PostgreSQL, Hive, and HDFS, and (iii) and different data formats (e.g. CSV, Avro, Parquet, ORC). We conducted experiments on RDF datasets with (100K, 1M, and 10M) triples.

Phase#2

In the second phase of our project, we conducted the same settings and configurations but differently in a distributed deployments with partitioning the data. In particular, we conducted our experiments in a Spark cluster of four machines. and we worked on a larger RDF dataset of 100M dataset. Notably, we don’t use PostgreSQL anymore in this phase experiments.

Phase#3

In this phase, we repeated the phase#2 experimetns, but this time we extended our compared relatonal experiments with new proposed relational schema representations (ExtVP, and WPT) from the State-of-the-art. Extended Vertically Partitioned Tables (ExtVP) and Wide Property Tables (WPT) are prominent optimizations that target specific workloads. Nevertheless, in a distributed context (with presenace of data pratitioning, and altenaritve storage backends) such improvements do not always outperform their baselines. Thus, we compare ExtVP with the baseline VT, and WPT with the baseline PT schema considering different partitoning techniques, and different file formats for storage.

Note Phase#4 results and figures are updated and can be found in the results section link

Phase#4 ( Bench-Ranking )

In this phase, we also conduct the phase#2 experimetns but with way larger datset scales (100M, 250M, and 500M) triples. moreover, differently from the previous phase, we apply different ranking and combined ranking criteria (Bench-Ranking) to quantitively and effectively help practioners to choose the best configuration combinations in such complex solution space of different dimensions (schema, partitioning, and storage).

Note Phase#3 Bench-Ranking results and figures are updated and can be found in the results section link

RDF Relational schemas

In order to process RDF graphs on top of a relational big dataframework, e.g Spark-SQL. RDF graphs have to represented in relational layouts (i.e relational schema). To this end, we employ the most popular relational schemas in the literiature for representing RDF graphs in atabular relational layouts. We present a systematic comparison of these three relevant RDF relational schemas, namely Single Statement Table (ST), Property Tables (PT), and Vertically-Partitioned Tables (VT). Moreover, in phase#4 of this project, we investigated the performance of other two relational schemata advancements that extend the core mentioned relational schemata, namely Wide Property Tables-WPT (extending the PT schema), and Extended Vertically-Partitioned-ExtVP (extending the VT schema).

Single Statement Table requires storing RDF datasets in a single triples table of three columns that represent the three components of the RDF triple, i.e., Subject, Predicate, and Object.

spark

Vertically-Partitioned Tables is an alterna-tive schema storage in which the RDF triples table is decomposed into a table of two columns (Subject, Object) for each unique property in the RDF dataset such that the first (subject) column contains all subject URIs of that unique property, and the second (object) contains all the object values (URIs and Literals) for those subjects

spark

Property Tables is proposed to cluster multiple RDF properties as n-ary table columns for the same subject to group entities that are similar instructure.

spark

For our new extension of the Phase#4 of this project, we include two other Relational schemata to our experiemtns (The Wide Property Table [“WPT”], and the Extended Vertical Tables [“ExtVP”]).

Wide Property Table extends the PT schema for optimizing star-shaped SPARQL queries. WPT aims at representing the whole RDF dataset into a single unified table. Such table uses all RDF properties in the dataset as columns.

spark

Extended Vertical Tables [“ExtVP”] aims at minimizing the size of input data during query evaluation [22]. In particular, ExtVP minimizes data skewness and eliminates dangling triples (i.e. triples that do not have a join partner) that do not contribute to any join in the query. This extension is inspired by the Semi-Join reductions.

spark

Apache Spark-SQL

For our experiments, we opted for the defacto big data (relational) engine, i.e. Apache Spark-SQL. It is one of the most popular high-level libraries of Apache Spark targeted for processing structured datasets using the DataFrames data abstraction, and optimized by means of the Catalyst query optimizer. In this project, we use Spark-SQL for processing and querying large RDF datasets. In this context, SPARQL queries are mapped to SQL queries and optimized over the “Catalyst” Optimizer.

Storage Backends

We evaluate the performance of SparkSQL querying engine for processing SPARQL queries using two different storage backends, namely, Hive, and HDFS. For the latter one, we compare four different data formats (CSV, ORC, Avro, and Parquet).

Partitioning techniques

In addition, we show the impact of using three different RDF-based partitioning techniques with our relational scenario which are Subject-based, Predicate-based, and Horizontal partitioning.

part_techs

Horizontal -Based Partitioning (HP): a technique that evenly partitions the data horizontally on the number of machines in the cluster. In particular, it partitions the relational tables we have according to (n) number of machines in the cluster.
Subject-Based Partitioning (SBP): a technique which distributes triples to the partitions according to the subject. As a result, all the triples that have the same subject are assumed to reside on the same partition. In our scenario, we applied spark partitioning using the subject- key with our different relational schema tables/Dataframes.
Predicate-Based Partitioning (PBP): a technique distributes triples to the partitions based on the RDF predicate. As a result, all the triples that have the same predicate are assumed to reside on the same partition. In our scenario, we applied Spark partitioning using the predicate- key with our different relational schema tables/Dataframes.

Installation & Pre-Processing

SP2Bench Data Generator generates RDF data in N3 format. Apache Jena is used to convert N3 into TDB files. Afterwards, we query TDB datasets using SPARQL quereis to generate our different CSV relational schemas (i.e. ST, PT, and VT). We further, used Spark-SQL framework to convert the CSV data into other HDFS file formats such as (Parquet, ORC, Avro). You can find the FileFormats Conversion Porject source code here.

We used the same approach to load the data into the tables of the Apache Hive data warehouse using a created database for our datasets. Data conversion to Hive files requires to enable the support for Hive in the Spark session configuration using the enableHiveSupport function. Notably, we have used another approach for creating Hive tables and loading data into them. We use HQL queries rough the hive CLI to create tables and load data in the form of the three differnt Relational schemas.

Note that: we are on the process of replicating our work on the “WatDiv” benchmark dataset.

Datasets

The SP2Bench Benchmark is scalable benchmark, whichj means it comprise a data generator that enables generatring arbitrarly large RDF datasets. For our First Phase of this project (Centralized Experiments), we generated datasets with the sizes [100K, 1M, and 10M] triples. While, for the second phase (Distributed experiments), we scale up to larger datasets with the sizes [100M, 250M, and 500M] triples.

	ST				VP				PT
	CSV	Avro	ORC	Parquet	CSV	Avro	ORC	Parquet	CSV	Avro	ORC	Parquet
100M	13GB	~2GB	1.3GB	1.7GB	~8.3GB	~1.7GB	~1.5GB	~1.6GB	~6.8GB	1.6GB	1.4GB	1.4GB
250M	31GB	~5GB	3.2GB	~4GB	21GB	~4.2GB	~3.8GB	4GB	36GB	6.2GB	5.5GB	5.8GB
500M	60GB	11GB	6.3GB	~8GB	41GB	8.3GB	7.3GB	7.8GB	33GB	7.5GB	6.5GB	7GB

For reproducability, We put here the 100K triples dataset alongside their relational schema conversions in different file formats. you can find these datasets here.
We couldn’t upload the 100M datsets. Indeed it’s quite larg dataset with for all the schema (ST,VT, PT, WPT, ExtVP,..) and for all the storage file formats.
Alternatively, you can use the SP2Bench generator (Download from here) for generating larger datasets. We keep spark project for converting the datasets into various file formats (i.e from CSV into Parquet, ORC, Avro,..) and storing them directly to HDFS.
We keep this project also with the partitioning scripts (please find in the source code) of the datasets with Spark into different partitioning techniques (Horizontally partitionied, Subject, and Predicate-based partitioning).

SP2Bench Queries

Sp2Bench SPARQL queries and their SQL translations for ST,VT, and PT relational schemas (that we will use in our experiments, compliant with the SparkSQL) can be found here

SQL translations of the SP2Bench for all the relational schemata (core [ST,VT,PT], and optimized [WPT, ExtVP]) can find here in our repo source code-Queries.
Query Complexity Analysis: the following table shows the Sp2Bench queries complexity analysis, i.e in terms of Number of Joins, Selections(Filters), and Projections.

spark

Experiments Architecture

This figure shows the summary of our experiments configurations. It guides the reader through the naming process, i.e., (Schema.PartitioniningTechnique.Storage_Backend). For instance, (a.ii.4) corresponds to Single ST schema, SBP partitioning, and Parquet backend.

spark

Source Code

You can find the source code of all expeiments in Source Code

Experiments Running

We used the Spark.time function by passing the spark.sql(query) query execution function as a parameter. The output of this function is the running time of evaluating the SQL query into the Spark environment using the Spark session interface. All queries are evaluated for all schemas and partitioned horizontally (HP), or over subject and predicate (SBP, PBP respectively), and on top of all the diﬀerent storage backends Hive, and the HDFS file formats.

For each storage backend, partitioning method, and a relational schema, we run the experiments for all queries five times (excluding the first cold start run time, to avoid the warm-up bias, and computed an average of the other four run times).

Scripts

In the directory of Scirpts, we share full scripts we used in our benchmaking experiments to make all computations and analysis as well as plotting all the figures over the logs of the experiments.

“Bench-Ranking” experiments’ scripts calculations (ranking tables generation, ranking scores computations, and Ranking goodness measures) can also be found in the above link.

Results

Centralized Expeiments results (single machine, smaller datsets)
- Centralized Experiments
Distributed Experiments
- Descriptive Analytics:
- Diagnostic Analytics:
  - Diagnostic Analyisis of the results (answering the “why?” question)
- Prescriptive analysis
  - (Single-Dimensional “Bench-Ranking”):
    - Relational Schema Ranking Scores
    - Partitioning Techs. Ranking Scores
    - Storage Backends Ranking Scores
    - We keep all the single-dimensional ranking for the above rankings for the results of the experiments that exlude “Hive” as a 5^th storage backend can be accessed from the above page links.
    - We keep all the intermediary ranking tables and logs calculations of all the above ranking plots of the dimensions in this link.
  - (Multi-Dimensional “Bench-Ranking”):
    - Find and download the multi-dim. Ranking criteria (Pareto front optimal) results from this link.
    - pareto optimal results (shown as tables for the mentioned dataset scales) can be shown (under pareto results section) here, also can be shwon as matblot 3d figures in the same webpage.
  - Bench-Ranking goodness results:
    - In this link, we keep the ranking goodness metrics/measures (coherence, and confidence) and results for all the ranking criteria and across datasets.
- Phase#3 results (Schema Advancments Benchmarking):
  - Relational Schemata Optimizattion VS BaseLine Schemata Results

Project Authors

Publications

Phase#1 of this project has been accepted at QuWeDa@ISWC conference in Auckland, New Zealand PDF:

Mohamed Ragab, Riccardo Tommasini and Sherif Sakr, Benchmarking SparkSQL under Alliterative RDF Relational Storage Backends, QuWeDa@ISWC 2019.

Phase#2 of this project has been accepted at Semantic Big Data (SBD 2020)@In conjunction with ACM SIGMOD 2020 in Portland, OR, USA PDF:

 Mohamed Ragab, Riccardo Tommasini, Sadiq Eyvazov and Sherif Sakr, Towards making sense of Spark-SQL performance for processing vast distributed RDF datasets, In Proceedings of The International Workshop on Semantic Big Data (SBD ’20).

Phase#3 of this project has been accepted at 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2021)@EDBT/ICT conference joint in Nicosia,Cyprus March 23 – 26, 2021 PDF:

 Mohamed Ragab, Riccardo Tommasini, Feras Awaysheh and Juan Carlos Ramos. An In-depth Comparison of Large-scale RDF Relational Schema Optimizations Using Spark-SQL(DOLAP ’21).	 

Phase#4 of this project has been accepted at 2021 IEEE International Conference on Big Data PDF:

 Mohamed Ragab, Feras Awaysheh and Riccardo Tommasini. Bench-Ranking: A First Step Towards Prescriptive Performance Analyses For Big Data Frameworks.	 

Licence

This work is mainianed by the DataSystems Group, University of Tartu, licensed under the terms of the GNU General Public License, version 3.0 (GPLv3)

Spark-SQLRDF Benchmarking by DataSystemsGroup is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.