Data Preperator

Intro

This module generates an exemplare pipeline for testing PAPyA Bench-Ranking for querying big RDF datasets scenario which takes an input of RDF graph encoded in N-Triple serialization. Data Preparator allows defining an arbitrary number of dimensions with as many options as necessary. In this experiment we have three dimensions specified (i.e. relational schemas, partitioning techniques, and storage format). Therefore, Data Preparator will automatically generates the relational schemas for the input RDF dataset according to the specified configurations. Data Preparator’s interface is generic, and the generated data is agnostic to the underlying relational system. Current implementation of the system relies on SparkSQL, allowing RDF relational schema generation using SQL transformation. SparkSQL also supports different partitioning techniques and multiple storage formats, making it ideal for our experiments.

This figure shows example of schema generation in Data Preparator module. First, the Data Preparator transforms the input RDF graph into an Single Statement schema, and then other schemas are generated using parameterized SQL queries. For example, Vertical-Partitioned schema and Wide Property Table schema are generated using SQL queries against the Single Statement table. While, Extended Vertical-Partitioned schema generation relies on Vertical-Partitioned schema to exist first.

Relational Schemas

Currently, Data Preparator includes four relational schemas commonly used in RDF processing:

Single Statement (ST)

storing triples using a ternary relation (subject, predicate, object), which often requires many self-joins
Vertical-Partitioned Table (VP)

mitigate some issues of self-joins in ST schema by using binary relations (subject, object) for each unique predicate in dataset
Wide Property Table (WPT)

attempts to encode the entire dataset into a single denormalized table
Extended Vertical-Partitioned Table (ExtVP)

precomputes semi-joins VP tables to reduce data shuffling

Partitioning Techniques

Data Preparator supports three different partitioning techniques:

Horizontal Partitioning

divides data evenly over the number of machines in the cluster
Subject Based Partitioning

divides data across partitions according to the subject keys
Predicate Based Partitioning

distribute data across various partitions according to the hash value computed for the predicate keys

Storage Formats

Data Preparator allows storing data using various HDFS file formats. In particular, the system has two types of storage format:

Row-Store (CSV and Avro)

storing data by record, keeping all of the data associated with a record next to each other in memory. Optimized for reading and writing rows efficiently
Columnar-Store (ORC and Parquet)

storing data by field, keeping all of the data associated with a field next to each other in memory. Optimized for reading and writing columns efficiently

Getting Started with PAPyA Data Preperator:

For compiling and generating a jar with all dependenceies in DP module, one should run the following command inside the DP main directory:

mvn package assembly:single

For Directly Run the DP module, we uploaded the fat jar, into the target directory with the name (PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar).

The user can specify the required schema, partioning techniques, and storage options for his experiments inside using file loader-default.ini.

Example: generating the logical partitioning, i.e., the relational schema:

[logicalPartitioning]
TT = True
WPT = True
VP = True
EXTVP = True

[storage]
TTcsv = True
TTorc = True
TTavro = True
TTParquet = True

VPcsv = True
VPorc = True
VPavro = True
VPParquet = True

WPTcsv = True
WPTorc = True
WPTavro = True
WPTParquet = True

EXTVPcsv = True
EXTVPorc = True
EXTVPavro = True
EXTVPParquet = True

[physicalPartitioning]
TTp = True
TTs = True
TTh = True

VPs = True
VPh = True

EXTVPs = True
EXTVPh = True

WPTs = True
WPTh = True

The data is generated by submmitting the jar to a spark job:

spark-submit --class run.PapyaDPMain  --master local[*] PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar <OUTPUT_DIR> -db <dbName>  -i <RDF_SOURCE_DIR>

The paramters of the job are as follows:
- The dbname is the name of databse to be generated.
- The is the output directory for the generated data.
- The is the source directory for the RDF (```N-Triples```), needed only for generating the ```ST``` schema.
Note: make sure that the loader-default.ini file should be loacated besides the jar file.
The DP can directly generate the database into HDFS by specifying the <OUTPUT_DIR> as HDFS directory (HDFS://...). Otherwise, PAPyA DP loads the data locally (FILE://...).