PAPyA

Prescrpeptive Analyisis for processing vast RDF datasets made easy.

This project is maintained by DataSystemsGroupUT

Data Preperator

Intro

This module generates an exemplare pipeline for testing PAPyA Bench-Ranking for querying big RDF datasets scenario which takes an input of RDF graph encoded in N-Triple serialization. Data Preparator allows defining an arbitrary number of dimensions with as many options as necessary. In this experiment we have three dimensions specified (i.e. relational schemas, partitioning techniques, and storage format). Therefore, Data Preparator will automatically generates the relational schemas for the input RDF dataset according to the specified configurations. Data Preparator’s interface is generic, and the generated data is agnostic to the underlying relational system. Current implementation of the system relies on SparkSQL, allowing RDF relational schema generation using SQL transformation. SparkSQL also supports different partitioning techniques and multiple storage formats, making it ideal for our experiments.

This figure shows example of schema generation in Data Preparator module. First, the Data Preparator transforms the input RDF graph into an Single Statement schema, and then other schemas are generated using parameterized SQL queries. For example, Vertical-Partitioned schema and Wide Property Table schema are generated using SQL queries against the Single Statement table. While, Extended Vertical-Partitioned schema generation relies on Vertical-Partitioned schema to exist first.

Relational Schemas

Currently, Data Preparator includes four relational schemas commonly used in RDF processing:

Partitioning Techniques

Data Preparator supports three different partitioning techniques:

Storage Formats

Data Preparator allows storing data using various HDFS file formats. In particular, the system has two types of storage format:

Getting Started with PAPyA Data Preperator:

For compiling and generating a jar with all dependenceies in DP module, one should run the following command inside the DP main directory:

mvn package assembly:single

For Directly Run the DP module, we uploaded the fat jar, into the target directory with the name (PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar).

The user can specify the required schema, partioning techniques, and storage options for his experiments inside using file loader-default.ini.

Example: generating the logical partitioning, i.e., the relational schema:

[logicalPartitioning]
TT = True
WPT = True
VP = True
EXTVP = True

[storage]
TTcsv = True
TTorc = True
TTavro = True
TTParquet = True

VPcsv = True
VPorc = True
VPavro = True
VPParquet = True

WPTcsv = True
WPTorc = True
WPTavro = True
WPTParquet = True

EXTVPcsv = True
EXTVPorc = True
EXTVPavro = True
EXTVPParquet = True

[physicalPartitioning]
TTp = True
TTs = True
TTh = True

VPs = True
VPh = True

EXTVPs = True
EXTVPh = True

WPTs = True
WPTh = True

The data is generated by submmitting the jar to a spark job:

spark-submit --class run.PapyaDPMain  --master local[*] PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar <OUTPUT_DIR> -db <dbName>  -i <RDF_SOURCE_DIR>