Prescrpeptive Analyisis for processing vast RDF datasets made easy.
This project is maintained by DataSystemsGroupUT
This module generates an exemplare pipeline for testing PAPyA Bench-Ranking for querying big RDF datasets scenario which takes an input of RDF graph encoded in N-Triple serialization. Data Preparator allows defining an arbitrary number of dimensions with as many options as necessary. In this experiment we have three dimensions specified (i.e. relational schemas, partitioning techniques, and storage format). Therefore, Data Preparator will automatically generates the relational schemas for the input RDF dataset according to the specified configurations. Data Preparator’s interface is generic, and the generated data is agnostic to the underlying relational system. Current implementation of the system relies on SparkSQL, allowing RDF relational schema generation using SQL transformation. SparkSQL also supports different partitioning techniques and multiple storage formats, making it ideal for our experiments.
This figure shows example of schema generation in Data Preparator module. First, the Data Preparator transforms the input RDF graph into an Single Statement schema, and then other schemas are generated using parameterized SQL queries. For example, Vertical-Partitioned schema and Wide Property Table schema are generated using SQL queries against the Single Statement table. While, Extended Vertical-Partitioned schema generation relies on Vertical-Partitioned schema to exist first.
Currently, Data Preparator includes four relational schemas commonly used in RDF processing:
Data Preparator supports three different partitioning techniques:
Data Preparator allows storing data using various HDFS file formats. In particular, the system has two types of storage format:
For compiling and generating a jar
with all dependenceies in DP module, one should run the following command inside the DP main directory:
mvn package assembly:single
For Directly Run the DP module, we uploaded the fat jar, into the target
directory with the name (PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar
).
The user can specify the required schema, partioning techniques, and storage options for his experiments inside using file loader-default.ini
.
Example: generating the logical partitioning, i.e., the relational schema:
[logicalPartitioning]
TT = True
WPT = True
VP = True
EXTVP = True
[storage]
TTcsv = True
TTorc = True
TTavro = True
TTParquet = True
VPcsv = True
VPorc = True
VPavro = True
VPParquet = True
WPTcsv = True
WPTorc = True
WPTavro = True
WPTParquet = True
EXTVPcsv = True
EXTVPorc = True
EXTVPavro = True
EXTVPParquet = True
[physicalPartitioning]
TTp = True
TTs = True
TTh = True
VPs = True
VPh = True
EXTVPs = True
EXTVPh = True
WPTs = True
WPTh = True
The data is generated by submmitting the jar to a spark job:
spark-submit --class run.PapyaDPMain --master local[*] PapyaDP-1.0-SNAPSHOT-jar-with-dependencies.jar <OUTPUT_DIR> -db <dbName> -i <RDF_SOURCE_DIR>
loader-default.ini
file should be loacated besides the jar file.<OUTPUT_DIR>
as HDFS directory (HDFS://...
). Otherwise, PAPyA DP loads the data locally (FILE://...
).