PAPyA

Prescrpeptive Analyisis for processing vast RDF datasets made easy.

This project is maintained by DataSystemsGroupUT

Results

This section presents PAPyA library in practice, we focus on the performance of a generalized single dimension and multi dimension ranking for complex big data solutions using a perscriptive performance analysis (PPA) approach. Our experiment design consist of :

In our experiment, we evaluate the performance of SparkSQL as a relational engine for evaluating the query workload. In particular, our key performance index is the query latency, but this could be extended to other metrics of evaluation. Our analysis is based on the average result of five different runs.

Bench-Ranking

Bench-Ranking phase starts when we have results from Data Preparator in log files in the log folder of our repository. To start the analysis, we need to specify all dimensions and their options along with our key performance index which in our case is the query runs.

# configuration file
dimensions:
    schemas: ["st", "vt", "pt", "extvt", "wpt"]
    partition: ["horizontal", "predicate", "subject"]
    storage: ["csv", "avro", "parquet", "orc"]
query: 11
# log file structures
log
└───100M
     │   st.horizontal.csv.txt
     │   st.horizontal.avro.txt
     │   ...
└───250M
     |   st.horizontal.csv.txt
     │   st.horizontal.avro.txt
     │   ...

Single Dimensional Ranking

We start the experiment by viewing our input data which are the log files of query runtimes over the configuration specified in the FileReader class parameters.

from PAPyA.file_reader import FileReader

SDRank(config, logs, '100M', 'schemas').file_reader()

In this experiment, we could get the single dimension ranking scores by calling the calculateRank function from the SDRank class which needs 4 parameters, the config file, logs file, dataset size, and the dimension we want to be ranked

from PAPyA.Rank import SDRank

SDRank(config, logs, '100M', 'schemas').calculateRank()

We can have even more parameters within the calculateRank method to give different results when removing some queries in the configuration.

# String parameters to slice the schema in this case we slice it on predicate and csv. 
# While excluding a list of queries which in this case we exclude query 3,4, and 5.
SDRank(config, logs, '100M', 'schemas').calculateRank('predicate', 'csv', [3,4,5])

Single Dimensional Visualization

To represent user’s data in an easy to read and interactive manner, PAPyA provides functionalities to visualize user’s data to help rationalize the performance results and final decisions on their experimental data.

Radar Plot

Ranking over one dimension is insufficient when it counts multiple dimensions. The presence of trade-offs reduces the accuracy of single dimension ranking functions. This plot can help view and understand this problem in a simple and intuitive way.

from PAPyA.Rank import SDRank
SDRank(config, logs, '100M', 'schemas').plotRadar()

This plot shows a figure of the top configuration of ranking by schema is optimized towards its dimension only, ignoring the other two dimensions.

Bar Plot

PAPyA also provides visualization that shows the performance of a single dimension parameters that user can choose in terms of their rank scores.

SDRank(config, logs, '100M', 'schemas').plot('csv')

Box Plot

In order to show the distributions of our query runtimes data, we need a box plot diagram to compare these data between queries in our experiment. Box plot can help provide information at a glance, giving us general information about our data.

from PAPyA.Rank import SDRank

# Box plot example of query 1,2,3 of schema ranking dimension
SDRank(config, logs, '100M', 'schemas').plotBox(["Q1", "Q2", "Q3"])

Replicability

This library comes with the functionality to check the replicability of our system’s performance when introducing different experimental dimensions. Replicability calculates over one dimensional option from the parameters over all the other dimensions.

# mode 0, replicability on query ; mode 1, replicability on average
from PAPyA.Rank import SDRank
SDRank(config, logs, '100M', 'storage').replicability(options = 'csv', mode = 1)

On the example code above, we used the storage dimension as the pivot point dimension to iterate over the other dimensions (schemas and partition) with csv as the choosen option of storage.

Replicability has two modes to calculate over, the first one is to calculate from the query rankings while the other one calculates from the average of the single dimensional scores.

The result is a replicability scores for csv when changing parameters over the other dimensions.

Replicability Comparison

PAPyA have a replicability comparison functionality to compare the replicability score of two options which the user can specify in the configuration files themselves.

# configuration file
dimensions:
    schemas: ["vp", "extvp"]
    partition: ["horizontal", "predicate", "subject"]
    storage: ["csv", "avro", "parquet", "orc"]
query: 11
# comparing replicability of two options in schemas dimension. mode 0 to compare globally, mode 1 to compare locally
from PAPyA.Rank import SDRank
SDRank(config, logs, '100M', 'schemas').replicability_comparison(option = 'vp', mode = 1)

The example above tries to compare vp with extvp in the schemas dimension while pivoting over all the options in the other dimensions (partition and storage).

This function has two modes of comparison, the first one is comparing it globally over the dimensions and the other one compares the option locally.

Replicability Visualization

To make it easier to understand the impact of replicability over a dimensional options, this library provides a replicability plotting. In this example, we try to check the impact of storage dimension on the schemas dimension.

from PAPyA.Rank import SDRank
SDRank(config, logs, '100M', 'schemas').replicability_plot('storage', mode = 0)

RTA

Ranking by Triangle Area is a new ranking function we added to PAPyA to test its abstractions to add new user defined ranking criterion besides Single Dimensional and Multi Dimensional Ranking that we already provides. RTA calculates the area of triangle in which our three dimensional ranks are. The higher the score the better the configuration.

from PAPyA.Rank import RTA
RTA(config, logs, '250M').rta()

Multi Dimensional Ranking

To get the configuration solutions of multi dimensional rankings, we apply the NSGA2 (Non-Dominated Sorting Genetic Algorithm) on paretoQ and paretoAgg method to call the two types of multi dimensional rankings respectively. This class takes three arguments, the config file, logs file, and the dataset size of our experiments.

from PAPyA.Ranker import MDRank

multiRanking = MDRank(config, logs, '100')
multiRanking.paretoQ()
multiRanking.paretoAgg()

Multi Dimensional Visualization

We can have a 3D visualization of multi dimensional ranking solution according to paretoAgg method as shades of green projected in the canvas.

MDRank(config, logs, '100M').plot()

Ranking Criteria Validation

Lastly, the library provides two metrics of evaluation to evaluate the goodness of the ranking criteria which are the conformance and coherence which can be called from the validator class. Both of these methods can take a list of ranking criterions the users want to evaluate.

from PAPyA.Ranker import Conformance, Coherence

conformance_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg', 'RTA']
coherence_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg', 'RTA']

Conformance(config, logs, '250M', conformance_set, 5, 28).run()
Coherence(config, logs, coherence_set).run('100M'. '250M')

Ranking Criteria Validation Visualization

PAPyA also provides functionality to have visualizations for the ranking criteria validations. For conformance, we can check different ranking criterions performance using a bar plots. While for coherence, we have a heatmap plot to show the coherence between two particular ranking sets of user’s choice.

Conformance

Coherence

The example above plots a heatmap between 100M datasets and 250M datasets with RTA as the choosen ranking criterion to observe.

Performance Analysis

100M 250M
SD Storage st.predicate.orc st.predicate.orc
SD Partition st.subject.parquet vt.subject.orc
SD Schema st.predicate.orc st.predicate.orc
ParetoAgg pt.subject.csv vt.predicate.parquet
ParetoQ wpt.subject.orc vt.subject.parquet

This table shows the top ranked configurations for each ranking criteria (i.e. Single Dimension and Multi Dimension Ranking) for 100M and 250M datasets.