PAPyA

Prescrpeptive Analyisis for processing vast RDF datasets made easy.

This project is maintained by DataSystemsGroupUT

Bench-Ranker

Intro

This module is used to benchmark user’s big data solutions with a prescriptive performance analysis approach. Bench-Ranker reduces the time required to calculate the rankings, obtain useful visualizations, and determine the best performing configurations of the user. Furthermore, the key performance index of this module is extensible to anything that is measurable for a specific implementation (i.e. query runtimes). Bench-Ranker also provides an easy and interactive environment with python’s Jupyter Notebook making it easy for users to get insights of their data.

To make the module scalable over the configurations space, Bench-Ranker allows plug in of any number of dimensions into the solution space for example the schemas, partitioning, and storage format. In addition, Bench-Ranker implements both Single Dimension Ranking (SD) and Multi Dimension Ranking (MD). With both solutions of Single and Multi Dimension Rankings, Bench-Ranker provides easy visualization which the user can specify themselves when interacting with the Notebook. Lastly, the system used conformance and coherence to evaluate the goodness of a ranking criteria to select which ranking criterion is “good”. Meaning that the ranking does not suggest a low-performing configurations. We are looking at all ranking criteria (single dimension and multi dimension criteria) and compare them to the results accross different scales (i.e. dataset sizes).

Single Dimension Ranking

Bench-Ranker apply the ranking criteria for each dimension using ranking function R which is the rank score of the ranked dimension (i.e. shcemas, partition, storage formats). A rank set R is an ordered set of elements ordered by a score. Below is the generalized version of the ranking function which calculates the rank scores for the configurations:

R is the rank score of the ranked dimension. Such that d represents the total number of parameters (configurations) under that dimension, O dim (r) denotes the number of occurences of the dimension being placed at the rank r (1st, 2nd, …), and Q represents the total number of queries, as we have 11 query executions in the experiment (i.e. Q = 11).

Replicability

Bench-Ranker provides the functionality of checking the system’s performance replicability while introducing different experimental dimensions. The idea of replicability is checking system’s performance on a single dimension while changing the parameters of the other dimensions. In the experiment, we compare two configurations (Partitioning & Storage) on the schema dimensions respectively. In the table below, shows the effect of introducing different partitioning techniques and file formats on some schema dimensions (ExtVp & WPT) with their baseline configurations (VP & PT).

The results show clear trade-offs between schema configurations as shown in the table above. This module also provides visualization for a better view in our data.

Multi Dimension Ranking

Single dimensional ranking optimizes the configurations towards a single particular dimension ignoring the trade-offs to other dimensions. This shows that Single Dimension ranking criteria maximizes the scores only for one dimension while ignoring the others. This leads to the idea of a Multi Dimensional Ranking criteria which aims to optimize all dimensions at the same time. Bench-Ranker utilizes the Non-Dominated Sorting Genetic Algorithm 2 (NSGA-II) to find the best performing configuration solution in a complex solution space.

Bench-Ranker provides two ways to apply NSGA-II algorithm:

Triangle Area (RTA) Ranking

Bench-Ranker allows user to plug in a new ranking criterion if needed apart from the already existing ones (Single Dimension & Multi Dimension Ranking). RTA is an example of adding new ranking criterion in Bench-Ranker. This ranking criterion makes an interpretation of the Single Dimensional Ranking scores based on the triangle area. In the figure below, shows the representation of Single Dimensional Ranking scores on a triangle sides, which aims to maximize the triangle’s area. The closer they are to the outer triangle, the better the configuration combinations are.

The formula of RTA uses basic triangle area formula. Which sums up the triangle area of the three sides for each dimensions (Schemas, Partitioning, Storage).

Ranking Validation

Bench-Ranker provides a ranking solution validation for all ranking criteria (i.e. SD Ranking and MD Ranking) using the conformance and coherence. We identify if a ranking criteria is “good” if it’s not suggesting any low-performing configurations in our experiment. We are using such metric to look at all ranking criteria and comparing them on different scales (i.e. dataset sizes).

We calculate conformance using this equation by positioning the element in the initial ranking score. For example, let’s consider the Rs ranking and the top-3 ranked configurations are {c1,c2,c3}, that overlaps only with the bottom-3 ranked configurations in query Q. That is, {c4,c2,c5}, i.e c2 is in the 59th position out of 60 ranks/positions (i.e., the rank before last). Thus, A(R) = 1 − 1/(11 ∗ 3), when k = 3 and Q = 11.

In this experiment, we assume that rank sets have the same number of elements. Kendall’s distance between two rank sets R1 and R2, where P represents the set of unique pairs of distinct elements in the two sets. For instance, the K index between R1={c1,c2,c3} and R2={c1,c2,c4} for 100M and 250M is 0.33, i.e., one disagreement out of three pair comparisons.

Visualization

To get better insights of the experiment’s data, Bench-Ranker gives visualization for both single dimensional ranking solution and multi dimensional ranking solution shown in the figure below. In addition, it also provides visualization that shows the trade-offs of using the single dimensional ranking criteria with a radar plot. A default data visualization for the rank shall be specified. However, this can be specified by the user due to the specificity of the visualization.

On recent updates, Bench-Ranker provides even more visualizations along with some new functionalities to help users get better understanding of their data. The updates include some of the functionality explained below: