A systematic Benchmarking on the performance of Spark-SQL for processing Vast RDF datasets
This project is maintained by DataSystemsGroupUT
To identify which configuration is the best, we need to optimize along all the dimensions simultaneously. In practice, this means designing a multi-dimensional ranking. To this extent, we propose three alternative techniques that aim at combining the ranking dimensions into a single unified ranking criterion.
The Average (AVG) criterion leverages an arithmetic interpretation of the rankings of our three experimental dimensions. In practice, it aims at maximising their sum and by computing the arithmetic mean of the three rankings (R_s, R_p, and R_f), see the folowing equation.
The Weighted Average (WAvg) criterion also leverages an arithmetic interpretation of the rankings of our three experimental dimensions. However, it assumes that each dimension contributes differently to the performance. Thus, it requires assigning weights to each individual rank according to its impact in the experiments, e.g., we have 5 different storage backends, 3 partitioning techniques, and 3 relational schemas). (see the following equation)
This criterion leverages a geometric interpretation of the rankings of our three experimental dimensions. It looks at the triangle subsumed by each ranking criterion (R_s, R_p, and R_f). The trade-offs ranking dimensions are presented by the triangle sides. The criterion aims at maximizing the area of this triangle (i.e., the blue triangle) the closer to the ideal (outer red triangle), the better it scores. In other words, the bigger the area of this triangle covers, the better the performance of the three ranking dimensions altogether.
The following formula computes the actual triangle area. Simply, it sums up the triangle area of the three triangle A, B, and C by two of its sides which are the rank scores of each dimension, i.e R_s, R_p, or R_f (dashed triangle sides), and the angle between both of them (i.e 120 in this case). Then, this triangle area is normalized dividing it by the area of the optimal **red** triangle D triangle.
For example, the actual area of the blue triangle of the figure above is calculated as follows: