An effective and efficient MapReduce algorithm for computing BFS-based traversals of large-scale RDF graphs. (English) Zbl 1461.68026

Summary: Nowadays, a leading instance of big data is represented by Web data that lead to the definition of so-called big Web data. Indeed, extending beyond to a large number of critical applications (e.g., Web advertisement), these data expose several characteristics that clearly adhere to the well-known 3V properties (i.e., volume, velocity, variety). Resource Description Framework (RDF) is a significant formalism and language for the so-called Semantic Web, due to the fact that a very wide family of Web entities can be naturally modeled in a graph-shaped manner. In this context, RDF graphs play a first-class role, because they are widely used in the context of modern Web applications and systems, including the emerging context of social networks. When RDF graphs are defined on top of big (Web) data, they lead to the so-called large-scale RDF graphs, which reasonably populate the next-generation Semantic Web. In order to process such kind of big data, MapReduce, an open source computational framework specifically tailored to big data processing, has emerged during the last years as the reference implementation for this critical setting. In line with this trend, in this paper, we present an approach for efficiently implementing traversals of large-scale RDF graphs over MapReduce that is based on the Breadth First Search (BFS) strategy for visiting (RDF) graphs to be decomposed and processed according to the MapReduce framework. We demonstrate how such implementation speeds-up the analysis of RDF graphs with respect to competitor approaches. Experimental results clearly support our contributions.


68M11 Internet topics
05C82 Small world graphs, complex networks (graph-theoretic aspects)
68R10 Graph theory (including graph drawing) in computer science
68T09 Computational aspects of data analysis and big data
68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
68W15 Distributed algorithms
Full Text: DOI


[1] Dean, J.; Ghemawat, S.; MapReduce: Simplified Data processing on Large Clusters; Commun. ACM: 2008; Volume 51 ,107-113.
[2] Ebay Data Warehouses; ; .
[3] Facebook Hadoop and Hive; ; .
[4] Facebook; ; . · Zbl 1357.91036
[5] MySpace; ; .
[6] NetFlix Documentation; ; .
[7] Leskovec, J.; Kleinberg, J.M.; Faloutsos, C.; Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations; Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining: ; ,177-187.
[8] Bahmani, B.; Kumar, R.; Vassilvitskii, S.; Densest Subgraph in Streaming and MapReduce; Proc. VLDB Endow.: 2012; Volume 5 ,454-465.
[9] Zhong, N.; Yau, S.S.; Ma, J.; Shimojo, S.; Just, M.; Hu, B.; Wang, G.; Oiwa, K.; Anzai, Y.; Brain Informatics-Based Big Data and the Wisdom Web of Things; IEEE Intell. Syst.: 2015; Volume 30 ,2-7.
[10] Lane, J.; Kim, H.J.; Big Data: Web-Crawling and Analysing Financial News Using RapidMiner; Int. J. Bus. Inf. Syst.: 2015; Volume 19 ,41-57.
[11] RDF 1.1 Concepts and Abstract Syntax—W3C Recommendation 25 February 2014; ; .
[12] Cappellari, P.; Virgilio, R.D.; Roantree, M.; Path-Oriented Keyword Search over Graph-Modeled Web Data; World Wide Web: 2012; Volume 15 ,631-661.
[13] Bröcheler, M.; Pugliese, A.; Subrahmanian, V.S.; Dogma: A Disk-Oriented Graph Matching Algorithm for RDF Databases; The Semantic Web—ISWC: Berlin Heidelberg, Germany 2009; ,97-113.
[14] Fan, W.; Li, J.; Ma, S.; Tang, N.; Wu, Y.; Wu, Y.; Graph Pattern Matching: From Intractable to Polynomial Time; Proc. VLDB Endow.: 2010; Volume 3 ,264-275.
[15] Zhang, S.; Yang, J.; Jin, W.; Sapper: Subgraph Indexing and Approximate Matching in Large Graphs; Proc. VLDB Endow.: 2010; Volume 3 ,1185-1194.
[16] Yu, B.; Cuzzocrea, A.; Jeong, D.H.; Maydebura, S.; On Managing Very Large Sensor-Network Data Using Bigtable; Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid): ; ,918-922.
[17] Yu, B.; Cuzzocrea, A.; Jeong, D.; Maybedura, S.; A Bigtable/MapReduce-based Cloud Infrastructure for Effectively and Efficiently Managing Large-Scale Sensor Networks; Data Management in Cloud, Grid and P2P Systems: Berlin Heidelberg, Germany 2012; ,25-36.
[18] Hadoop; ; .
[19] Cuzzocrea, A.; Furfaro, F.; Mazzeo, G.M.; Saccà, D.; A Grid Framework for Approximate Aggregate Query Answering on Summarized Sensor Network Readings; Proceedings of the OTM Confederated International Workshops and Posters, GADA, JTRES, MIOS, WORM, WOSE, PhDS, and INTEROP 2004: ; ,144-153.
[20] Cuzzocrea, A.; Furfaro, F.; Greco, S.; Masciari, E.; Mazzeo, G.M.; Saccà, D.; A Distributed System for Answering Range Queries on Sensor Network Data; Proceedings of the Third IEEE International Conference on Pervasive Computing and Communications Workshops, 2005. PerCom 2005 Workshops: ; ,369-373.
[21] Cuzzocrea, A.; Data Transformation Services over Grids With Real-Time Bound Constraints; On the Move to Meaningful Internet Systems: OTM 2008: Berlin Heidelberg, Germany 2008; ,852-869.
[22] Ghemawat, S.; Gobioff, H.; Leung, S.T.; The Google Fle System; Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles: ; ,29-43.
[23] Apache Nutch; ; .
[24] Amazon; ; .
[25] Elastic MapReduce Web Service; ; .
[26] Amazon Elastic Compute Cloud—EC2; ; .
[27] NetFlix; ; . · Zbl 1330.62090
[28] Hulu; ; .
[29] HBase—Apache Software Foundation Project Home Page; ; .
[30] Abadi, D.J.; Boncz, P.A.; Harizopoulos, S.; Column oriented Database Systems; Proc. VLDB Endow.: 2009; Volume 2 ,1664-1665.
[31] Cattell, R.; Scalable SQL and NoSQL Data Stores; SIGMOD Rec.: 2010; Volume 39 ,12-27.
[32] Shvachko, K.; Kuang, H.; Radia, S.; Chansler, R.; The Hadoop Distributed File System; Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST): ; ,1-10.
[33] Chang, F.; Dean, J.; Ghemawat, S.; Hsieh, W.C.; Wallach, D.A.; Burrows, M.; Chandra, T.; Fikes, A.; Gruber, R.E.; Bigtable: A Distributed Storage System for Structured Data; ACM Trans. Comput. Syst.: 2008; Volume 26 .
[34] Lin, J.; Dyer, C.; Data-Intensive Text Processing with MapReduce; Synthesis Lectures on Human Language Technologies: San Rafael, CA, USA 2010; .
[35] Bloom, B.H.; Space/Time Trade-Offs in Hash Coding with Allowable Errors; Commun. ACM: 1970; Volume 13 ,422-426. · Zbl 0195.47003
[36] Snappy: A Fast Compressor/Decompressor; ; .
[37] Broekstra, J.; Kampman, A.; van Harmelen, F.; Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema; The Semantic Web—ISWC: Berlin Heidelberg, Germany 2002; ,54-68. · Zbl 1048.68693
[38] Decker, S.; Melnik, S.; van Harmelen, F.; Fensel, D.; Klein, M.C.A.; Broekstra, J.; Erdmann, M.; Horrocks, I.; The Semantic Web: The Roles of XML and RDF; IEEE Intern. Comput.: 2000; Volume 4 ,63-74.
[39] Beckett, D.J.; The Design and Implementation of the Redland RDF Application Framework; Comput. Netw.: 2002; Volume 39 ,577-588.
[40] Huang, J.; Abadi, D.J.; Ren, K.; Scalable SPARQL Querying of Large RDF Graphs; Proc. VLDB Endow.: 2011; Volume 4 ,1123-1134.
[41] Herman, I.; Introduction to Semantic Web Technologies; SemTech: 2010; .
[42] Wikipedia; ; . · Zbl 1270.68302
[43] DBpedia; ; .
[44] RDQL—A Query Language for RDF—W3C Member Submission 9 January 2004; ; .
[45] Gabrilovich, E.; Markovitch, S.; Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis; Proceedings of the 20th International Joint Conference on Artifical Intelligence: ; ,1606-1611. · Zbl 1182.68319
[46] Chandramouli, N.; Goldstein, J.; Duan, S.; Temporal Analytics on Big Data for Web Advertising; Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE): ; ,90-101.
[47] Chen, C.C.Y.; Das, S.K.; Breadth-First Traversal of Trees and Integer Sorting in Parallel; Inf. Process. Lett.: 1992; Volume 41 ,39-49. · Zbl 0743.68066
[48] Niewiadomski, R.; Amaral, J.N.; Holte, R.C.; A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations; Proceedings of the International Conference on Parallel Processing: ; ,531-538.
[49] Chen, C.C.Y.; Das, S.K.; Akl, S.G.; A Unified Approach to Parallel Depth-First Traversals of General Trees; Inf. Process. Lett.: 1991; Volume 38 ,49-55. · Zbl 0736.68032
[50] Dittrich, J.; Quiané-Ruiz, J.A.; Efficient Big Data Processing in Hadoop MapReduce; Proc. VLDB Endow.: 2012; Volume 5 ,2014-2015.
[51] Chen, Y.; Alspaugh, S.; Katz, R.H.; Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads; Proc. VLDB Endow.: 2012; Volume 5 ,1802-1813.
[52] Papailiou, N.; Tsoumakos, D.; Konstantinou, I.; Karras, P.; Koziris, N.; H2RDF: Adaptive Query Processing on RDF Data in the Cloud; Proceedings of the 21st International Conference on World Wide Web: ; ,397-400.
[53] SPARQL 1.1 Overview—W3C Recommendation 21 March 2013; ; .
[54] Przyjaciel-Zablocki, M.; Schätzle, A.; Skaley, E.; Hornung, T.; Lausen, G.; Map-Side Merge Joins for Scalable SPARQL BGP Processing; Proceedings of the 2013 IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom): ; ,631-638.
[55] Jiang, H.; Chen, Y.; Qiao, Z.; Weng, T.H.; Li, K.C.; Scaling Up MapReduce-based Big Data Processing on Multi-GPU Systems; Clust. Comput.: 2015; Volume 18 ,369-383.
[56] Wang, Y.; Liu, Z.; Liao, H.; Li, C.; Improving the Performance of GIS Polygon Overlay Computation with MapReduce for Spatial Big Data Processing; Clust. Comput.: 2015; Volume 18 ,507-516.
[57] Kaoudi, Z.; Manolescu, I.; RDF in the Clouds: A Survey; VLDB J.: 2015; Volume 24 ,67-91.
[58] Rohloff, K.; Schantz, R.E.; High-Performance, Massively Scalable Distributed Systems Using the MapReduce Software Framework: The SHARD Triple-Store; Proceedings of the Programming Support Innovations for Emerging Distributed Applications: ; .
[59] Ladwig, G.; Harth, A.; CumulusRDF: Linked Data Management on Nested Key-Value Stores; Proceedings of the 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS2011) at the 10th International Semantic Web Conference: ; ,30-42.
[60] Gergatsoulis, M.; Nomikos, C.; Kalogeros, E.; Damigos, M.; An Algorithm for Querying Linked Data Using Map-Reduce; Proceedings of the 6th International Conference, Globe 2013: ; ,51-62.
[61] Schätzle, A.; Przyjaciel-Zablocki, M.; Lausen, G.; PigSPARQL: Mapping SPARQL to Pig Latin; Proceedings of the International Workshop on Semantic Web Information Management: ; .
[62] Olston, C.; Reed, B.; Srivastava, U.; Kumar, R.; Tomkins, A.; Pig Latin: A Not-So-Foreign Language for Data Processing; Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data: ; ,1099-1110.
[63] Nie, Z.; Du, F.; Chen, Y.; Du, C.; Xu, L.; Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing; Web Technologies and Applications: Berlin Heidelberg, Germany 2012; ,628-635.
[64] Du, J.H.; Wang, H.; Ni, Y.; Yu, Y.; HadoopRDF: A Scalable Semantic Data Analytical Engine; Intelligent Computing Theories and Applications: Berlin Heidelberg, Germany 2012; Volume Volume 2 ,633-641.
[65] Punnoose, R.; Crainiceanu, A.; Rapp, D.; Rya: A Scalable RDF Triple Store for the Clouds; Proceedings of the 1st International Workshop on Cloud Intelligence: ; .
[66] Urbani, J.; Maassen, J.; Drost, N.; Seinstra, F.J.; Bal, H.E.; Scalable RDF Data Compression with MapReduce; Concurr. Comput. Pract. Exp.: 2013; Volume 25 ,24-39.
[67] Ravindra, P.; Anyanwu, K.; Nesting Strategies for Enabling Nimble MapReduce Dataflows for Large RDF Data; Int. J. Semant. Web Inf. Syst.: 2014; Volume 10 ,1-26.
[68] Ravindra, P.; Anyanwu, K.; Scaling Unbound-Property Queries on Big RDF Data Warehouses Using MapReduce; Proceedings of the 18th International Conference on Extending Database Technology (EDBT): ; ,169-180.
[69] Apache Pig; ; .
[70] Choi, P.; Jung, J.; Lee, K.H.; RDFChain: Chain Centric Storage for Scalable Join Processing of RDF Graphs Using MapReduce and HBase; Proceedings of the 12th International Semantic Web Conference and the 1st Australasian Semantic Web Conference: ; ,249-252.
[71] Kim, H.S.; Ravindra, P.; Anyanwu, K.; Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce; Proceedings of the 2012 IEEE 5th International Conference on Cloud Computing (CLOUD): ; ,139-146.
[72] Ravindra, P.; Kim, H.S.; Anyanwu, K.; An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce; The Semanic Web: Research and Applications: Berlin Heidelberg, Germany 2011; ,46-61.
[73] Zhang, X.; Chen, L.; Wang, M.; Towards Efficient Join Processing over Large RDF Graph Using MapReduce; Scientific and Statistical Database Management: Berlin Heidelberg, Germany 2012; ,250-259.
[74] Apache Jena Core RDF API; ; .
[75] Vitolo, C.; Elkhatib, Y.; Reusser, D.; Macleod, C.J.A.; Buytaert, W.; Web Technologies for Environmental Big Data; Environ. Model. Softw.: 2015; Volume 63 ,185-198.
[76] Jacob, F.; Johnson, A.; Javed, F.; Zhao, M.; McNair, M.; WebScalding: A Framework for Big Data Web Services; Proceedings of the IEEE First International Conference on Big Data Computing Service and Applications (BigDataService): ; ,493-498.
[77] Cooper, B.F.; Silberstein, A.; Tam, E.; Ramakrishnan, R.; Sears, R.; Benchmarking Cloud Serving Systems with YCSB; Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC 2010: ; ,143-154.
[78] Silberstein, A.; Sears, R.; Zhou, W.; Cooper, B.F.; A Batch of PNUTS: Experiences Connecting Cloud Batch and Serving Systems; Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data: ; ,1101-1112.
[79] Apache Spark; ; . · Zbl 1360.68697
[80] Abedjan, Z.; Grütze, T.; Jentzsch, A.; Naumann, F.; Profiling and Mining RDF Data with ProLOD++; Proceedings of the IEEE 30th International Conference on Data Engineering: ; ,1198-1201.
[81] Kushwaha, N.; Vyas, O.P.; Leveragi0ng Bibliographic RDF Data for Keyword Prediction with Association Rule Mining (ARM); Data Sci. J.: 2014; Volume 13 ,119-126.
[82] Cuzzocrea, A.; A Framework for Modeling and Supporting Data Transformation Services over Data and Knowledge Grids with Real-Time Bound Constraints; Concurr. Comput. Pract. Exp.: 2011; Volume 23 ,436-457.
[83] Cuzzocrea, A.; Saccà, D.; Exploiting Compression and Approximation Paradigms for Effective And Efficient Online Analytical Processing over Sensor Network Readings in Data Grid Environments; Concurr. Comput. Pract. Exp.: 2013; Volume 25 ,2016-2035.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.