Processing extremely large graphs has been and remains a challenge, but recent advances in Big Data technologies have made this task more practical. Tapad, a startup based in NYC focused on cross-device content delivery, has made graph processing the heart of their business model using Big Data to scale to terabytes of data.
Social networks like Facebook or Twitter contain data that naturally lends itself to a graph representation. But graphs can be used to represent less obvious data, as in the case of Tapad’s device graph. Dag Liodden, Tapad’s co-founder and CTO, describes why using a graph representation for devices makes sense:
Tapad takes a graph-oriented approach to modeling relationships between devices. Anonymous identifiers (such as cookie IDs) are represented as nodes in our Device Graph and we track marketing information to these nodes. Edges between the nodes are scored / weighted using a combination of deterministic data and probabilistic statistical models / machine learning techniques. The concept of a “device” is defined as a starting device / node (let’s say the cookie ID of a browser) and the collections of nodes (let’s say the cookie IDs of a Tablet and a Connected TV) that are reachable from that starting point given a customizable set of edge constraints. Having an actual graph structure, as opposed to just aggregated information into a single node, gives us the flexibility to balance accuracy and scale dynamically as well as more easily augment the graph with new edge inference models. Using the right tool for the right job is important, and the same goes for graph processing: there is no need to use Big Data technologies for graphs that can be handled by more traditional workloads, like Dag says:
“Big Data” to me is the threshold where you no longer can use a small set of general purpose, off-the-shelf tools to store and analyze your data, but instead have to tailor different technologies to address specific use cases. These thresholds keep moving every year as software and hardware solutions evolve and mature, but so does the size of the data sets we deal with and the level of sophistication of the analysis we need to perform. For Facebook, this threshold is in the single digit petabytes, as detailed during their submission to the 2013 ACM SIGMOD conference in NYC. For Tapad, the amount of data in the graph is smaller but would still be impossible to process using traditional methods:
The US graph currently has about 1.1 billion nodes, representing mobile phones, tablets, laptops, gaming consoles and TVs. Some of these nodes are transient; for instance, due to a browser with non-persistent cookies, and thus have little data and no edges. The non-transient nodes have about five edges on average and around 500 discrete pieces of information, such as behavioral segments, associated with them. The live graph data weighs in at multiple TB and we read / write from / to it several hundred thousand times per second across multiple data centers. Updates to the graph are geographically cross-replicated and each data center is currently serving off of servers backed by 20 TB of Flash SSD storage and 2 TB of RAM. The recent years have seen a surge in the number of technologies used to process graphs at scale, especially 2013 which saw several new additions to the ecosystem. There are two classes of systems to consider:
Graph databases for OLTP workloads for quick low-latency access to small portions of graph data. Graph processing engines for OLAP workloads allowing batch processing of large portions of a graph. The list of graph databases is already very long, but several projects have emerged and differentiated themselves recently. Neo4j is one of the oldest and most mature graph databases, but still suffers from scalability issues since it doesn’t support sharding yet. Another database that, albeit pretty young, has been gaining a lot of popularity in 2013 is Titan. As a backend-agnostic graph database, it can leverage both HBase and Cassandra’s scalable architecture and uses an optimized vertex and edge representation internally to allow it to scale to billions of edges as reported in a blog post in 2013.
But one does not need to use graph-specific databases, and more generic scalable NoSQL databases can also be an effective solution to the problem. Apache Accumulo, a technology based on Google’s BigTable and open-sourced in 2011, is an example of a generic database that can also be a good fit to store graphs at scale because records are flexible and can be used to store graphs with typed edges and weights, and is actually being used by the NSA according to a technical report published in 2013. Cassandra or Aerospike are other examples of databases that, with a proper data model, can effectively model a graph with edges, vertexes and weights. Facebook also built their own solution using MySQL and Memcache in a system called Tao, which is being used to serve the social graph to its users. And according to Dag, Tapad used the same philosophy in the design of their device graph:
The live graph lives in a key-value store to allow for fast traversals and updates. We regularly snapshot the graph into HDFS where it can be retrieved for more advanced graph processing and augmented with other data streams. The results are later fed back into the “live graph”. There are advantages to using a graph specific database, but our current setup with extremely fast, simple traversals of the graph in our key-value store and slow, but very flexible traversal and analysis on Hadoop is serving us well, at least for now.
Even with a graph stored in a database, the operations that can be performed at scale will likely be limited to lookups and small traversals. For more complex analysis on a larger portion of a graph, there is a need for batch processing distributed frameworks. For the best performance, the GraphLab framework uses the Message Passing Interface (MPI) model to scale and run complex algorithms using data in HDFS. More recent frameworks like Apache Giraph and Apache Hama are based on the Bulk Synchronous Parallel (BSP) paradigm popularized by Google’s Pregel project. And the latest additions to the ecosystem are the GraphX project running on top of Spark which was unveiled in 2013, and Faunus, which is using Hadoop to run MapReduce jobs to process graphs in a Titan database. Tapad is using these new technologies to process their offline graph data. According to Dag:
Currently, our main graph processing framework is Apache Giraph, but we are experimenting with Spark GraphX and Graphlab as well. All of these frameworks are still pretty young, the learning curve is pretty steep and all come with their own pros, cons and caveats. For instance, Giraph and GraphX are convenient as they fit nicely into our Hadoop infrastructure, but Graphlab is very appealing due to the sheer performance of it.
Some projects are attempting to provide a unified framework to answer both OLTP and OLAP queries. Dendrite from Lab41 is such a project that leverages GraphLab on top of Titan for storage and processing, and AngularJS for visualization. This is still a very young project unveiled in early 2014 so the community reaction is limited, but the fact that it attempts to cover every use case should help drive adoption.
在寻找失事飞机、海底沉船、或珍珠宝藏过程中，当可用数据极其缺乏时，群体智慧 The Wisdom of Crowds 也可以派上用场。
1968年5月，美国潜艇蝎子号（Scorpion）在完成北大西洋参观后，在返回纽波特纽斯（Newport News）途中消失了。虽然海军知道蝎子号最后一次报告的位置，但是海军对蝎子号发生的事故一无所知，只能模糊得知在最后无线电联系后蝎子号前进的距离。为了寻找蝎子号，海军划定了一个半径32千米，数千英尺深的圆形海域。这几乎是一个不可能完成的任务。当时，人们想到的最可行方案是聘用三四个潜艇和海洋环流顶级专家来推断蝎子号的位置。但是，在雪莉•桑塔格（Sherry Sontag）和克里斯托弗•德鲁（Christopher Drew）的书《Blind Man’s Bluff: The Untold Story of American Submarine Espionage》中记载，一个叫约翰•克雷文（John Craven）的海军军官提出了一个不同的计划。
有这个话题兴趣的读者可以参看詹姆斯•索罗维基（James Surowiecki）的书《群体的智慧》（The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations）。在马航事件中，也有人提出是否可以用群体的智慧的方法来寻找它，但目前还没人实现这个想法。
基于贝叶斯方法对整体概率进行计算所利用的信息来自四个阶段的搜寻工作。阶段一：利用被动声学技术搜寻水下定位信号器。法航447装备的飞行数据记录器和驾驶舱语音记录器可以帮助分析事故发生时的状况。同时，在飞机沉入水中时，飞机装配的水下定位信号器发出信号协助通讯。水下定位信号器的电池可以工作至少30天，平均可以工作40天。搜寻持续了31天并于2009年7月10日停止。两台搜救船——费尔蒙特冰川号和探险号，均装备了美国海军提供的声波定位装置——参与了搜救。阶段二：旁侧声呐搜寻。在声波搜寻结束后，BEA决定使用Pourquoi Pas 提供的IFREMER旁侧声呐技术继续搜寻。在本阶段，一些由于时间关系未能在第一阶段搜寻的海域也被搜寻。阶段三：旁侧扫描声呐搜寻。 阶段四：即我们在上一段提及的利用贝叶斯方法进行搜救，并最终找到了飞机残骸。图4-2展示了搜救过程。