Storing and searching the nation's phone and email records is a difficult technical challenge. While computers have grown faster and more powerful, the volume of data involved have grown even faster. No one computer can handle the task. Instead, to deal with the billions of pieces of data, you need to divide the task up between thousands of computers. The people who helped solve that challenge for the National Security Agency (NSA) now work in Cambridge, seeking to commercialize the software they created.
The NSA was not the first organization to confront this data storage and analysis challenge. Google, with its worldwide services, had invented its own solution, as had both Yahoo and Facebook. But the NSA determined that those systems, some of which were readily available for use, lacked a critical requirement. The NSA needed cell-level security, that is, every individual piece of data in these massive databases had to be able to be tagged with who was permitted access. Seeing no other solution, the NSA built its own software, dubbed Accumulo. And then it gave it away.
It may seem paradoxical that an agency so secret that its acronym - "NSA" - was once joked to mean "No Such Agency" would give away a unique solution like Accumulo. But, in the world of complex software, it's common. There are no true secrets in Accumulo, just the results of clever software engineering that could be duplicated by anyone with the resources to do so. But with the software released to the open source community, others would start to use it, find and fix problems, and extend its capabilities, all of which could, in turn be used by the NSA. For Hadoop, the Yahoo created massive database system, contributors include apparent competitors like Facebook and Twitter, both of whom seem prepared to cooperate in advancing the technical state of the art, while competing for customer's time and attention through features. So the NSA gave Accumulo to the Apache Software Foundation, a well respected open source, non-profit, software steward. After a vetting process, Apache accepted the gift and launched Apache Accumulo.
But the Accumulo project, , ran into congressional roadblocks. Open source software development by the government is viewed as unfair, tax-payer funded competition for commercial software developers. The 2014 Defense Authorization Bill specifically mentions Accumulo and requires, before it can be used by the rest of the government, it to meet certain standards of uniqueness and commercial viability. According to industry observers, this sort of detailed legislative mention of a particular open source software project is unprecedented.
In late 2012, a team of engineers left the NSA and formed Sqrrl, a startup company whose goal is to commercialize Accumulo. Originally headquartered in the Washington DC area, Sqrrl relocated to Cambridge when it received $2 million in venture capital funds from Kendall Square's Atlas Venture. Sqrrl joins a growing number of companies in the Kendall Square area focusing on "big data" problems, the emerging field of managing and analyzing datasets whose size would have been unimaginable just a decade ago.
Sqrrl isn't shy about its NSA roots. Its founders' biographies include their NSA work. The Sqrrl web site says:
Apache Accumulo was born at NSA within the U.S. Intelligence Community. While we can't say much about how NSA currently uses Accumulo due to classification reasons, Accumulo was designed to help the Intelligence Community take advantage of the massive amounts of structured, semi-structured, and unstructured information available to it, while adhering to the strictest of security and privacy requirements.
There are also a series of videos featured on its web site in which Sqrrl's Chief Technology Officer talks about the lessons learned while managing data for the NSA. Another explains the technology behind the recent NY Times report that that NSA can search, in real time. not only based on sender and recipient, but on the contents of a communication.
Lessons Learned at the NSA
Conversely, one can glean insights into the areas of interest to the NSA. A Sqrrl blog post asks "Is Accumulo the world's most scalable graph store?" "Graphs", in the information storage context, are the way of storing data about relationships. Your Facebook friends, for example, are stored in what Facebook calls its "social graph". Who calls whom, who emails whom, are all data best stored as graphs, as would be the connections between a terrorist network. The Sqrrl blog post notes an NSA scientific presentation(PDF) that talks about Accumulo's ability to handle "web scale" graphs, consisting of trillions of entries.
To be sure, Sqrrl's technology, the marriage of big data tools with security, has significant commercial potential. As the ability to analyze these massive datasets becomes readily available, more and more industries are discovering that they can be transformed by the insights gleaned from these data. And some of them, e.g. medicine, have security and privacy requirements that are good matches to a security scheme that satisfied the NSA. But the focus on the NSA, brought about by Edward Snowden's leaks to The Guardian revealing the extent of government surveillance of communications, has made Sqrrl's business goals harder to achieve. Ely Kahn, Sqrrl's vice president of business development told the Ars Technica web site: "Because of everything that's going on with NSA, our early customers are not particularly excited about talking."
Update: My presentation on sqrrl, the NSA and free and open source software from LibraPlanet: