Creating a Big Data Factory
Gary Nakamura, CEO, Concurrent, Inc.
May 5, 2014
http://insights.wired.com/profiles/blogs/creating-a-big-data-factory
It is time to retire the myth of the data science hero – the virtuoso who slays dragons and emerges with a treasure of an amazing app based on insights from big data. If we examine leading companies, we find not only lots of smart people, but also entire processes and teams that are focused on doing great work over and over again. In successful organizations, big data applications are not the virtuoso effort of a lone data scientist. Rather, these applications are built by teams comprised of analysts, data scientists, developers and operations staff working together to rapidly build applications that yield high business value so organizations can systematically operationalize their data. The reason to move toward repeatable victories and move away from the idea of virtuosity, as this article will explain, is that virtuosity is expensive, risky and doesn’t scale.
The Big Data Factory: Less Complexity, Reproducible Victory
In the early days at almost every one of the big data pioneers, application development ran more like a virtuoso process than a factory of teams. When most companies first start experimenting with big data, this pattern usually holds. But when they want to scale fast with reproducible results, well, they quickly find they need to run more like a factory.
Here’s what happens. Excitement about big data leads to experiments and sometimes even to transformative insights. Data scientists partner with developers or just hack on their own to create an innovative application—but frankly, a brittle one, with no process to recreate or maintain it. However sweet that victory was, companies quickly learn that it probably isn’t repeatable when pursuing 10 or 15 or 20 other apps at the same time. You want victory after victory, not one brittle application after another.
In turn, companies moved away from this virtuoso process to a more methodical “Big Data Factory.” These factories exist already. For example, Twitter is not starting from scratch every time it recognizes a new opportunity to monetize Tweets; it’s building on past success. And LinkedIn applications, such as “People You May Know” and “Groups You May Like,” started out as virtuoso products but then, due to their success, became repeatable platforms to support other applications.
What’s Wrong with Virtuosos?
Businesses can’t afford the virtuoso approach to application development, relying on a single data scientist or developer for their victories. Many companies have learned lessons the hard way, finding themselves with a steep learning curve trying to maintain an application created by a virtuoso who flew the coop. Besides that, for the most important apps, no single data scientist (or developer) knows enough to create the whole thing on his or her own.
Businesses can’t afford complexity in application design, as complexity creates risk. You can’t afford to lose the one person who understands how a project all fits together; otherwise you’ll find yourself unable to maintain or iterate the application – and you must, because data is organic and changes with user behavior. Today major companies like Twitter, LinkedIn and others are entirely dependent upon adapting applications to new data and to new patterns emerging in the data.
But with big data apps, whether created by a single person or a team, complexity is the norm as developers are still using the equivalent of Hadoop assembly language (raw MapReduce) to build applications in place of more efficient tools or techniques (for example, languages such as Scala with development frameworks like Cascading). Big data companies like LinkedIn and Twitter were among the first to figure this out, as they understood that while Apache Hadoop projects were crucial for creating an infrastructure, they are not optimal for creating and deploying numerous applications. The end goal, therefore, is to build enterprise applications powered by Hadoop without having to become an expert in its intricacies.
The difference between using an inferior tool that sort of solves the problem and a tool that solves it completely should be obvious: better tools overcome complexity. Compare an application written in Cascading versus an incumbent approach. To stand up the same application, you’ll hand off one file to operations, versus 17 or 18 files with 20 different scripts across various incongruous projects.
In order to remain sustainable, businesses need repeatable, transparent development processes that can generate maintainable products—like a factory.
What Does a Big Data Factory Look Like?
Let’s compare a Big Data Factory to an automotive manufacturer. They’re alike in that an entire team designs and produces the product. The data scientist is like an automotive design engineer; developers are like the mechanical and electrical engineers who build a car prototype, operations creates and runs the factory that makes the cars; and early users who provide feedback are like test drivers. From this team comes a high-quality product—be it a new-model Chevrolet or a business application. Some applications will be more successful than others, but all of them are drivable and maintainable—and, importantly, were created using a repeatable process.
For auto manufacturing, computer-aided design (CAD) was a tremendous advance over the drafting table, and I believe application framework tools are a tremendous advance over Hadoop assembly language. Today, teams don’t need to know an assembly language like MapReduce; instead, they can focus on marrying the business problems to data. Similar to an automotive assembly line, teams can develop and iterate an application very quickly, and once they feel it’s production ready, they can launch the application.
I mentioned quick iteration, and the key is collaboration, which a user-friendly application framework enables. No one person, not even the most brilliant data scientist, can decipher exactly what is going on with ever-changing organic data and then translate that into a full-blown solution. The team as a whole needs to decipher the results of its last test run and tweak the data application as needed.
Starting Your Big Data Factory
A company that’s just entered, by desire or market pressure, into the big data business doesn’t have to go through the trenches that Twitter, eBay and LinkedIn have already dug. Most companies can’t afford it nor do they have the in-house skills or resources to navigate and survive such complexity. And why should they? We’ve got a host of big data giants today showing us how to build big data factories that turn out perfect product in repeatable processes. And just like modern auto manufacturing, it all comes down to teamwork and using the right tools.
So how does a company go about creating its own big data factory? First, start by doing your research to identify the right big data tools. As I recently told Software Magazine, I recommend selecting tools that are widely deployed, hardened and that can be early incorporated into standard development processes.
Next, think teamwork. Once you know what tools you want to use, assess the skills gap you face. You may have thought you needed someone with MapReduce skills, but after doing due diligence about available options, you will find that you can leverage existing Java skills, data analysts and data warehouse ETL experts as well as data scientists. Make sure your team includes people with deep business knowledge along with an understanding of data and its business implications.
With the right tools and with a realistic assessment of the skills you have versus those you need, you will be ready to create your own big data factory. The benefit is being able to achieve the repeatable victories that deliver real business value, month over month and year over year.
I’ll take that over virtuosity any day.
Gary Nakamura is the CEO of Concurrent.
SHARE: