Ensuring reliable application performance in a production Hadoop environment is really hard. We feel your pain and after speaking with dozens of organizations with mature implementations, some common themes emerge about the nature of these environments:
- Multiple business teams are developing applications using the technology that best suites their use case and/or skill set
- As a result, most environments are heterogeneous, meaning multiple frameworks and technologies in play (MapReduce, Hive, Pig, Scalding, Spark, Oozie, Sqoop, Control M… the list goes on and on and on)
- Most business critical applications have service levels expectations
- Lots of Hive queries out there, both scheduled and ad-hoc
- MapReduce is still the workhorse compute fabric but others like Apache Tez, Spark, Flink, Kafka, etc. are being tested, where it makes sense, with teams eventually rewriting or porting applications to other fabrics
Ok… so now imagine you are a member of the data engineering or IT operations team who is tasked with managing this hot mess of technology coupled with applications of varying degrees of business criticality and performance expectations. You basically have all the responsibility of optimizing cluster performance but almost no visibility to the applications running on it. At scale, this lack of visibility into the data pipeline or application performance makes providing a high quality of service pretty tough. Bottom-line is lack of visibility into an applications performance is the key issue that these teams find they cannot solve with cluster management solutions. It simply requires a different level of visibility within the environment.
The symptoms of a lack of application performance visibility manifest themselves in the following common operational challenges:
- We can’t maintain reliable, predictable performance for scheduled jobs/queries. If there is a slowdown, it takes too long to troubleshoot the issue and we cannot easily find the root cause
- We don’t know what teams are consuming what resources and if they are being consumed efficiently, especially related to ad-hoc Hive queries
- We are not sure what service levels we can commit to for business critical applications
- We don’t know which jobs/queries/applications are business critical when there is a slowdown, who owns them or the downstream impacts
- We have policies and best practices in place but its difficult to enforce them across the organization
There are a number of excellent cluster management and monitoring solutions, but they all focus on performance metrics for the cluster i.e. CPU utilization, I/O, disk performance, disk health, etc. However, when it comes to your applications, these solutions simply do not capture the metrics to provide application level performance monitoring. This why most DevOps teams are forced to wade through the resource manager, log files and thousands of lines of code to troubleshoot a problem.
To mitigate the reliability and performance risks in a shared service environment, you need to think about application performance monitoring and not just cluster performance monitoring. This means understanding performance metrics at the application level, not the job or task level. The context is actually quite different because you first need to understand what constitutes an application, i.e. what jobs, queries, etc. make up an end-to-end data flow. Then you need to understand what teams own those applications and their respective business priority. Only then understand what applications and teams are consuming the most resources, what applications are not running optimally or not following good design practices and what applications should be prioritized over others. Ensuring the right visibility is both tooling and an organizational issue so both need to be considered as you scale up your production environment.
In the next post in this series, we will discuss the different organizational models we see and the pro’s and con’s of each when it comes to scaling up and managing large production environments. We will also start to explore step-by-step how other organizations like HomeAway, Expedia, D&B and The Commonwealth Bank of Australia have improved their application performance visibility using readily available solutions.
About the Author: Kim is Sr. Director of Product Marketing at Concurrent, Inc., providers of Big Data technology solutions, Cascading and Driven. Cascading is an open source data processing development platform with over 500,000 downloads per month. Driven is a application performance management solution to monitor, collaborate and control all your Big Data applications in one place.
Take Driven for a free test drive and sign-up for our 30-day trial.
SHARE: