Support Questions

tim_david1954 · ‎09-08-2016

Hello,

I know that's a very generic question, but I am wondering if you have any best practices, tools to share around application development cycle in Hadoop? what to use to evolve an application from dev to test to production? continuous integration? is there any framework for Hadoop? is there special considerations for specific tools

Thanks

Tim

cstanca · ‎09-09-2016

@Tim David

Excellent question! SQL and notebooks like Zeppelin or IPython/Jupyter.

1) For SQL-like requests, I just finished a pilot project where we used JMeter for performance testing of Hive queries. This was NOT UI-driven, but a direct call via HiveServer2 of Hive queries with various parallelism levels. I plan to publish an HCC article soon (next week) on how to setup and execute Jmeter test on Hive. Additionally, Tez view was used for tracking execution tasks. Resource Manager UI was used to track resources per task (containers, RAM and cores use per container). The same approach could be used also for SparkSQL or Phoenix SQL on HBase. All these have REST API to JMX metrics available, as such one could extract the execution data and analyze it outside and correlating it with JMeter metrics. Unfortunately, system resource utilization per job/task is not covered by JMeter metrics and dashboards, however, JMeter provides some metrics on executions and throughput. Future release of HDP will add more metrics on resource utilization per application and task which can be represented in Grafana dashboard. Until then, something custom is needed. I wish I had that tool capable to connect to YARN and extract the resources used with each job/task directly. That would be a welcomed plugin to JMeter for example.

For HIVE queries, the traditional EXPLAIN will provide an understanding of the execution: stages, number of records, size. All can help to reduce the amount of records and bytes to be churned through to achieve the result, also to achieve a better parallelism to reduce the response time.

Functional review and test of the SQL will be done as it is done with any other databases running SQL.

2) For Notebooks like Zeppelin/IPython/Jupyter, the approach is a bit more heterogenous.

These tools use a mix of languages and widgets. For example, a Zeppelin notebook could have blocks of SQL, Spark code written with Python and Spark code written with Scala, also invoking User-Defined Functions (UDF) written with Java.

My approach is to test each individual piece (UNIT) in the traditional way for Java, Scala, Python as part of the core framework or specific extensions. They must be high quality before going to do an INTEGRATION test which is the entire notebook. That would be the UNIT TEST. These notebooks could be tested in dev and test before being deployed to PRODUCTION. Taking in account that these notebooks are like a web application, tools that are capable to test web applications can still be used. The approach is simple. Run the notebook and save the output as HTML. Compare that HTML with what is the expected HTML. This can be executed as part of your CI, assuming that you deal very well with data changes. In the design of my notebooks, to assure also dynamic block (changing data) TESTABILITY, I am making sure that I have a unique tag/Identifier for each block of the notebooks as such the testing tool is capable to identify the block and compare with expected results. I even used Selenium for that purpose.

IPython/Jupyter:

A few tools:

My approach for IPython (simpler if you stick only with Python):

Make sure your notebooks run correctly when running “Run All”. For automatic testing to work, they should run all blocks in sequence.
Test locally:

jupyter nbconvert--to=html--ExecutePreprocessor.enabled=Truemy-notebook.ipynb

This will convert your notebooks to HTML. This works with Jupyter and iPython >=4.
Next you could just run the same command in an isolated Docker container or in a CI step

docker run my_container/bin/sh-c\ "/usr/local/bin/jupyter nbconvert\ --to=html--
ExecutePreprocessor.enabled=True\ --ExecutePreprocessor.timeout=3600\ samples/my-sample.ipynb"

Zeppelin:

If you stick with Python, can reuse the approach for IPython/Jupyter, for most of it.

If you add Scala in the mix then things are a bit more complicated, but the principle is similar, you can still save the output as HTML and compare it with reference HTML. Any tool web capable, e.g. JMeter could handle this functional test. Otherwise, test each individual block with tools specific for Scala or SQL. It is highly recommended to write reusable blocks of code which can be continuously tested. If it is a data scientist hit and run work that is a bit more difficult and I am strong believer if you want a scalable and productionized version of the model software engineer skill is still needed, someone that understands performance tuning and best coding practices for performance and even security. We all want to build frameworks that are functionally rich and each function of the framework is high quality as such it can be used by others in their notebooks.

The topic is very wide and I just timeboxed my response. Sorry. I hope it helped.

I put more focus on QA in my responses above because at the end of the day that is the most important part of software development process, deliver a software with least bugs with a reasonable cost of development, and with the agility of processes and tools used to make software changes without going though expensive regression testing.

**********

If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

View solution in original post

cstanca · ‎09-08-2016

@Tim David

I have been in software development for a long period and a huge champion of agile development. The agile development approach works for Hadoop as it does for any other application development. The same best practices, e.g. CI, QA, automation-automation-automation. That part is similar and you can be as creative as you need to delivery faster and better.

Regarding tools, once upon a time, MapReduce developers needed a framework to test their MapReduce jobs. MRUnit was considered the framework of choice. However, this is not anymore the choice. There is less and less programmatic MapReduce written manually and more and more generated by different tools in the ecosystem (e.g. Hive, Pig etc) or third-party tools, e.g. Talend Studio.

My recommendation is to choose tools for development around the tools from the Hadoop ecosystem you plan to use and their programming languages. For example, if you write Spark with Scala stick with Scala specific tools. If you are a Java shop, just use the tools specific for Java.

I know that this is a generic response, but this is the idea. If you have specifics in mind, please submit another question with those specifics and I am sure that the Community, including myself, will be happy to chip in.

**********

If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

tim_david1954 · ‎09-09-2016

Thanks @Constantin Stanca

what about devlopments that are note java or scala? I am thinking about Hive requests, notebooks, etc

njayakumar · ‎09-12-2016

@Tim David - Scala would be ideal for the hadoop developments.

cstanca · ‎09-09-2016

@Tim David

Excellent question! SQL and notebooks like Zeppelin or IPython/Jupyter.

1) For SQL-like requests, I just finished a pilot project where we used JMeter for performance testing of Hive queries. This was NOT UI-driven, but a direct call via HiveServer2 of Hive queries with various parallelism levels. I plan to publish an HCC article soon (next week) on how to setup and execute Jmeter test on Hive. Additionally, Tez view was used for tracking execution tasks. Resource Manager UI was used to track resources per task (containers, RAM and cores use per container). The same approach could be used also for SparkSQL or Phoenix SQL on HBase. All these have REST API to JMX metrics available, as such one could extract the execution data and analyze it outside and correlating it with JMeter metrics. Unfortunately, system resource utilization per job/task is not covered by JMeter metrics and dashboards, however, JMeter provides some metrics on executions and throughput. Future release of HDP will add more metrics on resource utilization per application and task which can be represented in Grafana dashboard. Until then, something custom is needed. I wish I had that tool capable to connect to YARN and extract the resources used with each job/task directly. That would be a welcomed plugin to JMeter for example.

For HIVE queries, the traditional EXPLAIN will provide an understanding of the execution: stages, number of records, size. All can help to reduce the amount of records and bytes to be churned through to achieve the result, also to achieve a better parallelism to reduce the response time.

Functional review and test of the SQL will be done as it is done with any other databases running SQL.

2) For Notebooks like Zeppelin/IPython/Jupyter, the approach is a bit more heterogenous.

These tools use a mix of languages and widgets. For example, a Zeppelin notebook could have blocks of SQL, Spark code written with Python and Spark code written with Scala, also invoking User-Defined Functions (UDF) written with Java.

My approach is to test each individual piece (UNIT) in the traditional way for Java, Scala, Python as part of the core framework or specific extensions. They must be high quality before going to do an INTEGRATION test which is the entire notebook. That would be the UNIT TEST. These notebooks could be tested in dev and test before being deployed to PRODUCTION. Taking in account that these notebooks are like a web application, tools that are capable to test web applications can still be used. The approach is simple. Run the notebook and save the output as HTML. Compare that HTML with what is the expected HTML. This can be executed as part of your CI, assuming that you deal very well with data changes. In the design of my notebooks, to assure also dynamic block (changing data) TESTABILITY, I am making sure that I have a unique tag/Identifier for each block of the notebooks as such the testing tool is capable to identify the block and compare with expected results. I even used Selenium for that purpose.

IPython/Jupyter:

A few tools:

My approach for IPython (simpler if you stick only with Python):

Make sure your notebooks run correctly when running “Run All”. For automatic testing to work, they should run all blocks in sequence.
Test locally:

jupyter nbconvert--to=html--ExecutePreprocessor.enabled=Truemy-notebook.ipynb

This will convert your notebooks to HTML. This works with Jupyter and iPython >=4.
Next you could just run the same command in an isolated Docker container or in a CI step

docker run my_container/bin/sh-c\ "/usr/local/bin/jupyter nbconvert\ --to=html--
ExecutePreprocessor.enabled=True\ --ExecutePreprocessor.timeout=3600\ samples/my-sample.ipynb"

Zeppelin:

If you stick with Python, can reuse the approach for IPython/Jupyter, for most of it.

If you add Scala in the mix then things are a bit more complicated, but the principle is similar, you can still save the output as HTML and compare it with reference HTML. Any tool web capable, e.g. JMeter could handle this functional test. Otherwise, test each individual block with tools specific for Scala or SQL. It is highly recommended to write reusable blocks of code which can be continuously tested. If it is a data scientist hit and run work that is a bit more difficult and I am strong believer if you want a scalable and productionized version of the model software engineer skill is still needed, someone that understands performance tuning and best coding practices for performance and even security. We all want to build frameworks that are functionally rich and each function of the framework is high quality as such it can be used by others in their notebooks.

The topic is very wide and I just timeboxed my response. Sorry. I hope it helped.

I put more focus on QA in my responses above because at the end of the day that is the most important part of software development process, deliver a software with least bugs with a reasonable cost of development, and with the agility of processes and tools used to make software changes without going though expensive regression testing.

**********

If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

tim_david1954 · ‎09-10-2016

Thanks for this detailed answer

Cloudera Community

Support Questions

Devlopment cycle with Hadoop