About cstanca

cstanca · ‎09-09-2016

@Fish Berh Could you vote and accept my response? I suggested using the public URI of the server.

cstanca · ‎09-09-2016

@Fish Berh You may also want to check a few options here: https://blog.rstudio.org/tag/sparkr/

cstanca · ‎09-09-2016

@Fish Berh I assume that your R-Studio is on your laptop. It seems that you are trying to access an internal IP from your laptop. You need to reference a public IP or the public URI for your server.

cstanca · ‎09-09-2016

@Tim David Excellent question! SQL and notebooks like Zeppelin or IPython/Jupyter. 1) For SQL-like requests, I just finished a pilot project where we used JMeter for performance testing of Hive queries. This was NOT UI-driven, but a direct call via HiveServer2 of Hive queries with various parallelism levels. I plan to publish an HCC article soon (next week) on how to setup and execute Jmeter test on Hive. Additionally, Tez view was used for tracking execution tasks. Resource Manager UI was used to track resources per task (containers, RAM and cores use per container). The same approach could be used also for SparkSQL or Phoenix SQL on HBase. All these have REST API to JMX metrics available, as such one could extract the execution data and analyze it outside and correlating it with JMeter metrics. Unfortunately, system resource utilization per job/task is not covered by JMeter metrics and dashboards, however, JMeter provides some metrics on executions and throughput. Future release of HDP will add more metrics on resource utilization per application and task which can be represented in Grafana dashboard. Until then, something custom is needed. I wish I had that tool capable to connect to YARN and extract the resources used with each job/task directly. That would be a welcomed plugin to JMeter for example. For HIVE queries, the traditional EXPLAIN will provide an understanding of the execution: stages, number of records, size. All can help to reduce the amount of records and bytes to be churned through to achieve the result, also to achieve a better parallelism to reduce the response time. Functional review and test of the SQL will be done as it is done with any other databases running SQL. 2) For Notebooks like Zeppelin/IPython/Jupyter, the approach is a bit more heterogenous. These tools use a mix of languages and widgets. For example, a Zeppelin notebook could have blocks of SQL, Spark code written with Python and Spark code written with Scala, also invoking User-Defined Functions (UDF) written with Java. My approach is to test each individual piece (UNIT) in the traditional way for Java, Scala, Python as part of the core framework or specific extensions. They must be high quality before going to do an INTEGRATION test which is the entire notebook. That would be the UNIT TEST. These notebooks could be tested in dev and test before being deployed to PRODUCTION. Taking in account that these notebooks are like a web application, tools that are capable to test web applications can still be used. The approach is simple. Run the notebook and save the output as HTML. Compare that HTML with what is the expected HTML. This can be executed as part of your CI, assuming that you deal very well with data changes. In the design of my notebooks, to assure also dynamic block (changing data) TESTABILITY, I am making sure that I have a unique tag/Identifier for each block of the notebooks as such the testing tool is capable to identify the block and compare with expected results. I even used Selenium for that purpose. IPython/Jupyter: A few tools: https://github.com/bollwyvl/nosebook https://github.com/taavi/ipython_nose https://pypi.python.org/pypi/pytest-ipynb (my preference) My approach for IPython (simpler if you stick only with Python): Make sure your notebooks run correctly when running “Run All”. For automatic testing to work, they should run all blocks in sequence. Test locally: jupyter nbconvert--to=html--ExecutePreprocessor.enabled=Truemy-notebook.ipynb This will convert your notebooks to HTML. This works with Jupyter and iPython >=4. Next you could just run the same command in an isolated Docker container or in a CI step docker run my_container/bin/sh-c\ "/usr/local/bin/jupyter nbconvert\ --to=html-- ExecutePreprocessor.enabled=True\ --ExecutePreprocessor.timeout=3600\ samples/my-sample.ipynb" Zeppelin: If you stick with Python, can reuse the approach for IPython/Jupyter, for most of it. If you add Scala in the mix then things are a bit more complicated, but the principle is similar, you can still save the output as HTML and compare it with reference HTML. Any tool web capable, e.g. JMeter could handle this functional test. Otherwise, test each individual block with tools specific for Scala or SQL. It is highly recommended to write reusable blocks of code which can be continuously tested. If it is a data scientist hit and run work that is a bit more difficult and I am strong believer if you want a scalable and productionized version of the model software engineer skill is still needed, someone that understands performance tuning and best coding practices for performance and even security. We all want to build frameworks that are functionally rich and each function of the framework is high quality as such it can be used by others in their notebooks. The topic is very wide and I just timeboxed my response. Sorry. I hope it helped. I put more focus on QA in my responses above because at the end of the day that is the most important part of software development process, deliver a software with least bugs with a reasonable cost of development, and with the agility of processes and tools used to make software changes without going though expensive regression testing. ********** If any of the responses to your question addressed the problem don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-09-2016

@Smart Solutions As @Ravi Mutyala mentioned those will be only configurations of the clients. My understanding of your question is broader, cluster as a whole with server services and clients. I don't think you got the complete response. A simple shell script that searches for all *-site.xml files and tars them could be also helpful for a complete response.

cstanca · ‎09-09-2016

@Suzanne Dimant It is your sandbox root account with the password you set initially when you accessed the sandbox.

cstanca · ‎09-09-2016

@Rajib Mandal I get it, but usually Sqoop jobs are kicked with a Scheduler. As I said, Sqoop is already taking advantage of YARN containers and is MapReduce dependent. Yarn distributed shell is not the appropriate way to handle this type of Sqoop jobs. Again, YARN distributed shell is an example of a non-MapReduce application built on top of YARN. *** If any of the responses to your question helped don't forget to vote and accept the answer. If you fix the issue on your own, don't forget to post the answer to your own question. A moderator will review it and accept it.

cstanca · ‎09-09-2016

@Matt Andruff I wished I could see the SQL in question. If you are talking about a query where you want to select some fields from one table where a field value EXISTS in other table, I get it. It is two full table scans, most likely a small one (the lookup table) and a bigger one, the transactions table. That is useful when you have a lookup-like table which you don't know the values. That is a traditional dynamic list filtering problem. IN is useful for a static list. That's why I don't get it why we have to compare EXISTS with IN. Use IN if you have a static list. That is one table scan. My theory was about the general use of EXISTS to determine whether a specific record field value from the transactions table exists in the lookup table. That will not be necessarily a full table scan.

cstanca · ‎09-08-2016

@Eric Periard HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation? Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format. Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.

cstanca · ‎09-08-2016

@Harry Yuen @Devin Pinkston is correct. If you were to convert a date string, you would have had to add also the input date format, for example: ${now():toNumber():format('yyyy-MM-dd')}

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: sparkR with HDP 2.4 deployed on AWS ec2 connec...

Re: sparkR with HDP 2.4 deployed on AWS ec2 connec...

Re: sparkR with HDP 2.4 deployed on AWS ec2 connec...

Re: Devlopment cycle with Hadoop

Re: How to export all HDP configuration files (xml...

Re: Cannot scp a file from local machine to Azure ...

Re: Yarn Distributed Shell

Re: Exists or IN which performs better

Re: Lz0 is enabled now what?

Re: Converting Datetime to UnixEpoch time (millise...