About dineshc

dineshc · ‎08-03-2017

Cool! If you had mentioned the specifics in your question, you would have got help sooner 🙂

dineshc · ‎08-03-2017

Basically, your oozie workflow must pick a job properties from a path on the cluster. The job properties file will hold the details of hive server of that environment. In your workflow you would be picking the values of database from the job property file. Thus your workflow will be generic and based on which environment you executing, it will pick up the job properties and consequently the required hive server.

dineshc · ‎07-18-2017

@Sundar Lakshmanan When you are having a derived column then you should create an alias for it instead of treating it like a variable assignment. Ex. Select (your calculation for derived column) as derived_column. SELECT CONSUMER_R , CNTRY_ISO_C , (CASE WHEN ( DTBIRTH_Y = '0001-01-01' ) THEN 0 ELSE cast((DATEDIFF(current_date,'0001-01-01')/365) as smallint) END) as NEW_AGE, ....

dineshc · ‎07-14-2017

@Viswa According to official apache document by default number of reducers is set to 1 You can override this by using the following properties: For MR1 set mapred.reduce.tasks=N For MR2 set mapreduce.job.reduces=N The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>). With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks. Now to understand the number of tasks spawned I would point you to this blog In MR1, the number of tasks launched per node was specified via the settings mapred.map.tasks.maximum and mapred.reduce.tasks.maximum. In MR2, one can determine how many concurrent tasks are launched per node by dividing the resources allocated to YARN by the resources allocated to each MapReduce task, and taking the minimum of the two types of resources (memory and CPU). Specifically, you take the minimum of yarn.nodemanager.resource.memory-mb divided by mapreduce.[map|reduce].memory.mb and yarn.nodemanager.resource.cpu-vcores divided by mapreduce.[map|reduce].cpu.vcores. This will give you the number of tasks that will be spawned per node.

dineshc · ‎07-13-2017

Cool, I remember I had faced similar problems when I first tried to add a new node. I guess you are using a lower version of ambari/sandbox just like me. Higher versions automatically install ambari-agents when adding a new node into the cluster.

dineshc · ‎07-13-2017

@Karan Alang Verify the following: 1. If there is an ambari-agent installed on newly provisioned datanode. 2. If there is an ambari-agent installed try to restart it 3. Verify the version of ambari-agent and ambari-server, they both must be SAME.

dineshc · ‎07-12-2017

@Harjinder Brar 1. You will have access to Ambari only to monitor your cluster and say start or stop a service. 2. You will not have access to any of the Ambari Views like Pig View, Hive View, HDFS Files View etc. 3. You will have to execute all your tasks using the terminal. 4. It is recommended that you practice on the same version of HDP that you are going to get in the exam. Small version changes can lead to lot of problems. For example - in a lower version of Pig, you have to explicitly cast values when using FOREACH generate statement, but in higher versions of Pig it is not needed. It is better to get used to the exam version and avoid any time/effort wastage during the exam. 5. The practice exam is your best resource to give you a feel of the environment. https://hortonworks.com/wp-content/uploads/2015/02/HDPCD-PracticeExamGuide1.pdf 6. If you have finished all the related tutorials available in Hortonworks Tutorial site then you should be good. Just practice by altering scenarios in those tutorials, create your own questions on those data sets to play around and practice. Ensure you have covered all objectives for HDPCD: https://hortonworks.com/services/training/certification/exam-objectives/#hdpcd Here are a few tips that will help you in the exam - 1. Type your commands using a system editor program like gedit and then copy one line at a time into your terminal. This way you will stop where the error occurs and also you will be able to see any syntactical errors when you first type your commands in gedit instead of directly on the terminal. 2. When doing Pig or Hive tasks, if the question does not asks you to use a specific execution engine, then always prefer to use tez so that your jobs complete faster than when using MapReduce In hive session, set exection engine as tez: hive> set hive.execution.engine = tez; To open pig session with execution engine as tez, type following in your terminal: pig -x tez 3. Read the questions properly, don't rush to start writing the solution. The questions are very easy but their will be fine details that one may miss and ending up not scoring on that task. For example, in a hurry you may miss reading a certain part which could indicate that you need to add multiple conditions to your where clause or filter statements in hive/pig. So even though you may have executed task with no error, it will give incorrect answer and you will loose a point. Wish you all the best!

dineshc · ‎07-11-2017

@Zack Riesland As far as I know there is no such config property. You may have to write additional lines in your hive script to view the values/redirect them to a separate file. You can view the value using select ${hiveconf.my_variable}; To see values of all the variables you may use set -v

dineshc · ‎07-05-2017

@Hugo Felix Thank you for sharing the tutorial. I was able to replicate the issue and I found the issue to be with incompatible jars. I am using the following precise versions that I pass to spark-shell. spark-streaming-twitter_2.10-1.6.1.2.4.2.10-1.jar twitter4j-core-4.0.4.jar twitter4j-stream-4.0.4.jar For test purpose, I had put them all under /tmp and here is how I initiated the spark shell: Syntax: spark-shell --jars "/path/jar1,/path/jar2,/path/jar3" Example: spark-shell --jars "/tmp/spark-streaming-twitter_2.10-1.6.1.2.4.2.10-1.jar,/tmp/twitter4j-core-4.0.4.jar,/tmp/twitter4j-stream-4.0.4.jar" Once I got the scala prompt in spark-shell, I typed out the code from the tutorial without specifying any value for "spark.cleaner.ttl". Hope this helps.

dineshc · ‎07-05-2017

It requires spark-streaming jar to be added. Download the jar from here : https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-twitter_2.10/1.0.0 Open spark shell using: spark-shell --jars /path/to/jarFile

Online	Offline
Last Visited	‎12-08-2021 02:51 PM

Member Since	‎10-04-2016 05:35 PM
Last Visited	‎12-08-2021 02:51 PM
Posts	243
Kudos received	276

Cloudera Community

Re: Hortonworks HDPCA Practice Exam V3 Task.

Re: Spark 1.6 - Dataframe read json throws org.apa...

Re: Service 'webhcat' check failed: RA080 Can't de...

Re: Unable to see HDFS metrics in Grafana

Re: Spark sort by key with descending order

Re: How to produce Oozie production workflows

Re: How to produce Oozie production workflows

Re: Got Invalid table alias or column reference 'N...

Re: Number of Tasks created for each reducer

Re: No heartbeat to Ambari from newly provisioned ...

Re: No heartbeat to Ambari from newly provisioned ...

Re: hdpcd question

Re: How to display Hive variable values in logging...

Re: Spark error: org.apache.spark.SparkException: ...

Re: Spark Streaming Twitter - Object twitter is no...