About egarelnabi

egarelnabi · ‎08-13-2016

@RAMESH K Spark can run without HDFS. HDFS is only one of quite a few data stores/sources for Spark. Below are some links that answer your question in depth from different perspectives with some explanations and comparisons: http://stackoverflow.com/questions/32669187/is-hdfs-necessary-for-spark-workloads/34789554#34789554 http://stackoverflow.com/questions/32022334/can-apache-spark-run-without-hadoop http://stackoverflow.com/questions/28664834/which-cluster-type-should-i-choose-for-spark/34657719#34657719

egarelnabi · ‎08-08-2016

@Eric Brosch Your tooling selection really all depends on your particular use case. For "Speed" layer, you can use Storm or Spark Streaming. IMHO the main selection criteria between the two will depend on whether you're interested in ultra low latency (Storm) or high throughput (Spark Streaming). There's other factors, but these are some of the main drivers. For the "Serving" layer, your main choice is HBase. Depending on how you're going to query the "Serving" layer you may want to consider putting Phoenix on top of HBase. Since HBase is a NoSQL store, it has it's own API for making calls. Phoenix adds an abstraction layer on top of HBase and allows you to make queries in SQL format. Mind you, it's still in tech preview and may have some bugs here and there. Also, it's not meant for complex SQL queries. For your ingest and simple event processing you can look into HDF/Nifi. If you move beyond the HDP/HDF stack for the serving layer then your options increase to include other NoSQL stores as well as regular SQL DBs. Below is a diagram of a sample Lambda architecture for a demo that receives sensor data from trucks and analysis them, along with driver behaviour, to determine the possibility of a driver committing a traffic violation/infraction. It will give you a better idea of what a lambda deployment may look like.

egarelnabi · ‎07-20-2016

@mkataria With HDFS Snapshots there is no actual data copying up front for a new snapshot. It is simply a pointer to a record in time (point-in-time). So when you first take a snapshot, your HDFS storage usage will stay the same. It is only when you modify the data that data is copied/written. This follows the Copy on Write (COW) concept. Please take a look at the below JIRA. IT contains the discussion that lead to the design and is quite informative. https://issues.apache.org/jira/browse/HDFS-2802

egarelnabi · ‎07-20-2016

@Heath Yates Take a look at the below posting. It lists all the dependencies as well as setup instructions (not all steps will apply to you though). http://www.rdatamining.com/big-data/r-hadoop-setup-guide

egarelnabi · ‎07-20-2016

@ANSARI FAHEEM AHMED 1) If you hover your mouse over the "HDFS Disk Usage" widget (upper left hand corner) in the Ambari Dashboard it will show you the following details: DFS Used: Storage used for data Non-DFS Used: Storage used for things such as logs, shuffle writes, etc... Remaining: Remaining storage 2) From the command line you can also run "sudo -u hdfs hdfs dfsadmin -report", which will generate a full report of hdfs storage usage. 3) Finally, if you would like to check the disk usage for a particular folder (and sub folders), then you can use commands like "hadoop fsck", "hadoop fs -dus" or "hadoop fs -count -q". For an explanation of the differences between these commands as well as how to read the results please take a look at this post: http://www.michael-noll.com/blog/2011/10/20/understanding-hdfs-quotas-and-hadoop-fs-and-fsck-tools/

egarelnabi · ‎07-20-2016

@Mukesh Kumar The OSs supported with 2.3.4+ are: 64-bit CentOS 6/7 64-bit RHEL 6/7 64-bit Oracle Linux 6/7 64-bit SLES 11, SP3/SP4 64-bit Debian 6/7 64-bit Ubuntu Precise 12.04/14.04 Windows Server 2008/2012 From those, as well as the list you've shared, I would go with CentOS 6 if you're having challenges with Ubuntu. Install the following software on each of your hosts: • yum (for RHEL or CentOS) • zypper (for SLES) • php_curl (for SLES) • reposync (may not be installed by default on all SLES hosts) • apt-get (for Ubuntu and Debian) • rpm (for RHEL, CentOS, or SLES) • scp • curl • wget • unzip • chkconfig (Ubuntu and Debian) • tar For complete instructions and requirements/prerequisites for manual install of HDP (non-Ambari install) please refer to this guide: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_installing_manually_book/bk_installing_manually_book-20151221.pdf

egarelnabi · ‎07-19-2016

As @Artem Ervits mentioned, Oozie Spark Action is not yet supported. Instead you can follow the alternative from the tech note below: https://community.hortonworks.com/content/kbentry/51582/how-to-use-oozie-shell-action-to-run-a-spark-job-i-1.html -------------------- Begin Tech Note -------------------- Because spark action in oozie is not supported in HDP 2.3.x and HDP 2.4.0, there is no workaround especially in kerberos environment. We can use either java action or shell action to launch spark job in oozie workflow. In this article, we will discuss how to use oozie shell action to run a spark job in kerberos environment. Prerequisite: 1. Spark client is installed on every host where nodemanager is running. This is because we have no control over which node the 2. Optionally, if the spark job need to interact with hbase cluster, hbase client need to be installed on every host as well. Steps: 1. Create a shell script with the spark-submit command. For example, in the script.sh: /usr/hdp/current/spark-client/bin/spark-submit --keytab keytab --principal ambari-qa-falconJ@FALCONJSECURE.COM --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 500m --num-executors 1 --executor-memory 500m --executor-cores 1 spark-examples.jar 3 2. Prepare kerberos keytab which will be used by the spark job. For example, we use ambari smoke test user, the keytab is already generated by Ambari in/etc/security/keytabs/smokeuser.headless.keytab. 3. Create the oozie workflow with a shell action which will execute the script created above, for example, in the workflow.xml: <workflow-app name="WorkFlowForShellAction" xmlns="uri:oozie:workflow:0.4"> <start to="shellAction"/> <action name="shellAction"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>script.sh</exec> <file>/user/oozie/shell/script.sh#script.sh</file> <file>/user/oozie/shell/smokeuser.headless.keytab#keytab</file> <file>/user/oozie/shell/spark-examples.jar#spark-examples.jar</file> <capture-output/> </shell> <ok to="end"/> <error to="killAction"/> </action> <kill name="killAction"> <message>"Killed job due to error"</message> </kill> <end name="end"/> </workflow-app> 4. Create the oozie job properties file. For example, in job.properties: nameNode=hdfs://falconJ1.sec.support.com:8020 jobTracker=falconJ2.sec.support.com:8050 queueName=default oozie.wf.application.path=${nameNode}/user/oozie/shell oozie.use.system.libpath=true 5. Upload the following files created above to the oozie workflow application path in HDFS (In this example: /user/oozie/shell): - workflow.xml - smokeuser.headless.keytab - script.sh - spark uber jar (In this example: /usr/hdp/current/spark-client/lib/spark-examples*.jar) - Any other configuration file mentioned in workflow (optional) 6. Execute the oozie command to run this workflow. For example: oozie job -oozie http://<oozie-server>:11000/oozie -config job.properties -run -------------------- End Tech Note --------------------

egarelnabi · ‎07-19-2016

@Anil Mathew Joining the conversation late but maybe can shed some light. As far as TDE with Tez is concerned, it works and is supported with HDP 2.4. The caveat though is that the intermediary "shuffle" data will not be encrypted if it overflows to disk. This is temporary, short lived data though, if it does overflow to disk, and may not be an issue depending on your requirements. TDE and Spark should also be supported with HDP 2.5 but will also have the same caveat.

egarelnabi · ‎05-04-2016

It seems it has been addressed in Atlas 0.7, which will be included in the coming release of HDP (2.5+)

egarelnabi · ‎04-25-2016

Atlas 0.6 does not include any other components, so you would have to install them separately. In terms of functionality, all 3 options you indicated for Metadata import should be applicable. Again, I would suggest you download the HDP sandbox since it spares you having to go through the whole installation and configuration process.

Online	Offline
Last Visited	‎08-14-2019 09:54 AM

Member Since	‎10-06-2015 09:21 PM
Last Visited	‎08-14-2019 09:54 AM
Posts	273
Kudos received	202

Cloudera Community

Re: Is it possible to import a complete new taxono...

Re: Is it possible in Apache Atlas to add key-valu...

Re: Do we have tag carry forward in atlas hdp2.6.1...

Re: With ATLAS, which format attribute Date is acc...

Re: Spark streaming support for stream analytics m...

Re: Spark Standalone need of HDFS

Re: Hortonworks Tools for Implementing Lambda Arch...

Re: HDFS Snapshot - Size

Re: How to remotely connect to HDP2.4VM with RStud...

Re: How to check the total amount of data present ...

Re: advice what OS should i choose with minimum co...

Re: Oozie spark action giving key not found SPARK_...

Re: Transparent Data encryption (TDE) and Hive on ...

Re: Atlas generating huge log files

Re: If I just want to use basic function of atlas,...