About vshukla

vshukla · ‎12-08-2015

TL:DR: SparkSQL today provides table level access control and doesn't provide Hive level (column) level access control. Spark reads from both Hive Meta store and ORC (or PARQUET) files directly in HDF. For ORC files, security at HDFS still applies so READ/WRITE is controlled by HDFS ACL or Ranger. Right now Spark doesn't propagate end user identity to Hive meta store and we are working in the community to enhance this.

vshukla · ‎12-08-2015

Some of the common use case for Spark: Interactive SQL with small dataset for relatively simple SQL but where the response is expected in under a second. In this scenario the table is usually cached in memory. ETL: Use Spark for traditional ETL where MR was used. Any usecase where in the past you used MR is now a good fit for Spark Streaming: Spark Streaming can ingest data from variety of sources but the most commonly it is used in conjunction with Kafka. Since Kafka can provide message replay, putting it in front of Spark (or Storm) helps reliability. Spark is a good fit for Streaming where Streaming is a part of the overall data processing platform. If you need to build a specialized platform focused on streaming with millisecond latency consider Storm, otherwise Spark is good fit. Predictive Analytics: Spark makes data science and machine learning easier, with its built in libraries in MLlib & ML Pipeline API to model workflows, predictive analytics is much easier. Combine to above in a single application To make it more concrete, here are some examples from actual customers: Predict at risk shopping cart in an online session and offer coupon/other incentives to increase sales Process insurance claims coming from traditional data pipeline and process all claims data including textual claims information using SparkCore, use Spark for Feature engineering by using built in feature extraction facilities like TF-IDF and Word2Vec to predict insurance payment accuracies and flag certain claims for closer inspection.

vshukla · ‎12-08-2015

There are many differences between the two. Spark: Spark provides API, execution engine and Packages (SQL, ML, Graph) on top of the core Spark API Spark is application developer facing Sparks abstractions are RDD/DataFrame & now DataSet (with Spark 1.6) Tez Tez is the execution engine for Hive & PIG Bottom line, if are asking for the difference between Spark & Tez, consider using Spark.

vshukla · ‎12-08-2015

1. Authentication Spark supports running in a Kerberized Cluster. Only Spark on YARN supports security (Kerberos support). From command line run kinit before submitting spark jobs. LDAP Authentication, there is no Authentication in Spark UI OOB, supports filter for hooking in LDAP 2. Authorization Spark reads data from HDFS & ORC etc and access control at HDFS level still applies. For example HDFS file permissions (& Ranger integration) applicable to Spark jobs Spark submits job to YARN queue, so YARN queue ACL (& Ranger integration) applicable to Spark jobs 3. Audi The Spark jobs are run on YARN and read from HDFS, HBase etc. So audit logs for YARN and HDFS access is still applicable and you can use Ranger to view this. 4. Wire Encryption •Spark has some coverage, not all channels are covered

vshukla · ‎12-07-2015

We have started a project Magellan to bring Geo Spatial analytics to Spark See details at http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ You can also find more details at this meetup t http://hortonworks.com/events/magellan-geospatial-analytics-on-spark-and-the-mobile-majority-use-cases/ We also talked about it at Spark Summit EU - https://spark-summit.org/eu-2015/events/magellan-geospatial-analytics-on-spark/ Magellan slides are at http://www.slideshare.net/SparkSummit/magellen-geospatial-analytics-on-spark-by-ram-sriharsha HTH, Vinay

vshukla · ‎12-03-2015

Also see Practical Data Science with Apache Spark & Apache Zeppelin https://hadoopsummit.uservoice.com/forums/332055-data-science-applications-for-hadoop/suggestions/10847007-practical-data-science-with-apache-spark-apache Running Spark in Production https://hadoopsummit.uservoice.com/forums/332061-hadoop-governance-security-deployment-and-operat/suggestions/10848240-running-spark-in-production Cover topics of Spark Perf Tuning, Security & Spark on YARN Please consider voting if you want to hear more on these topics.

vshukla · ‎11-23-2015

There may have been an earlier instance of Spark where 1.5.1 TP was installed. Check spark-default.conf it should NOT have any YARN ATS service related properties (enabled), since we don't have the ATS integration code ported to this TP. Here is an example spark-default.conf that works. # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService #spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider ## Make sure the host and port match the node where your YARN history server is running #spark.yarn.historyServer.address localhost:18080 spark.driver.extraJavaOptions -Dhdp.version=2.3.2.1-12 spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.1-12

vshukla · ‎11-19-2015

You can just write out the DF as ORC and the underlying directory will be created. LMK, if this doesn't work.

vshukla · ‎11-19-2015

df.write.format("orc") will get you there. See: http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ or http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_orc-spark.html

vshukla · ‎11-18-2015

For Spark, there is no explicit move Spark history server feature in Ambari. Just use the Ambari REST API to delete Spark component and re-install Spark History server onto a new node.

Online	Offline
Last Visited	‎12-08-2017 06:12 PM

Member Since	‎09-29-2015 03:27 AM
Last Visited	‎12-08-2017 06:12 PM
Posts	122
Kudos received	149

Cloudera Community

Re: Exception: Session not found, Livy server woul...

Re: Spark 2 and Zeppelin

Re: Does the HiveContext object expire in Zeppelin...

Re: Adding libraries to Zeppelin

Re: Zeppelin Error - bash interpreter running a Hi...

Re: Ranger security when accessing Hive tables via...

Re: What are common use cases for Spark and Data s...

Re: Spark vs Tez?

Re: What security is available for Spark?

Re: What is the best way to perform geospatial ana...

Re: Interesting talks for Hadoop Summit 2016 (EMEA...

Re: Spark 1.5.1 Tech Preview

Re: How do I create an ORC Hive table from Spark?

Re: How do I create an ORC Hive table from Spark?

Re: Process for moving HDP Services manually