Member since
09-29-2015
122
Posts
159
Kudos Received
26
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6718 | 11-12-2016 12:32 AM | |
1925 | 10-05-2016 08:08 PM | |
2643 | 08-02-2016 11:29 PM | |
23359 | 06-24-2016 11:46 PM | |
2064 | 05-25-2016 11:12 PM |
12-08-2015
05:48 AM
4 Kudos
TL:DR: SparkSQL today provides table level access control and doesn't provide Hive level (column) level access control. Spark reads from both Hive Meta store and ORC (or PARQUET) files directly in HDF. For ORC files, security at HDFS still applies so READ/WRITE is controlled by HDFS ACL or Ranger. Right now Spark doesn't propagate end user identity to Hive meta store and we are working in the community to enhance this.
... View more
12-08-2015
05:44 AM
4 Kudos
Some of the common use case for Spark: Interactive SQL with small dataset for relatively simple SQL but where the response is expected in under a second. In this scenario the table is usually cached in memory. ETL: Use Spark for traditional ETL where MR was used. Any usecase where in the past you used MR is now a good fit for Spark Streaming: Spark Streaming can ingest data from variety of sources but the most commonly it is used in conjunction with Kafka. Since Kafka can provide message replay, putting it in front of Spark (or Storm) helps reliability. Spark is a good fit for Streaming where Streaming is a part of the overall data processing platform. If you need to build a specialized platform focused on streaming with millisecond latency consider Storm, otherwise Spark is good fit. Predictive Analytics: Spark makes data science and machine learning easier, with its built in libraries in MLlib & ML Pipeline API to model workflows, predictive analytics is much easier. Combine to above in a single application To make it more concrete, here are some examples from actual customers: Predict at risk shopping cart in an online session and offer coupon/other incentives to increase sales Process insurance claims coming from traditional data pipeline and process all claims data including textual claims information using SparkCore, use Spark for Feature engineering by using built in feature extraction facilities like TF-IDF and Word2Vec to predict insurance payment accuracies and flag certain claims for closer inspection.
... View more
12-08-2015
05:23 AM
1 Kudo
There are many differences between the two. Spark: Spark provides API, execution engine and Packages (SQL, ML, Graph) on
top of the core Spark API Spark is application developer facing Sparks abstractions are RDD/DataFrame & now DataSet (with Spark 1.6) Tez Tez is
the
execution engine for Hive & PIG Bottom line, if are asking for the difference between Spark & Tez, consider using Spark.
... View more
12-08-2015
05:20 AM
2 Kudos
1. Authentication Spark supports running in a Kerberized Cluster. Only Spark on YARN supports
security (Kerberos support). From command line run kinit before submitting spark jobs. LDAP Authentication, there is no Authentication in Spark UI OOB, supports filter for hooking in LDAP 2. Authorization Spark reads data from HDFS &
ORC etc and access control at HDFS level still applies. For example HDFS
file permissions (& Ranger integration) applicable to Spark jobs Spark submits job to YARN queue, so YARN
queue ACL (& Ranger integration) applicable to Spark jobs 3. Audi The Spark jobs are run on YARN and read from HDFS, HBase etc. So audit logs for YARN and HDFS access is still applicable and you can use Ranger to view this. 4. Wire Encryption •Spark
has some coverage, not all channels are covered
... View more
12-07-2015
09:27 PM
1 Kudo
We have started a project Magellan to bring Geo Spatial analytics to Spark See details at http://hortonworks.com/blog/magellan-geospatial-analytics-in-spark/ You can also find more details at this meetup t http://hortonworks.com/events/magellan-geospatial-analytics-on-spark-and-the-mobile-majority-use-cases/ We also talked about it at Spark Summit EU - https://spark-summit.org/eu-2015/events/magellan-geospatial-analytics-on-spark/ Magellan slides are at http://www.slideshare.net/SparkSummit/magellen-geospatial-analytics-on-spark-by-ram-sriharsha HTH, Vinay
... View more
12-03-2015
07:59 PM
Also see Practical Data Science with Apache Spark & Apache Zeppelin https://hadoopsummit.uservoice.com/forums/332055-data-science-applications-for-hadoop/suggestions/10847007-practical-data-science-with-apache-spark-apache Running Spark in Production https://hadoopsummit.uservoice.com/forums/332061-hadoop-governance-security-deployment-and-operat/suggestions/10848240-running-spark-in-production Cover topics of Spark Perf Tuning, Security & Spark on YARN Please consider voting if you want to hear more on these topics.
... View more
11-23-2015
05:02 PM
There may have been an earlier instance of Spark where 1.5.1 TP was installed. Check spark-default.conf it should NOT have any YARN ATS service related properties (enabled), since we don't have the ATS integration code ported to this TP. Here is an example spark-default.conf that works. # Default system properties included when running spark-submit. # This is useful for setting default environmental settings. # Example: # spark.master spark://master:7077 # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer # spark.driver.memory 5g # spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" #spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService #spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider ## Make sure the host and port match the node where your YARN history server is running #spark.yarn.historyServer.address localhost:18080 spark.driver.extraJavaOptions -Dhdp.version=2.3.2.1-12 spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.2.1-12
... View more
11-19-2015
03:24 AM
You can just write out the DF as ORC and the underlying directory will be created. LMK, if this doesn't work.
... View more
11-19-2015
03:07 AM
10 Kudos
df.write.format("orc") will get you there. See:
http://hortonworks.com/blog/bringing-orc-support-into-apache-spark/ or http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_orc-spark.html
... View more
11-18-2015
10:53 PM
1 Kudo
For Spark, there is no explicit move Spark history server feature in Ambari. Just use the Ambari REST API to delete Spark component and re-install Spark History server onto a new node.
... View more