Member since
09-24-2015
98
Posts
76
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2857 | 08-29-2016 04:42 PM | |
5698 | 08-09-2016 08:43 PM | |
1743 | 07-19-2016 04:08 PM | |
2468 | 07-07-2016 04:05 PM | |
7415 | 06-29-2016 08:25 PM |
07-19-2016
04:08 PM
1 Kudo
Spark has a GraphX component library (soon to be upgraded to GraphFrames) which can be used to model graph type relationships. These relationships are modeled by combining a vertex table (vertices) with an edge table (edges). Read here for more info: http://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph
... View more
07-12-2016
08:36 PM
Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth: pyspark.sql.functions.dayofmonth(col) Extract the day of the month of a given date as integer. Example: >>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a'])
>>> df.select(dayofmonth('a').alias('day')).collect()
[Row(day=8)]
... View more
07-11-2016
06:35 PM
@xrcs blue Looks like you are using Spark python API. The pyspark documentation says: join :
on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. Therefore, do the columns exist on both sides of join tables? Also, wondering if you can encode the "condition" separately, then pass it to the join() method, like this: >>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer')
... View more
07-07-2016
04:05 PM
@R Pul Yes, that is a common problem. The first thing I would try is at the Spark configuration level, enable Dynamic Resource Allocation. Here is a description (from link below):
"Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to your application up and down based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be returned to the cluster’s pool of resources and acquired by other applications. In Spark, dynamic resource allocation is performed on the granularity of the executor and can be enabled through spark.dynamicAllocation.enabled ." And in particular, the Remove Policy: The policy for removing executors is much simpler. A Spark application removes an executor when it has been idle for more than spark.dynamicAllocation.executorIdleTimeout seconds. Web page:
https://spark.apache.org/docs/1.2.0/job-scheduling.html
Also, check out the paragraph entitled "Graceful Decommission of Executors" for more information.
... View more
07-01-2016
02:27 PM
@Zach Kirsch The problem is more likely a lack of correlation between Spark's request for RAM (driver memory + executor memory) and Yarn's container sizing configuration. Yarn settings determine min/max container sizes, and should be based on available physical memory, number of nodes, etc. As a rule of thumb, try making the minimum Yarn container size 1.5 times the size of the requested driver/executor memory (in this case, 1.5 GB).
... View more
06-29-2016
08:25 PM
1 Kudo
The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC: https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col.html To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.
... View more
06-28-2016
09:46 PM
You can designate either way by setting --master and --deploy-mode arguments correctly. By designating --master=yarn, the Spark executors will be run on the cluster; --master=local[*] will place the executors on the local machine. The Spark driver location will then be determined by one of these modes: --deploy-mode=cluster runs driver on cluster, --deploy-mode=client runs driver on client (VM where it is launched). More info here: http://spark.apache.org/docs/latest/submitting-applications.html
... View more
06-28-2016
07:03 PM
You probably need to install the spark-client on your VM, which will include all the proper jar files and binaries to connect to YARN. There is also a chance that the version of Spark used by Titan DB was built specifically without YARN dependencies (to avoid duplicates). You can always rebuild your local Spark installation with YARN dependencies, using the instructions here:
http://spark.apache.org/docs/latest/building-spark.html
For instance, here is a sample build command using maven:
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
... View more
06-27-2016
07:37 PM
The referenced JIRA above is now resolved. I have successfully tested the new version of the Hive ODBC Driver on Mac OSX version 10.11 (El Capitan). However, please note that you must install the new Hive ODBC driver version 2.1.2 as shown through the iODBC Administration tool Please also note that the location of the driver file has changed. Here is the new odbcinst.ini file (stored in ~/.odbcinst.ini), showing the old driver location commented out and the new driver location below it: [ODBC Drivers]
Hortonworks Hive ODBC Driver=Installed
[Hortonworks Hive ODBC Driver]
Description=Hortonworks Hive ODBC Driver
; old driver location
; Driver=/usr/lib/hive/lib/native/universal/libhortonworkshiveodbc.dylib
; new driver location below
Driver=/opt/hortonworks/hiveodbc/lib/universal/libhortonworkshiveodbc.dylib
... View more
06-27-2016
05:07 PM
@Sri Bandaru Okay, so now I'm wondering if you should include the Spark assembly jar; that is where the reference class lives. Can you try adding this reference to your command-line (assuming your current directory is the spark-client directory, or $SPARK_HOME for your installation): --jars lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar Note: If running on HDP, you can use the soft-link to this file named "spark-hdp-assembly.jar"
... View more