About phargis

phargis · ‎07-19-2016

Spark has a GraphX component library (soon to be upgraded to GraphFrames) which can be used to model graph type relationships. These relationships are modeled by combining a vertex table (vertices) with an edge table (edges). Read here for more info: http://spark.apache.org/docs/latest/graphx-programming-guide.html#example-property-graph

phargis · ‎07-12-2016

Please note that there are also convenience functions provided in pyspark.sql.functions, such as dayofmonth: pyspark.sql.functions.dayofmonth(col) Extract the day of the month of a given date as integer. Example: >>> df = sqlContext.createDataFrame([('2015-04-08',)], ['a']) >>> df.select(dayofmonth('a').alias('day')).collect() [Row(day=8)]

phargis · ‎07-11-2016

@xrcs blue Looks like you are using Spark python API. The pyspark documentation says: join : on – a string for join column name, a list of column names, , a join expression (Column) or a list of Columns. If on is a string or a list of string indicating the name of the join column(s), the column(s) must exist on both sides, and this performs an equi-join. Therefore, do the columns exist on both sides of join tables? Also, wondering if you can encode the "condition" separately, then pass it to the join() method, like this: >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer')

phargis · ‎07-07-2016

@R Pul Yes, that is a common problem. The first thing I would try is at the Spark configuration level, enable Dynamic Resource Allocation. Here is a description (from link below): "Spark 1.2 introduces the ability to dynamically scale the set of cluster resources allocated to your application up and down based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster. If a subset of the resources allocated to an application becomes idle, it can be returned to the cluster’s pool of resources and acquired by other applications. In Spark, dynamic resource allocation is performed on the granularity of the executor and can be enabled through spark.dynamicAllocation.enabled ." And in particular, the Remove Policy: The policy for removing executors is much simpler. A Spark application removes an executor when it has been idle for more than spark.dynamicAllocation.executorIdleTimeout seconds. Web page: https://spark.apache.org/docs/1.2.0/job-scheduling.html Also, check out the paragraph entitled "Graceful Decommission of Executors" for more information.

phargis · ‎07-01-2016

@Zach Kirsch The problem is more likely a lack of correlation between Spark's request for RAM (driver memory + executor memory) and Yarn's container sizing configuration. Yarn settings determine min/max container sizes, and should be based on available physical memory, number of nodes, etc. As a rule of thumb, try making the minimum Yarn container size 1.5 times the size of the requested driver/executor memory (in this case, 1.5 GB).

phargis · ‎06-29-2016

The methods you mention will not alter sort order for a join operation, since data is always shuffled for join. For ways to enforce sort order, you can read this post on HCC: https://community.hortonworks.com/questions/42464/spark-dataframes-how-can-i-change-the-order-of-col.html To answer your questions about coalesce() and repartition(), these are both used to modify the # of partitions stored by the RDD. The repartition() method can increase or decrease the # of partitions, and allows shuffles across nodes, meaning data stored on one node can be moved to another. This makes it inefficient for large rdds. The coalesce() method can only be used to decrease the # of partitions, and shuffles are not allowed. This makes it more efficient than repartition, but it may result in asymmetric partitions since no data is moved across nodes.

phargis · ‎06-28-2016

You can designate either way by setting --master and --deploy-mode arguments correctly. By designating --master=yarn, the Spark executors will be run on the cluster; --master=local[*] will place the executors on the local machine. The Spark driver location will then be determined by one of these modes: --deploy-mode=cluster runs driver on cluster, --deploy-mode=client runs driver on client (VM where it is launched). More info here: http://spark.apache.org/docs/latest/submitting-applications.html

phargis · ‎06-28-2016

You probably need to install the spark-client on your VM, which will include all the proper jar files and binaries to connect to YARN. There is also a chance that the version of Spark used by Titan DB was built specifically without YARN dependencies (to avoid duplicates). You can always rebuild your local Spark installation with YARN dependencies, using the instructions here: http://spark.apache.org/docs/latest/building-spark.html For instance, here is a sample build command using maven: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

phargis · ‎06-27-2016

The referenced JIRA above is now resolved. I have successfully tested the new version of the Hive ODBC Driver on Mac OSX version 10.11 (El Capitan). However, please note that you must install the new Hive ODBC driver version 2.1.2 as shown through the iODBC Administration tool Please also note that the location of the driver file has changed. Here is the new odbcinst.ini file (stored in ~/.odbcinst.ini), showing the old driver location commented out and the new driver location below it: [ODBC Drivers] Hortonworks Hive ODBC Driver=Installed [Hortonworks Hive ODBC Driver] Description=Hortonworks Hive ODBC Driver ; old driver location ; Driver=/usr/lib/hive/lib/native/universal/libhortonworkshiveodbc.dylib ; new driver location below Driver=/opt/hortonworks/hiveodbc/lib/universal/libhortonworkshiveodbc.dylib

phargis · ‎06-27-2016

@Sri Bandaru Okay, so now I'm wondering if you should include the Spark assembly jar; that is where the reference class lives. Can you try adding this reference to your command-line (assuming your current directory is the spark-client directory, or $SPARK_HOME for your installation): --jars lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar Note: If running on HDP, you can use the soft-link to this file named "spark-hdp-assembly.jar"

Online	Offline
Last Visited	‎10-04-2016 10:20 PM

Member Since	‎09-24-2015 01:55 PM
Last Visited	‎10-04-2016 10:20 PM
Posts	98
Kudos received	73

Cloudera Community

Re: Fuzzy Algorithm in Apache Spark

Re: How to tune Spark for parallel processing when...

Re: Social Network Analysis using Spark MLLIB

Re: Configuring YARN queues for Spark notebooks

Re: Can Dataframe joins in Spark preserve order?

Re: Social Network Analysis using Spark MLLIB

Re: TimestampType format for Spark DataFrames

Re: spark join with udf fails

Re: Configuring YARN queues for Spark notebooks

Re: Container marked as failed: Spark & YARN

Re: Can Dataframe joins in Spark preserve order?

Re: Post Job to Spark via YARN from VM on a virtua...

Re: Post Job to Spark via YARN from VM on a virtua...

Re: Hive ODBC Driver on OSX 10.11 (El Capitan)

Re: Spark Job Failing "Could not find or load main...