Member since 
    
	
		
		
		10-07-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                107
            
            
                Posts
            
        
                73
            
            
                Kudos Received
            
        
                23
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3223 | 02-23-2017 04:57 PM | |
| 2564 | 12-08-2016 09:55 AM | |
| 10035 | 11-24-2016 07:24 PM | |
| 4851 | 11-24-2016 02:17 PM | |
| 10312 | 11-24-2016 09:50 AM | 
			
    
	
		
		
		11-21-2016
	
		
		11:54 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Works for me  print(s"Spark ${spark.version}")
val df = spark.createDataFrame(Seq(( 2,  9), ( 1,  5),( 1,  1),( 1,  2),( 2,  8)))
              .toDF("y", "x")
df.createOrReplaceTempView("test")
spark.sql("select CASE WHEN y = 2 THEN 'A' ELSE 'B' END AS flag, x from test").show  Returns  Spark 2.0.0
df: org.apache.spark.sql.DataFrame = [y: int, x: int]
+----+---+
|flag|  x|
+----+---+
|   A|  9|
|   B|  5|
|   B|  1|
|   B|  2|
|   A|  8|
+----+---+ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		11-07-2016
	
		
		12:30 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 "/opt/cloudera/parcels/CDH-5.7..." doesn't sound like HDP 2.5. Do you have a conflicting install of CDH and HDP? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		10-12-2016
	
		
		07:47 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 You might want to use "com.databricks:spark-csv_2.10:1.5.0", e.g.  pyspark --packages "com.databricks:spark-csv_2.10:1.5.0"  and then use (the csv file is in the folder /tmp/data2):  from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([
  StructField("IP",             StringType()),
  StructField("Time",           StringType()),
  StructField("Request_Type",   StringType()),
  StructField("Response_Code",  StringType()),
  StructField("City",           StringType()),
  StructField("Country",        StringType()),
  StructField("Isocode",        StringType()),
  StructField("Latitude",       DoubleType()),
  StructField("Longitude",      DoubleType())
])
logs_df = sqlContext.read\
                    .format("com.databricks.spark.csv")\
                    .schema(schema)\
                    .option("header", "false")\
                    .option("delimiter", "|")\
                    .load("/tmp/data2")
logs_df.show()  Result:  +------------+--------+------------+-------------+------+--------------+-------+--------+---------+
|          IP|    Time|Request_Type|Response_Code|  City|       Country|Isocode|Latitude|Longitude|
+------------+--------+------------+-------------+------+--------------+-------+--------+---------+
|192.168.1.19|13:23:56|         GET|          200|London|United Kingdom|     UK| 51.5074|   0.1278|
|192.168.5.23|09:45:13|        POST|          404|Munich|       Germany|     DE| 48.1351|   11.582|
+------------+--------+------------+-------------+------+--------------+-------+--------+---------+ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-24-2016
	
		
		01:45 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Replace  outputStream.write(inJson.toString().getBytes(StandardCharsets.UTF_8))  by  def outJson = new JsonBuilder(inJson)
outputStream.write(outJson.toString().getBytes(StandardCharsets.UTF_8))  The referenced sample in StackOverflow already imports jsonBuilder 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-22-2016
	
		
		01:03 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 I not sure ReplaceWithMapping is powerful enough ...  I would anyhow prefer a solution like this (look at the edited answer at the bottom): 
  http://stackoverflow.com/questions/37577453/apache-nifi-executescript-groovy-script-to-replace-json-values-via-a-mapping-fi  You have more control which values of which keys you actually replace and you can replace multiple values in one record 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-22-2016
	
		
		11:55 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 Hi Pedro,  python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames."  I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked:  Use pyspark like this:  pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10  or use zeppelin and add the dependencies to the interpreter configuration.  Maybe this library has what you need. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-18-2016
	
		
		02:29 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin.  Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too.  I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar.  So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-08-2016
	
		
		07:04 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You could also use pig to import text files, e.g. csv:  
 Create an ORC table in Hive with the right schema  Use something like https://community.hortonworks.com/questions/49818/how-to-load-csv-file-directly-into-hive-orc-table.html#answer-49886  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-05-2016
	
		
		07:44 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 There is an issue with the space in front of  "EF":  Let's use (you don't need the "escape" option, it can be used to e.g. get quotes into the dataframe if needed)  val df = sqlContext.read.format("com.databricks.spark.csv")
          .option("header", "true")
          .option("delimiter", "|")
          .load("/tmp/test.csv")
df.show()  With space in front of "EF"  +----+----+----+-----+
|Col1|Col2|Col3| Col4|
+----+----+----+-----+
|  AB|  CD|  DE| "EF"|
+----+----+----+-----+  Without space in front of "EF":  +----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
|  AB|  CD|  DE|  EF|
+----+----+----+----+  Can you remove the space before loading the csv into Spark? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













