About bwalter1

bwalter1 · ‎11-21-2016

Works for me print(s"Spark ${spark.version}") val df = spark.createDataFrame(Seq(( 2, 9), ( 1, 5),( 1, 1),( 1, 2),( 2, 8))) .toDF("y", "x") df.createOrReplaceTempView("test") spark.sql("select CASE WHEN y = 2 THEN 'A' ELSE 'B' END AS flag, x from test").show Returns Spark 2.0.0 df: org.apache.spark.sql.DataFrame = [y: int, x: int] +----+---+ |flag| x| +----+---+ | A| 9| | B| 5| | B| 1| | B| 2| | A| 8| +----+---+

bwalter1 · ‎11-07-2016

"/opt/cloudera/parcels/CDH-5.7..." doesn't sound like HDP 2.5. Do you have a conflicting install of CDH and HDP?

bwalter1 · ‎10-12-2016

You might want to use "com.databricks:spark-csv_2.10:1.5.0", e.g. pyspark --packages "com.databricks:spark-csv_2.10:1.5.0" and then use (the csv file is in the folder /tmp/data2): from pyspark.sql.types import StructType, StructField, DoubleType, StringType schema = StructType([ StructField("IP", StringType()), StructField("Time", StringType()), StructField("Request_Type", StringType()), StructField("Response_Code", StringType()), StructField("City", StringType()), StructField("Country", StringType()), StructField("Isocode", StringType()), StructField("Latitude", DoubleType()), StructField("Longitude", DoubleType()) ]) logs_df = sqlContext.read\ .format("com.databricks.spark.csv")\ .schema(schema)\ .option("header", "false")\ .option("delimiter", "|")\ .load("/tmp/data2") logs_df.show() Result: +------------+--------+------------+-------------+------+--------------+-------+--------+---------+ | IP| Time|Request_Type|Response_Code| City| Country|Isocode|Latitude|Longitude| +------------+--------+------------+-------------+------+--------------+-------+--------+---------+ |192.168.1.19|13:23:56| GET| 200|London|United Kingdom| UK| 51.5074| 0.1278| |192.168.5.23|09:45:13| POST| 404|Munich| Germany| DE| 48.1351| 11.582| +------------+--------+------------+-------------+------+--------------+-------+--------+---------+

bwalter1 · ‎08-24-2016

Replace outputStream.write(inJson.toString().getBytes(StandardCharsets.UTF_8)) by def outJson = new JsonBuilder(inJson) outputStream.write(outJson.toString().getBytes(StandardCharsets.UTF_8)) The referenced sample in StackOverflow already imports jsonBuilder

bwalter1 · ‎08-22-2016

I not sure ReplaceWithMapping is powerful enough ... I would anyhow prefer a solution like this (look at the edited answer at the bottom): http://stackoverflow.com/questions/37577453/apache-nifi-executescript-groovy-script-to-replace-json-values-via-a-mapping-fi You have more control which values of which keys you actually replace and you can replace multiple values in one record

bwalter1 · ‎08-22-2016

Hi Pedro, python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames." I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked: Use pyspark like this: pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10 or use zeppelin and add the dependencies to the interpreter configuration. Maybe this library has what you need.

bwalter1 · ‎08-18-2016

The above holds for Spark 1.6.2. Haven't checked Spark 2.0

bwalter1 · ‎08-18-2016

For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin. Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too. I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar. So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.

bwalter1 · ‎08-08-2016

You could also use pig to import text files, e.g. csv: Create an ORC table in Hive with the right schema Use something like https://community.hortonworks.com/questions/49818/how-to-load-csv-file-directly-into-hive-orc-table.html#answer-49886

bwalter1 · ‎08-05-2016

There is an issue with the space in front of "EF": Let's use (you don't need the "escape" option, it can be used to e.g. get quotes into the dataframe if needed) val df = sqlContext.read.format("com.databricks.spark.csv") .option("header", "true") .option("delimiter", "|") .load("/tmp/test.csv") df.show() With space in front of "EF" +----+----+----+-----+ |Col1|Col2|Col3| Col4| +----+----+----+-----+ | AB| CD| DE| "EF"| +----+----+----+-----+ Without space in front of "EF": +----+----+----+----+ |Col1|Col2|Col3|Col4| +----+----+----+----+ | AB| CD| DE| EF| +----+----+----+----+ Can you remove the space before loading the csv into Spark?

Online	Offline
Last Visited	‎06-07-2017 08:08 AM

Member Since	‎10-07-2015 10:28 PM
Last Visited	‎06-07-2017 08:08 AM
Posts	107
Kudos received	72

Cloudera Community

Re: Spark and HIVE

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: ReplaceText Regex to replace double quotes in ...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: Is Case Statement in query of Spark-Sql (Ver 2...

Re: Spark2 not working

Re: Not able to split the column into multiple col...

Re: How to use processor ReplaceWithMapping in nif...

Re: How to use processor ReplaceWithMapping in nif...

Re: Link Analysis using Spark Python

Re: netlib-java and Anaconda for Spark ML?

Re: netlib-java and Anaconda for Spark ML?

Re: How import text file format data to ORC Table...

Re: Escaping double quotes in spark dataframe