Member since
10-07-2015
107
Posts
73
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2521 | 02-23-2017 04:57 PM | |
1972 | 12-08-2016 09:55 AM | |
8857 | 11-24-2016 07:24 PM | |
3957 | 11-24-2016 02:17 PM | |
9316 | 11-24-2016 09:50 AM |
11-21-2016
11:54 AM
Works for me print(s"Spark ${spark.version}")
val df = spark.createDataFrame(Seq(( 2, 9), ( 1, 5),( 1, 1),( 1, 2),( 2, 8)))
.toDF("y", "x")
df.createOrReplaceTempView("test")
spark.sql("select CASE WHEN y = 2 THEN 'A' ELSE 'B' END AS flag, x from test").show Returns Spark 2.0.0
df: org.apache.spark.sql.DataFrame = [y: int, x: int]
+----+---+
|flag| x|
+----+---+
| A| 9|
| B| 5|
| B| 1|
| B| 2|
| A| 8|
+----+---+
... View more
11-07-2016
12:30 PM
"/opt/cloudera/parcels/CDH-5.7..." doesn't sound like HDP 2.5. Do you have a conflicting install of CDH and HDP?
... View more
10-12-2016
07:47 AM
1 Kudo
You might want to use "com.databricks:spark-csv_2.10:1.5.0", e.g. pyspark --packages "com.databricks:spark-csv_2.10:1.5.0" and then use (the csv file is in the folder /tmp/data2): from pyspark.sql.types import StructType, StructField, DoubleType, StringType
schema = StructType([
StructField("IP", StringType()),
StructField("Time", StringType()),
StructField("Request_Type", StringType()),
StructField("Response_Code", StringType()),
StructField("City", StringType()),
StructField("Country", StringType()),
StructField("Isocode", StringType()),
StructField("Latitude", DoubleType()),
StructField("Longitude", DoubleType())
])
logs_df = sqlContext.read\
.format("com.databricks.spark.csv")\
.schema(schema)\
.option("header", "false")\
.option("delimiter", "|")\
.load("/tmp/data2")
logs_df.show() Result: +------------+--------+------------+-------------+------+--------------+-------+--------+---------+
| IP| Time|Request_Type|Response_Code| City| Country|Isocode|Latitude|Longitude|
+------------+--------+------------+-------------+------+--------------+-------+--------+---------+
|192.168.1.19|13:23:56| GET| 200|London|United Kingdom| UK| 51.5074| 0.1278|
|192.168.5.23|09:45:13| POST| 404|Munich| Germany| DE| 48.1351| 11.582|
+------------+--------+------------+-------------+------+--------------+-------+--------+---------+
... View more
08-24-2016
01:45 PM
Replace outputStream.write(inJson.toString().getBytes(StandardCharsets.UTF_8)) by def outJson = new JsonBuilder(inJson)
outputStream.write(outJson.toString().getBytes(StandardCharsets.UTF_8)) The referenced sample in StackOverflow already imports jsonBuilder
... View more
08-22-2016
01:03 PM
1 Kudo
I not sure ReplaceWithMapping is powerful enough ... I would anyhow prefer a solution like this (look at the edited answer at the bottom):
http://stackoverflow.com/questions/37577453/apache-nifi-executescript-groovy-script-to-replace-json-values-via-a-mapping-fi You have more control which values of which keys you actually replace and you can replace multiple values in one record
... View more
08-22-2016
11:55 AM
1 Kudo
Hi Pedro, python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames." I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked: Use pyspark like this: pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10 or use zeppelin and add the dependencies to the interpreter configuration. Maybe this library has what you need.
... View more
08-18-2016
02:29 PM
2 Kudos
For example kmeans clustering in a SparkML pipeline with python requires "numpy" to be installed on every node. Anaconda is a nice way to get the full python scientific stack installed (including numpy) without caring about details. However, using Anaconda instead of operating system's python means you need to set the PATHs correct for Spark and Zeppelin. Alternatively I have just used "apt-get install python-numpy" on all of my ubuntu 14.04 based HDP nodes and then numpy is available and kmeans works (I guess there are other algorithms that also need numpy). Should be available on Redhat based systems too. I have never installed netlib-java manually. Spark is based on Breeze which uses netlib and netlib is already in the spark assembly jar. So numpy for python is a must if you want to use SparkML with python, netlib-java should already be there.
... View more
08-08-2016
07:04 AM
You could also use pig to import text files, e.g. csv:
Create an ORC table in Hive with the right schema Use something like https://community.hortonworks.com/questions/49818/how-to-load-csv-file-directly-into-hive-orc-table.html#answer-49886
... View more
08-05-2016
07:44 AM
There is an issue with the space in front of "EF": Let's use (you don't need the "escape" option, it can be used to e.g. get quotes into the dataframe if needed) val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", "|")
.load("/tmp/test.csv")
df.show() With space in front of "EF" +----+----+----+-----+
|Col1|Col2|Col3| Col4|
+----+----+----+-----+
| AB| CD| DE| "EF"|
+----+----+----+-----+ Without space in front of "EF": +----+----+----+----+
|Col1|Col2|Col3|Col4|
+----+----+----+----+
| AB| CD| DE| EF|
+----+----+----+----+ Can you remove the space before loading the csv into Spark?
... View more