About bwalter1

DianaTorres · ‎07-09-2025

@LSIMS As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post. Thanks.

Ryanp · ‎12-20-2019

I have the following issue with this setup . .I define Livy service on a Knox topology with authentication provider enabled . When I request the Livy session over Knox url Knox requests the Livy session with doAs = myuser . So far so good. .. Livy sessions is started with owner=Knox and proxyuser =myuser.. Problem is when we attempt to post to Livy statements API over the Knox url. If we use the Knox url for posting to the running Livy session Knox will add the doAs=myuser . But now we get a forbidden response . Basically because the Livy session is owned by Knox we cannot post statement into the session over the Knox url with doAs=myuser . in my setup at least only the Knox user may post a statement to a Livy session owned by Knox .

bwalter1 · ‎04-12-2017

iris = spark.read.csv("/tmp/iris.csv", header=True, inferSchema=True) iris.printSchema() Result: root |-- sepalLength: double (nullable = true) |-- sepalWidth: double (nullable = true) |-- petalLength: double (nullable = true) |-- petalWidth: double (nullable = true) |-- species: string (nullable = true) Write parquet file ... iris.write.parquet("/tmp/iris.parquet") ... and create hive table spark.sql(""" create external table iris_p ( sepalLength double, sepalWidth double, petalLength double, petalWidth double, species string ) STORED AS PARQUET location "/tmp/iris.parquet" """)

dkozlowski · ‎02-23-2017

@Oriane Try option "Reward User"

darrellulm · ‎10-25-2018

Useful for getting SparkMagic to run w/ Jupyter. And the images do not seem to load for me either, still good how-to tech article for Jupyter.

d_muldoon · ‎11-28-2016

Thank you! I was having difficulty with the replace function. I had not thought to first use the ExtractText processor.

bwalter1 · ‎08-22-2016

Hi Pedro, python API for Spark is still missing, however there is a git project with a higher level API on top of Spark GraphX called GraphFrames: (GraphFrames) . The project claims: "GraphX is to RDDs as GraphFrames are to DataFrames." I haven't worked with it, however a quick test of their samples with Spark 1.6.2 worked: Use pyspark like this: pyspark --packages graphframes:graphframes:0.2.0-spark1.6-s_2.10 or use zeppelin and add the dependencies to the interpreter configuration. Maybe this library has what you need.

bwalter1 · ‎07-07-2016

In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala, "sbt" (http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13 For fat jars you first need "sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt" into the "project" folder containing one line addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3") The project structure now looks like (most probably without the "target" folder which will be created upon building the project) MyProject +-- build.sbt +-- project | +-- assembly.sbt +-- src | +-- main | +-- scala | +-- MyProject.scala +-- target For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file "build.sbt" name := "MyProject" version := "0.1" scalaVersion := "2.10.6" resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/" resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/" libraryDependencies ++= Seq( "org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided", "org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258" ) assemblyMergeStrategy in assembly := { case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last case PathList("com", "squareup", xs @ _*) => MergeStrategy.last case PathList("com", "sun", xs @ _*) => MergeStrategy.last case PathList("com", "thoughtworks", xs @ _*) => MergeStrategy.last case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last case PathList("commons-cli", xs @ _*) => MergeStrategy.last case PathList("commons-collections", xs @ _*) => MergeStrategy.last case PathList("commons-io", xs @ _*) => MergeStrategy.last case PathList("io", "netty", xs @ _*) => MergeStrategy.last case PathList("javax", "activation", xs @ _*) => MergeStrategy.last case PathList("javax", "xml", xs @ _*) => MergeStrategy.last case PathList("org", "apache", xs @ _*) => MergeStrategy.last case PathList("org", "codehaus", xs @ _*) => MergeStrategy.last case PathList("org", "fusesource", xs @ _*) => MergeStrategy.last case PathList("org", "mortbay", xs @ _*) => MergeStrategy.last case PathList("org", "tukaani", xs @ _*) => MergeStrategy.last case PathList("xerces", xs @ _*) => MergeStrategy.last case PathList("xmlenc", xs @ _*) => MergeStrategy.last case "about.html" => MergeStrategy.rename case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last case "META-INF/mailcap" => MergeStrategy.last case "META-INF/mimetypes.default" => MergeStrategy.last case "plugin.properties" => MergeStrategy.last case "log4j.properties" => MergeStrategy.last case x => val oldStrategy = (assemblyMergeStrategy in assembly).value oldStrategy(x) } 1) The "resolvers" section adds the Hortonworks repositories. 2) In "libraryDependencies" you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar. Note that Spark-Streaming can be tagged as "provided" (it is omitted from the jat jar), since it is automatically available when you submit a job . 3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the "assemblyMergeStrategy" section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear. 4) Assemble the project (if you call it the first time it will "download the internet" like maven) sbt assembly will create "target/scala-2.10/myproject-assembly-0.1.jar" 5) You can now submit it to Spark spark-submit --master yarn --deploy-mode client \ --class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar

pminovic · ‎05-06-2016

Hi @Mike Vogt, thanks and glad to hear it worked. Can you kindly accept the answer and thus help us managing answered questions. Tnx!

bwalter1 · ‎04-22-2016

There seems to be a length issue. If I compress your avro schema to one line it works in Hiev view and in beeline DROP TABLE metro; CREATE TABLE metro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.literal'='{"namespace":"ly.stealth.xmlavro","protocol":"xml","type":"record","name":"MetroBillType","fields":[{"name":"BillDate","type":"string"},{"name":"BillTime","type":"string"},{"name":"Remit_CompanyName","type":"string"},{"name":"Remit_Addr","type":"string"},{"name":"Remit_CityStZip","type":"string"},{"name":"Remit_Phone","type":"string"},{"name":"Remit_Fax","type":"string"},{"name":"Remit_TaxID","type":"string"},{"name":"BillAcct_Break","type":{"type":"record","name":"BillAcct_BreakType","fields":[{"name":"BillAcct","type":"string"},{"name":"Invoice_Number","type":"int"},{"name":"Acct_Break","type":{"type":"record","name":"Acct_BreakType","fields":[{"name":"Acct","type":"string"},{"name":"Items","type":{"type":"record","name":"ItemsType","fields":[{"name":"Item","type":{"type":"array","items":{"type":"record","name":"ItemType","fields":[{"name":"Account","type":"string"},{"name":"Claim_Number","type":"string"},{"name":"Insured_Name","type":"string"},{"name":"Price","type":"float"},{"name":"Control_Number","type":"int"},{"name":"State","type":"string"},{"name":"Report_Type_Code","type":"string"},{"name":"Report_Type_Desc","type":"string"},{"name":"Policy_Number","type":"string"},{"name":"Date_of_Loss","type":"string"},{"name":"Date_Received","type":"string"},{"name":"Date_Closed","type":"string"},{"name":"Days_to_Fill","type":"int"},{"name":"Police_Dept","type":"string"},{"name":"Attention","type":"string"},{"name":"RequestID","type":"int"},{"name":"ForceDup","type":"string"},{"name":"BillAcct","type":"string"},{"name":"BillCode","type":"string"}]}}}]}},{"name":"Acct_Total","type":"float"},{"name":"Acct_Count","type":"int"}]}},{"name":"Bill_Total","type":"float"},{"name":"Bill_Count","type":"int"}]}},{"name":"Previous_Balance","type":"int"}]}'); SELECT * FROM metro; +-----------------+-----------------+--------------------------+-------------------+------------------------+--------------------+------------------+--------------------+-----------------------+-------------------------+--+ | metro.billdate | metro.billtime | metro.remit_companyname | metro.remit_addr | metro.remit_citystzip | metro.remit_phone | metro.remit_fax | metro.remit_taxid | metro.billacct_break | metro.previous_balance | +-----------------+-----------------+--------------------------+-------------------+------------------------+--------------------+------------------+--------------------+-----------------------+-------------------------+--+ +-----------------+-----------------+--------------------------+-------------------+------------------------+--------------------+------------------+--------------------+-----------------------+-------------------------+--+ With the nicely formatted avro schema I receive: "Unexpected end-of-input within/between ARRAY entries" which indicates that there is a length restriction for this parameter. Else try to use the avro.schema.url way.

Online	Offline
Last Visited	‎06-07-2017 08:08 AM

Member Since	‎10-07-2015 10:28 PM
Last Visited	‎06-07-2017 08:08 AM
Posts	107
Kudos received	72

Cloudera Community

Re: Spark and HIVE

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: ReplaceText Regex to replace double quotes in ...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: How could I use pandas library in Pyspark in Z...

Re: Adding Livy Server as service to Apache Knox

Re: Create a table from pyspark code on top of par...

Re: Does jar files missing for spark interpreter?

Re: Using Jupyter with Sparkmagic and Livy Server ...

Re: ReplaceText Regex to replace double quotes in ...

Re: Link Analysis using Spark Python

Creating fat jars for Spark Kafka Streaming using ...

Re: Cannot get Apache Spark to start

Re: Avro for multiple records type