Member since
09-05-2016
24
Posts
2
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
345 | 10-01-2019 09:45 AM | |
903 | 11-01-2016 06:47 PM | |
410 | 10-19-2016 05:07 PM |
12-16-2019
09:17 AM
There is no solution. The FTP server has a single location to grab files from. There is currently no easy way to do what I was asking.
... View more
10-01-2019
09:45 AM
Tried to delete this as noone seems to answer here. But if someone has a similiar issue, just UpdateAttribute on attributes you want to create a new flowfile and then pass them to AttributesToJSON processor.
... View more
10-01-2019
05:12 AM
I have a complex flow where I created 3 attributes, [name, date, json_content] extracted from other flow file data that need to go into a database. How can I take these 3 attributes convert them to a new flow file with those columns? The schema I will use has those names.
Schema: { "type": "record", "name": "mytable", "fields": [ {"name": "name","type": [ "string" ]}, {"name": "date","type": "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-millis" } ], "default" : null } }, {"name": "json_content","type": [ "string" ]
}] }
... View more
Labels:
- Labels:
-
Apache Hadoop
09-25-2019
10:10 AM
Can anyone at Cloudera/Horton aswer this??? Having same issues.
... View more
09-18-2019
05:13 AM
I have an FTP location where I must grab specific files, *.tar, from within named sub directories and only those sub directories. The layout is like so:
path/conf1/conf1.tar
path/conf2/conf2.tar
path/conf3/conf3.tar
path/support/support.tar
I only want tar files from path/conf*/. Is this possible using Path Filter Regex or some combo of properties? I do not want to look into support/ at all. In fact some directories I do not have permissions to look at so I get those permission exceptions. How to limit to only conf*/ folders?
Thanks
... View more
Labels:
- Labels:
-
Apache Hadoop
09-08-2017
02:36 AM
Is there instructions on how to install spark2 on an HDP 2.5.6 cluster? We are currently running 1.6.3 and we need to use the Magellan spatial libs but this does not function under 1.6.3 spark. Can you point me to spark 2 installation instructions?
... View more
Labels:
- Labels:
-
Apache Spark
09-07-2017
08:05 PM
I am running spark version 1.6.3 under HDP 2.5.6. What version of magellan should I use to run with this version?
... View more
06-23-2017
01:54 AM
At the moment I have not worked on this issue since but I will resurrect it and try somethings out. First you could start with using the Databricks libraries. I did try
one library off git but it was too difficult to work with. The schema I
am using is quite complex. What schema do you have for your data. Some
ideas that I learned, not tried, include pre-converting the XML to CSV
or Avro before consuming it into spark and using Databricks CSV or other
lib to process in the stream portion. Let me know how you are ingesting
the XML. I still need to do this at some point.
... View more
05-03-2017
06:15 PM
I need to upgrade a cluster from 2.4.2 to 2.5. Shouldn't we be following this link: http://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-upgrade/bk_ambari-upgrade.pdf? We are going TO 2.5.
... View more
04-13-2017
11:47 AM
1 Kudo
I run HDP spark 1.4.1 and 1.6.1 I have to process the input of rapidly arriving XML from a Kafka topic with spark strmng. I am able to do the .print function and see my data is indeed coming into spark. I have them batched now at 10 s. Now I need to know: 1) Is there a way to delimit each XML message? 2) How can I apply a JAXB like schema function to each message? I have a process already doing this in plain java and it works fine using the standard kafka APIs and JAXB. Sample output where I write the data with saveAsTextFiles() shows broken messages, seemed to be split on space and large XML messages are spread across more than one file. Thanks, M
... View more
Labels:
- Labels:
-
Apache Spark
11-01-2016
06:47 PM
At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now. Corrrected cmd line: spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro
... View more
10-31-2016
02:30 PM
I am executing the following command: spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well. I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine. Part of the huge stack trace: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84)
at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250)
... I have compiled this with scala 2.10.6 with the following lines: sqlContext.read.format("com.databricks.spark.avro").
option("header", "true").load(aPath + iAvroFile); sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " +
"(path '" + aPath + iAvroFile + "')"); val counts_query = "SELECT id ID,count(id) " +
"HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id"; val flight_counts = sqlContext.sql(counts_query);
flight_counts.show() # OOM I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4 Any ideas would help to get past this...
... View more
Labels:
- Labels:
-
Apache Spark
10-19-2016
05:07 PM
We passed this problem by eventually just matching up one of the versions of the lucene library.
... View more
10-19-2016
03:25 PM
We are using HDP with Flume 1.5.2.2.4 and attempting to get the Elasticsearch connector working. We installed elasticsearch-2.4.1.jar along with lucene-core-5.5.2.jar at first. When restarting flume we get java.lang.NoSuchFieldError: LUCENE_4_10_2.
We get these NoSuchfieldErrors no matter which version of lucene we use: LUCENE_5_2_1, LUCENE_4_0_0... Can anyone shed some light on how to get elasticsearch libs to work with flume? Thanks Mike
... View more
Labels:
- Labels:
-
Apache Flume
09-20-2016
02:20 AM
Almost forgot about this... I access my avro files like so: First as Tim said, include proper avro lib, in my case DataBricks. spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar val df = sqlContext.read.format("com.databricks.spark.avro").
option("header", "true").load("/user/user1/writer_test.avro")
df.select("time").show() ... Thanks all
... View more
09-18-2016
09:37 PM
Thanks, I will giving these answers a try...I will report back...
... View more
09-18-2016
01:07 AM
1 Kudo
Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5 I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file. If I use: val schema = sc.textFile("/user/test/ciws.avsc") This loads file and I can do : schema.take(100).foreach(println) and see contents. If I do (using Avro Schema parser): val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc")) or val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString I get: java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory) My core-site.xml specifies defaulFS as our namenode. I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class? Mike
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark