Member since
09-05-2016
24
Posts
2
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
154 | 10-01-2019 09:45 AM | |
519 | 11-01-2016 06:47 PM | |
230 | 10-19-2016 05:07 PM |
12-16-2019
09:17 AM
There is no solution. The FTP server has a single location to grab files from. There is currently no easy way to do what I was asking.
... View more
10-01-2019
09:45 AM
Tried to delete this as noone seems to answer here. But if someone has a similiar issue, just UpdateAttribute on attributes you want to create a new flowfile and then pass them to AttributesToJSON processor.
... View more
10-01-2019
05:12 AM
I have a complex flow where I created 3 attributes, [name, date, json_content] extracted from other flow file data that need to go into a database. How can I take these 3 attributes convert them to a new flow file with those columns? The schema I will use has those names.
Schema: { "type": "record", "name": "mytable", "fields": [ {"name": "name","type": [ "string" ]}, {"name": "date","type": "type" : [ "null", { "type" : "long", "logicalType" : "timestamp-millis" } ], "default" : null } }, {"name": "json_content","type": [ "string" ]
}] }
... View more
Labels:
09-25-2019
10:10 AM
Can anyone at Cloudera/Horton aswer this??? Having same issues.
... View more
09-18-2019
05:13 AM
I have an FTP location where I must grab specific files, *.tar, from within named sub directories and only those sub directories. The layout is like so:
path/conf1/conf1.tar
path/conf2/conf2.tar
path/conf3/conf3.tar
path/support/support.tar
I only want tar files from path/conf*/. Is this possible using Path Filter Regex or some combo of properties? I do not want to look into support/ at all. In fact some directories I do not have permissions to look at so I get those permission exceptions. How to limit to only conf*/ folders?
Thanks
... View more
Labels:
08-12-2019
08:28 PM
I am trying to simply take the content from a file I break up with SplitProcessor and send the line of text to a text FlowFile and convert that line to JSON and send to another flowFile in onTrigger(). I have tried cloning/creating new session but I am missing something. // Snippets onTrigger() { ... // read flowFile text line and send over to "TEXT" relationship (works fine) String content = new String(buffer,...); ff = session.write(ff, new OutputStream...); session.transfer(ff, TEXT); session.remove(ff); // now convert that string to JSON and send to JSON ff2 = session.create(); String jsonString = toJson(content); // write to ff2 ff2 = session.write(ff2, ... session.transfer(ff2, JSON); session.remove(ff2); I do get errors depending on if I remove or not (NullPointer, already marked for transfer...) Is there an example here that show how to do what I want to do? Thanks MK
... View more
Labels:
02-28-2019
07:17 PM
Is there documentation on how to specify time formats for example in Polling Interval for the human readable string? Ie: 60 sec, 1 hour, 10 hours? Does Nifi understand 1sec and "1 sec" and which words are available? sec, hour, hours?? Basically the "Allowed Values" column on the docs pages dont say.
... View more
- Tags:
- nifi-processor
Labels:
09-08-2017
02:36 AM
Is there instructions on how to install spark2 on an HDP 2.5.6 cluster? We are currently running 1.6.3 and we need to use the Magellan spatial libs but this does not function under 1.6.3 spark. Can you point me to spark 2 installation instructions?
... View more
Labels:
09-07-2017
08:05 PM
I am running spark version 1.6.3 under HDP 2.5.6. What version of magellan should I use to run with this version?
... View more
07-19-2017
06:31 PM
First time running teragen/sort/validate test went ok on 100 Gig test size. After re-running test, teragen fails and log says following:
I see onyl 26 maps are processed of the total of 92. Any ideas...stack below. 2017-07-19 14:01:12,719 INFO [Thread-145] org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state STOPPED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /mr-history/tmp/user1/job_1499796473254_0009.summary_tmp could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1620)
at ... org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2211)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2207)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2205)
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /mr-history/tmp/user1/job_1499796473254_0009.summary_tmp could only be replicated to 0 nodes instead of minReplication (=1). There are 4 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1620) Other info: hdfs dfsadmin -report Configured Capacity: 1074290753536 (1000.51 GB) Present Capacity: 601786240099 (560.46 GB) DFS Remaining: 172010277124 (160.20 GB) DFS Used: 429775962975 (400.26 GB) DFS Used%: 71.42% Under replicated blocks: 307 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0
... View more
06-23-2017
01:54 AM
At the moment I have not worked on this issue since but I will resurrect it and try somethings out. First you could start with using the Databricks libraries. I did try
one library off git but it was too difficult to work with. The schema I
am using is quite complex. What schema do you have for your data. Some
ideas that I learned, not tried, include pre-converting the XML to CSV
or Avro before consuming it into spark and using Databricks CSV or other
lib to process in the stream portion. Let me know how you are ingesting
the XML. I still need to do this at some point.
... View more
05-03-2017
06:15 PM
I need to upgrade a cluster from 2.4.2 to 2.5. Shouldn't we be following this link: http://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-upgrade/bk_ambari-upgrade.pdf? We are going TO 2.5.
... View more
04-13-2017
11:47 AM
1 Kudo
I run HDP spark 1.4.1 and 1.6.1 I have to process the input of rapidly arriving XML from a Kafka topic with spark strmng. I am able to do the .print function and see my data is indeed coming into spark. I have them batched now at 10 s. Now I need to know: 1) Is there a way to delimit each XML message? 2) How can I apply a JAXB like schema function to each message? I have a process already doing this in plain java and it works fine using the standard kafka APIs and JAXB. Sample output where I write the data with saveAsTextFiles() shows broken messages, seemed to be split on space and large XML messages are spread across more than one file. Thanks, M
... View more
Labels:
11-10-2016
03:20 PM
In this case yes I am in the shell testing before I create my app. That 127.0.0.1 is not so. Its really the IP of a server that has ES running on it. So think of it as esserver and it outside the cluster. Cmd ln: spark-shell --packages com.databricks:spark-avro_2.10:2.0.1 --jars elasticsearch-spark_2.10-2.4.0.jar --master yarn The elasticsearch-spark_2.10-2.4.0.jar file is located locally in my directory where I am running spark-shell.
... View more
11-09-2016
04:38 AM
Trying to get some data written over to an elastic search node from spark-shell before creating scala app. Running Horton with 1.4.1 Spark. Whenever I run following: val conf = new SparkConf() conf.set("es.index.auto.create", "true") conf.set("es.nodes", "127.0.0.1") conf.set("es.port", "9200") sc.makeRDD(Seq(numbers, airports)).saveToEs("my_index") <console>:27: error: value saveToEs is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Any]]
sc.makeRDD(Seq(numbers, airports)).saveToEs("my_index") Where is the saveToEs() method defined? I have tried this other way: import org.elasticsearch.spark.rdd.EsSpark EsSpark.saveToEs( sc.makeRDD(Seq(numbers, airports)),"my_index") And get: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only' Any ideas for those who have tried? MAK
... View more
Labels:
11-02-2016
03:36 PM
Running spark-submit job with Spark 1.4.1 on HDP 2.3.2. Jobs show no logs when I click on the Executors tab for my spark job. Jobs and Stages tabs are also empty. Where can I find the stderr/out logs for my job? Also, job still says "Incomplete" when it has finished. This same app works fine in Spark 1.6.1. I can see everything and can refresh the page as it runs with updates.
... View more
Labels:
11-01-2016
06:47 PM
At the moment I made it past this...unfortunately I added the extra options to the end of the command line and if you notice, the options "--master...memory 5g" are actually being fed into my jar. So I just moved the "Main.jar..." to the end and it works now. Corrrected cmd line: spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main --master yarn --executor-memory 5g --driver-memory 5g Main.jar --avro-file file_1.avro
... View more
10-31-2016
02:30 PM
I am executing the following command: spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class Main Main.jar --avro-file file_1.avro --master yarn --executor-memory 5g --driver-memory 5g The file_1.avro file is about 1.5 Gb. But fails with files at 300 Mg as well. I have tried running this on both HDP with Spark 1.4.1 and Spark 1.6.1 and I get OOM error. Running from spark-shell works fine. Part of the huge stack trace: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
at org.apache.avro.file.DeflateCodec.decompress(DeflateCodec.java:84)
at org.apache.avro.file.DataFileStream$DataBlock.decompressUsing(DataFileStream.java:352)
at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:199)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:64)
at org.apache.avro.mapred.AvroRecordReader.next(AvroRecordReader.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:250)
... I have compiled this with scala 2.10.6 with the following lines: sqlContext.read.format("com.databricks.spark.avro").
option("header", "true").load(aPath + iAvroFile); sqlContext.sql("CREATE TEMPORARY TABLE " + tempTable + " USING com.databricks.spark.avro OPTIONS " +
"(path '" + aPath + iAvroFile + "')"); val counts_query = "SELECT id ID,count(id) " +
"HitCount,'" + fileDate + "' DateHour FROM " + tempTable + " WHERE Format LIKE CONCAT('%','BEST','%') GROUP BY id"; val flight_counts = sqlContext.sql(counts_query);
flight_counts.show() # OOM I have tried many options and get cannot get past this. Ex: --master yarn-client --executor-memory 10g --driver-memory 10g --num-executors 4 --executor-cores 4 Any ideas would help to get past this...
... View more
Labels:
10-19-2016
05:07 PM
We passed this problem by eventually just matching up one of the versions of the lucene library.
... View more
10-19-2016
03:25 PM
We are using HDP with Flume 1.5.2.2.4 and attempting to get the Elasticsearch connector working. We installed elasticsearch-2.4.1.jar along with lucene-core-5.5.2.jar at first. When restarting flume we get java.lang.NoSuchFieldError: LUCENE_4_10_2.
We get these NoSuchfieldErrors no matter which version of lucene we use: LUCENE_5_2_1, LUCENE_4_0_0... Can anyone shed some light on how to get elasticsearch libs to work with flume? Thanks Mike
... View more
Labels:
10-14-2016
04:30 PM
I have a few external jars such as elasticsearch-spark_2.10-2.4.0.jar. Currently I use --jars option to load it for spark-shell. Is there a way to get this or other jars to load with spark for my cluster? I see through Ambari there is the spark-defaults but was wondering if I could just copy X.jar to /usr/hdp<Ver>/spark/lib and it would get picked up. A side question somewhat related, its in the same command line, is I use the following: "--packages com.databricks:spark-avro_2.10:2.0.1". I notice the first time this is done, spark goes out and grabs the jars like maven would. But I could not find these and wonder if I need this argument as well or can I get databrick libs installed permanently as with elasticsearch? Thanks, Mike
... View more
Labels:
09-20-2016
02:20 AM
Almost forgot about this... I access my avro files like so: First as Tim said, include proper avro lib, in my case DataBricks. spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 --class MyMain MyMain.jar val df = sqlContext.read.format("com.databricks.spark.avro").
option("header", "true").load("/user/user1/writer_test.avro")
df.select("time").show() ... Thanks all
... View more
09-18-2016
09:37 PM
Thanks, I will giving these answers a try...I will report back...
... View more
09-18-2016
01:07 AM
1 Kudo
Running HDP-2.4.2, Spark 1.6.1, Scala 2.10.5 I am trying to read avro files on HDFS from spark shell or code. First trying to pull in the schema file. If I use: val schema = sc.textFile("/user/test/ciws.avsc") This loads file and I can do : schema.take(100).foreach(println) and see contents. If I do (using Avro Schema parser): val schema = new Schema.Parser().parse(new File("/home/test/ciws.avsc")) or val schema = scala.io.Source.fromFile("/user/test/ciws.avsc").mkString I get: java.io.FileNotFoundException: /user/test/ciws.avsc (No such file or directory) My core-site.xml specifies defaulFS as our namenode. I have tried adding "hdfs:/" to filepath and "hdfs://<defaultFS>/..." and still no dice. Any ideas how to reference file in HDFS with the Schema.Parser class or this scala.io.Source class? Mike
... View more
Labels: