Member since
04-11-2018
47
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
14055 | 06-12-2018 11:26 AM |
10-08-2018
11:54 AM
Hi, To convert csv to dataframe we must be aware about the delimiter character at coding time, however in my case we are not aware about the same. The source file will be delimited by some character, our code should be able to infer the delimiter into the file and convert the file into dataframe. As of now i've written a java snippet to check the delimiter character first and tried to read the file. Do we have any predefined function to satisfy my need? Thanks, R
... View more
Labels:
- Labels:
-
Apache Spark
07-18-2018
02:00 PM
@Felix Albani Thanks for your answer, In my scenario we've 3 edge nodes. The same spark code jar has been deployed into all three edge nodes. let's say there is some maintenance work going on node 1 and my spark jar is not available. Now my job should get triggered from other available node2 or node3 from job schedulers. Currently we've deployed on cluster mode only but my confusion is, How the code should get triggered from next available edge node, what is the best approach to schedule spark job from multiple edge nodes.
... View more
07-17-2018
09:56 AM
I've my spark job and deployed on hadoop cluster, In my case i've more than one edge nodes pointing to the same hadoop cluster, now my requirement is if there is some issues for edge node 1 my spark job should get triggered from another available edge node. What is the best way to do it?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
07-17-2018
09:50 AM
@Matt Clarke, thanks for your answer, i am not aware about the file sequences.
... View more
07-03-2018
12:21 PM
Hi, I am expecting two files, file with abc.txt and file with pqr.txt my next processor should get triggered only if these two files are received. Currently i am using listFile fetchFIle to capture the changes of source directory. Once i have received two mandatory files my next process is to process them. Now i am not sure how should i configure NiFi to hold the file in queue until i got all mandatory files. Do we know how should we achieve this using NiFi?
... View more
Labels:
- Labels:
-
Apache NiFi
06-26-2018
02:24 PM
@Raymond Honderdors, Thanks for you answer
... View more
06-26-2018
03:19 AM
Hi, I am validating xml schema against XSD which are coming in the NiFi flow. Schema validation is being done using validateXML processor. I've given XSD file to validate the schema. Now whenever there any change into xml schema file do we need to restart the NiFi cluster or simply stop the validateXML processor and start again will work?
... View more
Labels:
- Labels:
-
Apache NiFi
06-25-2018
01:53 PM
We've 3 node NiFi cluster and we want to make some changes into flow which will need NiFi cluster restart. What is the correct way to restart the complete NiFi cluster? It should not be the one by one node. Thanks, R
... View more
Labels:
- Labels:
-
Apache NiFi
06-25-2018
08:31 AM
@Matt Clarke, could you please help me on this?
... View more
06-22-2018
10:13 AM
Hello Friends, @Pierre Villard (Specifically tagging you as like your explanations) Currently i am getting messages from a Queue and i want to trigger out notification email only once for each file type. I've used putEmail processor to do the same but my flow is triggering email notification for every time new flow file came in and that is the expected behavior. I thought to use mergeContent here but again i am not sure how should i configure it. Descriptive requirement: I am getting files for department like computer, mechanical,electronics, etc. Now i want to send notification email only once in a day per file type. Send email as soon as we receive the file for specific department. Thanks for your help,
... View more
Labels:
- Labels:
-
Apache NiFi
06-12-2018
11:26 AM
Hi, I found the correct way to do it, there is no need to do any workaround we can directly append the data into parquet hive table using saveAsTable("mytable") from spark 2.0 (Was not there in spark 1.6) Below is the code in case someone needs it. df.write.paritionedBy("mycol1","mycol2").mode(SaveMode.Append).format("parquet").saveAsTable("myhivetable")
In case table is not there it will create it and write the data into hive table. In case table is there it will append the data into hive table and specified partitions.
... View more
06-11-2018
04:08 PM
@sunile.manjee there might be multiple workaround for this, however i am not looking for workarounds. Expecting something concrete solution which should not have performance complications. We have option to write dataframe into hive table straight a way...why should we not go for that...instead of writing data into hdfs and then loading into hive table...moreover my hive table is partitioned on processing year and month...
... View more
06-11-2018
03:11 PM
@sunile.manjee thanks for your response. Hive table has input format, output format and serde as ParquetHiveSerDe, however my concern is why files are not created with .parquet extension and whenever i do cat on those .c000 files i am unable to find parquet schema which i could find after cat of normal .parquet files.
... View more
06-11-2018
02:19 PM
Hi, I am writing spark dataframe into parquet hive table like below df.write.format("parquet").mode("append").insertInto("my_table") But when i go to HDFS and check for the files which are created for hive table i could see that files are not created with .parquet extension. Files are created with .c000 extension. Again i am not sure whether my data is correctly written into table or not(I could see the data from hive select). How we should write the data into .parquet files into hive table? Appreciate your help on this! Thanks,
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
06-08-2018
10:41 AM
Hi, I want to write spark dataframe into hive table.Hive table is partitioned on year and month and file format is parquet. Currently i am writing dataframe into hive table using insertInto() and mode("append") i am able to write the data into hive table but i am not sure that is the correct way to do it? Also while writing i am getting "parquet.hadoop.codec.CompressionCodecNotSupportedException: codec not supported: org.apache.hadoop.io.compress.DefaultCodec" this exception. Could you please help me on this? Thanks for your time,
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
06-06-2018
01:40 PM
@Shu thanks for your answer,i have one doubt about your answer the command argument should not take the shell script as input instead of that we should have shell script in command path which is tried and tested.
... View more
06-06-2018
09:01 AM
My Requirement is as soon as source put the files my spark job should get triggered out and process the files. Currently i am thinking to do like below. 1. Source will push the files in local directory /temp/abc 2. NiFi ListFiles and fetchFile will take care of ingestion of those files into HDFS. 3. On success relationship of putHDFS thinking to setup executeStreamCommand. Could you please suggest is there any best approach to do it? what will be the configuration for executeStreamCommand? Thanks in advance, R
... View more
Labels:
- Labels:
-
Apache NiFi
-
Apache Spark
05-31-2018
05:45 PM
@gnovak, I am still wondering why it has created the directory on my local machine? Kind of wired... Related to this i have another issue, i am also reading files from hdfs directory using wholeTextFile() my hdfs input directory has text files and sub directories in it. On my local development machine i was able to read the files where wholeTextFile() was not considering sub directories, however whenever i deployed the same code cluster, it started to consider sub directories as well. Do you have any idea on this? Appreciate your help on this
... View more
05-31-2018
05:39 PM
@gnovak thanks for you time 🙂
... View more
05-31-2018
02:28 PM
@gnovak, In order to satisfy my need i am doing FileSystem.rename(src,tgt). If target path is not exists will it create? My understanding is, it will create the target path, however in my case i am able to move file as expected on my local machine and the same code has been deployed on cluster but i am able to move file to desired location. It is not giving me any exception but simply not doing the job.
... View more
05-25-2018
09:03 AM
@gnovak Thanks for getting my Question correctly. and the same has been done by me in my scala code. However thought to have others opinion on this.
... View more
05-25-2018
09:01 AM
@Geoffrey Shelton Okot, Thanks for your time but i was not looking for command line option(knows everyone).
... View more
05-24-2018
06:29 AM
@Geoffrey Shelton Okot, I have few files in hdfs directory. Simply wanted to move files from one hdfs directory to another. For example: Have file abc.txt in pqr directory wanted to move file to lmn directory. /apps/pqr/abc.txt move abc.txt to /apps/lmn/abc.txt
... View more
05-23-2018
07:48 AM
@Felix Albani, still i am not getting your point it should not throw exception in case of IF NOT EXISTS. As per my understanding when we say IF NOT EXISTS it should execute the statement silently without throwing any exception in case of database is already exists and that's why we are using IF NOT EXISTS. My purpose over here is to create the database if not exists otherwise don't create it.
... View more
05-23-2018
07:41 AM
I've files in one hdfs folder and after checking few things i wanted to move that file to another directory on hdfs. Currently i am using filesystem object with rename it is doing the job but it is actually renaming the file with complete path. Do have any other way to do it? Appriciate your help. Thanks,
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Spark
05-22-2018
11:49 AM
@Felix Albani, Any updates on this?
... View more
05-18-2018
03:35 PM
Hi, I need to create an empty hive bucketed table from spark with parquet file format. Currently spark is spitting exception for conventional cluster by. Thanks
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark
05-16-2018
08:16 AM
Thanks for your answer, As i mentioned there are multiple ways to do it however i am looking for best ways to do on performance stand point.
... View more
05-15-2018
03:57 PM
Hi Folks out there, Currently I have a scenario where i have to get only latest record per id from hive table based on timestamp. I am looking for best approach to do it. My data is in hive internal table with parquet files.Similar to this Using spark+hive. Thanks,
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Spark