Member since
06-28-2016
34
Posts
1
Kudos Received
0
Solutions
09-16-2020
03:27 PM
I believe this will fail if you stop your job today and run it tomorrow.. now will change to other day and you will miss the data...
... View more
02-02-2018
10:45 PM
@Biswajit Chakraborty
If you are using GetFTP processor then after pulling files then processor going to add getftp.remote.source attribute to the flowfile, then you can use this flowfile attribute then prepare filename in update attribute processor Add new property in update attribute filename ${filename}_${getftp.remote.source} //add remote source name to the filename as you can change the way of using expression language to change filename as following ${filename:append(${getftp.remote.source})} //result 711866091328995HDF04-1
(or)
${filename}${getftp.remote.source} //result 711866091328995HDF04-1 Example:- if you are having filename value as 711866091328995 and getftp.remote.source value as HDF04-1 then output flowfile from update attribute will have filename as 711866091328995_HDF04-1 //because we are adding remote source value to filename with underscore (or) if you are having issue with the same filenames and they are getting overwritten, The FlowFile will also have an attribute named uuid, By using UUID(which is a unique identifier for this FlowFile) as filename, will keep every filename as unique so that we are not going to have any overwriting issues. Configs:- filename ${uuid}
... View more
01-15-2018
03:30 AM
@Bala , sorry for vary late response .... Actually my purpose
is read some data file(server log) , transform those into proper format
and prepare a data warehouse (that in my case , HIVE) for analysis on
latter. So , in my project I have 3 different activities mainly 1) read and transform data from txt/log file (For which I am using Spark -- frequency : daily job) 2)
prepare a data-ware house with those daily data (for which , I am
inserting those Spark DF into HIVE table --- frequency : daily job) 3)
Show the result (for this I am using again spark SQL along with HIVE as
that is faster than using only HIVE query , and will use Zeppelin or tableau for data visualization --frequency : weekly job or as on required ) Though
as my reading and understanding , I guess SpakSQL alone + cache will be
much faster the spark+hive , but I think I do ont have any other option as I
have to do analysis on repository data. Do you suggest any other approach for this use case?
... View more
09-23-2017
05:17 PM
Thanks a lot for your help , you saved my day ... thanks again .....
... View more
06-14-2017
03:38 PM
1 Kudo
Hi, the problem is that your query is syntactically wrong. The right query to achieve your goal is: select memberid, max(insertdtm) from finaldata group by memberid having datediff(current_date, max(insertdtm))>30;
Hope it helps.
... View more
08-11-2016
07:50 PM
Please consider publishing an article on this, others will find it useful as it's not an obvious find.
... View more
07-11-2016
06:25 PM
@Biswajit Chakraborty The official Hortonworks documentation for deploying HBase clusters has been spread out in multiple guides, which makes it difficult to find. A refresh of docs.hortonworks.com that will be coming soon with a new release of HDP should correct this problem. For now, you can find some information in these links: HBase Cluster Capacity and Region Sizing Add HBase RegionServer Optimizing HBase I/O I think you are set with installing HBase. But in case you are not, one way to access the installation steps is to use the links on Using Apache HBase and Apache Phoenix (this information will also be enhanced and moved soon). Let me know if this helps. Thanks for your patience.
... View more
07-05-2016
02:14 PM
Cool, but you accepted the wrong answer 🙂
... View more