About Shu_ashu

Shu_ashu · ‎10-28-2018

@Lenu K We can do export to Hive ORC as follows: hive> Create table <db_name>.<orc_table_name> stored as orc as select * from <db_name>.<hbase_hive_table>; The above CTAS is generic statement even you can create a partitioned table (or) use distribute by sort by to create files in the directories.

Shu_ashu · ‎10-28-2018

@vivek jain Could you make hbaseConf.set properties changes directly in Spark hbase-site.xml file instead of setting those property values in spark job and then run spark-submit with newly changed hbase-site.xml?

Shu_ashu · ‎10-28-2018

@Lenu K One way to avoid full table scans is by using RowKey in your hive filter query and if you are filtering out another columns(not only row key) then it would be a lot more efficient if you export all HBase table data into Hive-ORC table then run all your queries on the exported table. Refer to this and this links for tuning up the Queries in case of HBase-Hive table. - If the Answer helped to resolve your issue, Click on Accept button below to accept the answer.

Shu_ashu · ‎10-24-2018

@Sandip Dobariya Try with this configurations in ReplaceText processor and this configs, we are only applying nifi expression language replace function on first line of the extracted content not on all lines and this expression replaces spaces in first line. Search Value: (?s)(^[^\n]*)(.*$) Replacement Value: ${'$1':replace(" ","")}$2 Replacement Strategy RegexReplace Evaluation Mode Entire text Input: Date,Location,Name,Manager,Division,Revenue,PVNMS OnLine,NMS LHSC,RMS OnLineRMS LHSC Output: Date,Location,Name,Manager,Division,Revenue,PVNMSOnLine,NMSLHSC,RMSOnLineRMSLHSC

Shu_ashu · ‎10-24-2018

@HENI MAHER Use Update Record processor and use concat function on Date and TIME attributes then in Record Writer avro schema don't mention the original attributes. Refer to this link for more details regards to Update Record processor. (or) 1.Another way is to extract the attribute value from the content and keep as Flowfile Attribute(using ExtractText,EvaluateJson..etc) processors then 2.Use Update Attribute processor and add new property in update attribute processor as ${Date}${TIME} in Delete attributes list property value add your original attribute name.

Shu_ashu · ‎10-23-2018

@Sandip Dobariya If your csv file size is not huge then you can use on of the mentioned way in this link. (or) By using record oriented processor(ConvertRecord..etc) which more efficient way of doing this task. Configure the ConvertRecord processor with CsvReader and CsvWriter controller service(include header line 'False'). then use ReplaceText processor to prepend the header line to the csv file. Refer to this link for more reference.

Shu_ashu · ‎10-23-2018

@Carlos Cardoso There is AccessControl exception in your shared logs.. org.apache.hadoop.hive.metastore.api.MetaException: org.apache.hadoop.security.AccessControlException: Permission denied: user=nifi, access=EXECUTE, inode="/warehouse/tablespace/managed/hive":hive:hadoop:drwx------ Make sure nifi user having appropriate permission on this directory "/warehouse/tablespace/managed/hive" access to the directory and try to ingest data into table again. - If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues

Shu_ashu · ‎10-23-2018

@NARENDRA KALLI Try with escaping double back slash(\\) $ hadoop fs -rm -r /hdfspath/test/\\$\\{\\db\\}

Shu_ashu · ‎10-20-2018

@Carlton Patterson This is not possible with default save/csv/json functions but using Hadoop API we can rename the filename. Example: >>> df=spark.sql("select int(1)id,string('ll')name") //create a dataframe >>> df.coalesce(1).write.mode("overwrite").csv("/user/shu/test/temp_dir") //writing the df to temp-dir >>> from py4j.java_gateway import java_import >>> java_import(spark._jvm, 'org.apache.hadoop.fs.Path') >>> fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()) >>> file = fs.globStatus(sc._jvm.Path('/user/shu/test/temp_dir/part*'))[0].getPath().getName() //get the filename of temp_dir >>> fs.rename(sc._jvm.Path('/user/shu/test/temp_dir/' + file),sc._jvm.Path('/user/shu/test/mydata.csv')) //rename the temp directory file with desired filename and directory path >>> fs.delete(sc._jvm.Path('/user/shu/test/temp_dir'), True) //delete the temp directory. - If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎10-19-2018

@Nisha Patel If you look into configs of PublishKafkaRecord processor there is Record Reader/Writer controller services, so if your Record Writer is CsvSetWriter then you have configured Include Heder Line property value as true. i.e on each record you are writing the header so when we use Merge Content processor you are going to have header lines included for each record. . To resolve this issue Change the Include Heder Line value to False(now we are not writing header to each record) and then in Merge Content processor keep the Header property value as your header. So by following this way after merging completes then processor adds Header to the file. . Is there a specific reason why are you using PublishKafkaRecord processor? You can even use PublishKafka processor(because as you are splitting each record so there is no need to use Record oriented processors in this case unless if you have some valid reason.) which doesn't require any Record reader/writer controller services, so the message that we published into Kafka topic will be routed to Success relationship. Then use Merge Content processor to Merge all these flowfiles into one and then add the Header to the merged file. Flow: Replace PublishKafkaRecord processor with PublishKafka processor MergeContent processor

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Underlying HBASE Table is taking 30+ minutes f...

Re: getting timeout exception while reading from H...

Re: Underlying HBASE Table is taking 30+ minutes f...

Re: How can I change specific column title in CSV ...

Re: Concatenate Date and Time in a signle field DT...

Re: How can I change specific column title in CSV ...

Re: putHive3Streaming Error

Re: How to delete a hdfs path with special charact...

Re: Unable to Create a single file with PySpark qu...

Re: Nifi Merge Content Processor with Defragment e...