About Shu_ashu

Shu_ashu · ‎04-28-2019

@Rohit Bhattacharya Try with \[EUtranCellRelation\].*\.csv ,\[EUtranCellFDD\] *\.csv in GetSFTP processors. (or) Use RouteOnAttribute processor and filter out the files by matching filenames. Flow: 1.GetSFTP processor 2.RouteOnAttribute //keep matching nifi expression to match files Use NiFi expression language add two properties in RouteOnAttribute processor ${filename:startsWith('\[EUtranCellRelation\]')} (or) ${filename:contains('\[EUtranCellRelation\]')} Then matching files will be routed to the corresponding properties. Refer this and this links for more details.

Shu_ashu · ‎04-26-2019

@Bala S It seems to be NiFi user is not having access to run the commands, make sure you have given execute permissions to NiFi user. Refer to this and this links for similar kind of issues..!!

Shu_ashu · ‎04-26-2019

@Denis Sokol Here are my thoughts around competitors from Hortonworks.. Using Hive Transactional tables: 1.if we are getting full dump every time then you can try with Hive-Merge functionality(only in hortonworks) which data availability will be in less than a minute(depends on how much data we scanning and cluster resources..etc). Using HBase: 2.If you are thinking about only the latest version of each record then by Using Hbase we can handle all updates(but scanning a non row key will not give you any performance), use Phoenix on top HBase to get SQL on top of Nosql table. - Both approaches will server for updating the existing data and available only the latest version of the record. - Refer to this and this links about more details about these approaches. Using Druid: Refer to this link for druid. - It would be great if you comment out which way performed better (or) you have chosen for this case 🙂

Shu_ashu · ‎04-20-2019

@James Fowler We need to use ReplaceText processor after GenerateTableFetch processor and replace select * with: select col1, UCR_COST_IN_$ as UCR_COST_IN from table //replace * with column names and add an alias

Shu_ashu · ‎04-19-2019

@James Fowler In executesql processor change Normalize Table/Column Names property value to true (or) in your select query add an alias name to the special character field names without special character.

Shu_ashu · ‎04-19-2019

@Barath Natarajan Check out how many executors and memory that spark-sql cli has been initialized(it seems to be running on local mode with one executor). To debug the query run an explain plan on the query. Check out how many files in hdfs directory for each table, if too many files then consolidate them to smaller number. Another approach would be: -> Run spark-shell (or) pyspark with local mode/yarn-client mode with more number of executors/more memory -> Then load the tables into dataframe and then registerTempTable(spark1.X)/createOrReplaceTempView(if using spark2) -> Run your join using spark.sql("<join query>") -> Check out the performance of the query.

Shu_ashu · ‎04-19-2019

@Jeff Watson Could you try using GetHDFSFileInfo processor, as this processor accepts incoming connections and regex to match only the required directories/files/exclude files..!

Shu_ashu · ‎04-19-2019

@Mahendiran Palani Samy Try with .option instead of hc.setConf Example: dataframe.write() .format("parquet") .option("compression","snappy") .saveAsTable("<table_name>")

Shu_ashu · ‎04-16-2019

@Karthik Gullapalli You can use ExtractText processor to extract elements of Array then using Update attribute and ReplaceText processor we can create the final json. Flow: 1.ExtractText //add new property with regex expression to extract a,1 2.ReplaceText //always replace as replacement strategy and use nifi expression language to prepare json. (or) Use the approach specified in this article 2.2 to iterate through array of elements and then use nifi expression language to create output flowfile in json format using ReplaceText processor.

Shu_ashu · ‎04-09-2019

@Kevin Lahey Not sure if you are using NiFi cluster (or) not, could you try to run ListS3 processor only on Primary Node only. As this processor intended to run only on primary node as per documentation.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: apache Nifi getsftp regex filter??

Re: AWS Comprehend API CLI command using NiFi

Re: Delta alternative

Re: Column name with Special Character ($) and (#)...

Re: Column name with Special Character ($) and (#)...

Re: Spark Sql for ETL performance tuning

Re: How to determine if files exist in HDFS direct...

Re: SPARK HIVE - Parquet and Snappy format - Table...

Re: Is there a way to iterate over array elements ...

Re: Why is DetectDuplicate not filtering duplicate...