About Shu_ashu

Shu_ashu · ‎07-04-2019

@David Sargrad You can have sysdate,current_date..etc variable in your spark job then use that variable to read the directory from HDFS dynamically.

Shu_ashu · ‎07-04-2019

@Matt Field From NiFi ExtractText docs: The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided. This is an expected behaviour from NiFi as you are having capture group in your regular expression, so extract text processor adds index value to attribute name. For consistency use ${myattribute} without index value as the reference for the attribute value. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎07-03-2019

@David Sargrad Option1:NiFi You can also think of using MergeContent processor to create bigger files(by using min,max group size properties 1GB..etc) and then store these files into HDFS directory. - Option2:Hive If you have structured json files then create hive table on top of this files and run insert overwrite <same_hive_table> select * from <same_hive_table>; By using this method hive will create exclusive lock on the directory until the overwrite will be completed. - Option3:Hadoop Streaming jar: Store all files into one daily directory and then run merge as a daily job at midnight..etc by using hadoop-streaming.jar as described in this link. - Option4:Hive Using ORC files: If you are thinking to store the files as orc files then convert json data to orc format then you can use concatenate feature of ORC to create big file by merging small orc files. - Option5:Hive Transactional tables: By using hive transactional tables we can insert data using PutHiveStreaming(convert json data to avro and feed it to puthivestreaming ) processor and based on buckets we have created in hive transactional table, Hive will store all your data into these buckets(those many files in HDFS). -> If you are reading this data from Spark then make sure your spark is able to read Hive Transactional tables. - If you found any other efficient way to do this task, please mention the method so that we will learn based on your experience.. 🙂

Shu_ashu · ‎07-03-2019

@Amod Acharya Create a shell script, that will truncate your hive orc table and then sqoop import to hive orc table. Sample shell script:- 1.For Managed(internal) Tables: bash$ cat sq.sh #truncate the hive table first hive -e "truncate table default.my_table_orc" #sqoop import into Hive table sqoop import --connect jdbc:mysql://localhost:3306/<db_name> --username root --password "<password>" --table <table_name> --hcatalog-database <hive_database> --hcatalog-table <hive_table_name> --hcatalog-storage-stanza "stored as orcfile" 2.For External Tables: bash$ cat sq.sh #drop the hive table first hive -e "drop table default.my_table_orc" #sqoop import into Hive table sqoop import --connect jdbc:mysql://localhost:3306/<db_name> --username root --password "<password>" --table order_items --hcatalog-database <hive_database> --hcatalog-table <hive_table_name> --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"

Shu_ashu · ‎07-01-2019

@DADA206 If you are thinking to count all duplicated rows you can use one of these methods. 1.Using dropDuplicates function: scala> val df1=Seq((1,"q"),(2,"c"),(3,"d"),(1,"q"),(2,"c"),(3,"e")).toDF("id","n") scala> println("duplicated counts:" + (df1.count - df1.dropDuplicates.count)) duplicated counts:2 There are 2 duplicated rows in the dataframe it means in total there are 4 rows duplicated. 2.Using groupBy on all columns: scala> import org.apache.spark.sql.functions._ scala> val cols=df1.columns scala> df1.groupBy(cols.head,cols.tail:_*).agg(count("*").alias("cnt")).filter('cnt > 1).select(sum("cnt")).show() +--------+ |sum(cnt)| +--------+ | 4| +--------+ 3.Using window functions: scala> import org.apache.spark.sql.expressions.Window scala> val wdw=Window.partitionBy(cols.head,cols.tail:_*) wdw: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1f6df7ac scala> df1.withColumn("cnt",count("*").over(wdw)).filter('cnt > 1).count() res80: Long = 4

Shu_ashu · ‎07-01-2019

@hem lohani --> Permission Denied - Please check the permission on the HDFS directory! --> Could you share the error logs, of the error that you are getting?

Shu_ashu · ‎07-01-2019

@DADA206 --> You can try using case statement and then assign 1 value for all the rows that matches col2,col4 otherwise assign 0. --> Then aggregate on the new column and sum the new column. Example: scala> val df=Seq((1,3,999,4,999),(2,2,888,5,888),(3,1,777,6,777)).toDF("id","col1","col2","col3","col4") scala> df.withColumn("cw",when('col2 === 'col4,1).otherwise(0)).agg(sum('cw) as "su").show() +---+ | su| +---+ | 3| +---+ - As you are having 3 rows same values and the count is 3 in this case. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly :-).

Shu_ashu · ‎06-30-2019

@hem lohani Create column using withColumn function with literal value as 12. Use month column as partitionby column and use insertInto table. df.withColumn("month",lit(12)).write.mode("<append or overwrite>").partitionBy("month").insertInto("<hive_table_name>") (or) Using SQL query df.createOrReplaceTempView("temp_table") spark.sql("insert into <partition_table> partition(`month`=12) select * from <temp_table>") - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎06-30-2019

@hem lohani Try with below syntax df.write.mode("<append or overwrite>").partitionBy("<partition_cols>").insertInto("<hive_table_name>")

Shu_ashu · ‎06-28-2019

@Khaja Mohd Use GetFile processor as this processor allows to keep specific patterns. Refer this link for configuring GetFile processor to use with File Filter patterns.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	515

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: What is the best way to stream data to HDFS, a...

Re: Duplicate results using extract text processor...

Re: What is the best way to stream data to HDFS, a...

Re: Sqoop Import Oracle table to hive with ORC ?

Re: Scala- How to find duplicated columns with all...

Re: Insert Spark dataframe into hive partitioned

Re: Scala- How to find duplicated columns with all...

Re: Insert Spark dataframe into hive partitioned

Re: Insert Spark dataframe into hive partitioned

Re: FetchFile in nifi