Member since
06-08-2017
1049
Posts
517
Kudos Received
312
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9904 | 04-15-2020 05:01 PM | |
5936 | 10-15-2019 08:12 PM | |
2414 | 10-12-2019 08:29 PM | |
9580 | 09-21-2019 10:04 AM | |
3506 | 09-19-2019 07:11 AM |
07-04-2019
03:19 AM
@David Sargrad You can have sysdate,current_date..etc variable in your spark job then use that variable to read the directory from HDFS dynamically.
... View more
07-04-2019
03:09 AM
@Matt Field From NiFi ExtractText docs: The first capture group, if any found, will be placed into that attribute name.But all capture groups, including the matching string sequence itself will also be provided at that attribute name with an index value provided. This is an expected behaviour from NiFi as you are having capture group in your regular expression, so extract text processor adds index value to attribute name. For consistency use ${myattribute} without index value as the reference for the attribute value. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
07-03-2019
06:18 AM
1 Kudo
@David Sargrad Option1:NiFi You can also think of using MergeContent processor to create bigger files(by using min,max group size properties 1GB..etc) and then store these files into HDFS directory. - Option2:Hive If you have structured json files then create hive table on top of this files and run insert overwrite <same_hive_table> select * from <same_hive_table>; By using this method hive will create exclusive lock on the directory until the overwrite will be completed. - Option3:Hadoop Streaming jar: Store all files into one daily directory and then run merge as a daily job at midnight..etc by using hadoop-streaming.jar as described in this link. - Option4:Hive Using ORC files: If you are thinking to store the files as orc files then convert json data to orc format then you can use concatenate feature of ORC to create big file by merging small orc files. - Option5:Hive Transactional tables: By using hive transactional tables we can insert data using PutHiveStreaming(convert json data to avro and feed it to puthivestreaming ) processor and based on buckets we have created in hive transactional table, Hive will store all your data into these buckets(those many files in HDFS). -> If you are reading this data from Spark then make sure your spark is able to read Hive Transactional tables. - If you found any other efficient way to do this task, please mention the method so that we will learn based on your experience.. 🙂
... View more
07-03-2019
05:23 AM
@Amod Acharya Create a shell script, that will truncate your hive orc table and then sqoop import to hive orc table. Sample shell script:- 1.For Managed(internal) Tables: bash$ cat sq.sh
#truncate the hive table first
hive -e "truncate table default.my_table_orc"
#sqoop import into Hive table
sqoop import --connect jdbc:mysql://localhost:3306/<db_name> --username root --password "<password>" --table <table_name> --hcatalog-database <hive_database> --hcatalog-table <hive_table_name> --hcatalog-storage-stanza "stored as orcfile" 2.For External Tables: bash$ cat sq.sh
#drop the hive table first
hive -e "drop table default.my_table_orc"
#sqoop import into Hive table
sqoop import --connect jdbc:mysql://localhost:3306/<db_name> --username root --password "<password>" --table order_items --hcatalog-database <hive_database> --hcatalog-table <hive_table_name> --create-hcatalog-table --hcatalog-storage-stanza "stored as orcfile"
... View more
07-01-2019
02:19 PM
@DADA206 If you are thinking to count all duplicated rows you can use one of these methods. 1.Using dropDuplicates function: scala> val df1=Seq((1,"q"),(2,"c"),(3,"d"),(1,"q"),(2,"c"),(3,"e")).toDF("id","n")
scala> println("duplicated counts:" + (df1.count - df1.dropDuplicates.count))
duplicated counts:2 There are 2 duplicated rows in the dataframe it means in total there are 4 rows duplicated. 2.Using groupBy on all columns: scala> import org.apache.spark.sql.functions._
scala> val cols=df1.columns
scala> df1.groupBy(cols.head,cols.tail:_*).agg(count("*").alias("cnt")).filter('cnt > 1).select(sum("cnt")).show()
+--------+
|sum(cnt)|
+--------+
| 4|
+--------+ 3.Using window functions: scala> import org.apache.spark.sql.expressions.Window
scala> val wdw=Window.partitionBy(cols.head,cols.tail:_*)
wdw: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@1f6df7ac
scala> df1.withColumn("cnt",count("*").over(wdw)).filter('cnt > 1).count()
res80: Long = 4
... View more
07-01-2019
03:50 AM
@hem lohani --> Permission Denied - Please check the permission on the HDFS directory! --> Could you share the error logs, of the error that you are getting?
... View more
07-01-2019
03:38 AM
1 Kudo
@DADA206 --> You can try using case statement and then assign 1 value for all the rows that matches col2,col4 otherwise assign 0. --> Then aggregate on the new column and sum the new column. Example: scala> val df=Seq((1,3,999,4,999),(2,2,888,5,888),(3,1,777,6,777)).toDF("id","col1","col2","col3","col4")
scala> df.withColumn("cw",when('col2 === 'col4,1).otherwise(0)).agg(sum('cw) as "su").show()
+---+
| su|
+---+
| 3|
+---+ - As you are having 3 rows same values and the count is 3 in this case. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly :-).
... View more
06-30-2019
06:03 PM
@hem lohani Create column using withColumn function with literal value as 12. Use month column as partitionby column and use insertInto table. df.withColumn("month",lit(12)).write.mode("<append or overwrite>").partitionBy("month").insertInto("<hive_table_name>") (or) Using SQL query df.createOrReplaceTempView("temp_table")
spark.sql("insert into <partition_table> partition(`month`=12) select * from <temp_table>") - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
06-30-2019
02:58 AM
@hem lohani Try with below syntax df.write.mode("<append or overwrite>").partitionBy("<partition_cols>").insertInto("<hive_table_name>")
... View more
06-28-2019
09:32 AM
@Khaja Mohd Use GetFile processor as this processor allows to keep specific patterns. Refer this link for configuring GetFile processor to use with File Filter patterns.
... View more