About Shu_ashu

Shu_ashu · ‎06-28-2019

@manohar ghanta Option-1: You can do .coalesce(n) (no shuffle will happen) on your dataframe and then use .option("maxRecordsPerFile",n) to control the number of records written in each file. Option-2: Using spark.sql.shuffle.partitions=n this option is used to control the number of shuffles happens. Then use df.sort("<col_name>").write.etc will create exactly the number of files that we mentioned for shuffle.partitions . Option-3: Hive: Once the spark job is done then trigger hive job insert overwrite by selecting the same table and use sortby,distributedby,clusteredby and set the all hive configurations that you have mentioned in the question. Insert overwrite table select * from table sort by <col1> distributed by <col2> Option-4: Hive: If you have ORC table then schedule concatenate job to run periodically alter table <table_name> concatenate; If none of the methods seems to be feasible solutions then .repartition(n) will be the way to go as this will take extra overhead but we are going to end up ~evenly sized filesin HDFS and boost up the performance while reading these files from hive/spark. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎06-28-2019

@Piotr Grzegorski Try using ListHDFS + FetchHDFS processors. You can simulate MoveHDFS processor with the below Flow: ListHDFS //list all the files in HDFS directory RouteOnAttribute //Use nifi expression language to filter out the required files FetchHDFS //fetch the files from HDFS PutHDFS //put the files into HDFS directory. DeleteHDFS //delete the file from HDFS directory that are pulled from FetchHDFS - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎06-26-2019

Try with below jolt spec: [{ "operation": "shift", "spec": { "id": "ID", "nummer": "Nummer", "table": { "*": { "zn": "ArtikelPreise_Pos.[#2].ZeileNr", "stfflbisart": "ArtikelPreise_Pos.[#2].StaffelBis" } } } }, { "operation": "default", "spec": { "Default_Kopf": "${VAR_KD}", "ArtikelPreise_Pos[]": { "*": { "Default_Kopf": "${DFT_POS}" } } } } ] Output: { "ID" : "177", "Nummer" : "22", "ArtikelPreise_Pos" : [ { "ZeileNr" : 1, "StaffelBis" : 10, "Default_Kopf" : "${DFT_POS}" }, { "ZeileNr" : 2, "StaffelBis" : 50, "Default_Kopf" : "${DFT_POS}" } ], "Default_Kopf" : "${VAR_KD}" } I hope this matches with your expected output. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎06-26-2019

@srini This is Hive bug reported here HIVE-2927. Possible Workarounds would be: Hive >=1.3.+ Then use replace function in hive with get_json_object Example: hive> select get_json_object(replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t; 2015-06-15 (or) hive> select get_json_object(replace(jsn,'@date','date'),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t; 2015-06-15 Hive <1.3: Use regexp_replace function: hive> select get_json_object(regexp_replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t; 2015-06-15 (or) hive> select get_json_object(regexp_replace(jsn,'@date','date'),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t; 2015-06-15 -->Final query would be: hive> select get_json_object(jsn,'$.name'),get_json_object(regexp_replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t; jai 2015-06-15 - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂

Shu_ashu · ‎06-25-2019

@Jim Barnett Not able to recreate the same scenario on my end(using hive 1.2.1). Wondering how count(distinct key) is generated just one row of count as result ,but in your case query is giving results as select key,count(*) from default.dummy_data group by key; ----- Check is there any data that already exists in hdfs directory that table is pointing to, If yes clear off the data from the hdfs directory and recreate the table again. Example: hive> CREATE TABLE default.dummy_data > AS > SELECT row_nbr as key > FROM ( > SELECT row_number() OVER (partition by '1') as row_nbr > FROM ( > select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct > ) AS x > ) AS y; hive> select count(distinct key) c from default.dummy_data; 1000000 hive> select key,count(*)cnt from default.dummy_data group by key order by cnt desc limit 10; --ordering desc and limiting 10 Result: key cnt 10 1 9 1 1000000 1 7 1 6 1 5 1 4 1 3 1 999999 1 1 1

Shu_ashu · ‎06-25-2019

@Rupak Dum 1. Do I need to upload the .sh script in a folder within HDFS No need to upload .sh script to HDFS/ If you upload script to HDFS then follow this link to execute shell script from HDFS. 2. How do I setup the permission for the script so that it runs successfully You are running sqoop import as root user for this case you need to change the permissions in HDFS for /user directory. Refer to this and this link for similar kind of thread. 3. How to execute .sh file from within hdfs so that I do not get the permission denied error change the permissions of /user hdfs directory to 700 (or) 777, then you won't get any permission issues.

Shu_ashu · ‎06-25-2019

@Juhyeon Yun Configure directory as /user/paxata/job and keep recursive sub directories to true will list all the files in the sub directories. Similar question reported in this HCC thread.

Shu_ashu · ‎06-20-2019

@Amrutha K This is a known issue in spark reported in this Jira SPARK-24260 and not yet resolved. One way of doing this is to execute each query at a time i.e after reading .hql file we can access array of elemets by their indexes (0),(1) val df1=sc.sql(sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").collect()(0)) val df2=sc.sql(sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").collect()(1)) (or) If you want to just execute the queries and see the results on console then try this approach. sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").map(x => sc.sql(x).show()) Now we are executing all queries in hql script and displaying results in console.

Shu_ashu · ‎06-19-2019

@Jayashree S Use RouteOnArttribute processor after ListS3Object processor and filter only the required file and pass that to FetchS3Object. Flow: Lists3 RouteOnAttribute FetchS3 (or) If you want to pull the same file from s3 all the time, then you can use flow as: GenerateFlowFile //schedule this processor as per your requirements FetchS3Object //configure full s3 file path

Shu_ashu · ‎06-19-2019

@Bill Miller Try with series of SplitRecord processors to create smaller chunks of files. Follow the similar approach mentioned in this thread and see if you get any performance with this approach.

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: Spark inserting data into hive external table ...

Re: Copying files within HDFS using wildcard in Ni...

Re: NiFi - JOLT set defaults within an array

Re: Hive query to get data from json records

Re: Hive row_number() function generating duplicat...

Re: Need suggestion for shell script for sqoop

Re: NIFI: HDFS how to get files under multiple dir...

Re: Hql to Sparksql: Hive query execution giving e...

Re: No progress in nifi flow

Re: Converting Large CSV into JSON