Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
9981 | 04-15-2020 05:01 PM | |
5994 | 10-15-2019 08:12 PM | |
2444 | 10-12-2019 08:29 PM | |
9666 | 09-21-2019 10:04 AM | |
3540 | 09-19-2019 07:11 AM |
06-28-2019
03:45 AM
@manohar ghanta Option-1: You can do .coalesce(n) (no shuffle will happen) on your dataframe and then use .option("maxRecordsPerFile",n) to control the number of records written in each file. Option-2: Using spark.sql.shuffle.partitions=n this option is used to control the number of shuffles happens. Then use df.sort("<col_name>").write.etc will create exactly the number of files that we mentioned for shuffle.partitions . Option-3: Hive: Once the spark job is done then trigger hive job insert overwrite by selecting the same table and use sortby,distributedby,clusteredby and set the all hive configurations that you have mentioned in the question. Insert overwrite table select * from table sort by <col1> distributed by <col2>
Option-4: Hive: If you have ORC table then schedule concatenate job to run periodically alter table <table_name> concatenate;
If none of the methods seems to be feasible solutions then .repartition(n) will be the way to go as this will take extra overhead but we are going to end up ~evenly sized filesin HDFS and boost up the performance while reading these files from hive/spark. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
06-28-2019
03:20 AM
@Piotr Grzegorski Try using ListHDFS + FetchHDFS processors. You can simulate MoveHDFS processor with the below Flow: ListHDFS //list all the files in HDFS directory
RouteOnAttribute //Use nifi expression language to filter out the required files
FetchHDFS //fetch the files from HDFS
PutHDFS //put the files into HDFS directory.
DeleteHDFS //delete the file from HDFS directory that are pulled from FetchHDFS - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
06-26-2019
07:29 PM
1 Kudo
Try with below jolt spec: [{
"operation": "shift",
"spec": {
"id": "ID",
"nummer": "Nummer",
"table": {
"*": {
"zn": "ArtikelPreise_Pos.[#2].ZeileNr",
"stfflbisart": "ArtikelPreise_Pos.[#2].StaffelBis"
}
}
}
}, {
"operation": "default",
"spec": {
"Default_Kopf": "${VAR_KD}",
"ArtikelPreise_Pos[]": {
"*": {
"Default_Kopf": "${DFT_POS}"
}
}
}
}
] Output: {
"ID" : "177",
"Nummer" : "22",
"ArtikelPreise_Pos" : [ {
"ZeileNr" : 1,
"StaffelBis" : 10,
"Default_Kopf" : "${DFT_POS}"
}, {
"ZeileNr" : 2,
"StaffelBis" : 50,
"Default_Kopf" : "${DFT_POS}"
} ],
"Default_Kopf" : "${VAR_KD}"
} I hope this matches with your expected output. - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
06-26-2019
04:36 PM
@srini This is Hive bug reported here HIVE-2927. Possible Workarounds would be: Hive >=1.3.+ Then use replace function in hive with get_json_object Example: hive> select get_json_object(replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t;
2015-06-15 (or) hive> select get_json_object(replace(jsn,'@date','date'),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t;
2015-06-15 Hive <1.3: Use regexp_replace function: hive> select get_json_object(regexp_replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t;
2015-06-15 (or) hive> select get_json_object(regexp_replace(jsn,'@date','date'),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t;
2015-06-15 -->Final query would be: hive> select get_json_object(jsn,'$.name'),get_json_object(regexp_replace(jsn,'@',''),"$.date") from (select string('{"name":"jai","@date":"2015-06-15"}')jsn)t;
jai 2015-06-15 - If the answer is helpful to resolve the issue, Login and Click on Accept button below to close this thread.This will help other community users to find answers quickly 🙂
... View more
06-25-2019
06:02 PM
@Jim Barnett Not able to recreate the same scenario on my end(using hive 1.2.1). Wondering how count(distinct key) is generated just one row of count as result ,but in your case query is giving results as select key,count(*) from default.dummy_data group by key; ----- Check is there any data that already exists in hdfs directory that table is pointing to, If yes clear off the data from the hdfs directory and recreate the table again. Example: hive> CREATE TABLE default.dummy_data
> AS
> SELECT row_nbr as key
> FROM (
> SELECT row_number() OVER (partition by '1') as row_nbr
> FROM (
> select explode(split(repeat("x,", 1000000-1), ",")) -- 1,024 distinct
> ) AS x
> ) AS y; hive> select count(distinct key) c from default.dummy_data;
1000000 hive> select key,count(*)cnt from default.dummy_data group by key order by cnt desc limit 10; --ordering desc and limiting 10 Result: key cnt
10 1
9 1
1000000 1
7 1
6 1
5 1
4 1
3 1
999999 1
1 1
... View more
06-25-2019
05:51 PM
1 Kudo
@Rupak Dum 1. Do I need to upload the .sh script in a folder within HDFS No need to upload .sh script to HDFS/ If you upload script to HDFS then follow this link to execute shell script from HDFS. 2. How do I setup the permission for the script so that it runs successfully You are running sqoop import as root user for this case you need to change the permissions in HDFS for /user directory. Refer to this and this link for similar kind of thread. 3. How to execute .sh file from within hdfs so that I do not get the permission denied error change the permissions of /user hdfs directory to 700 (or) 777, then you won't get any permission issues.
... View more
06-25-2019
02:00 PM
1 Kudo
@Juhyeon Yun Configure directory as /user/paxata/job and keep recursive sub directories to true will list all the files in the sub directories. Similar question reported in this HCC thread.
... View more
06-20-2019
08:37 PM
@Amrutha K This is a known issue in spark reported in this Jira SPARK-24260 and not yet resolved. One way of doing this is to execute each query at a time i.e after reading .hql file we can access array of elemets by their indexes (0),(1) val df1=sc.sql(sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").collect()(0)) val df2=sc.sql(sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").collect()(1)) (or) If you want to just execute the queries and see the results on console then try this approach. sc.textFile("/user/temp/hive.hql").collect().mkString.split(";").map(x => sc.sql(x).show()) Now we are executing all queries in hql script and displaying results in console.
... View more
06-19-2019
01:52 AM
@Jayashree S Use RouteOnArttribute processor after ListS3Object processor and filter only the required file and pass that to FetchS3Object. Flow: Lists3
RouteOnAttribute
FetchS3 (or) If you want to pull the same file from s3 all the time, then you can use flow as: GenerateFlowFile //schedule this processor as per your requirements
FetchS3Object //configure full s3 file path
... View more
06-19-2019
01:21 AM
@Bill Miller Try with series of SplitRecord processors to create smaller chunks of files. Follow the similar approach mentioned in this thread and see if you get any performance with this approach.
... View more