Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11200 | 04-15-2020 05:01 PM | |
| 7099 | 10-15-2019 08:12 PM | |
| 3091 | 10-12-2019 08:29 PM | |
| 11426 | 09-21-2019 10:04 AM | |
| 4321 | 09-19-2019 07:11 AM |
10-23-2018
01:24 PM
1 Kudo
Did you solve this problem? I have a very simple way. You can use JSONPath expression like "$[-1].id" in EvaluateJsonPath. "-1" means last index of attribute.
... View more
04-27-2018
08:40 AM
@Ok , I managed to get this fixed. In oracle I can see it works a little differently. I used the following and this worked.I passed the table name followed by a "." and that seemed to do the trick. @Shu GIM.GIDB_GC_SKILL.*,'${now():toNumber():format('yyyy-MM-dd HH:mm:ss')}' AS LOAD_TMS
... View more
11-02-2017
02:41 AM
3 Kudos
@Gayathri Devi First you need to create a hive non partition table on raw data. Then you need to create partition table in hive then insert from non partition table to partition table. For testing i have tried an example as below:- Right now my hive normal table(i.e not partition table) having these list of records. Normal table(without partition column):- hive# create table text_table(id int, dt string,name string) stored as textfile location '/user/yashu/text_table';
hive# select * from text_table;
+----------------+----------------------+------------------+--+
| text_table.id | text_table.dt | text_table.name |
+----------------+----------------------+------------------+--+
| 1 | 2017-10-31 10:12:09 | foo |
| 1 | 2017-10-31 12:12:09 | bar |
| 1 | 2017-10-30 12:12:09 | foobar |
| 1 | 2017-10-30 10:12:09 | barbar |
+----------------+----------------------+------------------+--+ Then i want to do daily partition table for this case i need to create a new table having dt as partition column in it Partition table:- There are 2 kinds of partitions in hive 1.Static partitions //adding partition statically and loading data into it,takes less time than dynamic partitions as it won't need to look into data while creating partitions.
2.Dynamic partitions //creating partitions dynamically based on the column value, take more time than static partitions if data is huge because it needs to look into data while creating partitions.
hive# create table partition_table(
id int,
name string)
partitioned by (dt string); 1.Dynamic Partition:- once you create partition table then select from non partition table, hive# insert into partition_table partition(dt) select id,name, substring(dt,0,10) from text_table; //we need to have daily partition so i'm doing sub string from 0-10 i.e 2017-10-31 so this will create date partitions
INFO : Time taken to load dynamic partitions: 0.066 seconds
INFO : Loading partition {dt=2017-10-30} //creating 2017-10-30 partition
INFO : Loading partition {dt=2017-10-31} //creating 2017-10-30 partition
INFO : Time taken for adding to write entity : 0
INFO : Partition default.partition_table{dt=2017-10-30} stats: [numFiles=1, numRows=2, totalSize=18, rawDataSize=16]
INFO : Partition default.partition_table{dt=2017-10-31} stats: [numFiles=1, numRows=2, totalSize=12, rawDataSize=10]
No rows affected (10.055 seconds) We are doing dynamic partitions in our above statement i.e we are creating partition based on our data. if you want to view the partitions then give hive# show partitions partition_table; //we can view all partitions that has create in the table.
+----------------+--+
| partition |
+----------------+--+
| dt=2017-10-30 |
| dt=2017-10-31 |
+----------------+--+
2 rows selected (0.064 seconds) Drop partitions:- hive# alter table partition_table drop partition(dt>'0') purge; //it will drop all the partitions (or) you can drop specific partition by mentioning as dt='2017-10-30'(it will drop only 2017-10-30 partition)
INFO : Dropped the partition dt=2017-10-30
INFO : Dropped the partition dt=2017-10-31
No rows affected (0.132 seconds) To view all partition directories information hadoop fs -ls -R /apps/hive/warehouse/partition_table/
drwxrwxrwx - hdfs 0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30 //partition directory
-rwxrwxrwx 3 hdfs 18 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-30/000000_0 //file in the partition
drwxrwxrwx - hdfs 0 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31
-rwxrwxrwx 3 hdfs 12 2017-11-01 21:45 /apps/hive/warehouse/partition_table/dt=2017-10-31/000000_0 To view data from one partition select * from partition_table where dt='2017-10-30';
+---------------------+-----------------------+---------------------+--+
| partition_table.id | partition_table.name | partition_table.dt |
+---------------------+-----------------------+---------------------+--+
| 1 | foobar | 2017-10-30 |
| 1 | barbar | 2017-10-30 |
+---------------------+-----------------------+---------------------+--+ As you can see out dt column in non partitioned table having 2017-10-30 12:12:09 but in partition table having 2017-10-30 because as we are loading the data to partition table we did sub string on dt column. --> if you don't want to change the source data i.e dt column from non partition table to partition table then create partition table with hive# create table partition_table(
id int,
name string,
dt string)
partitioned by (daily string); //new partition column
hive# insert into partition_table partition(daily) select id,name,dt, substring(dt,0,10) from text_table; //we are having daily as partition column and in select statement we have used dt column twice one is to load actual dt column data and second one is to create partition column.
show partitions partition_table;
+-------------------+--+
| partition |
+-------------------+--+
| daily=2017-10-30 |
| daily=2017-10-31 |
+-------------------+--+
2 rows selected (0.066 seconds)
0: jdbc:hive2://usor7dhc01w01.use.ucdp.net:21> select * from partition_table; //as you can see we haven't changed dt column data as we have new daily column as partition column
+---------------------+-----------------------+----------------------+------------------------+--+
| partition_table.id | partition_table.name | partition_table.dt | partition_table.daily |
+---------------------+-----------------------+----------------------+------------------------+--+
| 1 | foobar | 2017-10-30 12:12:09 | 2017-10-30 |
| 1 | barbar | 2017-10-30 10:12:09 | 2017-10-30 |
| 1 | foo | 2017-10-31 10:12:09 | 2017-10-31 |
| 1 | bar | 2017-10-31 12:12:09 | 2017-10-31 |
+---------------------+-----------------------+----------------------+------------------------+--+ **keep in mind partition column needs to be last column in your select statement, if not hive creates partitions based on what ever the last column is in your select statement. 2.Static partition:- We are statically creating partition and loading all the data into that partition, hive# insert into partition_table partition(dt='2017-10-30') select id,name from text_table; //we are mentioned partition name here as dt='2017-10-30' so all data will be loaded into 2017-10-30 partition if you are doing static partition that means all the dt data should be 2017-10-30 and you can view we haven't mentioned dt column in select statement. hive# show partitions partition_table;
+----------------+--+
| partition |
+----------------+--+
| dt=2017-10-30 |
+----------------+--+
hive# select * from partition_table; //all dt will be 2017-10-30 because we are doing static partition column
+---------------------+-----------------------+---------------------+--+
| partition_table.id | partition_table.name | partition_table.dt |
+---------------------+-----------------------+---------------------+--+
| 1 | foo | 2017-10-30 |
| 1 | bar | 2017-10-30 |
| 1 | foobar | 2017-10-30 |
| 1 | barbar | 2017-10-30 |
+---------------------+-----------------------+---------------------+--+
As you need to decide which kind of partitions are best fit for your case. Hope this will help you to understand about partitions..!!
... View more
11-01-2017
04:01 PM
@Shu Appreciate your help This is as clear an explanation as it can get Thanks again
... View more
10-30-2017
08:32 PM
1 Kudo
@dhieru singh Min number of entries must be set and defaults to 1. That is fine as long as you don't set max num entries. You are correct it is (min number of entries AND min group size) OR max number of entries OR max group size. So either of the "max " settings will force a merge just like max bin age will.
... View more
10-31-2017
07:18 PM
1 Kudo
@Hadoop User, Merge content minimum group size depends on your input file size, In merge content processor Change Correlation Attribute Name property to filename //it will binned all the chunks of files that having same filename and merges them.
<strong>Minimum Number of Entries</strong> //this is minimum number of flowfiles to include in a bundle and needs to be at least equal to chunk of files that you are getting after split text processor. Maximum Number of Entries max number of flowfiles to include in bundle. <strong>Minimum Group Size minimum size of the bundle</strong>// this should be at least your file size, if not then some of your data will not be merged.
Max Bin Age The maximum age of a Bin that will trigger a Bin to be complete. i.e after those many minutes processor flushes out what ever the flowfiles are waiting before the processor. in above screenshot i am having Correlation attribute name property as filename that means all the chunks of files that are having same filename will be grouped as one. Processor waits for minimum 2 files to merge and max is 1000 files and check for min and max group size properties also. if your flow is satisfying these properties then merge content processor won't having any files waiting before merge content processor. if your flow is not met the configurations above then we need to use Max Bin Age property to flush out all the files that are waiting before the processor. as you can see in my conf i have given 1 minute so this processor will wait for 1 minute if it won't find any correlation attributes that will flushes out, in your case you need to define the value as per your requirements. For your reference Ex1:-lets consider your filesize is 100 mb, after split text we are having 1000 chunks of splits then your Merge content configurations will looks like Minimum Number of Entries 1 Maximum Number of Entries 1000 Minimum Group Size 100 MB //atleast equal to your file size.
case1:-if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file. case2:-if 1000 flowfiles having 10 mb size then minimum group size property ignored because max entries are 1000 it satisfies max requirements then processor will merge those files. then the 1000 chunks are merged into 1 file. Ex2:-lets consider your filesize is 95 mb, after split text we are having 900 chunks of splits..The challange in this case is processor with above configuration will not merge 900 chunks because it hasn't reached the max group sixe i.e 100 MB but we are having 95 mb but still we need to merge that file in this case you need to use then your Merge content configurations will looks like Minimum Number of Entries 1 Maximum Number of Entries 1000 //equals to chunk of files Minimum Group Size 100 MB //atleast equal to your file size, if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file. case1:-if one flowfile having 100 mb size then maximum number of entries property ignored because min entries are 1 and min group size is 100 mb it satisfies min requirements then processor will merge that file.
case2:-if 1000 flowfiles having 10 mb size then minimum group size property ignored because max entries are 1000 it satisfies max requirements then processor will merge those files. --same until here-- Max Bin Age 1 minute we need to add max bin age this property helps if the files are waiting before the processor after 1 minute it will flush out those files then merges them according to filename attribute correlation. By analyzing your get file,split text,replace text processors(size,count), you need to configure merge content processor.
... View more
03-28-2019
04:46 AM
@Shu how can i use this for multiple file base on one file name example :- input path contains 3 files and one is .done.cvs emp.csv dept.csv account.csv date.done.csv if the input path contains the .done.csv then only my file should route in nifi flow . else it should not be route .
... View more
11-04-2017
12:18 AM
@Matt Burgess JDBC spec does state that REAL maps to (single-precision) float; FLOAT maps to double precision, thus the mapping to in avro should be to double not float, reference table B-1 page 190: http://download.oracle.com/otn-pub/jcp/jdbc-4_3-mrel3-eval-spec/jdbc4.3-fr-spec.pdf So the following code fix resolved the issue. https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/JdbcCommon.java#L527 original case FLOAT:
case REAL:
builder.name(columnName).type().unionOf().nullBuilder().endNull().and().floatType().endUnion().noDefault();
break;
case DOUBLE:
builder.name(columnName).type().unionOf().nullBuilder().endNull().and().doubleType().endUnion().noDefault();
break; Modified case REAL:
builder.name(columnName).type().unionOf().nullBuilder().endNull().and().floatType().endUnion().noDefault();
break;
case FLOAT:
case DOUBLE:
builder.name(columnName).type().unionOf().nullBuilder().endNull().and().doubleType().endUnion().noDefault();
break; Thanks to @Ron Mahoney for finding this issue.
... View more
10-30-2017
06:54 AM
Thanks lot Shu, One more thing , is that possible to catch the output of the ExecuteStreenCommand and add that values in the PutEmail Body ? like in my 4th Script wanted to count the files and that files count i wanted to send through email
... View more
10-28-2017
07:01 PM
One silly question plz.. where can I find the below import source files ? which folder if I need to check the content or add more imports to them ? import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
... View more