Member since
06-08-2017
1049
Posts
518
Kudos Received
312
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 11300 | 04-15-2020 05:01 PM | |
| 7195 | 10-15-2019 08:12 PM | |
| 3171 | 10-12-2019 08:29 PM | |
| 11672 | 09-21-2019 10:04 AM | |
| 4407 | 09-19-2019 07:11 AM |
03-15-2018
09:03 PM
@Sami Ahmad This hive connection pool configs are varies depending on the setup,unfortunately there are no standard properties that i can share with you. HiveConnectionPool configs:- I use the below connection pool configs Database Connection URL
jdbc:hive2://sandbox.hortonworks:2181,sandbox.hortonworks:2181,sandbox.hortonworks:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-binary;principal=hive/_HOST@HDP.LOCAL
Hive Configuration Resources
/etc/hdp/hive-site.xml,/etc/hdp/core-site.xml,/etc/hdp/hdfs-site.xml
Database User
nifi
Password
No value set
Max Wait Time
500 millis
Max Total Connections
8
Validation query select current_timestamp Please refer to below links to get your jdbc connection url and if your environment is kerberized then you need to mention kerberos principal,keytab in the connection pool service. https://community.hortonworks.com/questions/107945/hive-database-connection-pooling-service.html https://community.hortonworks.com/articles/4103/hiveserver2-jdbc-connection-url-examples.html https://community.hortonworks.com/questions/64609/cannot-create-hive-connection-pool.html
... View more
03-15-2018
01:07 PM
2 Kudos
@Satya Nittala You can use to_date function in your where clause to get only the max(Last uploaded date) records. Example:- I'm having a table hive> select * from test_table;
+-----+-------+-----------+----------------------+--+
| id | name | dt | daily |
+-----+-------+-----------+----------------------+--+
| 1 | hcc | 12:00:00 | 2017-10-10 12:00:00 |
| 1 | foo | 12:00:00 | 2017-10-11 12:00:00 |
| 1 | foo | 12:00:00 | 2017-10-12 12:00:00 |
| 2 | bar | 09:00:00 | 2017-12-23 09:00:00 |
| 3 | foo | 09:00:00 | 2018-03-15 09:00:00 |
| 4 | bar | 09:00:00 | 2018-03-15 09:00:00 |
+-----+-------+-----------+----------------------+--+
Now i want to get only the records that having max date i.e 2017-03-15 is our max date. hive> select count(*) from test_table where to_Date(daily) in (select max(to_Date(daily)) from test_table); +------+--+
| _c0 |
+------+--+
| 2 |
+------+--+ As i have got only 2 records because there are 2 records in the table that are having 2018-03-15 as date in daily column and we have used to_Date function it will extract only the date from timestamp. to_date function:- hive> select to_date(daily) from test_table;
+-------------+--+
| _c0 |
+-------------+--+
| 2017-10-10 |
| 2017-10-11 |
| 2017-10-12 |
| 2017-12-23 |
| 2018-03-15 |
| 2018-03-15 |
+-------------+--+
Let us know if you having some issues..!! . If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
... View more
03-15-2018
12:37 AM
1 Kudo
@Sai Krishna Makineni For this use case there are couple of ways we can acheive Method 1:- Storing data into Hbase table using Primary key of RDBMS table as Row Key:- Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format. You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records. Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table. As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table. If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table. Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey, if we are going to have millions of records then we need to do some tuning to the hive queries https://stackoverflow.com/questions/30074734/tuning-hive-queries-that-uses-underlying-hbase-table Method 2:- Store data into Hive then run deduplication:- In this method you can pull all the incremental data from RDBMS and store the data into HDFS location in ORC format. Create table on top the HDFS location after ingestion completes then run DeDuplication every time after ingestion gets completed, by using window function based on Last Modified Date and get only the new records for the specific key. Write the final deduplicated data into target table, So our final table will be having only the newest set of records. if we are going to have millions of records then this deduplication will be heavy process because everytime we are going to do deduplication on full dataset. Once we deduplicated and stored in target table, then you will get good performance for the table compared to Hive-Hbase table. As you can choose which way will be better suits for your case..!!
... View more
03-14-2018
07:48 AM
@Vaibhav Kumar we need to change the name tag in the xml file i.e open the .xml file in any editor like Notepad++ ..etc then <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<template encoding-version="1.2">
<description></description>
<groupId>3b738dba-0161-1000-c808-f7d38f21fcab</groupId>
<name>176561_jsontoCSV</name>
<snippet> edit the name inside the .xml file then you can import same template again. As NiFi keeps the name of the template based on the name tag in the .xml file, if we do rename for the xml file it won't change the name tag filename so NiFi able to see duplicated template names and throws already exists error.
... View more
03-13-2018
10:42 PM
@Vaibhav Kumar, i think while uploading template you are facing the below issue if so that means you have already uploaded the template and again you are trying to upload the template that having same name again, at this time NiFi throws an already exists error. To make sure what are the available templates in your NiFi instance then go to top right corner Global Menu Click on Templates Option then you can view what are the available Templates in the NiFi instance. For More details please refer to below link https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_user-guide/content/templates.html what exactly happens at the background when we check/un-check relation checkboxes ? Splitjson:- Lets assume your input json record [
{
"member_id": 1,
"loan_amnt": 5000
},{
"member_id": 2,
"loan_amnt": 1000
}
] In split json processor we are having three relationships 1.split --> as we are having a json array with 2 records as shown above, this relation gets the splitted records i.e 2 in our case. 2.original --> as this relation states original i.e what ever the input json that split json processor that will be routed to this relation, in our case json array(i.e same as input record that shown above code snippet) that having 2 records in it. 3.failure --> the FlowFile is not valid JSON or the specified path does not exist), it will be routed to this relationship, mostly used to get notifications if the process failed. So when we feed split,original,failure relations to EvaluateJsonPath processor 1.from split relation as we are having 2 splitted records and we can get member_id,loan_amnt attribute values 1,5000.. from these flowfiles.Replace text processor gets all the values in the csv file. 2.from original relation we are going to have our input json array message to EvaluateJsonPath processor for this flowfile we are not able to get the attribute values for member_id,loan_amnt because it is inside array, So for this flowfile we will have empty attribute values and when we did replace text processor we are going to have empty value for the attributes. 3.As we are having a valid json so we are not going to have any flowfile routed to failure relation. Let us know if you having any other questions ..!!
... View more
03-13-2018
10:08 PM
@Saikrishna Tarapareddy, if you are using NiFi 1.2+ then you can try with ConvertRecord processor instead of ConvertJSONToAvro processor, as ConvertRecord processor takes array of json and converts as avro. Keep convertrecord processor configs record reader as Json reader and writer as avro set writer.
... View more
03-13-2018
01:49 PM
@Bob T You can achieve that by using three successive replace text processors with the below Configs:- Replacetext1:- Search Value
\]\"
Replacement Value
] Maximum Buffer Size
1 MB
Replacement Strategy
Regex Replace
Evaluation Mode Entire Text Replacetext2:- Search Value
\"\[
Replacement Value
[ Replacement Strategy
Regex Replace
Evaluation Mode
Entire text ReplaceText3:- in this processor we are trying to replace \(backslash) with empty string. Search Value
\
Replacement Value Empty string set Replacement Strategy
Regex Replace
Evaluation Mode
Entire text The output from the last replacetext processor would be {
"page_id": "page_bb51c576-a885-482f-bf19-648f0b654745",
"group_id": "MFGRPPA000054",
"provider": "{"addresses":["1405 Shady Avenue, Pittsburgh, PA"],"education":[{"name":"TEXAS A&M UNIVERSITY COLLEGE OF MEDICINE","graduation_date":""}],"last_name":"Oandasan","affiliations":"","middle_name":"","suffix":"","certifications":["ADULT PSYCHIATRY, CHILD & ADOLESCENT PSYCHIATRY – AMERICAN BOARD OF PSYCHIATRY AND NEUROLOGY"],"degrees":["MD"],"internships":[],"fellowships":[],"phone_numbers":["4124202400","41242024001405","8007320999"],"specialties":["CHILD & ADOLESCENT PSYCHIATRY"],"residencies":["GENERAL PSYCHIATRY, GEORGETOWN UNIVERSITY"],"name":"Aileen Oandasan","state":"PA","first_name":"Aileen"}",
"org_name": "CHILDREN'S INSTITUTE OF PITTSBURGH, THE",
"scrape_id": "scrape_42d77b5d-19c1-4760-94bc-1fdf4c3b1d79"
} Flow:- If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
... View more
03-13-2018
12:57 PM
@Vaibhav Kumar
The issue is with Failure,Original,Unmathed relationship as we need to auto terminate them or connect them to put email processor to get notification if something gone wrong. Steps to Autoterminate relationships:- Splitjson:-
Delete all the connections that are feeding to EvaluateJsonpath from SplitJson processor RightClick on SplitJson Processor Goto Settings tab Click on Check box at Failure and Original relationships.
then connect only split relationship to EvaluateJson Path processor. EvaluateJson Path:- 1.Delete all existing relationships that are feeding to ReplaceText processor 2.Right Click on EvalJsonPath Processor 3.Goto Setting Tab 4.Click on check boxes before Failure and Unmatched relationships 5.Connect only the matched relationship to Replacetext processor ReplaceText Processor:- 1.Delete all existing relationships that are going out from Replacetext processor 2.Right Click on Replacetext Processor 3.Goto Setting Tab 4.Click on check boxes before Failure relationship 5.Connect only the success relationship from Replacetext processor I have attached the xml flow below, save the xml flow and upload it to your NiFi instance and change the configs in GetFile Processor. If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues. 176561-jsontocsv.xml
... View more
03-13-2018
11:54 AM
@Vaibhav Kumar I have tried with your input Json data and processor configs,all your configs looks good. i'm able to get the flowfile with "1","1000" from replacetext processor. Can you make sure the feeding connections to the processors are correct based on the below screenshot. If the issue still exists then share your flow screenshot or xml so that we can help you ASAP.
... View more
03-13-2018
12:07 AM
@Sami Ahmad Sad to hear that 😞 I tried to create the same flow and it works as expected. Here is what i tried.. i have create a table in hive bucketed,partitioned,orc format,transactional. hive ddl:- hive> create table default.tmp(id string,name string)
partitioned by(ts timestamp)
clustered by (id)
into 2 buckets
stored as orc
TBLPROPERTIES ("transactional"="true"); hive> desc tmp;
+--------------------------+-----------------------+-----------------------+--+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+--+
| id | string | |
| name | string | |
| ts | timestamp | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| ts | timestamp | |
+--------------------------+-----------------------+-----------------------+--+ i have created the tmp table with id,name columns as string types and partitioned by ts as timestamp type. Now PutHiveStreaming Processor Configs:- select data from tmp table now:- hive> select * from tmp;
+-----+-------+--------------------------+--+
| id | name | ts |
+-----+-------+--------------------------+--+
| 1 | hdf | 2015-01-02 20:23:20.0 |
| 1 | hdf | 2015-01-02 20:23:20.0 |
| 1 | hdf | 2018-03-12 19:11:51.217 |
| 1 | hdf | 2018-03-12 19:11:51.24 |
+-----+-------+--------------------------+--+ hive> show partitions tmp;
+---------------------------------+--+
| partition |
+---------------------------------+--+
| ts=2015-01-02 20%3A23%3A20.0 |
| ts=2018-03-12 19%3A11%3A51.217 |
| ts=2018-03-12 19%3A11%3A51.24 |
+---------------------------------+--+ For testing purpose use select hiveql processor with below configs:- HiveQL Select Query
select int(1) id,string('hdf')name,timestamp(current_timestamp)ts
Output Format
Avro Configs:- Test Flow:- try this flow once and make sure you are able to load using PutHiveStreaming processor into hive.once you are able to load the data into hive then try with your scenario. If issue still exists then share your hive create table statement and 10 records of sample data so that we can recreate and help you out ASAP. In addition if you don't want to create partitions then create table statement would be hive> create table default.tmp(id string,name string,ts string) --with ts as timestamp type also works
clustered by (id)
into 2 buckets
stored as orc
TBLPROPERTIES ("transactional"="true");
... View more