About Shu_ashu

Shu_ashu · ‎03-15-2018

@Sami Ahmad This hive connection pool configs are varies depending on the setup,unfortunately there are no standard properties that i can share with you. HiveConnectionPool configs:- I use the below connection pool configs Database Connection URL jdbc:hive2://sandbox.hortonworks:2181,sandbox.hortonworks:2181,sandbox.hortonworks:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-binary;principal=hive/_HOST@HDP.LOCAL Hive Configuration Resources /etc/hdp/hive-site.xml,/etc/hdp/core-site.xml,/etc/hdp/hdfs-site.xml Database User nifi Password No value set Max Wait Time 500 millis Max Total Connections 8 Validation query select current_timestamp Please refer to below links to get your jdbc connection url and if your environment is kerberized then you need to mention kerberos principal,keytab in the connection pool service. https://community.hortonworks.com/questions/107945/hive-database-connection-pooling-service.html https://community.hortonworks.com/articles/4103/hiveserver2-jdbc-connection-url-examples.html https://community.hortonworks.com/questions/64609/cannot-create-hive-connection-pool.html

Shu_ashu · ‎03-15-2018

@Satya Nittala You can use to_date function in your where clause to get only the max(Last uploaded date) records. Example:- I'm having a table hive> select * from test_table; +-----+-------+-----------+----------------------+--+ | id | name | dt | daily | +-----+-------+-----------+----------------------+--+ | 1 | hcc | 12:00:00 | 2017-10-10 12:00:00 | | 1 | foo | 12:00:00 | 2017-10-11 12:00:00 | | 1 | foo | 12:00:00 | 2017-10-12 12:00:00 | | 2 | bar | 09:00:00 | 2017-12-23 09:00:00 | | 3 | foo | 09:00:00 | 2018-03-15 09:00:00 | | 4 | bar | 09:00:00 | 2018-03-15 09:00:00 | +-----+-------+-----------+----------------------+--+ Now i want to get only the records that having max date i.e 2017-03-15 is our max date. hive> select count(*) from test_table where to_Date(daily) in (select max(to_Date(daily)) from test_table); +------+--+ | _c0 | +------+--+ | 2 | +------+--+ As i have got only 2 records because there are 2 records in the table that are having 2018-03-15 as date in daily column and we have used to_Date function it will extract only the date from timestamp. to_date function:- hive> select to_date(daily) from test_table; +-------------+--+ | _c0 | +-------------+--+ | 2017-10-10 | | 2017-10-11 | | 2017-10-12 | | 2017-12-23 | | 2018-03-15 | | 2018-03-15 | +-------------+--+ Let us know if you having some issues..!! . If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎03-15-2018

@Sai Krishna Makineni For this use case there are couple of ways we can acheive Method 1:- Storing data into Hbase table using Primary key of RDBMS table as Row Key:- Once you pull all the data from RDBMS with NiFi processors(executesql,Querydatabasetable..etc) we are going to have output from the processors in Avro format. You can use ConvertAvroToJson processor and then use SplitJson Processor to split each record from array of json records. Store all the records in Hbase table having Rowkey as the Primary key in the RDBMS table. As when we get incremental load based on Last Modified Date field we are going to have updated records and newly added records from the RDBMS table. If we got update for the existing rowkey then Hbase will overwrite the existing data for that record, for newly added records Hbase will add them as a new record in the table. Then by using Hive-Hbase integration you can get the Hbase table data exposed using Hive. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration By using this method we are going to have Hbase table that will take care of all the upsert operations and we cannot expect same performance from hive-hbase table vs native hive table will perform faster,as hbase tables are not meant for sql kind of queries, hbase table is most efficient if you are accessing data based on Rowkey, if we are going to have millions of records then we need to do some tuning to the hive queries https://stackoverflow.com/questions/30074734/tuning-hive-queries-that-uses-underlying-hbase-table Method 2:- Store data into Hive then run deduplication:- In this method you can pull all the incremental data from RDBMS and store the data into HDFS location in ORC format. Create table on top the HDFS location after ingestion completes then run DeDuplication every time after ingestion gets completed, by using window function based on Last Modified Date and get only the new records for the specific key. Write the final deduplicated data into target table, So our final table will be having only the newest set of records. if we are going to have millions of records then this deduplication will be heavy process because everytime we are going to do deduplication on full dataset. Once we deduplicated and stored in target table, then you will get good performance for the table compared to Hive-Hbase table. As you can choose which way will be better suits for your case..!!

Shu_ashu · ‎03-14-2018

@Vaibhav Kumar we need to change the name tag in the xml file i.e open the .xml file in any editor like Notepad++ ..etc then <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <template encoding-version="1.2"> <description></description> <groupId>3b738dba-0161-1000-c808-f7d38f21fcab</groupId> <name>176561_jsontoCSV</name> <snippet> edit the name inside the .xml file then you can import same template again. As NiFi keeps the name of the template based on the name tag in the .xml file, if we do rename for the xml file it won't change the name tag filename so NiFi able to see duplicated template names and throws already exists error.

Shu_ashu · ‎03-13-2018

@Vaibhav Kumar, i think while uploading template you are facing the below issue if so that means you have already uploaded the template and again you are trying to upload the template that having same name again, at this time NiFi throws an already exists error. To make sure what are the available templates in your NiFi instance then go to top right corner Global Menu Click on Templates Option then you can view what are the available Templates in the NiFi instance. For More details please refer to below link https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_user-guide/content/templates.html what exactly happens at the background when we check/un-check relation checkboxes ? Splitjson:- Lets assume your input json record [ { "member_id": 1, "loan_amnt": 5000 },{ "member_id": 2, "loan_amnt": 1000 } ] In split json processor we are having three relationships 1.split --> as we are having a json array with 2 records as shown above, this relation gets the splitted records i.e 2 in our case. 2.original --> as this relation states original i.e what ever the input json that split json processor that will be routed to this relation, in our case json array(i.e same as input record that shown above code snippet) that having 2 records in it. 3.failure --> the FlowFile is not valid JSON or the specified path does not exist), it will be routed to this relationship, mostly used to get notifications if the process failed. So when we feed split,original,failure relations to EvaluateJsonPath processor 1.from split relation as we are having 2 splitted records and we can get member_id,loan_amnt attribute values 1,5000.. from these flowfiles.Replace text processor gets all the values in the csv file. 2.from original relation we are going to have our input json array message to EvaluateJsonPath processor for this flowfile we are not able to get the attribute values for member_id,loan_amnt because it is inside array, So for this flowfile we will have empty attribute values and when we did replace text processor we are going to have empty value for the attributes. 3.As we are having a valid json so we are not going to have any flowfile routed to failure relation. Let us know if you having any other questions ..!!

Shu_ashu · ‎03-13-2018

@Saikrishna Tarapareddy, if you are using NiFi 1.2+ then you can try with ConvertRecord processor instead of ConvertJSONToAvro processor, as ConvertRecord processor takes array of json and converts as avro. Keep convertrecord processor configs record reader as Json reader and writer as avro set writer.

Shu_ashu · ‎03-13-2018

@Bob T You can achieve that by using three successive replace text processors with the below Configs:- Replacetext1:- Search Value \]\" Replacement Value ] Maximum Buffer Size 1 MB Replacement Strategy Regex Replace Evaluation Mode Entire Text Replacetext2:- Search Value \"\[ Replacement Value [ Replacement Strategy Regex Replace Evaluation Mode Entire text ReplaceText3:- in this processor we are trying to replace \(backslash) with empty string. Search Value \ Replacement Value Empty string set Replacement Strategy Regex Replace Evaluation Mode Entire text The output from the last replacetext processor would be { "page_id": "page_bb51c576-a885-482f-bf19-648f0b654745", "group_id": "MFGRPPA000054", "provider": "{"addresses":["1405 Shady Avenue, Pittsburgh, PA"],"education":[{"name":"TEXAS A&M UNIVERSITY COLLEGE OF MEDICINE","graduation_date":""}],"last_name":"Oandasan","affiliations":"","middle_name":"","suffix":"","certifications":["ADULT PSYCHIATRY, CHILD & ADOLESCENT PSYCHIATRY – AMERICAN BOARD OF PSYCHIATRY AND NEUROLOGY"],"degrees":["MD"],"internships":[],"fellowships":[],"phone_numbers":["4124202400","41242024001405","8007320999"],"specialties":["CHILD & ADOLESCENT PSYCHIATRY"],"residencies":["GENERAL PSYCHIATRY, GEORGETOWN UNIVERSITY"],"name":"Aileen Oandasan","state":"PA","first_name":"Aileen"}", "org_name": "CHILDREN'S INSTITUTE OF PITTSBURGH, THE", "scrape_id": "scrape_42d77b5d-19c1-4760-94bc-1fdf4c3b1d79" } Flow:- If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.

Shu_ashu · ‎03-13-2018

@Vaibhav Kumar The issue is with Failure,Original,Unmathed relationship as we need to auto terminate them or connect them to put email processor to get notification if something gone wrong. Steps to Autoterminate relationships:- Splitjson:- Delete all the connections that are feeding to EvaluateJsonpath from SplitJson processor RightClick on SplitJson Processor Goto Settings tab Click on Check box at Failure and Original relationships. then connect only split relationship to EvaluateJson Path processor. EvaluateJson Path:- 1.Delete all existing relationships that are feeding to ReplaceText processor 2.Right Click on EvalJsonPath Processor 3.Goto Setting Tab 4.Click on check boxes before Failure and Unmatched relationships 5.Connect only the matched relationship to Replacetext processor ReplaceText Processor:- 1.Delete all existing relationships that are going out from Replacetext processor 2.Right Click on Replacetext Processor 3.Goto Setting Tab 4.Click on check boxes before Failure relationship 5.Connect only the success relationship from Replacetext processor I have attached the xml flow below, save the xml flow and upload it to your NiFi instance and change the configs in GetFile Processor. If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues. 176561-jsontocsv.xml

Shu_ashu · ‎03-13-2018

@Vaibhav Kumar I have tried with your input Json data and processor configs,all your configs looks good. i'm able to get the flowfile with "1","1000" from replacetext processor. Can you make sure the feeding connections to the processors are correct based on the below screenshot. If the issue still exists then share your flow screenshot or xml so that we can help you ASAP.

Shu_ashu · ‎03-13-2018

@Sami Ahmad Sad to hear that 😞 I tried to create the same flow and it works as expected. Here is what i tried.. i have create a table in hive bucketed,partitioned,orc format,transactional. hive ddl:- hive> create table default.tmp(id string,name string) partitioned by(ts timestamp) clustered by (id) into 2 buckets stored as orc TBLPROPERTIES ("transactional"="true"); hive> desc tmp; +--------------------------+-----------------------+-----------------------+--+ | col_name | data_type | comment | +--------------------------+-----------------------+-----------------------+--+ | id | string | | | name | string | | | ts | timestamp | | | | NULL | NULL | | # Partition Information | NULL | NULL | | # col_name | data_type | comment | | | NULL | NULL | | ts | timestamp | | +--------------------------+-----------------------+-----------------------+--+ i have created the tmp table with id,name columns as string types and partitioned by ts as timestamp type. Now PutHiveStreaming Processor Configs:- select data from tmp table now:- hive> select * from tmp; +-----+-------+--------------------------+--+ | id | name | ts | +-----+-------+--------------------------+--+ | 1 | hdf | 2015-01-02 20:23:20.0 | | 1 | hdf | 2015-01-02 20:23:20.0 | | 1 | hdf | 2018-03-12 19:11:51.217 | | 1 | hdf | 2018-03-12 19:11:51.24 | +-----+-------+--------------------------+--+ hive> show partitions tmp; +---------------------------------+--+ | partition | +---------------------------------+--+ | ts=2015-01-02 20%3A23%3A20.0 | | ts=2018-03-12 19%3A11%3A51.217 | | ts=2018-03-12 19%3A11%3A51.24 | +---------------------------------+--+ For testing purpose use select hiveql processor with below configs:- HiveQL Select Query select int(1) id,string('hdf')name,timestamp(current_timestamp)ts Output Format Avro Configs:- Test Flow:- try this flow once and make sure you are able to load using PutHiveStreaming processor into hive.once you are able to load the data into hive then try with your scenario. If issue still exists then share your hive create table statement and 10 records of sample data so that we can recreate and help you out ASAP. In addition if you don't want to create partitions then create table statement would be hive> create table default.tmp(id string,name string,ts string) --with ts as timestamp type also works clustered by (id) into 2 buckets stored as orc TBLPROPERTIES ("transactional"="true");

Online	Offline
Last Visited	‎04-04-2021 06:38 PM

Member Since	‎06-08-2017 08:15 PM
Last Visited	‎04-04-2021 06:38 PM
Posts	1,049
Kudos received	516

Cloudera Community

Re: Get column values in comma separated value

Re: nifi Json data using routeonattributeto to spl...

Re: HIVE MANAGED TABLE

Re: CSV file with Duplicate Headers

Re: NIFI - SQL Server Lookup

Re: NIFI putHiveStreaming processor error

Re: How to retrieve Latest Uploaded records from H...

Re: Need to parse the updated records from RDBMS t...

Re: Converting json to CSV giving blank using Nifi

Re: Converting json to CSV giving blank using Nifi

Re: Array of JSON records to Avro with Nifi

Re: NIFI Replace Text issue

Re: Converting json to CSV giving blank using Nifi

Re: Converting json to CSV giving blank using Nifi

Re: NIFI putHiveStreaming processor error