About RahulSoni

RahulSoni · ‎03-21-2018

@Sudha Chandrika ORC is a Column Major storage format! You need not mention ROW FORMAT DELIMITED FIELDS TERMINATED BY statements. Simply create your table as CREATE EXTERNAL TABLE mytable ( col1 bigint, col2 bigint ) STORED AS ORC location '<ORC File location'; From Question - I am using the same delimiter for ORC storage and HIVE table creation. How are you creating ORC file?

RahulSoni · ‎03-20-2018

@Yong Boon Lim The problem is with your create table statement for test_9. Just have a look at your syntax. CREATE TABLE default.test_9 (col INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'; Now when you will describe this table, you will see something like | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL | That's so not right!!! Your SerDe library is LazySimpleSerde and your Input Format and Output Format are ORC. Totally not gonna work! Now let's say you tweak your CREATE TABLE STATEMENT to look something like CREATE TABLE default.test_9 (col INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u001C' STORED AS ORC; A describe formatted table statement will show the storage information as | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL | And now if you try to write data into test_9 from anywhere, you would be able to. Hope that helps!

RahulSoni · ‎03-19-2018

Your flow looks fine to me. The only thing that you need to take care of it merging ALL the fragments while using MergeContent. So use Defragment merge strategy. This will pick ALL the flow files created from a single document by SplitJson and will make sure that all of them are merged before moving further. Follows a sample MergeContent config. Let know if that works for you or if you need any other help. PS - If you need any help with your existing flow, refer to this link

RahulSoni · ‎03-18-2018

@Pramod Kalvala You can schedule multiple jobs at a single given time assuming you have the resource available to cater all of them 🙂 Coming to how can you do that, there are multiple scheduling options available in NiFi. If you right-click on your "Triggering processor", that is the very first processor in your job and click on "Configure", you will see a scheduling tab. You will see an interface as presented below. There you can see "Scheduling Strategy" drop down. There you can see two major scheduling options. Timer Driven Cron Driven Timer Driven strategy executes the processor according to the duration mentioned in the "Run schedule". So in this case, the processor will run every second, triggering the next processors in the flow, provided they don't have a schedule of there own. Cron Driven is the strategy where you can mention a specific time of a day, specific day(s) etc, i.e. the schedule when you can execute that processor. Let's say that you want to run your job @1 PM every day, the Scheduling tab will look something like You can have whatever number of jobs scheduled to run at 1 PM by using the same scheduling strategy for all of them and all of them will run at the same time. All of those jobs will be separate will not interfere with each other, until you instruct them to do so and will run without any issues, provided that you have sufficient resources for all of them.

RahulSoni · ‎03-18-2018

@Pramod Kalvala You can schedule multiple jobs at a single given time assuming you have the resource available to cater all of them 🙂 Coming to how can you do that, there are multiple scheduling options available in NiFi. If you right-click on your "Triggering processor", that is the very first processor in your job and click on "Configure", you will see a scheduling tab. You will see an interface as presented below. There you can see "Scheduling Strategy" drop down. There you can see two major scheduling options. Timer Driven Cron Driven Timer Driven strategy executes the processor according to the duration mentioned in the "Run schedule". So in this case, the processor will run every second, triggering the next processors in the flow, provided they don't have a schedule of there own. Cron Driven is the strategy where you can mention a specific time of a day, specific day(s) etc, i.e. the schedule when you can execute that processor. Let's say that you want to run your job @1 PM every day, the Scheduling tab will look something like You can have whatever number of jobs scheduled to run at 1 PM by using the same scheduling strategy for all of them and all of them will run at the same time. All of those jobs will be separate will not interfere with each other, until you instruct them to do so and will run without any issues, provided that you have sufficient resources for all of them.

RahulSoni · ‎03-18-2018

@Santanu Ghosh Your machine's configuration doesn't have any say in what instance you can use on AWS. You can choose any. Coming to your question regarding which AWS instance type you should select, m4.2xlarge is pretty close to what is recommended in version 2 of practice exam guide. I understand that you may not be having a fluid navigation, like say a standalone installation or an actual cluster, but please be noted that this is roughly the kind of experience that you will get in the actual exam. So I will recommend sticking to the current config and try to attempt the given questions within the stipulated time of 2 hours. This is actually getting you ready for the real exam experience and you should include the slow VM constraint into your time consideration when planning your time distribution 🙂 All the very best for your certification exam!

RahulSoni · ‎03-18-2018

@Kok Ching Hoo First thing first! Treat JSON as JSON and not as plain text. Stop extracting text! 🙂 Now let's talk about the solution! You have a JSON whose structure looks like this. { "XMLfile_2234":{ "xsi:schemaLocation":"http://xml.mscibarra.com/random.xsd", "dataset_12232":{ "entry":[] } } } You want to pick entry column, which is an array out of it and then split individual array elements into separate docs so that you can ultimately push them to elastic search. So here is the step by step solution! 1. Assuming that values after XMLfile_ and dataset_ may differ for different documents, even if they don't this solution will work but since this may happen in a lot of cases, taking that case into consideration also. First of all, read the JSON document and cherry-pick only the entry column of it. How to do that? JoltTransformJSON is the best processor in NiFi to do any JSON operations. Follows the details on your JoltTransformJSON processor configuration. Your complete Jolt specification [ { "operation": "shift", "spec": { "XMLFile_*.DAILY": { "*": { "entry": "entry" } } } } ] This will give you only the entry column from your data. 2. Now since you have just the entry column, simply use the SplitJSON processor to split the entry, an array, into individual documents. Follows the snapshot of the SplitJSON processor configuration. 3. The split relation will have your individualized array elements as separate flow files. A sample snapshot from your data after the data you provided in your answer went through SplitJSON processor. An individual array element in the data. Now a flow file. Once you have these steps in your flow, the data out of the SplitJSON processor, specifically the split relation of SplitJSON processor, you can re-route it further as per your use case. Hope that helps!

RahulSoni · ‎03-18-2018

Hello @Sai Krishna Makineni Updating the existing records on HDFS/HBase has always been a trivial use case and the implementation vary greatly depending on per use case basis. There have been multiple options mentioned in previous answers and have there pros and cons. Follows my inputs. 1. Storing the data in HBase and exposing it using Hive is a very bad idea! Suddenly firing queries without using RowKey column is a sin and joins are a strict no-no. You need to understand NoSQL databases are not meant for your usual RDBMS like operations. In RDBMS world, you first think of your design and later the queries come into the picture. In NoSQL, first, you think of your queries and then design your schema accordingly. So having your data in HBase and then relying on HBase to deduplicate the data using "upserts" and keeping your data there is the only good thing about that solution. Your "random" SQL operations and joins requirements will be marginalized. Is there still hope using HBase? Maybe may not be. This really depends on your data size. And even more on your requirements. For example, are your table a few gigs in size? Is your business OK with a certain level of stale data? If the answers to above questions are YES, you may still can go ahead with using HBase for deduplication, you don't need to do anything, HBase will automatically take care of it by its upsert logic, and can them dump the contents of that HBase table(s) onto HDFS, creating a Hive table on top of that, let's say, once a day. Your data will be a day stale but you will get a far better performance than using HBase+Hive or even HBase+Phoenix solutions. 2. Deduplication using Hive can have its own consequences. Beware! Things can really go out of hand very easily here. Let's say if you are using a window function like RankOver() every time you dump some new file into HDFS and using the "Last Modified Date", picking up the "latest" record and end up keeping it and deleting rest of them. The problem is, with every passing day, your data will only grow in size and the time/resources taken to get this job is going to increase and at a certain point, you may not want that operation to even trigger since either it will eat a substantial amount of your cluster resources or will end up taking a lot of time to get processed, which may not make any business sense to you. And if you have a large table beforehand, I don't think this is even an option. Can you use Hive for deduplication at all? Like the first point of using HBase. Totally depends on your use case. If you have a few columns, which are good candidates for being considered as Partitioning Columns or Bucketing columns, you should for sure use them. Let me give you an example! You identify latest records using LastModifiedTS. Also, your table has a column called CreatedTS. You should always use the CreatedTS column as partitioning column if it fits well in the use case. You say what's the benefit? Next time you have a data dump on HDFS from NiFi, simply identify unique CreatedTS values from that table. Now you only pick-up partitions from the existing table which corresponds to these CreatedTS values. You will realize that you are using only a fraction of the existing data for "Upsert" operations using windowing operations as compared to using the entire table for a RankOver. The operation is sustainable over a longer period of time and will take way less time/resources for getting the job done. This is the best option that you can use, if applicable, in your ingestion pattern.

RahulSoni · ‎03-17-2018

@Jandoubi Chaima 1. when you say, EvaluateJSONPath doesn't works, what is the issue that you are facing? 2. Do you need your final data on HDFS in JSON format? 3. What is the query that you want in ReplaceText? DDL? DML?

RahulSoni · ‎03-17-2018

@Pavan M You can use the remove method from the session object. Follows an example. NiFi Flow Each GenerateFlowFile processor generate an empty flow with an attributed called "a" valued 1, 2 & 3 respectively. Script in ExecuteScript processor flowFile = session.get() a = int(flowFile.getAttribute('a')) if(a == 1): session.transfer(flowFile, REL_FAILURE) elif(a == 2): session.transfer(flowFile, REL_SUCCESS) else: session.remove(flowFile) So the flowFiles with "a" equalling 1 or 2 are being redirected accordingly while the others are removed. You can see that SUCCESS and FAILURE relation have 1 flow file each. The third one has been deleted. You can modify the above script according to your business logic and you should be good. Hope that helps! Thanks!

Online	Offline
Last Visited	‎10-08-2020 11:27 AM

Member Since	‎08-03-2019 10:44 AM
Last Visited	‎10-08-2020 11:27 AM
Posts	186
Kudos received	33

Cloudera Community

Re: Hive / HBase migration - Different clusters

Re: Flowfiles are stuck in que/connection of Nifi

Re: Save dataframe with header in spark 1.6

Re: hive external table pointing to AVRO files

Re: sqoop 1.4.6.2.6.3.0-235 import failing

Re: Hive table Creation from ORC format file

Re: Hive INSERT-SELECT: cannot be cast to org.apac...

Re: JSON split and then converted to CSV before be...

Re: Nifi job scheduling

Re: NIFI job schedule

Re: AWS EC2 Instance for HDPCD Practice - which on...

Re: Removing text before and after [ ] characters

Re: Need to parse the updated records from RDBMS t...

Re: What is the best way to ingest data from HDFS ...

Re: Delete/Stop flow files in execute script