Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Insert into a HBase Table with multiple column families using NIFI

avatar
Rising Star

I want to insert data into Hbase from a flowfile using NIFI. Does putHbaseCell supports Hbase tables with multiple column families.Say I have create an Hbase table with 2 column families cf1(column1,column2,column3) and cf2(column4,column5).

How do I specify "Column Family" and "Column Qualifier" properties in the putHbaseCell configuration.

Where do I specify the mapping between the flowfile(Text file with pipe comma separated values) and the Hbase table? The flowfile will have pipe separated columns.And I want to store a subset of columns into each column families.

Regards,

Indranil Roy

1 ACCEPTED SOLUTION

avatar
Master Guru

PutHBaseCell is used to write a single cell to HBase and it uses the content of the FlowFile as the value of the cell, the column family and column qualifier come from properties in the processor.

If you want to write multiple values then you would want to use PutHBaseJSON which takes a flat JSON document and uses the field names as column qualifiers and the value of each field as the value for that column qualifier. The column family is a property in the processor.

It doesn't support writing to multiple column families, so you would need to take your original data and split it into two JSON documents, one for column family 1 and one for column family 2. You could then have two PutHBaseJSON processors for each column family, or you could have one where the column family was set to ${col.family} and you could set an attribute "col.family" on each flow file upstream to specify which column family goes with that flow file.

View solution in original post

7 REPLIES 7

avatar
Master Guru

PutHBaseCell is used to write a single cell to HBase and it uses the content of the FlowFile as the value of the cell, the column family and column qualifier come from properties in the processor.

If you want to write multiple values then you would want to use PutHBaseJSON which takes a flat JSON document and uses the field names as column qualifiers and the value of each field as the value for that column qualifier. The column family is a property in the processor.

It doesn't support writing to multiple column families, so you would need to take your original data and split it into two JSON documents, one for column family 1 and one for column family 2. You could then have two PutHBaseJSON processors for each column family, or you could have one where the column family was set to ${col.family} and you could set an attribute "col.family" on each flow file upstream to specify which column family goes with that flow file.

avatar
Rising Star

@Bryan Bende

As you mentioned PutHBaseCell is used to write a single cell to HBase and it uses the content of the FlowFile as the value of the cell.Now if my input flowfile has say 50 lines of pipe separated values,will it insert all those rows into 50 cells with 50 different row id's or it will enter all the rows into same row?

avatar
Master Guru

If you use PutHbaseCell with a FlowFile that has 50 lines, all 50 lines will be written as value of one cell (row id, col fam, col qual).

PutHBaseCell has no idea what the content of the FlowFile is, it takes the content as byte[] and sticks it in one cell.

avatar
Rising Star

@Bryan Bende

In such a scenario since we want to store the rows with different row ids is there a workaround possible?If I assume correctly using PutHBaseJSON might help.So is there any processors available to convert the pipe delimited source file into a JSON file to be consumed by the PutHbaseJSON processor to insert multiple values?

avatar
Master Guru

It depends what you want to do, there are a lot of options...

If you want to store one line of piped values as the value of a cell, you could use SplitText with a line count of 1 to get each line into its own flow file, then send each of those to PutHbaseCell and set the Row Id property to something unique like ${uuid} or whatever you want.

If you want the piped values to represent multiple cells with in one row then you need to convert each line of piped text to JSON somehow, you probably still need to split each line as described above, then use something like ExtractText and ReplaceText to create JSON (https://github.com/hortonworks-gallery/nifi-templates/blob/master/templates/csv-to-json-flow.xml) or you use ExcecuteScript processor with a Groovy or Python script that converted your piped line to JSON.

avatar
Rising Star

@Bryan Bende

Thanks for the input it really helped a lot in our case.Say I have 2 rows in my table

1|Indranil|ETL

2|Reporting|Joy

I want to convert it to JSON so that I am able to insert each row into multiple cells in a single Hbase row.

This is my converted JSON

{ "Personal":

[

{ "id":"1", "name":"Indranil", "Skill":"ETL" } ,

{ "id":"2", "name":"Joy", "Skill":"Reporting" }

]

}

Is this JSON in the correct format to be consumed by the PutHBaseJSON. My end goal is to insert all the values in a row to different cell."Personal" refers to the "column family" and "id" refers to the "Row Identifier Field Name".

avatar
Master Guru

@INDRANIL ROY

Here is a template that shows how to get the data formatted properly: delimitedtohbase.xml

The first two processors (GenerateFlowFile and ReplaceText) are just creating fake data every 30 seconds, you would replace that with wherever your data is coming from.