Support Questions
Find answers, ask questions, and share your expertise

How to Import Bulk Data from HDFS to HBase

Explorer

Hi

I have list of document files in HDFS, it contains .csv, excel, image, pdf,etc.,

I want to load these these documents into HBase table.

Please provide the suggestions

3 REPLIES 3

Explorer

I have these kind of unstructured data stored in HDFS, Please provide the suggestion to load these unstructured data into HBase table to view the data.

Mentor

@Nethaji R

Here is an example of loading a CSV file. I generated public data available

# Generated sample data

There is data that is readiky available from http://www.generatedata.com/

# Sample context of names.txt

basically format name and email separated by comma

Maxwell,risus@Quisque.com
Alden,blandit.Nam.nulla@laciniamattisInteger.ca
Ignatius,non.bibendum@Cumsociisnatoque.com
Keaton,mollis.vitae.posuere@incursus.co.uk
Charles,tempor@idenimCurabitur.net
Jared,a@congueelit.net
Jonas,Suspendisse.ac@Nulla.ca 

# Precreate the namespace

Invoke hbase shell as the hbase user

$ hbase shell 

This should match the csv file

hbase(main):004:0> create 'jina','cf'
0 row(s) in 2.3580 seconds
=> Hbase::Table - jina 

# Created a directory in the hbase user home in hdfs

$ hdfs dfs -mkdir  /user/hbase/test 

# Copied the name.txt to hdfs

$ hdfs dfs -put name.txt /user/hbase/test 

Evoke the hbase load utility

Load the csv to hbase using -Dimporttsv

Track the evolution in the YARN UI

$ cd /usr/hdp/current/hbase-client/ 
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=","-Dimporttsv.columns=HBASE_ROW_KEY,cf jina /user/hbase/test/name.txt
.....
.....
2019-01-28 10:49:09,708 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x69e1dd28 connecting to ZooKeeper ensemble=nanyuki.dunnya.com:2181
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-292--1, built on 05/11/2018 07:15 GMT
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:host.name=nanyuki.dunnya.com
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_112
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre
.......
019-01-28 12:06:14,837 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2019-01-28 12:06:33,197 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2019-01-28 12:06:40,926 INFO  [main] mapreduce.Job: Job job_1548672281316_0003 completed successfully
2019-01-28 12:06:41,640 INFO  [main] mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=186665
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=3727
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=31502
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=15751
                Total vcore-milliseconds taken by all map tasks=15751
                Total megabyte-milliseconds taken by all map tasks=24193536
        Map-Reduce Framework
                Map input records=100
                Map output records=100
                Input split bytes=118
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=125
                CPU time spent (ms)=2590
                Physical memory (bytes) snapshot=279126016
                Virtual memory (bytes) snapshot=3279044608
                Total committed heap usage (bytes)=176160768
        ImportTsv
                Bad Lines=0
        File Input Format Counters
                Bytes Read=3609
        File Output Format Counters
                Bytes Written=0 

# Now scan the hbase table

hbase(main):005:0>  scan 'jina'
ROW                           COLUMN+CELL
 Alden                        column=cf:, timestamp=1548673532506, value=imperdiet.non@euarcu.edu
 Alfonso                      column=cf:, timestamp=1548673532506, value=sed.leo.Cras@elit.net
 Amal                         column=cf:, timestamp=1548673532506, value=scelerisque.scelerisque@nisisem.net
 Aquila                       column=cf:, timestamp=1548673532506, value=orci@arcu.com
 Armando                      column=cf:, timestamp=1548673532506, value=egestas@vel.ca
 Avram                        column=cf:, timestamp=1548673532506, value=Morbi.quis@ornare.edu
 Basil                        column=cf:, timestamp=1548673532506, value=ligula.Aenean.euismod@arcuvel.org
 Brandon                      column=cf:, timestamp=1548673532506, value=Quisque@malesuada.co.uk
 Brendan                      column=cf:, timestamp=1548673532506, value=ut.dolor.dapibus@senectus.net
 Brock                        column=cf:, timestamp=1548673532506, value=libero.Donec@vehiculaet.com
 Burton                       column=cf:, timestamp=1548673532506, value=In.tincidunt.congue@turpis.org
 Cade                         column=cf:, timestamp=1548673532506, value=quis.lectus@Curae.com
 Cairo                        column=cf:, timestamp=1548673532506, value=est.ac.facilisis@ligula.net
 Calvin                       column=cf:, timestamp=1548673532506, value=ante.Maecenas.mi@magnaSuspendisue.org
 Castor                       column=cf:, timestamp=1548673532506, value=orci.Ut.semper@enim.net
 Cedric                       column=cf:, timestamp=1548673532506, value=Maecenas.iaculis@bibendum.edu
 Charles                      column=cf:, timestamp=1548673532506, value=in@nibh.co.uk
 Clark                        column=cf:, timestamp=1548673532506, value=amet.risus@maurisMorbi.co.uk
 Cyrus                        column=cf:, timestamp=1548673532506, value=odio@ipsumCurabitur.org
 Daquan                       column=cf:, timestamp=1548673532506, value=dolor.sit@nequenonquam.net
 Deacon                       column=cf:, timestamp=1548673532506, value=bibendum.sed@egetvenenatis.ca
 Dieter                       column=cf:, timestamp=1548673532506, value=ac@interdumfeugiatSed.com
 Eagan                        column=cf:, timestamp=1548673532506, value=molestie.Sed.id@pellentesddictum.com
 Elliott                      column=cf:, timestamp=1548673532506, value=gravida.sagittis.Duis@miDuisrisus.com
 Erich                        column=cf:, timestamp=1548673532506, value=mauris.Suspendisse@Sedid.co.uk
 Francis                      column=cf:, timestamp=1548673532506, value=eu.odio.Phasellus@eu.org
 Garrison                     column=cf:, timestamp=1548673532506, value=malesuada.vel@nuncullamcorpereu.org
 Geoffrey                     column=cf:, timestamp=1548673532506, value=amet@est.com
 Gray                         column=cf:, timestamp=1548673532506, value=condimentum@ligulaconsuerrhoncus.org
 Hamilton                     column=cf:, timestamp=1548673532506, value=tortor@lacusCrasinterdum.ca
 Henry                        column=cf:, timestamp=1548673532506, value=velit.in@augueeutempor.ca
 Hoyt                         column=cf:, timestamp=1548673532506, value=tristique.senectus@Inornasagittis.net
..........
 Sylvester                    column=cf:, timestamp=1548673532506, value=Morbi.quis@dis.co.uk
 Tate                         column=cf:, timestamp=1548673532506, value=purus.ac.tellus@Nullanissiaecenas.com
 Theodore                     column=cf:, timestamp=1548673532506, value=Mauris.nulla.Integer@vestibuluris.net
 Thomas                       column=cf:, timestamp=1548673532506, value=fringilla.est@adipiscing.org
Victor                        column=cf:, timestamp=1548673532506, value=eleifend.vitae.erat@velarcbitur.co.uk
 Wayne                        column=cf:, timestamp=1548673532506, value=sed.turpis.nec@vel.ca
 Zane                         column=cf:, timestamp=1548673532506, value=vel.pede@Integertinciduntaliquam.net
 Zeus                         column=cf:, timestamp=1548673532506, value=ac.risus.Morbi@Duisvolutpat.ca
89 row(s) in 0.5300 seconds

Voila you csv file is now in hbase !!

Explorer

Hi

Thank you for your reply

I tried this method to insert csv data into hbase table that's working fine.

My question is, i have a list of flat files i.e,.word, excel, image in my hdfs directory, i want to store all these data into one hbase table as a object. still i didn't get solution for this problem, please provide any suggestions for me.

Thank you

; ;