Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to Import Bulk Data from HDFS to HBase

avatar
Explorer

Hi

I have list of document files in HDFS, it contains .csv, excel, image, pdf,etc.,

I want to load these these documents into HBase table.

Please provide the suggestions

3 REPLIES 3

avatar
Explorer

I have these kind of unstructured data stored in HDFS, Please provide the suggestion to load these unstructured data into HBase table to view the data.

avatar
Master Mentor

@Nethaji R

Here is an example of loading a CSV file. I generated public data available

# Generated sample data

There is data that is readiky available from http://www.generatedata.com/

# Sample context of names.txt

basically format name and email separated by comma

Maxwell,risus@Quisque.com
Alden,blandit.Nam.nulla@laciniamattisInteger.ca
Ignatius,non.bibendum@Cumsociisnatoque.com
Keaton,mollis.vitae.posuere@incursus.co.uk
Charles,tempor@idenimCurabitur.net
Jared,a@congueelit.net
Jonas,Suspendisse.ac@Nulla.ca 

# Precreate the namespace

Invoke hbase shell as the hbase user

$ hbase shell 

This should match the csv file

hbase(main):004:0> create 'jina','cf'
0 row(s) in 2.3580 seconds
=> Hbase::Table - jina 

# Created a directory in the hbase user home in hdfs

$ hdfs dfs -mkdir  /user/hbase/test 

# Copied the name.txt to hdfs

$ hdfs dfs -put name.txt /user/hbase/test 

Evoke the hbase load utility

Load the csv to hbase using -Dimporttsv

Track the evolution in the YARN UI

$ cd /usr/hdp/current/hbase-client/ 
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=","-Dimporttsv.columns=HBASE_ROW_KEY,cf jina /user/hbase/test/name.txt
.....
.....
2019-01-28 10:49:09,708 INFO  [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x69e1dd28 connecting to ZooKeeper ensemble=nanyuki.dunnya.com:2181
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-292--1, built on 05/11/2018 07:15 GMT
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:host.name=nanyuki.dunnya.com
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_112
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2019-01-28 10:49:09,719 INFO  [main] zookeeper.ZooKeeper: Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre
.......
019-01-28 12:06:14,837 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2019-01-28 12:06:33,197 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2019-01-28 12:06:40,926 INFO  [main] mapreduce.Job: Job job_1548672281316_0003 completed successfully
2019-01-28 12:06:41,640 INFO  [main] mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=186665
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=3727
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=31502
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=15751
                Total vcore-milliseconds taken by all map tasks=15751
                Total megabyte-milliseconds taken by all map tasks=24193536
        Map-Reduce Framework
                Map input records=100
                Map output records=100
                Input split bytes=118
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=125
                CPU time spent (ms)=2590
                Physical memory (bytes) snapshot=279126016
                Virtual memory (bytes) snapshot=3279044608
                Total committed heap usage (bytes)=176160768
        ImportTsv
                Bad Lines=0
        File Input Format Counters
                Bytes Read=3609
        File Output Format Counters
                Bytes Written=0 

# Now scan the hbase table

hbase(main):005:0>  scan 'jina'
ROW                           COLUMN+CELL
 Alden                        column=cf:, timestamp=1548673532506, value=imperdiet.non@euarcu.edu
 Alfonso                      column=cf:, timestamp=1548673532506, value=sed.leo.Cras@elit.net
 Amal                         column=cf:, timestamp=1548673532506, value=scelerisque.scelerisque@nisisem.net
 Aquila                       column=cf:, timestamp=1548673532506, value=orci@arcu.com
 Armando                      column=cf:, timestamp=1548673532506, value=egestas@vel.ca
 Avram                        column=cf:, timestamp=1548673532506, value=Morbi.quis@ornare.edu
 Basil                        column=cf:, timestamp=1548673532506, value=ligula.Aenean.euismod@arcuvel.org
 Brandon                      column=cf:, timestamp=1548673532506, value=Quisque@malesuada.co.uk
 Brendan                      column=cf:, timestamp=1548673532506, value=ut.dolor.dapibus@senectus.net
 Brock                        column=cf:, timestamp=1548673532506, value=libero.Donec@vehiculaet.com
 Burton                       column=cf:, timestamp=1548673532506, value=In.tincidunt.congue@turpis.org
 Cade                         column=cf:, timestamp=1548673532506, value=quis.lectus@Curae.com
 Cairo                        column=cf:, timestamp=1548673532506, value=est.ac.facilisis@ligula.net
 Calvin                       column=cf:, timestamp=1548673532506, value=ante.Maecenas.mi@magnaSuspendisue.org
 Castor                       column=cf:, timestamp=1548673532506, value=orci.Ut.semper@enim.net
 Cedric                       column=cf:, timestamp=1548673532506, value=Maecenas.iaculis@bibendum.edu
 Charles                      column=cf:, timestamp=1548673532506, value=in@nibh.co.uk
 Clark                        column=cf:, timestamp=1548673532506, value=amet.risus@maurisMorbi.co.uk
 Cyrus                        column=cf:, timestamp=1548673532506, value=odio@ipsumCurabitur.org
 Daquan                       column=cf:, timestamp=1548673532506, value=dolor.sit@nequenonquam.net
 Deacon                       column=cf:, timestamp=1548673532506, value=bibendum.sed@egetvenenatis.ca
 Dieter                       column=cf:, timestamp=1548673532506, value=ac@interdumfeugiatSed.com
 Eagan                        column=cf:, timestamp=1548673532506, value=molestie.Sed.id@pellentesddictum.com
 Elliott                      column=cf:, timestamp=1548673532506, value=gravida.sagittis.Duis@miDuisrisus.com
 Erich                        column=cf:, timestamp=1548673532506, value=mauris.Suspendisse@Sedid.co.uk
 Francis                      column=cf:, timestamp=1548673532506, value=eu.odio.Phasellus@eu.org
 Garrison                     column=cf:, timestamp=1548673532506, value=malesuada.vel@nuncullamcorpereu.org
 Geoffrey                     column=cf:, timestamp=1548673532506, value=amet@est.com
 Gray                         column=cf:, timestamp=1548673532506, value=condimentum@ligulaconsuerrhoncus.org
 Hamilton                     column=cf:, timestamp=1548673532506, value=tortor@lacusCrasinterdum.ca
 Henry                        column=cf:, timestamp=1548673532506, value=velit.in@augueeutempor.ca
 Hoyt                         column=cf:, timestamp=1548673532506, value=tristique.senectus@Inornasagittis.net
..........
 Sylvester                    column=cf:, timestamp=1548673532506, value=Morbi.quis@dis.co.uk
 Tate                         column=cf:, timestamp=1548673532506, value=purus.ac.tellus@Nullanissiaecenas.com
 Theodore                     column=cf:, timestamp=1548673532506, value=Mauris.nulla.Integer@vestibuluris.net
 Thomas                       column=cf:, timestamp=1548673532506, value=fringilla.est@adipiscing.org
Victor                        column=cf:, timestamp=1548673532506, value=eleifend.vitae.erat@velarcbitur.co.uk
 Wayne                        column=cf:, timestamp=1548673532506, value=sed.turpis.nec@vel.ca
 Zane                         column=cf:, timestamp=1548673532506, value=vel.pede@Integertinciduntaliquam.net
 Zeus                         column=cf:, timestamp=1548673532506, value=ac.risus.Morbi@Duisvolutpat.ca
89 row(s) in 0.5300 seconds

Voila you csv file is now in hbase !!

avatar
Explorer

Hi

Thank you for your reply

I tried this method to insert csv data into hbase table that's working fine.

My question is, i have a list of flat files i.e,.word, excel, image in my hdfs directory, i want to store all these data into one hbase table as a object. still i didn't get solution for this problem, please provide any suggestions for me.

Thank you