Created 01-25-2019 05:18 PM
Hi
I have list of document files in HDFS, it contains .csv, excel, image, pdf,etc.,
I want to load these these documents into HBase table.
Please provide the suggestions
Created 01-28-2019 10:47 AM
I have these kind of unstructured data stored in HDFS, Please provide the suggestion to load these unstructured data into HBase table to view the data.
Created 01-28-2019 03:16 PM
Here is an example of loading a CSV file. I generated public data available
# Generated sample data
There is data that is readiky available from http://www.generatedata.com/
# Sample context of names.txt
basically format name and email separated by comma
Maxwell,risus@Quisque.com Alden,blandit.Nam.nulla@laciniamattisInteger.ca Ignatius,non.bibendum@Cumsociisnatoque.com Keaton,mollis.vitae.posuere@incursus.co.uk Charles,tempor@idenimCurabitur.net Jared,a@congueelit.net Jonas,Suspendisse.ac@Nulla.ca
# Precreate the namespace
Invoke hbase shell as the hbase user
$ hbase shell
This should match the csv file
hbase(main):004:0> create 'jina','cf' 0 row(s) in 2.3580 seconds => Hbase::Table - jina
# Created a directory in the hbase user home in hdfs
$ hdfs dfs -mkdir /user/hbase/test
# Copied the name.txt to hdfs
$ hdfs dfs -put name.txt /user/hbase/test
Evoke the hbase load utility
Load the csv to hbase using -Dimporttsv
Track the evolution in the YARN UI
$ cd /usr/hdp/current/hbase-client/
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=","-Dimporttsv.columns=HBASE_ROW_KEY,cf jina /user/hbase/test/name.txt
.....
.....
2019-01-28 10:49:09,708 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x69e1dd28 connecting to ZooKeeper ensemble=nanyuki.dunnya.com:2181
2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-292--1, built on 05/11/2018 07:15 GMT
2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=nanyuki.dunnya.com
2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_112
2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre
.......
019-01-28 12:06:14,837 INFO [main] mapreduce.Job: map 0% reduce 0%
2019-01-28 12:06:33,197 INFO [main] mapreduce.Job: map 100% reduce 0%
2019-01-28 12:06:40,926 INFO [main] mapreduce.Job: Job job_1548672281316_0003 completed successfully
2019-01-28 12:06:41,640 INFO [main] mapreduce.Job: Counters: 31
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=186665
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3727
HDFS: Number of bytes written=0
HDFS: Number of read operations=2
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=31502
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=15751
Total vcore-milliseconds taken by all map tasks=15751
Total megabyte-milliseconds taken by all map tasks=24193536
Map-Reduce Framework
Map input records=100
Map output records=100
Input split bytes=118
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=125
CPU time spent (ms)=2590
Physical memory (bytes) snapshot=279126016
Virtual memory (bytes) snapshot=3279044608
Total committed heap usage (bytes)=176160768
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=3609
File Output Format Counters
Bytes Written=0 # Now scan the hbase table
hbase(main):005:0> scan 'jina' ROW COLUMN+CELL Alden column=cf:, timestamp=1548673532506, value=imperdiet.non@euarcu.edu Alfonso column=cf:, timestamp=1548673532506, value=sed.leo.Cras@elit.net Amal column=cf:, timestamp=1548673532506, value=scelerisque.scelerisque@nisisem.net Aquila column=cf:, timestamp=1548673532506, value=orci@arcu.com Armando column=cf:, timestamp=1548673532506, value=egestas@vel.ca Avram column=cf:, timestamp=1548673532506, value=Morbi.quis@ornare.edu Basil column=cf:, timestamp=1548673532506, value=ligula.Aenean.euismod@arcuvel.org Brandon column=cf:, timestamp=1548673532506, value=Quisque@malesuada.co.uk Brendan column=cf:, timestamp=1548673532506, value=ut.dolor.dapibus@senectus.net Brock column=cf:, timestamp=1548673532506, value=libero.Donec@vehiculaet.com Burton column=cf:, timestamp=1548673532506, value=In.tincidunt.congue@turpis.org Cade column=cf:, timestamp=1548673532506, value=quis.lectus@Curae.com Cairo column=cf:, timestamp=1548673532506, value=est.ac.facilisis@ligula.net Calvin column=cf:, timestamp=1548673532506, value=ante.Maecenas.mi@magnaSuspendisue.org Castor column=cf:, timestamp=1548673532506, value=orci.Ut.semper@enim.net Cedric column=cf:, timestamp=1548673532506, value=Maecenas.iaculis@bibendum.edu Charles column=cf:, timestamp=1548673532506, value=in@nibh.co.uk Clark column=cf:, timestamp=1548673532506, value=amet.risus@maurisMorbi.co.uk Cyrus column=cf:, timestamp=1548673532506, value=odio@ipsumCurabitur.org Daquan column=cf:, timestamp=1548673532506, value=dolor.sit@nequenonquam.net Deacon column=cf:, timestamp=1548673532506, value=bibendum.sed@egetvenenatis.ca Dieter column=cf:, timestamp=1548673532506, value=ac@interdumfeugiatSed.com Eagan column=cf:, timestamp=1548673532506, value=molestie.Sed.id@pellentesddictum.com Elliott column=cf:, timestamp=1548673532506, value=gravida.sagittis.Duis@miDuisrisus.com Erich column=cf:, timestamp=1548673532506, value=mauris.Suspendisse@Sedid.co.uk Francis column=cf:, timestamp=1548673532506, value=eu.odio.Phasellus@eu.org Garrison column=cf:, timestamp=1548673532506, value=malesuada.vel@nuncullamcorpereu.org Geoffrey column=cf:, timestamp=1548673532506, value=amet@est.com Gray column=cf:, timestamp=1548673532506, value=condimentum@ligulaconsuerrhoncus.org Hamilton column=cf:, timestamp=1548673532506, value=tortor@lacusCrasinterdum.ca Henry column=cf:, timestamp=1548673532506, value=velit.in@augueeutempor.ca Hoyt column=cf:, timestamp=1548673532506, value=tristique.senectus@Inornasagittis.net .......... Sylvester column=cf:, timestamp=1548673532506, value=Morbi.quis@dis.co.uk Tate column=cf:, timestamp=1548673532506, value=purus.ac.tellus@Nullanissiaecenas.com Theodore column=cf:, timestamp=1548673532506, value=Mauris.nulla.Integer@vestibuluris.net Thomas column=cf:, timestamp=1548673532506, value=fringilla.est@adipiscing.org Victor column=cf:, timestamp=1548673532506, value=eleifend.vitae.erat@velarcbitur.co.uk Wayne column=cf:, timestamp=1548673532506, value=sed.turpis.nec@vel.ca Zane column=cf:, timestamp=1548673532506, value=vel.pede@Integertinciduntaliquam.net Zeus column=cf:, timestamp=1548673532506, value=ac.risus.Morbi@Duisvolutpat.ca 89 row(s) in 0.5300 seconds
Voila you csv file is now in hbase !!
Created 01-28-2019 09:40 PM
Hi
Thank you for your reply
I tried this method to insert csv data into hbase table that's working fine.
My question is, i have a list of flat files i.e,.word, excel, image in my hdfs directory, i want to store all these data into one hbase table as a object. still i didn't get solution for this problem, please provide any suggestions for me.
Thank you