Created 01-25-2019 05:18 PM
Hi
I have list of document files in HDFS, it contains .csv, excel, image, pdf,etc.,
I want to load these these documents into HBase table.
Please provide the suggestions
Created 01-28-2019 10:47 AM
I have these kind of unstructured data stored in HDFS, Please provide the suggestion to load these unstructured data into HBase table to view the data.
Created 01-28-2019 03:16 PM
Here is an example of loading a CSV file. I generated public data available
# Generated sample data
There is data that is readiky available from http://www.generatedata.com/
# Sample context of names.txt
basically format name and email separated by comma
Maxwell,risus@Quisque.com Alden,blandit.Nam.nulla@laciniamattisInteger.ca Ignatius,non.bibendum@Cumsociisnatoque.com Keaton,mollis.vitae.posuere@incursus.co.uk Charles,tempor@idenimCurabitur.net Jared,a@congueelit.net Jonas,Suspendisse.ac@Nulla.ca
# Precreate the namespace
Invoke hbase shell as the hbase user
$ hbase shell
This should match the csv file
hbase(main):004:0> create 'jina','cf' 0 row(s) in 2.3580 seconds => Hbase::Table - jina
# Created a directory in the hbase user home in hdfs
$ hdfs dfs -mkdir /user/hbase/test
# Copied the name.txt to hdfs
$ hdfs dfs -put name.txt /user/hbase/test
Evoke the hbase load utility
Load the csv to hbase using -Dimporttsv
Track the evolution in the YARN UI
$ cd /usr/hdp/current/hbase-client/ $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=","-Dimporttsv.columns=HBASE_ROW_KEY,cf jina /user/hbase/test/name.txt ..... ..... 2019-01-28 10:49:09,708 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x69e1dd28 connecting to ZooKeeper ensemble=nanyuki.dunnya.com:2181 2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-292--1, built on 05/11/2018 07:15 GMT 2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:host.name=nanyuki.dunnya.com 2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.version=1.8.0_112 2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation 2019-01-28 10:49:09,719 INFO [main] zookeeper.ZooKeeper: Client environment:java.home=/usr/jdk64/jdk1.8.0_112/jre ....... 019-01-28 12:06:14,837 INFO [main] mapreduce.Job: map 0% reduce 0% 2019-01-28 12:06:33,197 INFO [main] mapreduce.Job: map 100% reduce 0% 2019-01-28 12:06:40,926 INFO [main] mapreduce.Job: Job job_1548672281316_0003 completed successfully 2019-01-28 12:06:41,640 INFO [main] mapreduce.Job: Counters: 31 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=186665 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=3727 HDFS: Number of bytes written=0 HDFS: Number of read operations=2 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=31502 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=15751 Total vcore-milliseconds taken by all map tasks=15751 Total megabyte-milliseconds taken by all map tasks=24193536 Map-Reduce Framework Map input records=100 Map output records=100 Input split bytes=118 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=125 CPU time spent (ms)=2590 Physical memory (bytes) snapshot=279126016 Virtual memory (bytes) snapshot=3279044608 Total committed heap usage (bytes)=176160768 ImportTsv Bad Lines=0 File Input Format Counters Bytes Read=3609 File Output Format Counters Bytes Written=0
# Now scan the hbase table
hbase(main):005:0> scan 'jina' ROW COLUMN+CELL Alden column=cf:, timestamp=1548673532506, value=imperdiet.non@euarcu.edu Alfonso column=cf:, timestamp=1548673532506, value=sed.leo.Cras@elit.net Amal column=cf:, timestamp=1548673532506, value=scelerisque.scelerisque@nisisem.net Aquila column=cf:, timestamp=1548673532506, value=orci@arcu.com Armando column=cf:, timestamp=1548673532506, value=egestas@vel.ca Avram column=cf:, timestamp=1548673532506, value=Morbi.quis@ornare.edu Basil column=cf:, timestamp=1548673532506, value=ligula.Aenean.euismod@arcuvel.org Brandon column=cf:, timestamp=1548673532506, value=Quisque@malesuada.co.uk Brendan column=cf:, timestamp=1548673532506, value=ut.dolor.dapibus@senectus.net Brock column=cf:, timestamp=1548673532506, value=libero.Donec@vehiculaet.com Burton column=cf:, timestamp=1548673532506, value=In.tincidunt.congue@turpis.org Cade column=cf:, timestamp=1548673532506, value=quis.lectus@Curae.com Cairo column=cf:, timestamp=1548673532506, value=est.ac.facilisis@ligula.net Calvin column=cf:, timestamp=1548673532506, value=ante.Maecenas.mi@magnaSuspendisue.org Castor column=cf:, timestamp=1548673532506, value=orci.Ut.semper@enim.net Cedric column=cf:, timestamp=1548673532506, value=Maecenas.iaculis@bibendum.edu Charles column=cf:, timestamp=1548673532506, value=in@nibh.co.uk Clark column=cf:, timestamp=1548673532506, value=amet.risus@maurisMorbi.co.uk Cyrus column=cf:, timestamp=1548673532506, value=odio@ipsumCurabitur.org Daquan column=cf:, timestamp=1548673532506, value=dolor.sit@nequenonquam.net Deacon column=cf:, timestamp=1548673532506, value=bibendum.sed@egetvenenatis.ca Dieter column=cf:, timestamp=1548673532506, value=ac@interdumfeugiatSed.com Eagan column=cf:, timestamp=1548673532506, value=molestie.Sed.id@pellentesddictum.com Elliott column=cf:, timestamp=1548673532506, value=gravida.sagittis.Duis@miDuisrisus.com Erich column=cf:, timestamp=1548673532506, value=mauris.Suspendisse@Sedid.co.uk Francis column=cf:, timestamp=1548673532506, value=eu.odio.Phasellus@eu.org Garrison column=cf:, timestamp=1548673532506, value=malesuada.vel@nuncullamcorpereu.org Geoffrey column=cf:, timestamp=1548673532506, value=amet@est.com Gray column=cf:, timestamp=1548673532506, value=condimentum@ligulaconsuerrhoncus.org Hamilton column=cf:, timestamp=1548673532506, value=tortor@lacusCrasinterdum.ca Henry column=cf:, timestamp=1548673532506, value=velit.in@augueeutempor.ca Hoyt column=cf:, timestamp=1548673532506, value=tristique.senectus@Inornasagittis.net .......... Sylvester column=cf:, timestamp=1548673532506, value=Morbi.quis@dis.co.uk Tate column=cf:, timestamp=1548673532506, value=purus.ac.tellus@Nullanissiaecenas.com Theodore column=cf:, timestamp=1548673532506, value=Mauris.nulla.Integer@vestibuluris.net Thomas column=cf:, timestamp=1548673532506, value=fringilla.est@adipiscing.org Victor column=cf:, timestamp=1548673532506, value=eleifend.vitae.erat@velarcbitur.co.uk Wayne column=cf:, timestamp=1548673532506, value=sed.turpis.nec@vel.ca Zane column=cf:, timestamp=1548673532506, value=vel.pede@Integertinciduntaliquam.net Zeus column=cf:, timestamp=1548673532506, value=ac.risus.Morbi@Duisvolutpat.ca 89 row(s) in 0.5300 seconds
Voila you csv file is now in hbase !!
Created 01-28-2019 09:40 PM
Hi
Thank you for your reply
I tried this method to insert csv data into hbase table that's working fine.
My question is, i have a list of flat files i.e,.word, excel, image in my hdfs directory, i want to store all these data into one hbase table as a object. still i didn't get solution for this problem, please provide any suggestions for me.
Thank you