<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hbase data ingestion in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106065#M42350</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/12351/mansour-ramy.html" nodeid="12351"&gt;@Ramy Mansour&lt;/A&gt; &lt;/P&gt;&lt;P&gt;if phoenix schema you are going to map to HBase table have any composite primary key,  data types other than strings or secondary indexes then you can use CsvBulkLoadTool otherwise you can go ahead with ImportTsv which performs better. And the remaining optimizations helps for both the cases so you can use them.&lt;/P&gt;</description>
    <pubDate>Tue, 04 Oct 2016 12:51:16 GMT</pubDate>
    <dc:creator>rchintaguntla</dc:creator>
    <dc:date>2016-10-04T12:51:16Z</dc:date>
    <item>
      <title>Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106059#M42344</link>
      <description>&lt;P&gt;We have a 250GB CSV file that contains 60 Million records and roughly 600 columns. The file lives within HDFS currently and we are trying to ingest it into HBase and have a phoenix table on top of it.&lt;/P&gt;&lt;P&gt;The approach we tried so far was to create a Hive table backed by HBase and then execute an overwrite command in Hive which ingests the data in HBase.&lt;/P&gt;&lt;P&gt;The biggest problem we have is that the job currently takes about 3-4 days to run!! this is running on a 10 node cluster with medium spec cluster (30GB of RAM each node, and 2TB on each). Any advice on how to speed this up or different methods that can be more efficient?&lt;/P&gt;</description>
      <pubDate>Fri, 30 Sep 2016 02:25:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106059#M42344</guid>
      <dc:creator>ramym</dc:creator>
      <dc:date>2016-09-30T02:25:18Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106060#M42345</link>
      <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/12351/mansour-ramy.html"&gt;Ramy Mansour&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I found this interesting to read it could help !&lt;/P&gt;&lt;P&gt;&lt;A href="http://cdn.oreillystatic.com/en/assets/1/event/119/Bulk%20Loading%20Your%20Big%20Data%20into%20Apache%20HBase,%20a%20Full%20Walkthrough%20Presentation.pdf"&gt;Link&lt;/A&gt;
&lt;/P&gt;</description>
      <pubDate>Fri, 30 Sep 2016 02:55:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106060#M42345</guid>
      <dc:creator>Shelton</dc:creator>
      <dc:date>2016-09-30T02:55:19Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106061#M42346</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/12351/mansour-ramy.html" nodeid="12351"&gt;@Ramy Mansour&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can directly create table in phoenix and load data using CsvBulkLoadTool.&lt;/P&gt;&lt;P&gt;&lt;A href="http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce" target="_blank"&gt;http://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Currently with your data there will be 1000's of mappers running. The number of reducers depending on the number regions so increase the parallelization you can presplit the table by providing split points in DDL statement. You can also compress the table to reduce IO or shuffle data during bulkload tool.&lt;/P&gt;&lt;P&gt;&lt;A href="http://phoenix.apache.org/language/index.html#create_table" target="_blank"&gt;http://phoenix.apache.org/language/index.html#create_table&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Or else you can directly use ImportTsv and completeBulkload bulkload tools for loading data into HBase table directly.&lt;/P&gt;&lt;P&gt;&lt;A href="https://hbase.apache.org/book.html#importtsv" target="_blank"&gt;https://hbase.apache.org/book.html#importtsv&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://hbase.apache.org/book.html#completebulkload" target="_blank"&gt;https://hbase.apache.org/book.html#completebulkload&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Here some more configurations can be provided to mapred-site.xml to improve the job performance.&lt;/P&gt;&lt;PRE&gt;&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapreduce.map.output.compress&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;true&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;
&amp;lt;property&amp;gt;
  &amp;lt;name&amp;gt;mapred.map.output.compress.codec&amp;lt;/name&amp;gt;
  &amp;lt;value&amp;gt;org.apache.hadoop.io.compress.SnappyCodec&amp;lt;/value&amp;gt;
&amp;lt;/property&amp;gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 30 Sep 2016 13:15:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106061#M42346</guid>
      <dc:creator>rchintaguntla</dc:creator>
      <dc:date>2016-09-30T13:15:55Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106062#M42347</link>
      <description>&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/12351/mansour-ramy.html"&gt;@Ramy Mansour&lt;/A&gt;&lt;/P&gt;&lt;P&gt;It seems that your job does not use any parallelism. Among those suggested in this thread, chunking out the CSV input file in multiple parts could also help. It would do what Phoenix would do, but manually. Number of chunks should be determined based on resources that you want to use but for your cluster resources you could probably split the file in at least 25 parts of 10 GB each. &lt;/P&gt;</description>
      <pubDate>Sat, 01 Oct 2016 00:45:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106062#M42347</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-10-01T00:45:22Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106063#M42348</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3486/cstanca.html" nodeid="3486"&gt;@Constantin Stanca&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thanks for the insight. Based on your comment, does Phoenix chunk the data automically if we ingest it through it?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Oct 2016 01:51:56 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106063#M42348</guid>
      <dc:creator>ramym</dc:creator>
      <dc:date>2016-10-04T01:51:56Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106064#M42349</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/425/rchintaguntla.html" nodeid="425"&gt;@Rajeshbabu Chintaguntla&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Thanks for that detailed post, there seems to be two really good approaches there.&lt;/P&gt;&lt;P&gt;Which approach would likely provide better performance? It seems like the CsvBulkLoadTool might better than ImportTsv but wanted to verify.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Oct 2016 01:54:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106064#M42349</guid>
      <dc:creator>ramym</dc:creator>
      <dc:date>2016-10-04T01:54:35Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106065#M42350</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/12351/mansour-ramy.html" nodeid="12351"&gt;@Ramy Mansour&lt;/A&gt; &lt;/P&gt;&lt;P&gt;if phoenix schema you are going to map to HBase table have any composite primary key,  data types other than strings or secondary indexes then you can use CsvBulkLoadTool otherwise you can go ahead with ImportTsv which performs better. And the remaining optimizations helps for both the cases so you can use them.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Oct 2016 12:51:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106065#M42350</guid>
      <dc:creator>rchintaguntla</dc:creator>
      <dc:date>2016-10-04T12:51:16Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106066#M42351</link>
      <description>&lt;P&gt;Thanks &lt;A href="https://community.hortonworks.com/questions/59159/hbase-data-ingestion.html#"&gt;@Rajeshbabu Chintaguntla&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Oct 2016 20:56:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106066#M42351</guid>
      <dc:creator>ramym</dc:creator>
      <dc:date>2016-10-05T20:56:42Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase data ingestion</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106067#M42352</link>
      <description>&lt;P&gt;Yes. It does. Phoenix was designed to allow that level of parallelism and data locality.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Dec 2016 23:58:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-data-ingestion/m-p/106067#M42352</guid>
      <dc:creator>coneal77</dc:creator>
      <dc:date>2016-12-20T23:58:42Z</dc:date>
    </item>
  </channel>
</rss>

