Support Questions

Find answers, ask questions, and share your expertise

Import to HDFS or Hive

avatar
Super Collaborator
  1. Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
  2. The source is a MS SQL database of around 1.6TB and around 25 tables
  3. The ultimate objective is to check if the existing queries can run faster on the HDP
  4. There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
  5. As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :

sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose

Now I plan to create a Hive table but I have some questions mentioned here.

My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)

1 ACCEPTED SOLUTION

avatar
Master Guru

The good thing about hadoop is that once you have the data in and in a format that is correct ( i.e. when you query it you can see all your data, the row count is correct etc. ) You can always transform it in any form you want.

To transform into any Hive table from HDFS

a) Make an external Table on the data folder in your case stored as avro

CREATE EXTERNAL TABLE AVROTABLE ( colums ) STORED AS AVRO LOCATION '/tmp/myfolder';

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-Hive0.14andlater

b) Check that you can see all data and the data is correct

c) Create your performance tables ( ORC, partitioned, ... )

just use the CTAS statement to create a new copy of the table in your new format:

CREATE TABLE X STORED AS ORC AS SELECT * FROM AVROTABLE;

Or create your new table first and then insert into it

CREATE TABLE ORCTABLE ( columns) STORED AS ORC ... ;

INSERT INTO ORCTABLE SELECT * FROM AVROTABLE;

If you load directly into Hive in the first step just skip the create external table step.

Finally some tips:

- In Hive use ORC in Impala Parquet

- Use partitioning based on your where conditions ( date perhaps )

- in HDP2.2+ use the default ORC settings ( zip compression, ... )

- Sort your data during insert by a column of your where condition ( customer, location, product ... )

- Forget about indexes ORC and parquet have built in ones that work very similar without the overhead

- Make sure you have a new version of HDP

- Use statistics ( Analyze keyword )

- When you run a query check in resource manager or with hive.tez.exec.print.summary=true that you utilize the cluster nicely and don't have a couple reducers blocking your query.

Have fun, Hive has gotten pretty awesome recently.

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

View solution in original post

4 REPLIES 4

avatar
Master Guru

The good thing about hadoop is that once you have the data in and in a format that is correct ( i.e. when you query it you can see all your data, the row count is correct etc. ) You can always transform it in any form you want.

To transform into any Hive table from HDFS

a) Make an external Table on the data folder in your case stored as avro

CREATE EXTERNAL TABLE AVROTABLE ( colums ) STORED AS AVRO LOCATION '/tmp/myfolder';

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-Hive0.14andlater

b) Check that you can see all data and the data is correct

c) Create your performance tables ( ORC, partitioned, ... )

just use the CTAS statement to create a new copy of the table in your new format:

CREATE TABLE X STORED AS ORC AS SELECT * FROM AVROTABLE;

Or create your new table first and then insert into it

CREATE TABLE ORCTABLE ( columns) STORED AS ORC ... ;

INSERT INTO ORCTABLE SELECT * FROM AVROTABLE;

If you load directly into Hive in the first step just skip the create external table step.

Finally some tips:

- In Hive use ORC in Impala Parquet

- Use partitioning based on your where conditions ( date perhaps )

- in HDP2.2+ use the default ORC settings ( zip compression, ... )

- Sort your data during insert by a column of your where condition ( customer, location, product ... )

- Forget about indexes ORC and parquet have built in ones that work very similar without the overhead

- Make sure you have a new version of HDP

- Use statistics ( Analyze keyword )

- When you run a query check in resource manager or with hive.tez.exec.print.summary=true that you utilize the cluster nicely and don't have a couple reducers blocking your query.

Have fun, Hive has gotten pretty awesome recently.

http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

avatar
Super Collaborator

That's a comprehensive reply, I'm grateful 🙂

So can I assume that first-hdfs-then-whatever approach is correct or ... ?

avatar
Master Guru

First HDFS, First Hive it doesn't really matter. You can transform your data any way you want after. CTAS tables are great. I personally would use HDFS first because then you can easily use other tools like pig on the raw data as well to do some data checking, cleaning, transformations. Pig can be nice to check formatting issues for example. ( You can for example read delimited data as strings and apply regexes on it easier than you could in hive ).

Commonly in Hadoop I would do:

- Have a /data/raw/project1/timestamp folder which contains all your raw ingested data and keep it as long as you want.

- Run transformation jobs to create clean Hive/ORC tables

- Run any aggregation jobs on the ORC tables if possible ( faster ) ( you can for example export those to tableau etc )

For sqoop it doesn't matter so much because the data that comes out of a database is normally already clean so hive directly works well but for other data sources ( flat files ) you normally have some data quality issues that might require some pig or whatever magic.

avatar
Contributor

Emil and Benjamin have covered the question thoroughly. I would add the following general point. When you import data into a Hive table, you must define a schema before loading. This is not likely to be a problem if the data originated in a DBMS. However, importing data first into HDFS allows you to load the data without defining a schema - you just load the data. You can apply the schema later on if you wish. In this way, loading into HDFS first gives you greater flexibility.

In your case, the schema is stored alongside the data in Avro anyway, so my point might be academic.