Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

basic hive clarification

avatar
Expert Contributor

1)I want to try the examples in the below link related to JSON.when i check in internet i found so many JSON SerDe's. Not sure which one i need to use and also what is the difference between each one? http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html

2)I am new to hive and how to load data into hive table.When i check in Internet i found using Load and also using Insert statement. Not sure which one i need to use and also what is the difference between each one?

1 ACCEPTED SOLUTION

avatar

@vamsi valiveti

The Jar cited in the doc should already be on your cluster since it's the default SerDe. I just checked in my HDP sandbox and it's present here : /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar

I also tested it and everything is working from my side:

0: jdbc:hive2://sandbox.hortonworks.com:10000> CREATE TABLE my_table(a string, b bigint)
0: jdbc:hive2://sandbox.hortonworks.com:10000> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
No rows affected (0.841 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10000> insert into table my_table values ('row1',1);
INFO  : Session is already open
INFO  : Dag name: insert into table my_table values ('row1',1)(Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  :


INFO  : Status: Running (Executing on YARN cluster with App id application_1462551928453_0043)


INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table default.my_table from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/my_table/.hive-staging_hive_2016-05-10_15-26-38_253_2589783258292691814-2/-ext-10000
INFO  : Table default.my_table stats: [numFiles=1, numRows=1, totalSize=19, rawDataSize=0]
No rows affected (9.278 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10000>

As you can see data get created in HDFS :

[root@sandbox tmp]# hdfs dfs -ls /apps/hive/warehouse/my_table
Found 1 items
-rwxrwxrwx   3 hive hdfs         19 2016-05-10 15:26 /apps/hive/warehouse/my_table/000000_0
[root@sandbox tmp]# hdfs dfs -cat /apps/hive/warehouse/my_table/000000_0
{"a":"row1","b":1}

And hive has the data:

0: jdbc:hive2://sandbox.hortonworks.com:10000> select * from my_table;
+-------------+-------------+--+
| my_table.a  | my_table.b  |
+-------------+-------------+--+
| row1        | 1           |
+-------------+-------------+--+
1 row selected (0.847 seconds)

For LOAD vs INSERT, as I said, use load if you have data already in a file (local or HDFS) that you want to add to your Hive Table.

If you want to use Insert Values, you need to read about ACID support in Hive and its limitation. Hive not a "standard" database. This is out of the scope of this question so I don't spend too match on it. Just look to the note from the doc: INSERT...VALUES is available starting in Hive 0.14. Inserting values from SQL statements can only be performed on tables that support ACID. See Hive Transactions for details.

Does this answer your question ?

View solution in original post

5 REPLIES 5

avatar
Master Guru

LOAD by itself doesn't do any data transformations, it essentially takes files and puts them in the hive table directory. ( So you need to be sure to have created your hive table with the correct storage options for the data. )

If you need to transform data you need to create an external table and then use the insert into statement to transform your data.

avatar

Hi @vamsi valiveti

1) You can start with the JSON SerDe that's cited in the Hive documentation: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe

CREATE TABLE my_table(a string, b bigint, ...)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

2) Load and insert are used in different scenarios:

  • Load is used to copy (or move) data from a file into you Hive table. This is used when you already have data in a file in HDFS or your local file system and you want to create a table to process it with SQL. Here you start by creating a Hive table and you do a load to get data from files into your Hive table.
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

If your data is in your local file system use 'LOCAL'. In this case data will be copied from you local file system into the HDFS directory of your Hive table (by default in the Hive warehouse folder). If your data is already in HDFS, do not use LOCAL but note that this moves (and does not copy) your HDFS files from their current path to the HDFS directory of your Hive table.

You have also other options such as OVERWRITE to decide if you want to delete existing data in you Hive table or if you want to append.

  • Insert is used if you want to copy data from a Hive table to another table. Here let say you have table 1 with data in it. You want to create table 2 to get a specific part of you table 1 data. You can combine an insert and select statement to get data that you want from Table 1 and store it in table 2
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

Here again you have choice between appending new data or replacing everything (overwrite)

avatar
Expert Contributor

Thanks for input

1)I do not see any download option in https://cwiki.apache.org/confluence/display/Hive/Json+SerDe and where to download the jar file

2)when i check below link i can see below sql statement.My question is when we will use below insert statement and normal load statement?

insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');

http://sanjivblogs.blogspot.com/2014/12/transactions-are-available-in-hive-014.html?view=classic

avatar

@vamsi valiveti

The Jar cited in the doc should already be on your cluster since it's the default SerDe. I just checked in my HDP sandbox and it's present here : /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar

I also tested it and everything is working from my side:

0: jdbc:hive2://sandbox.hortonworks.com:10000> CREATE TABLE my_table(a string, b bigint)
0: jdbc:hive2://sandbox.hortonworks.com:10000> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
No rows affected (0.841 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10000> insert into table my_table values ('row1',1);
INFO  : Session is already open
INFO  : Dag name: insert into table my_table values ('row1',1)(Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  :


INFO  : Status: Running (Executing on YARN cluster with App id application_1462551928453_0043)


INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table default.my_table from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/my_table/.hive-staging_hive_2016-05-10_15-26-38_253_2589783258292691814-2/-ext-10000
INFO  : Table default.my_table stats: [numFiles=1, numRows=1, totalSize=19, rawDataSize=0]
No rows affected (9.278 seconds)
0: jdbc:hive2://sandbox.hortonworks.com:10000>

As you can see data get created in HDFS :

[root@sandbox tmp]# hdfs dfs -ls /apps/hive/warehouse/my_table
Found 1 items
-rwxrwxrwx   3 hive hdfs         19 2016-05-10 15:26 /apps/hive/warehouse/my_table/000000_0
[root@sandbox tmp]# hdfs dfs -cat /apps/hive/warehouse/my_table/000000_0
{"a":"row1","b":1}

And hive has the data:

0: jdbc:hive2://sandbox.hortonworks.com:10000> select * from my_table;
+-------------+-------------+--+
| my_table.a  | my_table.b  |
+-------------+-------------+--+
| row1        | 1           |
+-------------+-------------+--+
1 row selected (0.847 seconds)

For LOAD vs INSERT, as I said, use load if you have data already in a file (local or HDFS) that you want to add to your Hive Table.

If you want to use Insert Values, you need to read about ACID support in Hive and its limitation. Hive not a "standard" database. This is out of the scope of this question so I don't spend too match on it. Just look to the note from the doc: INSERT...VALUES is available starting in Hive 0.14. Inserting values from SQL statements can only be performed on tables that support ACID. See Hive Transactions for details.

Does this answer your question ?

avatar

@vamsi valiveti have you tried this solution ?