Created on 05-10-2016 09:12 AM - edited 09-16-2022 03:18 AM
1)I want to try the examples in the below link related to JSON.when i check in internet i found so many JSON SerDe's. Not sure which one i need to use and also what is the difference between each one? http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
2)I am new to hive and how to load data into hive table.When i check in Internet i found using Load and also using Insert statement. Not sure which one i need to use and also what is the difference between each one?
Created 05-10-2016 03:33 PM
The Jar cited in the doc should already be on your cluster since it's the default SerDe. I just checked in my HDP sandbox and it's present here : /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
I also tested it and everything is working from my side:
0: jdbc:hive2://sandbox.hortonworks.com:10000> CREATE TABLE my_table(a string, b bigint) 0: jdbc:hive2://sandbox.hortonworks.com:10000> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; No rows affected (0.841 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000> insert into table my_table values ('row1',1); INFO : Session is already open INFO : Dag name: insert into table my_table values ('row1',1)(Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1462551928453_0043) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.my_table from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/my_table/.hive-staging_hive_2016-05-10_15-26-38_253_2589783258292691814-2/-ext-10000 INFO : Table default.my_table stats: [numFiles=1, numRows=1, totalSize=19, rawDataSize=0] No rows affected (9.278 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000>
As you can see data get created in HDFS :
[root@sandbox tmp]# hdfs dfs -ls /apps/hive/warehouse/my_table Found 1 items -rwxrwxrwx 3 hive hdfs 19 2016-05-10 15:26 /apps/hive/warehouse/my_table/000000_0 [root@sandbox tmp]# hdfs dfs -cat /apps/hive/warehouse/my_table/000000_0 {"a":"row1","b":1}
And hive has the data:
0: jdbc:hive2://sandbox.hortonworks.com:10000> select * from my_table; +-------------+-------------+--+ | my_table.a | my_table.b | +-------------+-------------+--+ | row1 | 1 | +-------------+-------------+--+ 1 row selected (0.847 seconds)
For LOAD vs INSERT, as I said, use load if you have data already in a file (local or HDFS) that you want to add to your Hive Table.
If you want to use Insert Values, you need to read about ACID support in Hive and its limitation. Hive not a "standard" database. This is out of the scope of this question so I don't spend too match on it. Just look to the note from the doc: INSERT...VALUES is available starting in Hive 0.14. Inserting values from SQL statements can only be performed on tables that support ACID. See Hive Transactions for details.
Does this answer your question ?
Created 05-10-2016 11:54 AM
LOAD by itself doesn't do any data transformations, it essentially takes files and puts them in the hive table directory. ( So you need to be sure to have created your hive table with the correct storage options for the data. )
If you need to transform data you need to create an external table and then use the insert into statement to transform your data.
Created 05-10-2016 11:56 AM
1) You can start with the JSON SerDe that's cited in the Hive documentation: https://cwiki.apache.org/confluence/display/Hive/Json+SerDe
CREATE TABLE my_table(a string, b bigint, ...) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE;
2) Load and insert are used in different scenarios:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
If your data is in your local file system use 'LOCAL'. In this case data will be copied from you local file system into the HDFS directory of your Hive table (by default in the Hive warehouse folder). If your data is already in HDFS, do not use LOCAL but note that this moves (and does not copy) your HDFS files from their current path to the HDFS directory of your Hive table.
You have also other options such as OVERWRITE to decide if you want to delete existing data in you Hive table or if you want to append.
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Here again you have choice between appending new data or replacing everything (overwrite)
Created 05-10-2016 01:46 PM
Thanks for input
1)I do not see any download option in https://cwiki.apache.org/confluence/display/Hive/Json+SerDe and where to download the jar file
2)when i check below link i can see below sql statement.My question is when we will use below insert statement and normal load statement?
insert into table testTableNew values (1,'row1'),(2,'row2'),(3,'row3');
http://sanjivblogs.blogspot.com/2014/12/transactions-are-available-in-hive-014.html?view=classic
Created 05-10-2016 03:33 PM
The Jar cited in the doc should already be on your cluster since it's the default SerDe. I just checked in my HDP sandbox and it's present here : /usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
I also tested it and everything is working from my side:
0: jdbc:hive2://sandbox.hortonworks.com:10000> CREATE TABLE my_table(a string, b bigint) 0: jdbc:hive2://sandbox.hortonworks.com:10000> ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'; No rows affected (0.841 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000> insert into table my_table values ('row1',1); INFO : Session is already open INFO : Dag name: insert into table my_table values ('row1',1)(Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1462551928453_0043) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.my_table from hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/my_table/.hive-staging_hive_2016-05-10_15-26-38_253_2589783258292691814-2/-ext-10000 INFO : Table default.my_table stats: [numFiles=1, numRows=1, totalSize=19, rawDataSize=0] No rows affected (9.278 seconds) 0: jdbc:hive2://sandbox.hortonworks.com:10000>
As you can see data get created in HDFS :
[root@sandbox tmp]# hdfs dfs -ls /apps/hive/warehouse/my_table Found 1 items -rwxrwxrwx 3 hive hdfs 19 2016-05-10 15:26 /apps/hive/warehouse/my_table/000000_0 [root@sandbox tmp]# hdfs dfs -cat /apps/hive/warehouse/my_table/000000_0 {"a":"row1","b":1}
And hive has the data:
0: jdbc:hive2://sandbox.hortonworks.com:10000> select * from my_table; +-------------+-------------+--+ | my_table.a | my_table.b | +-------------+-------------+--+ | row1 | 1 | +-------------+-------------+--+ 1 row selected (0.847 seconds)
For LOAD vs INSERT, as I said, use load if you have data already in a file (local or HDFS) that you want to add to your Hive Table.
If you want to use Insert Values, you need to read about ACID support in Hive and its limitation. Hive not a "standard" database. This is out of the scope of this question so I don't spend too match on it. Just look to the note from the doc: INSERT...VALUES is available starting in Hive 0.14. Inserting values from SQL statements can only be performed on tables that support ACID. See Hive Transactions for details.
Does this answer your question ?
Created 05-12-2016 05:02 PM
@vamsi valiveti have you tried this solution ?