Created 09-12-2018 12:27 PM
I have this table with what I believe is a nested column.
I created this table with the statement :
create table testtbl stored as AVRO TBLPROPERTIES ('avro.schema.url'='hdfs://testhost:8020/tmp/avroschemas/testtbl.json');
testtbl.json looks like :
{
"type" : "record",
"name" : "testtbl",
"namespace" : "orgn.data.domain",
"fields" : [ {
"name" : "id",
"type" : {
"type" : "record",
"name" : "Key",
"fields" : [ {
"name" : "TId",
"type" : "string"
}, {
"name" : "action",
"type" : "string"
}, {
"name" : "createdTS",
"type" : {
"type" : "long",
"logicalType" : "timestamp-millis"
}
} ]
}
}, {
"name" : "CId",
"type" : "string"
}, {
"name" : "ANumber",
"type" : "string"
} ]
}
Can somebody give me a valid insert statement to insert one row into the table.
Appreciate the help.
Created 09-12-2018 01:10 PM
Try below insert statement
0: jdbc:hive2://abcd:10000> with t as (select NAMED_STRUCT('tid','1','action','success', 'createdts',current_timestamp) as id ,'1' as cid,'12345' as anumber) 0: jdbc:hive2://abcd:10000> insert into testtbl select * from t; No rows affected (20.464 seconds) 0: jdbc:hive2://abcd:10000> select * from testtbl; +-----------------------------------------------------------------------+--------------+------------------+--+ | testtbl.id | testtbl.cid | testtbl.anumber | +-----------------------------------------------------------------------+--------------+------------------+--+ | {"tid":"1","action":"success","createdts":"2018-09-12 15:06:27.075"} | 1 | 12345 | +-----------------------------------------------------------------------+--------------+------------------+--+
Created 09-12-2018 01:10 PM
Try below insert statement
0: jdbc:hive2://abcd:10000> with t as (select NAMED_STRUCT('tid','1','action','success', 'createdts',current_timestamp) as id ,'1' as cid,'12345' as anumber) 0: jdbc:hive2://abcd:10000> insert into testtbl select * from t; No rows affected (20.464 seconds) 0: jdbc:hive2://abcd:10000> select * from testtbl; +-----------------------------------------------------------------------+--------------+------------------+--+ | testtbl.id | testtbl.cid | testtbl.anumber | +-----------------------------------------------------------------------+--------------+------------------+--+ | {"tid":"1","action":"success","createdts":"2018-09-12 15:06:27.075"} | 1 | 12345 | +-----------------------------------------------------------------------+--------------+------------------+--+
Created 09-12-2018 01:12 PM
As you are having struct type for your first column in the table we need to use named_struct function while inserting the data.
Table definition:
hive> desc testtbl; +-----------+-------------------------------------------------------+----------+--+ | col_name | data_type | comment | +-----------+-------------------------------------------------------+----------+--+ | id | struct<tid:string,action:string,createdts:timestamp> | | | cid | string | | | anumber | string | | +-----------+-------------------------------------------------------+----------+--+
Inserting data into testtbl:
hive> insert into testtbl select named_struct('tid',"1",'action',"post",'createdts',timestamp(150987427)),string("1241"),string("124") from(select '1')t;
Selecting data from the table:
hive> select * from testtbl; +--------------------------------------------------------------------+--------------+------------------+--+ | testtbl.id | testtbl.cid | testtbl.anumber | +--------------------------------------------------------------------+--------------+------------------+--+ | {"tid":"1","action":"post","createdts":"1970-01-02 12:56:27.427"} | 1241 | 124 | +--------------------------------------------------------------------+--------------+------------------+--+
-
If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
Created 09-12-2018 02:31 PM
naresh and shu, thanks so much - both the statements worked!
one more question : if i have data files (for similar avro tables) being sent to a directory in hdfs (through kafka/flume) what is the best way to load it into the table?
is there any way that i can configure it such that data is picked up automatically from the directory path?
appreciate the feedback.
Created 09-12-2018 03:09 PM
I assume raw data is in text & u want to convert & load the data into avro tables.
If so, u can create another identical text table & specifiy the delimiter in data..
i.e.,
create table staging(id struct<tid:string,action:string,createdts:timestamp>, cid string, anumber string) row format delimited fields terminated by ',' collection items terminated by '|' stored as textfile;
sample text data can be as below
1|success|150987428888,3,12345
insert into testtbl select * from staging;
If kafka or flume is generating avro files directly, then those files can be written into table path directly. Its better to create external table if source files are written directly on table path.
Created 09-12-2018 02:38 PM
and is it not possible to insert values without another table (t) like :
hive> insert into testtbl values NAMED_STRUCT('tid','3','action','success', 'createdts',150987428888) as id ,'3' as cid,'12345' as anumber;
FAILED: ParseException line 1:27 extraneous input 'NAMED_STRUCT' expecting ( near ')' line 1:107 missing EOF at 'as' near ')'
Created 09-12-2018 03:01 PM
Its not possible to use functions in insert into table values statement.
Created 09-12-2018 03:36 PM
>create table staging(id struct<tid:string,action:string,createdts:timestamp>, cid string, anumber string) row format delimited fields >terminated by ',' collection items terminated by '|' stored as textfile;
>sample text data can be as below
>1|success|150987428888,3,12345
>insert into testtbl select * from staging;
how is the text data loaded into the staging table?
Also is it possible to use the 'load data' command in this context : load data inpath '/tmp/test.csv' into table testtbl;
Appreciate the clarification.
Created 09-12-2018 03:55 PM
Yes. File content will be
# hadoop fs -cat /tmp/data1.txt
1|success|2018-09-12 17:45:39.69,3,12345
Then you need to load the content into staging table using below command
load data inpath '/tmp/data1.txt' into table staging;
Then from staging, you need to load it into actual avro table using below command
insert into testtbl select * from staging;
If my answer helped you to resolve your issue, you can accept it. It will be helpful for others.
Created 09-12-2018 03:57 PM
ok, i used the "load data" command to load the data into the staging table. selecting from the table i can see below output :
hive> select * from staging;
OK
{"tid":"1","action":"success","createdts":null}312345
Time taken: 0.398 seconds, Fetched: 1 row(s)
Is that good? I am kind of concerned with the flower braces and the column names in the resultant data.
Created 09-12-2018 03:58 PM
so is there no way to load the data automatically from the files coming into a particular directory in hdfs?
Created 09-12-2018 04:06 PM
you can create external table with location & write the text files directly on that path.
eg., create external table staging1(id struct<tid:string,action:string,createdts:timestamp>, cid string, anumber string) row format delimited fields terminated by ',' collection items terminated by '|' stored as textfile LOCATION '/tmp/staging/';
All text files can be directly written at /tmp/staging/ by kafka or flume
If Kafka or flume will be able to generate Avro files, then you can skip the staging table & create external avro table & write avro files directly on to external table location.
Created 09-12-2018 05:02 PM
naresh, you are the man!!! thanks so much!!!
Created 09-12-2018 06:53 PM
any idea what is wrong with this :
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json')
LOCATION '/tmp/staging';
I am getting :
FAILED: ParseException line 7:0 missing EOF at 'LOCATION' near ')'
Created 09-12-2018 06:58 PM
its because that TBLPROPERTIES should the last one, use the below and it should help:
+++++++++++++++++
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/staging'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json');
+++++++++++++++++
Created 09-12-2018 07:38 PM
so with a table created as above, how should the data be to be able to load it in? in what format that is? because i am not specifying any delimiters etc. appreciate the insights.
Created 09-13-2018 12:54 PM
@Mahesh Balakrishnan @Naresh P R @Shu
can i have feedback on how the data should be formatted to be loaded (load inpath command) into a table created as :
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/staging'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json');
The schema is the same as described earlier. But there is no delimiter specified.
Appreciate if you could provide a sample piece of data for this.
Created 09-13-2018 03:00 PM
There are multiple ways to populate avro tables.
1) Insert into avro_table values(<col1>,<col2>..,<colN>) -- This way hive will write avro files.
2) Generating avro files & copying directly to '/tmp/staging', You can read avro documentation to write avro files directly into hdfs path. Avro Reader/Writer APIs will take care of storing & retrieving records, we don't need to explicitly specify delimiters for avro files.