Created 09-12-2018 03:58 PM
so is there no way to load the data automatically from the files coming into a particular directory in hdfs?
Created 09-12-2018 04:06 PM
you can create external table with location & write the text files directly on that path.
eg., create external table staging1(id struct<tid:string,action:string,createdts:timestamp>, cid string, anumber string) row format delimited fields terminated by ',' collection items terminated by '|' stored as textfile LOCATION '/tmp/staging/';
All text files can be directly written at /tmp/staging/ by kafka or flume
If Kafka or flume will be able to generate Avro files, then you can skip the staging table & create external avro table & write avro files directly on to external table location.
Created 09-12-2018 05:02 PM
naresh, you are the man!!! thanks so much!!!
Created 09-12-2018 06:53 PM
any idea what is wrong with this :
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json')
LOCATION '/tmp/staging';
I am getting :
FAILED: ParseException line 7:0 missing EOF at 'LOCATION' near ')'
Created 09-12-2018 06:58 PM
its because that TBLPROPERTIES should the last one, use the below and it should help:
+++++++++++++++++
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/staging'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json');
+++++++++++++++++
Created 09-12-2018 07:38 PM
so with a table created as above, how should the data be to be able to load it in? in what format that is? because i am not specifying any delimiters etc. appreciate the insights.
Created 09-13-2018 12:54 PM
@Mahesh Balakrishnan @Naresh P R @Shu
can i have feedback on how the data should be formatted to be loaded (load inpath command) into a table created as :
CREATE EXTERNAL TABLE staging3
ROW FORMAT SERDE 'org.apache.hadoop.hive.serd2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/tmp/staging'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/avroschemas/testtbl.json');
The schema is the same as described earlier. But there is no delimiter specified.
Appreciate if you could provide a sample piece of data for this.
Created 09-13-2018 03:00 PM
There are multiple ways to populate avro tables.
1) Insert into avro_table values(<col1>,<col2>..,<colN>) -- This way hive will write avro files.
2) Generating avro files & copying directly to '/tmp/staging', You can read avro documentation to write avro files directly into hdfs path. Avro Reader/Writer APIs will take care of storing & retrieving records, we don't need to explicitly specify delimiters for avro files.