Created 07-12-2018 08:55 PM
I am trying to create an external Hive table that points to an avro schema file (.avsc) that lives on the local file system. I know this is possible on Cloudera, but not so sure about Hortonworks. Typically, most 'avro.schema.url' examples point to 'hdfs:///', but this is not what I am hoping to accomplish. I am attempting to use 'file:///'.
The functionality I am attempting to mimic can be found here.
Any help would be greatly appreciated!
Created 07-12-2018 09:30 PM
Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:
[hive@node3 ~]$ cat user.avsc {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } 0: jdbc:hive2://node3:10000/default> CREATE TABLE test 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/user.avsc'); No rows affected (1.492 seconds) 0: jdbc:hive2://node3:10000/default> show create table test; +------------------------------------------------------------------+--+ | createtab_stmt | +------------------------------------------------------------------+--+ | CREATE TABLE `test`( | | `name` string COMMENT '', | | `favorite_number` int COMMENT '', | | `favorite_color` string COMMENT '') | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://Admin-TrainingNS/apps/hive/warehouse/test' | | TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', | | 'avro.schema.url'='file:///home/hive/user.avsc', | | 'numFiles'='0', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='0', | | 'transient_lastDdlTime'='1531430559') | +------------------------------------------------------------------+--+ 20 rows selected (0.775 seconds)Hope this helps!
Created 07-12-2018 09:30 PM
Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:
[hive@node3 ~]$ cat user.avsc {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } 0: jdbc:hive2://node3:10000/default> CREATE TABLE test 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/user.avsc'); No rows affected (1.492 seconds) 0: jdbc:hive2://node3:10000/default> show create table test; +------------------------------------------------------------------+--+ | createtab_stmt | +------------------------------------------------------------------+--+ | CREATE TABLE `test`( | | `name` string COMMENT '', | | `favorite_number` int COMMENT '', | | `favorite_color` string COMMENT '') | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://Admin-TrainingNS/apps/hive/warehouse/test' | | TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', | | 'avro.schema.url'='file:///home/hive/user.avsc', | | 'numFiles'='0', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='0', | | 'transient_lastDdlTime'='1531430559') | +------------------------------------------------------------------+--+ 20 rows selected (0.775 seconds)Hope this helps!
Created 07-12-2018 09:38 PM
Thanks @Vinicius Higa Murakami for the prompt answer!
I did try to use TBLPROPERTIES ( 'avro.schema.url'='file:///<filepath>') and got the following error:
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///<filepath>.avsc) (state=08S01,code=1)
I also tried WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///<filepath>.avsc') and got the same error as above.
I am under the assumption it may be a configuration issue. Any ideas?
Created 07-12-2018 09:45 PM
Hello @Shane B
Hmm, guess something inside your .avsc file can't be serialized by avro serde. Could you share it with us?
Btw, are you able to create the table using the hdfs path?
Created 07-12-2018 09:58 PM
And yes, it does work if I throw it in HDFS. Just not using 'file:///'
{ "namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"union1", "type":["float", "boolean", "string"] }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] }, { "name":"bytes1", "type":"bytes" }, { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} } ] }<br>
This was an example taken from the Hive docs
Created 07-13-2018 04:07 AM
It is very informative and helpful.
Created 07-13-2018 03:15 PM
Hi @Shane B!
I made the same test here using your example, and it's working 😐
[hive@node3 ~]$ cat -A test-avro.avsc {$ "namespace": "com.linkedin.haivvreo",$ "name": "test_serializer",$ "type": "record",$ "fields": [$ { "name":"string1", "type":"string" },$ { "name":"int1", "type":"int" },$ { "name":"tinyint1", "type":"int" },$ { "name":"smallint1", "type":"int" },$ { "name":"bigint1", "type":"long" },$ { "name":"boolean1", "type":"boolean" },$ { "name":"float1", "type":"float" },$ { "name":"double1", "type":"double" },$ { "name":"list1", "type":{"type":"array", "items":"string"} },$ { "name":"map1", "type":{"type":"map", "values":"int"} },$ { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [$ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },$ { "name":"union1", "type":["float", "boolean", "string"] },$ { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },$ { "name":"nullableint", "type":["int", "null"] },$ { "name":"bytes1", "type":"bytes" },$ { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }$ ] }$ [hive@node3 ~]$ beeline -u 'jdbc:hive2://node3:10000/default' -n hive Connecting to jdbc:hive2://node3:10000/default Connected to: Apache Hive (version 1.2.1000.2.6.5.0-292) Driver: Hive JDBC (version 1.2.1000.2.6.5.0-292) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.6.5.0-292 by Apache Hive 0: jdbc:hive2://node3:10000/default> CREATE TABLE test_hcc 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/test-avro.avsc'); No rows affected (0.513 seconds) 0: jdbc:hive2://node3:10000/default> desc formatted test_hcc; +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ | col_name | data_type | comment | +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ | # col_name | data_type | comment | | | NULL | NULL | | string1 | string | | | int1 | int | | | tinyint1 | int | | | smallint1 | int | | | bigint1 | bigint | | | boolean1 | boolean | | | float1 | float | | | double1 | double | | | list1 | array<string> | | | map1 | map<string,int> | | | struct1 | struct<sint:int,sboolean:boolean,sstring:string> | | | union1 | uniontype<float,boolean,string> | | | enum1 | string | | | nullableint | int | | | bytes1 | binary | | | fixed1 | binary | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | hive | NULL | | CreateTime: | Fri Jul 13 14:28:18 UTC 2018 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://Admin-TrainingNS/apps/hive/warehouse/test_hcc | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | avro.schema.url | file:///home/hive/test-avro.avsc | | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | totalSize | 0 | | | transient_lastDdlTime | 1531492098 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.avro.AvroSerDe | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ 47 rows selected (0.48 seconds)
Sorry I'm getting curious, gonna make a lot of questions 😄
Which Hive version are you running?
Are you using hiveCLI or beeline to execute these commands? Could you try to execute the following commands?
E.g. Beeline
0: jdbc:hive2://node3:10000/default> !sh ls test-avro.avsc
E.g. HiveCLI
hive> !ls /home/hive/test-avro.avsc
I made a test here to reproduce your issue (by adding a +d to my filename), got the same error
hive> CREATE TABLE test_hcc2 > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED as AVRO > TBLPROPERTIES ( > 'avro.schema.url'='file:///home/hive/test-avrod.avsc'); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///home/hive/test-avrod.avsc)
Also, I took a look at the Hive Github and it seems that you're hitting this line:
BTW, I'm attaching my hive props so you can compare with yours 🙂
hive.txt
Hope this helps!
Created 07-15-2018 10:06 PM
Thank you for being so willing to help. I appreciate it!
I ended up doing the same thing you did and tracing the error (which I assumed was the IO versus URISyntax) on the Hive GitHub repo and tracked that to the Apache Hadoop docs. I got to this point in my trace and realized that my Hive server was running on a different server [than my edge node]. Once I dropped the file on that server, it worked like a charm. Thanks again for your help!
Created 07-16-2018 02:59 AM
That's excellent news @Shane B! 😄
Created 07-17-2018 05:44 PM
You seem very knowledgeable and I was wondering if you knew when Hive would reference the schema that I specified. For example, if I version-out that schema and replace it with a new schema in the same location with the same name, when will my External Hive table pick up on that schema?
My main question is the behavior of Hive and to understand when that schema is referenced by the Hive table.
As always, I appreciate your help.