- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive table pointing to Avro schema on local file system
- Labels:
-
Apache Hive
Created 07-12-2018 08:55 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am trying to create an external Hive table that points to an avro schema file (.avsc) that lives on the local file system. I know this is possible on Cloudera, but not so sure about Hortonworks. Typically, most 'avro.schema.url' examples point to 'hdfs:///', but this is not what I am hoping to accomplish. I am attempting to use 'file:///'.
The functionality I am attempting to mimic can be found here.
Any help would be greatly appreciated!
Created 07-12-2018 09:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:
[hive@node3 ~]$ cat user.avsc {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } 0: jdbc:hive2://node3:10000/default> CREATE TABLE test 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/user.avsc'); No rows affected (1.492 seconds) 0: jdbc:hive2://node3:10000/default> show create table test; +------------------------------------------------------------------+--+ | createtab_stmt | +------------------------------------------------------------------+--+ | CREATE TABLE `test`( | | `name` string COMMENT '', | | `favorite_number` int COMMENT '', | | `favorite_color` string COMMENT '') | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://Admin-TrainingNS/apps/hive/warehouse/test' | | TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', | | 'avro.schema.url'='file:///home/hive/user.avsc', | | 'numFiles'='0', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='0', | | 'transient_lastDdlTime'='1531430559') | +------------------------------------------------------------------+--+ 20 rows selected (0.775 seconds)Hope this helps!
Created 07-12-2018 09:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:
[hive@node3 ~]$ cat user.avsc {"namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] } 0: jdbc:hive2://node3:10000/default> CREATE TABLE test 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/user.avsc'); No rows affected (1.492 seconds) 0: jdbc:hive2://node3:10000/default> show create table test; +------------------------------------------------------------------+--+ | createtab_stmt | +------------------------------------------------------------------+--+ | CREATE TABLE `test`( | | `name` string COMMENT '', | | `favorite_number` int COMMENT '', | | `favorite_color` string COMMENT '') | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | | LOCATION | | 'hdfs://Admin-TrainingNS/apps/hive/warehouse/test' | | TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', | | 'avro.schema.url'='file:///home/hive/user.avsc', | | 'numFiles'='0', | | 'numRows'='0', | | 'rawDataSize'='0', | | 'totalSize'='0', | | 'transient_lastDdlTime'='1531430559') | +------------------------------------------------------------------+--+ 20 rows selected (0.775 seconds)Hope this helps!
Created 07-12-2018 09:38 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Vinicius Higa Murakami for the prompt answer!
I did try to use TBLPROPERTIES ( 'avro.schema.url'='file:///<filepath>') and got the following error:
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///<filepath>.avsc) (state=08S01,code=1)
I also tried WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///<filepath>.avsc') and got the same error as above.
I am under the assumption it may be a configuration issue. Any ideas?
Created 07-12-2018 09:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Shane B
Hmm, guess something inside your .avsc file can't be serialized by avro serde. Could you share it with us?
Btw, are you able to create the table using the hdfs path?
Created 07-12-2018 09:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
And yes, it does work if I throw it in HDFS. Just not using 'file:///'
{ "namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"union1", "type":["float", "boolean", "string"] }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] }, { "name":"bytes1", "type":"bytes" }, { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} } ] }<br>
This was an example taken from the Hive docs
Created 07-13-2018 04:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is very informative and helpful.
Created 07-13-2018 03:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Shane B!
I made the same test here using your example, and it's working 😐
[hive@node3 ~]$ cat -A test-avro.avsc {$ "namespace": "com.linkedin.haivvreo",$ "name": "test_serializer",$ "type": "record",$ "fields": [$ { "name":"string1", "type":"string" },$ { "name":"int1", "type":"int" },$ { "name":"tinyint1", "type":"int" },$ { "name":"smallint1", "type":"int" },$ { "name":"bigint1", "type":"long" },$ { "name":"boolean1", "type":"boolean" },$ { "name":"float1", "type":"float" },$ { "name":"double1", "type":"double" },$ { "name":"list1", "type":{"type":"array", "items":"string"} },$ { "name":"map1", "type":{"type":"map", "values":"int"} },$ { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [$ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },$ { "name":"union1", "type":["float", "boolean", "string"] },$ { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },$ { "name":"nullableint", "type":["int", "null"] },$ { "name":"bytes1", "type":"bytes" },$ { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }$ ] }$ [hive@node3 ~]$ beeline -u 'jdbc:hive2://node3:10000/default' -n hive Connecting to jdbc:hive2://node3:10000/default Connected to: Apache Hive (version 1.2.1000.2.6.5.0-292) Driver: Hive JDBC (version 1.2.1000.2.6.5.0-292) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.6.5.0-292 by Apache Hive 0: jdbc:hive2://node3:10000/default> CREATE TABLE test_hcc 0: jdbc:hive2://node3:10000/default> ROW FORMAT SERDE 0: jdbc:hive2://node3:10000/default> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 0: jdbc:hive2://node3:10000/default> STORED as AVRO 0: jdbc:hive2://node3:10000/default> TBLPROPERTIES ( 0: jdbc:hive2://node3:10000/default> 'avro.schema.url'='file:///home/hive/test-avro.avsc'); No rows affected (0.513 seconds) 0: jdbc:hive2://node3:10000/default> desc formatted test_hcc; +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ | col_name | data_type | comment | +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ | # col_name | data_type | comment | | | NULL | NULL | | string1 | string | | | int1 | int | | | tinyint1 | int | | | smallint1 | int | | | bigint1 | bigint | | | boolean1 | boolean | | | float1 | float | | | double1 | double | | | list1 | array<string> | | | map1 | map<string,int> | | | struct1 | struct<sint:int,sboolean:boolean,sstring:string> | | | union1 | uniontype<float,boolean,string> | | | enum1 | string | | | nullableint | int | | | bytes1 | binary | | | fixed1 | binary | | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | default | NULL | | Owner: | hive | NULL | | CreateTime: | Fri Jul 13 14:28:18 UTC 2018 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Location: | hdfs://Admin-TrainingNS/apps/hive/warehouse/test_hcc | NULL | | Table Type: | MANAGED_TABLE | NULL | | Table Parameters: | NULL | NULL | | | COLUMN_STATS_ACCURATE | {\"BASIC_STATS\":\"true\"} | | | avro.schema.url | file:///home/hive/test-avro.avsc | | | numFiles | 0 | | | numRows | 0 | | | rawDataSize | 0 | | | totalSize | 0 | | | transient_lastDdlTime | 1531492098 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | org.apache.hadoop.hive.serde2.avro.AvroSerDe | NULL | | InputFormat: | org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat | NULL | | OutputFormat: | org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat | NULL | | Compressed: | No | NULL | | Num Buckets: | -1 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | Storage Desc Params: | NULL | NULL | | | serialization.format | 1 | +-------------------------------+--------------------------------------------------------------+-----------------------------------+--+ 47 rows selected (0.48 seconds)
Sorry I'm getting curious, gonna make a lot of questions 😄
Which Hive version are you running?
Are you using hiveCLI or beeline to execute these commands? Could you try to execute the following commands?
E.g. Beeline
0: jdbc:hive2://node3:10000/default> !sh ls test-avro.avsc
E.g. HiveCLI
hive> !ls /home/hive/test-avro.avsc
I made a test here to reproduce your issue (by adding a +d to my filename), got the same error
hive> CREATE TABLE test_hcc2 > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' > STORED as AVRO > TBLPROPERTIES ( > 'avro.schema.url'='file:///home/hive/test-avrod.avsc'); FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///home/hive/test-avrod.avsc)
Also, I took a look at the Hive Github and it seems that you're hitting this line:
BTW, I'm attaching my hive props so you can compare with yours 🙂
hive.txt
Hope this helps!
Created 07-15-2018 10:06 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for being so willing to help. I appreciate it!
I ended up doing the same thing you did and tracing the error (which I assumed was the IO versus URISyntax) on the Hive GitHub repo and tracked that to the Apache Hadoop docs. I got to this point in my trace and realized that my Hive server was running on a different server [than my edge node]. Once I dropped the file on that server, it worked like a charm. Thanks again for your help!
Created 07-16-2018 02:59 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's excellent news @Shane B! 😄
Created 07-17-2018 05:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You seem very knowledgeable and I was wondering if you knew when Hive would reference the schema that I specified. For example, if I version-out that schema and replace it with a new schema in the same location with the same name, when will my External Hive table pick up on that schema?
My main question is the behavior of Hive and to understand when that schema is referenced by the Hive table.
As always, I appreciate your help.
