Support Questions

Find answers, ask questions, and share your expertise

Hive table pointing to Avro schema on local file system

avatar

I am trying to create an external Hive table that points to an avro schema file (.avsc) that lives on the local file system. I know this is possible on Cloudera, but not so sure about Hortonworks. Typically, most 'avro.schema.url' examples point to 'hdfs:///', but this is not what I am hoping to accomplish. I am attempting to use 'file:///'.

The functionality I am attempting to mimic can be found here.

Any help would be greatly appreciated!

1 ACCEPTED SOLUTION

avatar

Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:

[hive@node3 ~]$ cat  user.avsc 
{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}
0: jdbc:hive2://node3:10000/default>  CREATE TABLE test
0: jdbc:hive2://node3:10000/default>   ROW FORMAT SERDE
0: jdbc:hive2://node3:10000/default>   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
0: jdbc:hive2://node3:10000/default>   STORED as AVRO
0: jdbc:hive2://node3:10000/default>   TBLPROPERTIES (
0: jdbc:hive2://node3:10000/default>     'avro.schema.url'='file:///home/hive/user.avsc');
No rows affected (1.492 seconds)
0: jdbc:hive2://node3:10000/default> show create table test;
+------------------------------------------------------------------+--+
|                          createtab_stmt                          |
+------------------------------------------------------------------+--+
| CREATE TABLE `test`(                                             |
|   `name` string COMMENT '',                                      |
|   `favorite_number` int COMMENT '',                              |
|   `favorite_color` string COMMENT '')                            |
| ROW FORMAT SERDE                                                 |
|   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'                 |
| STORED AS INPUTFORMAT                                            |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'   |
| OUTPUTFORMAT                                                     |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  |
| LOCATION                                                         |
|   'hdfs://Admin-TrainingNS/apps/hive/warehouse/test'             |
| TBLPROPERTIES (                                                  |
|   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',          |
|   'avro.schema.url'='file:///home/hive/user.avsc',               |
|   'numFiles'='0',                                                |
|   'numRows'='0',                                                 |
|   'rawDataSize'='0',                                             |
|   'totalSize'='0',                                               |
|   'transient_lastDdlTime'='1531430559')                          |
+------------------------------------------------------------------+--+
20 rows selected (0.775 seconds)
Hope this helps!

View solution in original post

10 REPLIES 10

avatar

Hey @Shane B!
It should work with avro schemas located in your local fs. Both use the same serde to deal with avro typos.
Here's an example:

[hive@node3 ~]$ cat  user.avsc 
{"namespace": "example.avro",
 "type": "record",
 "name": "User",
 "fields": [
     {"name": "name", "type": "string"},
     {"name": "favorite_number",  "type": ["int", "null"]},
     {"name": "favorite_color", "type": ["string", "null"]}
 ]
}
0: jdbc:hive2://node3:10000/default>  CREATE TABLE test
0: jdbc:hive2://node3:10000/default>   ROW FORMAT SERDE
0: jdbc:hive2://node3:10000/default>   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
0: jdbc:hive2://node3:10000/default>   STORED as AVRO
0: jdbc:hive2://node3:10000/default>   TBLPROPERTIES (
0: jdbc:hive2://node3:10000/default>     'avro.schema.url'='file:///home/hive/user.avsc');
No rows affected (1.492 seconds)
0: jdbc:hive2://node3:10000/default> show create table test;
+------------------------------------------------------------------+--+
|                          createtab_stmt                          |
+------------------------------------------------------------------+--+
| CREATE TABLE `test`(                                             |
|   `name` string COMMENT '',                                      |
|   `favorite_number` int COMMENT '',                              |
|   `favorite_color` string COMMENT '')                            |
| ROW FORMAT SERDE                                                 |
|   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'                 |
| STORED AS INPUTFORMAT                                            |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'   |
| OUTPUTFORMAT                                                     |
|   'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  |
| LOCATION                                                         |
|   'hdfs://Admin-TrainingNS/apps/hive/warehouse/test'             |
| TBLPROPERTIES (                                                  |
|   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',          |
|   'avro.schema.url'='file:///home/hive/user.avsc',               |
|   'numFiles'='0',                                                |
|   'numRows'='0',                                                 |
|   'rawDataSize'='0',                                             |
|   'totalSize'='0',                                               |
|   'transient_lastDdlTime'='1531430559')                          |
+------------------------------------------------------------------+--+
20 rows selected (0.775 seconds)
Hope this helps!

avatar

Thanks @Vinicius Higa Murakami for the prompt answer!

I did try to use TBLPROPERTIES ( 'avro.schema.url'='file:///<filepath>') and got the following error:

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///<filepath>.avsc) (state=08S01,code=1)

I also tried WITH SERDEPROPERTIES ( 'avro.schema.url'='file:///<filepath>.avsc') and got the same error as above.

I am under the assumption it may be a configuration issue. Any ideas?

avatar

Hello @Shane B
Hmm, guess something inside your .avsc file can't be serialized by avro serde. Could you share it with us?
Btw, are you able to create the table using the hdfs path?

avatar

Hi @Vinicius Higa Murakami

And yes, it does work if I throw it in HDFS. Just not using 'file:///'

{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
    { "name":"union1", "type":["float", "boolean", "string"] },
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
    { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" },
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }<br>

This was an example taken from the Hive docs

avatar
New Contributor

It is very informative and helpful.

avatar

Hi @Shane B!
I made the same test here using your example, and it's working 😐

[hive@node3 ~]$ cat -A test-avro.avsc
{$
  "namespace": "com.linkedin.haivvreo",$
  "name": "test_serializer",$
  "type": "record",$
  "fields": [$
    { "name":"string1", "type":"string" },$
    { "name":"int1", "type":"int" },$
    { "name":"tinyint1", "type":"int" },$
    { "name":"smallint1", "type":"int" },$
    { "name":"bigint1", "type":"long" },$
    { "name":"boolean1", "type":"boolean" },$
    { "name":"float1", "type":"float" },$
    { "name":"double1", "type":"double" },$
    { "name":"list1", "type":{"type":"array", "items":"string"} },$
    { "name":"map1", "type":{"type":"map", "values":"int"} },$
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [$
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },$
    { "name":"union1", "type":["float", "boolean", "string"] },$
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },$
    { "name":"nullableint", "type":["int", "null"] },$
    { "name":"bytes1", "type":"bytes" },$
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }$
  ] }$
[hive@node3 ~]$ beeline -u 'jdbc:hive2://node3:10000/default' -n hive
Connecting to jdbc:hive2://node3:10000/default
Connected to: Apache Hive (version 1.2.1000.2.6.5.0-292)
Driver: Hive JDBC (version 1.2.1000.2.6.5.0-292)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.6.5.0-292 by Apache Hive
0: jdbc:hive2://node3:10000/default>  CREATE TABLE test_hcc
0: jdbc:hive2://node3:10000/default>   ROW FORMAT SERDE
0: jdbc:hive2://node3:10000/default>   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
0: jdbc:hive2://node3:10000/default>   STORED as AVRO
0: jdbc:hive2://node3:10000/default>   TBLPROPERTIES (
0: jdbc:hive2://node3:10000/default>     'avro.schema.url'='file:///home/hive/test-avro.avsc');
No rows affected (0.513 seconds)
0: jdbc:hive2://node3:10000/default> desc formatted test_hcc;
+-------------------------------+--------------------------------------------------------------+-----------------------------------+--+
|           col_name            |                          data_type                           |              comment              |
+-------------------------------+--------------------------------------------------------------+-----------------------------------+--+
| # col_name                    | data_type                                                    | comment                           |
|                               | NULL                                                         | NULL                              |
| string1                       | string                                                       |                                   |
| int1                          | int                                                          |                                   |
| tinyint1                      | int                                                          |                                   |
| smallint1                     | int                                                          |                                   |
| bigint1                       | bigint                                                       |                                   |
| boolean1                      | boolean                                                      |                                   |
| float1                        | float                                                        |                                   |
| double1                       | double                                                       |                                   |
| list1                         | array<string>                                                |                                   |
| map1                          | map<string,int>                                              |                                   |
| struct1                       | struct<sint:int,sboolean:boolean,sstring:string>             |                                   |
| union1                        | uniontype<float,boolean,string>                              |                                   |
| enum1                         | string                                                       |                                   |
| nullableint                   | int                                                          |                                   |
| bytes1                        | binary                                                       |                                   |
| fixed1                        | binary                                                       |                                   |
|                               | NULL                                                         | NULL                              |
| # Detailed Table Information  | NULL                                                         | NULL                              |
| Database:                     | default                                                      | NULL                              |
| Owner:                        | hive                                                         | NULL                              |
| CreateTime:                   | Fri Jul 13 14:28:18 UTC 2018                                 | NULL                              |
| LastAccessTime:               | UNKNOWN                                                      | NULL                              |
| Protect Mode:                 | None                                                         | NULL                              |
| Retention:                    | 0                                                            | NULL                              |
| Location:                     | hdfs://Admin-TrainingNS/apps/hive/warehouse/test_hcc         | NULL                              |
| Table Type:                   | MANAGED_TABLE                                                | NULL                              |
| Table Parameters:             | NULL                                                         | NULL                              |
|                               | COLUMN_STATS_ACCURATE                                        | {\"BASIC_STATS\":\"true\"}        |
|                               | avro.schema.url                                              | file:///home/hive/test-avro.avsc  |
|                               | numFiles                                                     | 0                                 |
|                               | numRows                                                      | 0                                 |
|                               | rawDataSize                                                  | 0                                 |
|                               | totalSize                                                    | 0                                 |
|                               | transient_lastDdlTime                                        | 1531492098                        |
|                               | NULL                                                         | NULL                              |
| # Storage Information         | NULL                                                         | NULL                              |
| SerDe Library:                | org.apache.hadoop.hive.serde2.avro.AvroSerDe                 | NULL                              |
| InputFormat:                  | org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat   | NULL                              |
| OutputFormat:                 | org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat  | NULL                              |
| Compressed:                   | No                                                           | NULL                              |
| Num Buckets:                  | -1                                                           | NULL                              |
| Bucket Columns:               | []                                                           | NULL                              |
| Sort Columns:                 | []                                                           | NULL                              |
| Storage Desc Params:          | NULL                                                         | NULL                              |
|                               | serialization.format                                         | 1                                 |
+-------------------------------+--------------------------------------------------------------+-----------------------------------+--+
47 rows selected (0.48 seconds)

Sorry I'm getting curious, gonna make a lot of questions 😄

Which Hive version are you running?
Are you using hiveCLI or beeline to execute these commands? Could you try to execute the following commands?
E.g. Beeline

0: jdbc:hive2://node3:10000/default> !sh ls test-avro.avsc

E.g. HiveCLI

hive> !ls /home/hive/test-avro.avsc

I made a test here to reproduce your issue (by adding a +d to my filename), got the same error

hive>  CREATE TABLE test_hcc2
    >   ROW FORMAT SERDE
    >   'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    >   STORED as AVRO
    >   TBLPROPERTIES (
    >     'avro.schema.url'='file:///home/hive/test-avrod.avsc');
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: MetaException(message:org.apache.hadoop.hive.serde2.SerDeException Encountered AvroSerdeException determining schema. Returning signal schema to indicate problem: Unable to read schema from given path: file:///home/hive/test-avrod.avsc)

Also, I took a look at the Hive Github and it seems that you're hitting this line:

https://github.com/apache/hive/blob/cacb1c09574c89ac07fcffc0b8c3fad18e283aec/serde/src/java/org/apac...

BTW, I'm attaching my hive props so you can compare with yours 🙂
hive.txt

Hope this helps!

avatar

Hi @Vinicius Higa Murakami

Thank you for being so willing to help. I appreciate it!

I ended up doing the same thing you did and tracing the error (which I assumed was the IO versus URISyntax) on the Hive GitHub repo and tracked that to the Apache Hadoop docs. I got to this point in my trace and realized that my Hive server was running on a different server [than my edge node]. Once I dropped the file on that server, it worked like a charm. Thanks again for your help!

avatar

That's excellent news @Shane B! 😄

avatar

Hey @Vinicius Higa Murakami

You seem very knowledgeable and I was wondering if you knew when Hive would reference the schema that I specified. For example, if I version-out that schema and replace it with a new schema in the same location with the same name, when will my External Hive table pick up on that schema?

My main question is the behavior of Hive and to understand when that schema is referenced by the Hive table.

As always, I appreciate your help.