Support Questions

Find answers, ask questions, and share your expertise

Avro - java.io.IOException: java.io.IOException: Not a data file.

avatar
Rising Star

Hi,I have very simple AVSC file, And I generated the Avro using GitHub Code, xml to Avro converter,

https://github.com/elodina/xml-avro/tree/master/src/ly/stealth/xmlavro/Converter.java

But when I query the table I get below error.

Avro - java.io.IOException: java.io.IOException: Not a data file.

I can see the avro file inside the table folder. Kindly let me know what I am missing??

Below is my AVSC-

CREATE TABLE embedded COMMENT "just drop the schema right into the HQL" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://csaa-aap-qa/apps/hive/warehouse/reservemodel.db/embedded' TBLPROPERTIES ( 'avro.schema.literal'='{ "fields": [ { "name": "BillDate", "source": "element BillDate", "type": "string" }, { "name": "BillTime", "source": "element BillTime", "type": "string" }, { "name": "Remit_CompanyName", "source": "element Remit_CompanyName", "type": "string" }, { "name": "Remit_Addr", "source": "element Remit_Addr", "type": "string" }, { "name": "Remit_CityStZip", "source": "element Remit_CityStZip", "type": "string" }, { "name": "Remit_Phone", "source": "element Remit_Phone", "type": "string" }, { "name": "Remit_Fax", "source": "element Remit_Fax", "type": "string" }, { "name": "Remit_TaxID", "source": "element Remit_TaxID", "type": "string" }, { "name": "Previous_Balance", "source": "element Previous_Balance", "type": "string" }, { "name": "others", "type": { "type": "map", "values": "string" } } ], "name": "MetroBillType", "namespace": "ly.stealth.xmlavro", "protocol": "xml", "type": "record" } ') ;

And the XML is.

<?xml version="1.0" encoding="UTF-8" ?> <MetroBill xmlns:xs="http://www.w3.org/2001/XMLSchema-instance" xs:noNamespaceSchemaLocation="Metrobill.xsd" > <BillDate>02/29/2016</BillDate> <BillTime>18:49:05</BillTime> <Remit_CompanyName>METROPOLITAN REPORTING BUREAU</Remit_CompanyName> <Remit_Addr>P.O. BOX 926, WILLIAM PENN ANNEX</Remit_Addr> <Remit_CityStZip>PHILADELPHIA, PA 19105-0926</Remit_CityStZip> <Remit_Phone>(800) 245-6686</Remit_Phone> <Remit_Fax>(800) 343-9047</Remit_Fax> <Remit_TaxID>23-1879730</Remit_TaxID> <Previous_Balance>1663</Previous_Balance> </MetroBill>

The Avro genrated is with special chars-

02/29/201618:49:05:METROPOLITAN REPORTING BUREAU@P.O. BOX 926, WILLIAM PENN ANNEX6PHILADELPHIA, PA 19105-0926(800) 245-6686(800) 343-904723-18797301663

Regards

Mamta Chawla

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I am novice to data loading and Avro so I can't verify the avro schema etc. provided in the post but I did face the same issue (java.io.IOException: java.io.IOException: Not a data file). I am listing below the steps I carried out to load a sql server table to avro and then to a Hive external table, maybe, it will provide some pointers :

--create dir. for sqoop import

[hdfs@l1031lab root]$  hadoop fs -mkdir -p /dataload/tohdfs/reio/odpdw/may2016/
[hdfs@l1031lab root]$

--grant permissions to sqoop

[hdfs@l1031lab root]$ hadoop fs -chown -R sqoop /dataload
[hdfs@l1031lab root]$
[hdfs@l1031lab root]$ hadoop fs -ls /dataload/tohdfs/reio/odpdw/may2016/
[hdfs@l1031lab root]$

--sqoop import

sqoop import --connect 'jdbc:sqlserver://dbserver;database=dbname' --username someusername --password somepassword --as-avrodatafile --num-mappers 8 --table DimSampleDesc --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

An exception was thrown but the avro files were created :

Writing Avro schema file: /tmp/sqoop-sqoop/compile/e64596608ce0247bf2233353991b20fd/DimSampleDesc.avsc
16/05/09 13:09:00 DEBUG mapreduce.DataDrivenImportJob: Could not move Avro schema file to code output directory.
java.io.FileNotFoundException: Destination directory '.' does not exist [createDestDir=true]
        at org.apache.commons.io.FileUtils.moveFileToDirectory(FileUtils.java:2865)
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.writeAvroSchema(DataDrivenImportJob.java:146)
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:92)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
        at org.apache.sqoop.manager.SQLServerManager.importTable(SQLServerManager.java:163)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:244)

error in moving the schema file, hence, do manually. If this step is skipped, the table will still get created but when you do a 'select' on the table, you get the 'Not a data file' error message.

[sqoop@l1038lab root]$ hadoop fs -copyFromLocal /tmp/sqoop-sqoop/compile/e64596608ce0247bf2233353991b20fd/* /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/

to allow hive to write, give write permissions to all

[sqoop@l1038lab root]$ hadoop fs -chmod -R a+w /dataload/
[sqoop@l1038lab root]$

view the warehouse-dir

[sqoop@l1038lab root]$ hadoop fs -ls /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc
Found 15 items
-rw-rw-rw-   3 sqoop hdfs       3659 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
-rw-rw-rw-   3 sqoop hdfs      28540 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.class
-rw-rw-rw-   3 sqoop hdfs       9568 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.jar
-rw-rw-rw-   3 sqoop hdfs      61005 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.java
-rw-rw-rw-   3 sqoop hdfs          0 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/_SUCCESS
-rw-rw-rw-   3 sqoop hdfs       2660 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00000.avro
-rw-rw-rw-   3 sqoop hdfs    1465072 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00001.avro
-rw-rw-rw-   3 sqoop hdfs    3577459 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00002.avro
-rw-rw-rw-   3 sqoop hdfs    1203520 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00003.avro
-rw-rw-rw-   3 sqoop hdfs     236282 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00004.avro
-rw-rw-rw-   3 sqoop hdfs     638532 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00005.avro
-rw-rw-rw-   3 sqoop hdfs     850454 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00006.avro
-rw-rw-rw-   3 sqoop hdfs     257025 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00007.avro
-rw-rw-rw-   3 sqoop hdfs     341204 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00008.avro
-rw-rw-rw-   3 sqoop hdfs       4792 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00009.avro
[sqoop@l1038lab root]$

view the schema file

[sqoop@l1038lab root]$ hadoop fs -cat /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
{
  "type" : "record",
  "name" : "DimSampleDesc",
  "doc" : "Sqoop import of DimSampleDesc",
  "fields" : [ {
    "name" : "SmapiName_ver",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName_ver",
    "sqlType" : "12"
  }, {
    "name" : "SmapiColName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiColName",
    "sqlType" : "12"
  }, {
    "name" : "ChartType",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "ChartType",
    "sqlType" : "4"
  }, {
    "name" : "X_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "Y_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "X_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "Y_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "SmapiName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName",
    "sqlType" : "12"
  }, {
    "name" : "Incorrect_Ver_FL",
    "type" : [ "null", "boolean" ],
    "default" : null,
    "columnName" : "Incorrect_Ver_FL",
    "sqlType" : "-7"
  } ],
  "tableName" : "DimSampleDesc"
}[sqoop@l1038lab root]$

create the hive table in the schema i.e first do a 'use database' on the hive prompt and then create the table :

hive>
    > CREATE EXTERNAL TABLE DimSampleDesc  ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  TBLPROPERTIES (    'avro.schema.url'='hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc');
OK
Time taken: 0.37 seconds
hive>

--describe the table

hive>
    >
    > describe formatted DimSampleDesc;
OK
# col_name              data_type               comment
smapiname_ver           string
smapicolname            string
charttype               int
x_indexet               int
y_indexet               int
x_tick                  string
y_tick                  string
x_tickrange             string
x_tickrangefrom         string
x_tickrangetom          string
y_tickrange             string
y_tickrangefrom         string
y_tickrangetom          string
indexcount              int
x_indexcount            int
y_indexcount            int
x_symbol                string
x_symbolname            string
x_symboldescr           string
y_symbol                string
y_symbolname            string
y_symboldescr           string
smapiname               string
incorrect_ver_fl        boolean
# Detailed Table Information
Database:               odp_dw_may2016
Owner:                  hive
CreateTime:             Mon May 09 16:37:08 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l1031lab.sss.se.com:8020/apps/hive/warehouse/odp_dw_may2016.db/dimsampledesc
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        avro.schema.url         hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
        numFiles                6
        numRows                 -1
        rawDataSize             -1
        totalSize               8566342
        transient_lastDdlTime   1462804628
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.avro.AvroSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.431 seconds, Fetched: 56 row(s)
hive>

--execute some select queries

hive>
    > select * from DimSampleDesc limit 5;
OK
Abnormal_ShutDown_ct_1  Abnormal_ShutDown_ct_1_000      7       0       0                                                                       1       1       1       times   counts  -   Abnormal_ShutDown_ct     NULL
Abnormal_ShutDown_ct_2  Abnormal_ShutDown_ct_2_000      7       0       0                                                                       1       1       1       %       percent -   Abnormal_ShutDown_ct     NULL
ABS_ActivCOO_ct_1       ABS_ActivCOO_ct_1_000   7       0       0                                                                       1       1       1       times   counts  -           ABS_ActivCOO_ct  NULL
ABS_ActivCOO_ct_2       ABS_ActivCOO_ct_2_000   1       0       0       0               0       0       0                               3       3       0       times   counts  -           ABS_ActivCOO_ct  NULL
ABS_ActivCOO_ct_2       ABS_ActivCOO_ct_2_001   1       1       0       1               1       1       1                               3       3       0       times   counts  -           ABS_ActivCOO_ct  NULL
Time taken: 0.095 seconds, Fetched: 5 row(s)
hive>

The core is to have the avsc and avro files in place and ensure proper permissions.

I hope that helps.

View solution in original post

6 REPLIES 6

avatar
Rising Star

I'm also having a lot of problems with Avro (see here) and have seen exactly the same problem you are having.

In my case I was sending Avro objects to Kafka and then having Flume transfer them from Kafka to HDFS. If I generated the objects (from Python) using DatumWrite and sent bytes I could not decode them in HDFS or Hive, even if I specified a schema. If I used DataFileWrite (such that the schema was included in the object) and uploaded to HDFS manually, all was fine.

I suspect the problem is with the Avro SerDe. It looks like it is ignoring the supplied schema and always looking in the object for the schema definition.

avatar
Rising Star

@mark doutre

Hi Mark,

Can you please plrovide me a sample how to add Avro schema with avro data??

Thanks

Mamta

avatar
Super Collaborator

I am novice to data loading and Avro so I can't verify the avro schema etc. provided in the post but I did face the same issue (java.io.IOException: java.io.IOException: Not a data file). I am listing below the steps I carried out to load a sql server table to avro and then to a Hive external table, maybe, it will provide some pointers :

--create dir. for sqoop import

[hdfs@l1031lab root]$  hadoop fs -mkdir -p /dataload/tohdfs/reio/odpdw/may2016/
[hdfs@l1031lab root]$

--grant permissions to sqoop

[hdfs@l1031lab root]$ hadoop fs -chown -R sqoop /dataload
[hdfs@l1031lab root]$
[hdfs@l1031lab root]$ hadoop fs -ls /dataload/tohdfs/reio/odpdw/may2016/
[hdfs@l1031lab root]$

--sqoop import

sqoop import --connect 'jdbc:sqlserver://dbserver;database=dbname' --username someusername --password somepassword --as-avrodatafile --num-mappers 8 --table DimSampleDesc --warehouse-dir /dataload/tohdfs/reio/odpdw/may2016 --verbose

An exception was thrown but the avro files were created :

Writing Avro schema file: /tmp/sqoop-sqoop/compile/e64596608ce0247bf2233353991b20fd/DimSampleDesc.avsc
16/05/09 13:09:00 DEBUG mapreduce.DataDrivenImportJob: Could not move Avro schema file to code output directory.
java.io.FileNotFoundException: Destination directory '.' does not exist [createDestDir=true]
        at org.apache.commons.io.FileUtils.moveFileToDirectory(FileUtils.java:2865)
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.writeAvroSchema(DataDrivenImportJob.java:146)
        at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:92)
        at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
        at org.apache.sqoop.manager.SQLServerManager.importTable(SQLServerManager.java:163)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:148)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:184)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:226)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:235)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:244)

error in moving the schema file, hence, do manually. If this step is skipped, the table will still get created but when you do a 'select' on the table, you get the 'Not a data file' error message.

[sqoop@l1038lab root]$ hadoop fs -copyFromLocal /tmp/sqoop-sqoop/compile/e64596608ce0247bf2233353991b20fd/* /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/

to allow hive to write, give write permissions to all

[sqoop@l1038lab root]$ hadoop fs -chmod -R a+w /dataload/
[sqoop@l1038lab root]$

view the warehouse-dir

[sqoop@l1038lab root]$ hadoop fs -ls /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc
Found 15 items
-rw-rw-rw-   3 sqoop hdfs       3659 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
-rw-rw-rw-   3 sqoop hdfs      28540 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.class
-rw-rw-rw-   3 sqoop hdfs       9568 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.jar
-rw-rw-rw-   3 sqoop hdfs      61005 2016-05-09 16:18 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.java
-rw-rw-rw-   3 sqoop hdfs          0 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/_SUCCESS
-rw-rw-rw-   3 sqoop hdfs       2660 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00000.avro
-rw-rw-rw-   3 sqoop hdfs    1465072 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00001.avro
-rw-rw-rw-   3 sqoop hdfs    3577459 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00002.avro
-rw-rw-rw-   3 sqoop hdfs    1203520 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00003.avro
-rw-rw-rw-   3 sqoop hdfs     236282 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00004.avro
-rw-rw-rw-   3 sqoop hdfs     638532 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00005.avro
-rw-rw-rw-   3 sqoop hdfs     850454 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00006.avro
-rw-rw-rw-   3 sqoop hdfs     257025 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00007.avro
-rw-rw-rw-   3 sqoop hdfs     341204 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00008.avro
-rw-rw-rw-   3 sqoop hdfs       4792 2016-05-09 16:10 /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00009.avro
[sqoop@l1038lab root]$

view the schema file

[sqoop@l1038lab root]$ hadoop fs -cat /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
{
  "type" : "record",
  "name" : "DimSampleDesc",
  "doc" : "Sqoop import of DimSampleDesc",
  "fields" : [ {
    "name" : "SmapiName_ver",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName_ver",
    "sqlType" : "12"
  }, {
    "name" : "SmapiColName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiColName",
    "sqlType" : "12"
  }, {
    "name" : "ChartType",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "ChartType",
    "sqlType" : "4"
  }, {
    "name" : "X_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "Y_Indexet",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_Indexet",
    "sqlType" : "4"
  }, {
    "name" : "X_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Tick",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Tick",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "X_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRange",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRange",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeFrom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeFrom",
    "sqlType" : "-9"
  }, {
    "name" : "Y_TickRangeTom",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_TickRangeTom",
    "sqlType" : "-9"
  }, {
    "name" : "IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "X_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "Y_IndexCount",
    "type" : [ "null", "int" ],
    "default" : null,
    "columnName" : "Y_IndexCount",
    "sqlType" : "4"
  }, {
    "name" : "X_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "X_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "X_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "Y_Symbol",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_Symbol",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolName",
    "sqlType" : "-9"
  }, {
    "name" : "Y_SymbolDescr",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "Y_SymbolDescr",
    "sqlType" : "-9"
  }, {
    "name" : "SmapiName",
    "type" : [ "null", "string" ],
    "default" : null,
    "columnName" : "SmapiName",
    "sqlType" : "12"
  }, {
    "name" : "Incorrect_Ver_FL",
    "type" : [ "null", "boolean" ],
    "default" : null,
    "columnName" : "Incorrect_Ver_FL",
    "sqlType" : "-7"
  } ],
  "tableName" : "DimSampleDesc"
}[sqoop@l1038lab root]$

create the hive table in the schema i.e first do a 'use database' on the hive prompt and then create the table :

hive>
    > CREATE EXTERNAL TABLE DimSampleDesc  ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'  STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'  TBLPROPERTIES (    'avro.schema.url'='hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc');
OK
Time taken: 0.37 seconds
hive>

--describe the table

hive>
    >
    > describe formatted DimSampleDesc;
OK
# col_name              data_type               comment
smapiname_ver           string
smapicolname            string
charttype               int
x_indexet               int
y_indexet               int
x_tick                  string
y_tick                  string
x_tickrange             string
x_tickrangefrom         string
x_tickrangetom          string
y_tickrange             string
y_tickrangefrom         string
y_tickrangetom          string
indexcount              int
x_indexcount            int
y_indexcount            int
x_symbol                string
x_symbolname            string
x_symboldescr           string
y_symbol                string
y_symbolname            string
y_symboldescr           string
smapiname               string
incorrect_ver_fl        boolean
# Detailed Table Information
Database:               odp_dw_may2016
Owner:                  hive
CreateTime:             Mon May 09 16:37:08 CEST 2016
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://l1031lab.sss.se.com:8020/apps/hive/warehouse/odp_dw_may2016.db/dimsampledesc
Table Type:             EXTERNAL_TABLE
Table Parameters:
        COLUMN_STATS_ACCURATE   false
        EXTERNAL                TRUE
        avro.schema.url         hdfs://l1031lab.sss.se.com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/DimSampleDesc.avsc
        numFiles                6
        numRows                 -1
        rawDataSize             -1
        totalSize               8566342
        transient_lastDdlTime   1462804628
# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.avro.AvroSerDe
InputFormat:            org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        serialization.format    1
Time taken: 0.431 seconds, Fetched: 56 row(s)
hive>

--execute some select queries

hive>
    > select * from DimSampleDesc limit 5;
OK
Abnormal_ShutDown_ct_1  Abnormal_ShutDown_ct_1_000      7       0       0                                                                       1       1       1       times   counts  -   Abnormal_ShutDown_ct     NULL
Abnormal_ShutDown_ct_2  Abnormal_ShutDown_ct_2_000      7       0       0                                                                       1       1       1       %       percent -   Abnormal_ShutDown_ct     NULL
ABS_ActivCOO_ct_1       ABS_ActivCOO_ct_1_000   7       0       0                                                                       1       1       1       times   counts  -           ABS_ActivCOO_ct  NULL
ABS_ActivCOO_ct_2       ABS_ActivCOO_ct_2_000   1       0       0       0               0       0       0                               3       3       0       times   counts  -           ABS_ActivCOO_ct  NULL
ABS_ActivCOO_ct_2       ABS_ActivCOO_ct_2_001   1       1       0       1               1       1       1                               3       3       0       times   counts  -           ABS_ActivCOO_ct  NULL
Time taken: 0.095 seconds, Fetched: 5 row(s)
hive>

The core is to have the avsc and avro files in place and ensure proper permissions.

I hope that helps.

avatar
Rising Star

Thanks @Kaliyug Antagonist, for the solution it worked.

avatar
New Contributor

can you let me know how did it work for you? Did you give permissions to avsc and avro files and created hive tables to solve above issue?

avatar
New Contributor

For me, the issue was due to some data already present at external table hdfs location. Since that data was not in avro format table was not able to read it properly.

I just removed it and the table worked fine.