Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Basics unclear about external table and hdfs file relation

avatar
Super Collaborator

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

I had imported a table from ms sql server to HDFS using Sqoop, there I had faced permission issues but I could proceed with the help of the community. The corresponding files on HDFS are as shown in the image :

4065-sqoop-table-import-from-sql-server-to-hdfs.pngMy objectives are :

  1. Create an external table in Hive using the above files
  2. Dropping this table later and recreating it, say using Parquet, shouldn't cause any data loss or errors
  3. Preferably, store the Hive table at a custom location but this is NOT mandatory

Accordingly, I tried to create the table :

CREATE EXTERNAL TABLE DimSampleDesc ( SmapiName_ver varchar(30) ,SmapiColName varchar(35),ChartType int,X_Indexet int,Y_Indexet int,X_Tick varchar(20),Y_Tick varchar(20),X_TickRange varchar(40),X_TickRangeFrom varchar(20),X_TickRangeTom varchar(20),Y_TickRange varchar(40),Y_TickRangeFrom varchar(20),Y_TickRangeTom varchar(20),IndexCount int,X_IndexCount int,Y_IndexCount int,X_Symbol varchar(10),X_SymbolName varchar(40),X_SymbolDescr varchar(40),Y_Symbol varchar(10),Y_SymbolName varchar(40),Y_SymbolDescr varchar(40),SmapiName varchar(30),Incorrect_Ver_FL boolean ) COMMENT 'EXTERNAL, ORC table' STORED AS ORC LOCATION '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc';

The error made me question about my basics about 'location' - I assumed it is the 'read/source' location but now Hive is trying to write there(I think so!). If this is the case, how do I specify the source location for the external table ?

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=hive, access=WRITE, inode="/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc":sqoop:hdfs:drwxr-xr-x
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:219)
        at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1771)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1755)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1729)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8285)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1914)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1443)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2137)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2133)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
1 ACCEPTED SOLUTION

avatar
Master Guru

Keeping imported table as-is and transforming it (ORC, Parquet, etc) is the preferred way. You keep your data, no need to import again if something unplanned happens, and once you have decided how to handle it you can drop the imported table. (You can also import directly to Hive as ORC, there is a guide how to do that on HCC). Regarding the second question, my example was with a managed table (stored at /apps/hive/warehouse), if you wish to store tables elsewhere you can create another external table, provide the location, and write into it.

View solution in original post

11 REPLIES 11

avatar
Master Guru

Your location is okay, but the format of the files is not. It looks like you imported your table as avro files. So, in your declaration you have to say "STORED AS AVRO", not "STORED AS ORC". Once that suceeds you can first test your table by selecting some rows, SELECT COUNT(*) etc, and then create your ORC table by saying in Hive, for example: "CREATE TABLE DimSampleDesc_orc STORED AS ORC AS SELECT * from DimSampleDesc". You also have permission issues, the owner of the file is sqoop but user hive needs write permission to create external table. There are several way to resolve this, the fastest will be to give write permissions to everyone:

hdfs dfs chmod -R a+w /dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc

However, you better think how to handle permissions in you system, for example you can run sqoop etc. as an end user (like user1), add user1 to hadoop group, and give write permissions to the group (g+w). That will work because user hive also belongs to group hadoop. Or you can use Ranger to manage permissions.

avatar
Super Collaborator

I partially understood the solution that you are suggesting :

  1. Does it mean that there are two tables that I need to create - first say DimSampleDesc_avro(external table) and then say DimSampleDesc_orc(MANAGED table) and then drop DimSampleDesc_avro. Doesn't it defeat the original purpose of creating an external table, moreover, are two tables mandatory ?
  2. Is it possible to provide Hive another location to write, I mean will it meddle with '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc' or ? As shown in the pic. below, I have created another location where I wish Hive to store the tables

4067-hive-intended-location-for-storing-table.png


hive-intended-location-for-storing-table.png

avatar
Super Collaborator
Another update :
The empty table is getting created but data load is failing :
hive>
    >
    > CREATE EXTERNAL TABLE DimSampleDesc ( SmapiName_ver varchar(30) ,SmapiColName varchar(35),ChartType int,X_Indexet int,Y_Indexet int,X_Tick varchar(20),Y_Tick varchar(20),X_TickRange varchar(40),X_TickRangeFrom varchar(20),X_TickRangeTom varchar(20),Y_TickRange varchar(40),Y_TickRangeFrom varchar(20),Y_TickRangeTom varchar(20),IndexCount int,X_IndexCount int,Y_IndexCount int,X_Symbol varchar(10),X_SymbolName varchar(40),X_SymbolDescr varchar(40),Y_Symbol varchar(10),Y_SymbolName varchar(40),Y_SymbolDescr varchar(40),SmapiName varchar(30),Incorrect_Ver_FL boolean ) COMMENT 'EXTERNAL, ORC table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS ORC LOCATION '/datastore/hdfstohive/reio/odpdw/may2016';
OK
Time taken: 0.187 seconds
hive>
    > select * from DimSampleDesc;
OK
Time taken: 0.367 seconds
hive>
hive>
    > LOAD DATA INPATH '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/*.avro' INTO TABLE DimSampleDesc;
Loading data to table odp_dw_may2016.dimsampledesc
Failed with exception Unable to move source hdfs://l1031lab.sss.se..com:8020/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/part-m-00000.avro to destination hdfs://l1031lab.sss.se.com:8020/datastore/hdfstohive/reio/odpdw/may2016/part-m-00000.avro
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
hive>
    >

					
				
			
			
				
			
			
			
			
			
			
			
		

avatar
Master Guru

Keeping imported table as-is and transforming it (ORC, Parquet, etc) is the preferred way. You keep your data, no need to import again if something unplanned happens, and once you have decided how to handle it you can drop the imported table. (You can also import directly to Hive as ORC, there is a guide how to do that on HCC). Regarding the second question, my example was with a managed table (stored at /apps/hive/warehouse), if you wish to store tables elsewhere you can create another external table, provide the location, and write into it.

avatar
Master Guru

Try this

-- Imported table
CREATE EXTERNAL TABLE DimSampleDesc_avro(SmapiName_ver varchar(30),...) STORED AS AVRO LOCATION '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc';
-- External ORC table
CREATE EXTERNAL TABLE DimSampleDesc(SmapiName_ver varchar(30), ...) STORED AS ORC LOCATION '/datastore/hdfstohive/reio/odpdw/may2016';
INSERT OVERWRITE TABLE DimSampleDesc SELECT * FROM DimSampleDesc_avro;

avatar
Super Collaborator

@Predag Minovic

Couldn't comment on the post despite several attempts, hence, this 'answer'

Changed the permissions for Hive to have write access

[sqoop@l1038lab root]$ hadoop fs -chmod -R a+w /dataload/[sqoop@l1038lab root]$

The table gets created but all fields are NULL :

hive>  >  > CREATE EXTERNAL TABLE DimSampleDesc ( SmapiName_ver varchar(30) ,SmapiColName varchar(35),ChartType int,X_Indexet int,Y_Indexet int,X_Tick varchar(20),Y_Tick varchar(20),X_TickRange varchar(40),X_TickRangeFrom varchar(20),X_TickRangeTom varchar(20),Y_TickRange varchar(40),Y_TickRangeFrom varchar(20),Y_TickRangeTom varchar(20),IndexCount int,X_IndexCount int,Y_IndexCount int,X_Symbol varchar(10),X_SymbolName varchar(40),X_SymbolDescr varchar(40),Y_Symbol varchar(10),Y_SymbolName varchar(40),Y_SymbolDescr varchar(40),SmapiName varchar(30),Incorrect_Ver_FL boolean ) COMMENT 'EXTERNAL, ORC table' STORED AS Avro LOCATION '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/';OKTime taken: 0.206 secondshive>
hive>  > select * from DimSampleDesc limit 10 ;OKNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLTime taken: 0.099 seconds, Fetched: 10 row(s)hive>  >

Even providing all options doesn't help :

hive>  >  > drop table DimSampleDesc;OKTime taken: 0.422 secondshive>  >  > CREATE EXTERNAL TABLE DimSampleDesc ( SmapiName_ver varchar(30) ,SmapiColName varchar(35),ChartType int,X_Indexet int,Y_Indexet int,X_Tick varchar(20),Y_Tick varchar(20),X_TickRange varchar(40),X_TickRangeFrom varchar(20),X_TickRangeTom varchar(20),Y_TickRange varchar(40),Y_TickRangeFrom varchar(20),Y_TickRangeTom varchar(20),IndexCount int,X_IndexCount int,Y_IndexCount int,X_Symbol varchar(10),X_SymbolName varchar(40),X_SymbolDescr varchar(40),Y_Symbol varchar(10),Y_SymbolName varchar(40),Y_SymbolDescr varchar(40),SmapiName varchar(30),Incorrect_Ver_FL boolean ) COMMENT 'EXTERNAL, AVRO table' ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/dataload/tohdfs/reio/odpdw/may2016/DimSampleDesc/';OKTime taken: 0.169 secondshive>  > select * from DimSampleDesc limit 10 ;OKNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLNULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULL  NULLTime taken: 0.107 seconds, Fetched: 10 row(s)hive>

This NULL is bothering me - I didn't get any errors in Sqoop while importing the file onto HDFS. Is there an issue - in data import in the first place or this Hive table creation ?

avatar
Master Guru

Yeah, doesn't look good, can you try to "get" one avro file to your local files system and check is it correct. You can use Avro Tools. Or if Avro is not a must, import the table as text file, and explore the Avro path later.

avatar
Super Collaborator

Yeah even I share your opinion.

I am uncomfortable with installing any new tools on the cluster machines, I did find jars like /usr/hdp/2.3.2.0-2950/sqoop/lib/avro-mapred-1.7.5-hadoop2.jar and /usr/hdp/2.3.2.0-2950/pig/lib/avro-tools-1.7.5-nodeps.jar but I am unable to find the classes/commands to be used merely to view the avro files already present on HDFS.

avatar
Master Guru

In your Avro table declaration in Hive, can you try to replace VARCHAR(n) with STRING? I just tried to import as Avro one of my tables from Mysql with 3 columns: int, varchar(16) and timestamp. When I declared my varchar(16) as string in Hive it worked, when I used varchar(16) in Hive I got an error when I tried "SELECT *" and it failed. Also, on import, Sqoop creates an .avsc file in your current directory, you can inspect that file to find out what types Sqoop used when creating Avro files.