Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

BUG: CTAS in impalad version 3.2.0-cdh6.2.0 on ADLS Gen2 broken by HADOOP-15860

Explorer

Overview

 

With a destination table using the ABFS(S) (Azure Data Lake Storage Gen2) as its location, a CTAS (create table as select) operation fails with the following error (e.g.) on CDH6.2.0 (Impala 3.2.0):

 

 

ERROR: Failed to open HDFS file for writing: abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/dest_tab/_impala_insert_staging/244c5ee8ece6f759_8b1a1e3b00000000/.244c5ee8ece6f759-8b1a1e3b00000000_1926230402_dir/244c5ee8ece6f759-8b1a1e3b00000000_45513034_data.0.
Error(22): Invalid argument
Root cause: IllegalArgumentException: ABFS does not allow files or directories to end with a dot.

 

This operation works on CDH6.1.1 (Impala 3.1.0). The error is seemingly caused by the inclusion of https://issues.apache.org/jira/browse/HADOOP-15860 in CDH6.2.0.

 

See test case below to reproduce.


Q1: Are there any workarounds available for this? E.g. is there a way to control the format of the filenames in the staging area created by Impala?

 

Q2: Are you aware of this issue and will you be fixing this in a future release? If so, can you give me the version number of CDH you will include the fix please?

 

 

Test Case

 

CDH6.2.0

Server version: impalad version 3.2.0-cdh6.2.0 RELEASE (build edc19942b4debdbfd485fbd26098eef435003f5d)

 

0. Setup

 

 

CREATE DATABASE IF NOT EXISTS test_db LOCATION 'abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir';

use test_db;

CREATE TABLE source_tab (
cust_id BIGINT,
cust_first_name STRING,
cust_last_name STRING
)
PARTITIONED BY (
cust_gender STRING
)
STORED AS PARQUET;

show create table source_tab;
+-------------------------------------------------------------------------------------------+
| result |
+-------------------------------------------------------------------------------------------+
| CREATE TABLE test_db.source_tab ( |
| cust_id BIGINT, |
| cust_first_name STRING, |
| cust_last_name STRING |
| ) |
| PARTITIONED BY ( |
| cust_gender STRING |
| ) |
| STORED AS PARQUET |
| LOCATION 'abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/source_tab' |
| |
+-------------------------------------------------------------------------------------------+

INSERT INTO source_tab PARTITION (cust_gender='M') VALUES (1, 'Joe', 'Bloggs');
INSERT INTO source_tab PARTITION (cust_gender='M') VALUES (2, 'Fred', 'Bloggs');
INSERT INTO source_tab PARTITION (cust_gender='F') VALUES (3, 'Anne', 'Bloggs');
INSERT INTO source_tab PARTITION (cust_gender='F') VALUES (4, 'Izzy', 'Bloggs');

SELECT * FROM source_tab;

+---------+-----------------+----------------+-------------+
| cust_id | cust_first_name | cust_last_name | cust_gender |
+---------+-----------------+----------------+-------------+
| 1 | Joe | Bloggs | M |
| 4 | Izzy | Bloggs | F |
| 2 | Fred | Bloggs | M |
| 3 | Anne | Bloggs | F |
+---------+-----------------+----------------+-------------+

 

 

1. CTAS throws the error

 

 

CREATE TABLE dest_tab AS
SELECT cust_id,
cust_first_name,
cust_last_name
FROM source_tab;

Query: CREATE TABLE dest_tab AS
SELECT cust_id,
cust_first_name,
cust_last_name
FROM source_tab
ERROR: Failed to open HDFS file for writing: abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/dest_tab/_impala_insert_staging/244c5ee8ece6f759_8b1a1e3b00000000/.244c5ee8ece6f759-8b1a1e3b00000000_1926230402_dir/244c5ee8ece6f759-8b1a1e3b00000000_45513034_data.0.
Error(22): Invalid argument
Root cause: IllegalArgumentException: ABFS does not allow files or directories to end with a dot.

 

 

2. INSERT-SELECT works

 

 

DROP TABLE dest_tab PURGE;

CREATE TABLE dest_tab (
cust_id BIGINT,
cust_first_name STRING,
cust_last_name STRING
)
STORED AS PARQUET;

show create table dest_tab;
+-----------------------------------------------------------------------------------------+
| result |
+-----------------------------------------------------------------------------------------+
| CREATE TABLE test_db.dest_tab ( |
| cust_id BIGINT, |
| cust_first_name STRING, |
| cust_last_name STRING |
| ) |
| STORED AS PARQUET |
| LOCATION 'abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/dest_tab' |
| |
+-----------------------------------------------------------------------------------------+

INSERT INTO dest_tab
SELECT cust_id,
cust_first_name,
cust_last_name
FROM source_tab;
SELECT * FROM dest_tab; +---------+-----------------+----------------+ | cust_id | cust_first_name | cust_last_name | +---------+-----------------+----------------+ | 1 | Joe | Bloggs | | 2 | Fred | Bloggs | | 3 | Anne | Bloggs | | 4 | Izzy | Bloggs | +---------+-----------------+----------------+

 

 

CDH6.1.1

Server version: impalad version 3.1.0-cdh6.1.1 RELEASE (build 97215ce79febfa42364dbff8e4c4d3c5bfc583ba)

 

0. Setup

As above.

 

1. CTAS works

 

CREATE TABLE dest_tab AS
SELECT cust_id,
cust_first_name,
cust_last_name
FROM source_tab;

show create table dest_tab
+-----------------------------------------------------------------------------------------+
| result |
+-----------------------------------------------------------------------------------------+
| CREATE TABLE test_db.dest_tab ( |
| cust_id BIGINT, |
| cust_first_name STRING, |
| cust_last_name STRING |
| ) |
| STORED AS TEXTFILE |
| LOCATION 'abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/dest_tab' |
| |
+-----------------------------------------------------------------------------------------+

SELECT * FROM dest_tab;

+---------+-----------------+----------------+
| cust_id | cust_first_name | cust_last_name |
+---------+-----------------+----------------+
| 3 | Anne | Bloggs |
| 2 | Fred | Bloggs |
| 4 | Izzy | Bloggs |
| 1 | Joe | Bloggs |
+---------+-----------------+----------------+

 

2. INSERT-SELECT works

 

DROP TABLE dest_tab PURGE;

CREATE TABLE dest_tab (
cust_id BIGINT,
cust_first_name STRING,
cust_last_name STRING
)
STORED AS PARQUET;

show create table dest_tab;
+-----------------------------------------------------------------------------------------+
| result |
+-----------------------------------------------------------------------------------------+
| CREATE TABLE test_db.dest_tab ( |
| cust_id BIGINT, |
| cust_first_name STRING, |
| cust_last_name STRING |
| ) |
| STORED AS PARQUET |
| LOCATION 'abfss://glgpv2store02cdh01@glgpv2store02.dfs.core.windows.net/mydir/dest_tab' |
| |
+-----------------------------------------------------------------------------------------+

INSERT INTO dest_tab
SELECT cust_id,
cust_first_name,
cust_last_name
FROM source_tab;

SELECT * FROM dest_tab;

+---------+-----------------+----------------+
| cust_id | cust_first_name | cust_last_name |
+---------+-----------------+----------------+
| 1 | Joe | Bloggs |
| 2 | Fred | Bloggs |
| 4 | Izzy | Bloggs |
| 3 | Anne | Bloggs |
+---------+-----------------+----------------+
7 REPLIES 7

Guru
Hi,

Thanks for reporting this.

Based on the line below in Impala code:
https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-table-sink.cc#L329

I can see that impala tries to build file name using format of: $0.$1.$2, where $0 is final_hdfs_file_name_prefix, $1 is output_partition->num_files and $2 is output_partition->writer->file_extension().

Follow the code path to here:
https://github.com/cloudera/Impala/blob/cdh6.2.0/be/src/exec/hdfs-text-table-writer.cc#L65

I can see that $2 will be empty string if the file format is text, so the file name will end with ".".

Since impala internally builds such file name and I don't see a way to override it, I don't see there is an easy workaround if you have to use text file format.

Will you be OK to switch to other format?

I have searched internally there is no such bug reported. I will create one and share the ID.

Guru
I have reported this upstream:
https://issues.apache.org/jira/browse/IMPALA-8557

For now, please use other file formats like Parquest to see if it helps.

Explorer

Thanks for the reply and for raising the bug upstream.

 

We will track the bug to see when it is fixed and then check the release notes for future CDH versions to see if it is included.

 

I will retry my test case using parquet for the destination table.

Guru
Ahh, Thanks for pointing out about the Parquet, I missed it in your CREATE TABLE statement.

Will look a bit further to see why parquet also fails.

Explorer

No you are correct, I am wrong. The destination table is created as TEXT and not parquet. I've confirmed that this works in 6.2.0:

 

Query: CREATE TABLE dest_tab
STORED AS PARQUET
AS
SELECT  cust_id,
        cust_first_name,
        cust_last_name
FROM    source_tab
+-------------------+
| summary           |
+-------------------+
| Inserted 4 row(s) |
+-------------------+
Fetched 1 row(s) in 1.01s

Thanks.

Guru
Cool, thanks for testing out again to confirm that Parquet works.

This is useful.

Cheers
Eric

Cloudera Employee

IMPALA-8557 has been fixed. The fix should be coming in a CDH release soon.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.