Member since
01-07-2016
26
Posts
7
Kudos Received
0
Solutions
02-22-2019
12:27 AM
Hi, I have not followed the development of Impala lateley.If this i still a limitation you might try the following approach.Design the schema with an additional column with information about which rows holds information for a particular struct column and then use this additional column in the WHERE clause. Something like: name complex1 complex2 complex3
complex1 content NULL NULL
complex3 NULL NULL content and then: SELECT complex1.*
FROM myTable
WHERE name = 'complex1' Br, Petter
... View more
10-09-2018
02:58 AM
Hi all, we have our cluster deployed on AWS EC2 instances where some of the worker noedes are on spot instances. Usually there is no problem when spot instances disapear. We have time to decomission them from CM. Recently we have started to experience a ResourceManager crash in connection when we loose spot instances. See log below. After the ResourceManager crashes it does not restart automatically and after a while, all of our remaining NodeManger processes are shut down as well leaving no YARN capacity left at all eventhough we have plenty of helthy machines. We are using CDH 5.14.2. 1. Is the problem in the stack trace below known (Timer allready cancelled) 2. Can we change the configuration to have the ResourceManager automatically recover from this? I only see a automatically restart option for JobHistory server in CM but perhaps this is the same process? Br, Petter 2018-10-08 16:14:45,617 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, FSPreemptionThread, that exited unexpectedly: java.lang.IllegalStateException: Timer already cancelled.
at java.util.Timer.sched(Timer.java:397)
at java.util.Timer.schedule(Timer.java:193)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.preemptContainers(FSPreemptionThread.java:212)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSPreemptionThread.run(FSPreemptionThread.java:77)
2018-10-08 16:14:45,623 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down the resource manager.
2018-10-08 16:14:45,624 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2018-10-08 16:14:45,629 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@ip-10-255-4-86.eu-west-1.compute.internal:8088
2018-10-08 16:14:45,731 INFO org.apache.hadoop.ipc.Server: Stopping server on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8032
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:45,732 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033
2018-10-08 16:14:45,733 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-10-08 16:14:48,250 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033
2018-10-08 16:14:49,643 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:50,644 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-86.eu-west-1.compute.internal/10.255.4.86:8033. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-10-08 16:14:51,647 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ip-10-255-4-
... View more
Labels:
12-12-2017
07:10 AM
Hi, great! It solved my problem! For other users in the future: We upgraded a 5.10.1 cluster (without Kudu) to a 5.12.1 cluster (with Kudu). The missing part was the configuration option 'Kudu Service' that was set to none in the Impala Service-Wide configuration. Setting this to Kudu insert the impalad startup option -kudu_master_hosts and after that I can create tables without the TBLPROPERTIES clause and Sentry now works as expected. Thank you very much, Hao!
... View more
12-11-2017
12:27 AM
Hi, >Would you mind sharing the query how you create a new table? Did you happen to set kudu master addresses in TBLPROPERTIES clause? I did use the TBLPROPERTIES clause. I read somewhere that it should not be needed if running in a CM environment but in our case we have to specify it. I see now that CM has not added the --tserver_master_addrs flas to the gflagfile. See belwo for a simplified CREATE TABLE statement. CREATE TABLE my_db.my_table
(
key BIGINT,
value STRING,
PRIMARY KEY(key)
)
PARTITION BY RANGE (key)
(
PARTITION 1 <= VALUES < 1000
)
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='my-master-address'); Are you saying that it will work (with Sentry) if we add the --tserver_master_addrs to the tservers and remove the TBLPROPERTIES clause? Br, Petter
... View more
12-08-2017
05:50 AM
Hi, thank you for your reply! >Sorry, I missed that you are using external Kudu tables in the previous reply. They are in fact internal tables. I do not use the EXTERNAL keyword when creating the tables. The only way I can let one user group (ROLE in Sentry) create their own Kudu tables (via Impala) is to give the ALL privilegies on the server level. This has the side effect that this user group will enyoy access to all data on the cluster. This is not desired. Granting ALL on the (impala) db level does not help. Have I missed something? Will finer grained access arrive in the future? Br, Petter
... View more
12-06-2017
04:53 AM
Hi, we have a sentry role that have action=ALL on db=my_db When trying to issue a CREATE TABLE statment in Impala to create a Kudu table in my_db we get the following error: I1205 12:32:21.124711 47537 jni-util.cc:176] org.apache.impala.catalog.AuthorizationException: User 'my_user' does not have privileges to access: name_of_sentry_server A work-around is to set action=ALL on the server level to the sentry role but we don't want to give this wide permission to the role. Do we need to set action=ALL on the server level in order to delegate the rights to our users to create Kudu tables or how could we set up Sentry in this case? We use CDH 5.12.1 (Kudu 1.4.0) Br, Petter
... View more
08-23-2017
05:28 AM
Hi, we are experienceing the same issue. We are on CDH 5.10.1 Our corresponding figures reads: Planner Timeline
Analysis finished: 63015903
Equivalence classes computed: 63148873
Single node plan created: 72171446
Runtime filters computed: 72242303
Distributed plan created: 72530789
Lineage info computed: 72627976
Planning finished: 74054390
Query Timeline
Query submitted: 0
Planning finished: 212302910792
Submit for admission: 212305910788
Completed admission: 212305910788
Ready to start 13 fragment instances: 212306910788
All 13 fragment instances started: 212314910786
Rows available: 216195909152
Cancelled: 223800905948
Unregister query: 223816905942 Br, Petter
... View more
08-10-2017
02:04 AM
1 Kudo
Hi, we are experiencing the same or similar problem. We get a lot of (in cloudera-scm-agent.log): [10/Aug/2017 08:00:33 +0000] 11211 ImpalaDaemonQueryMonitoring throttling_logger ERROR (31 skipped) Error fetching executing query profile at 'http://our_host_name:25000/query_profile_encoded'
Traceback (most recent call last):
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.10.1-py2.7.egg/cmf/monitor/impalad/query_monitor.py", line 526, in get_executing_query_profile
password=password)
File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.10.1-py2.7.egg/cmf/url_util.py", line 67, in urlopen_with_timeout
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1217, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 444, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib64/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out This results in a IMPALAD_QUERY_MONITORING_STATUS alert. We are running CDH 5.10.1 on Ubuntu 14. I guess the load is fairly high on the nodes but not through the roof.
... View more
06-07-2017
07:18 AM
We are feeling the same pain here. In cloudera manager there is usually a "safety valve" for relevant configuration files where you get the opportunity to tweak the configuration for each role. In the Spark2 section in cloudera manager there is no safety valve for hive-site.xml. BR, Petter
... View more
01-10-2017
11:34 AM
Hi Tim, thank you for taking the time to look at this issue! Br, Petter
... View more
- Tags:
- i i
01-10-2017
05:37 AM
1 Kudo
Hi all, I reported IMPALA-4725 last week but it seems like it has not been triaged yet. I wanted to bring some more attention to this issue (and possible suggestions for workarounds) since it has a heavy impact on us. To summarize it seems like Impala mixes-up values in arrays of structs which to me seems like a fundamental problem in the parquet reader. Alternatively the values gets mixed-up when presented as a result. Either way, I would very much appreciated an initiated persons view on this issue. We are running Impala that is bundled with CDH 5.8.3 Br, Petter
... View more
- Tags:
- CDH 5.8.3
Labels:
11-29-2016
07:25 AM
Hi all, Best Practices for Using Impala with S3 states "Set the safety valve fs.s3a.connection.maximum to 1500 for impalad ." Can annyone clarify which safety valve field should be used and with what syntax? I'm reading somewhere that this setting belongs to core-site.xml but Impala configuration in Cloudera Manger does not seem to have a safety valve for core-site.xml. The instructions mentions safety valve for impalad but that safety valve seems to be for command line arguments to impalad. The problem we are trying to adress is hdfsSeek(desiredPos=503890631): FSDataInputStream#seek error: com.cloudera.com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool that we keep getting when using Impala for querying data stored in S3. We are using CDH 5.8.3 Thanks, Petter
... View more
Labels:
08-18-2016
11:51 PM
Hi, thank you very much for your reply! Just a follow up question. Given the scenario that we target say 10 GB of data stored in gzipped parquet in each partition. We have three nodes currently but it will increase soon. From an Impala performance perspective, which of the below approaches is better? - Store the data in 40 parquet files with file size = row group size = hdfs block size = 256 MB - Store the data in 10 parquet files with file size = row group size = hdfs block size = 1 GB - Store the data in 10 parquet files with file size 1 GB, row group size = hdfs block size = 256 MB Thanks, Petter
... View more
08-12-2016
02:45 AM
I have described an issue with time consuming parquet file generation in the Hive forum. See this post for a description on the environment. The question is half Impala related so I would appreciated if any Impala experts here could read that post as well. https://community.cloudera.com/t5/Batch-SQL-Apache-Hive/How-to-improve-performance-when-creating-Parquet-files-with-Hive/m-p/43804#U43804 I have some additional questions that are Impala specific. The environment currently has three Impala nodes with 5-10GB worth of data in each partition. The question is how I should generate the parquet files to achieve the most performance out of Impala. Currently I target the parquet file size to 1 GB each. The HDFS block size is set to 256 MB for these files and I have instructed to create row groups of the same size. Surprisingly I get many more row groups. I just picked a random file and it contained 91 row groups. Given our environment, where should we aim at file-size, number of row groups in each file and HDFS block size for the files? Also, if it would be more beneficiary to have fewer row-groups in each file, how can we instruct Hive to generate fewer row groups since Hive does not seem to respect the parquet.block.size option? We used the Impala version bundled with CDH 5.7.1 Thanks in advance, Petter
... View more
08-11-2016
01:33 AM
1 Kudo
We are generating parquet files (to be used by Impala) daily with Hive. Reason is that the source file format is proprietary and not supported by Impala. The process works fine but it takes a long time for each conversion job to finish. It seems like the process of writing parquet files is very time consuming. The job uses very few map tasks and each map task can take several hours to complete. We are interested in getting a parquet file layout (i.e file size and page size) that will be performant when used with impala. Each conversion job generates three tables. The most time-consuming table to generate is a table with approximately 200 columns where 30 columns have scalar types and 170 of the columns have complex (nested) data types. The data in the 170 complex columns can be very skew. Some columns can have a size in order of a few bytes and others up to 1 MB. Many columns values can also be NULL. So its fair to say that the table is wide and spare. The total daily size of the parquet files generated for this table varies around 5-10 GB (using GZIP compression). The Hive MR job we use to generate the files comprises two map-only stages. The last stage is only used to even out the resulting file sizes (hive.merge.mapfiles=true) so they average at 1 GB. I am not sure this stage is needed. I guess it depends on how well impala handles smaller files. The last stage doubles the total job time. I think the reason for this is that the job has to write parquet files twice (once time per stage) . I have not found a way of controlling the intermediate file format when using hive.merge.mapfiles. I suspect that another intermediate file format would speed up a lot but it seems like it is not configurable. Is there anybody out there with parquet generation knowledge that can help us look at the parameters we use (input size, buffers, heap size etc) or that have an opinion on if the second stage can be skipped? Also, it seems like the dfs.blocksize and parquet.block.size parameters are not respected. We set the parquet block size to 256MB but the resulting files are generated with more smaller blocks. Perhaps this is a result of the skewness of the data. We are using CDH 5.7.1. Destination table: CREATE EXTERNAL TABLE IF NOT EXISTS Destination (
col1 STRING,
col2 STRING,
.
.
– Example of one of the simpler structs
col200 struct<field1:string,field2:string,field3:string,field4:int>)
PARTITIONED BY (import_date STRING)
STORED AS PARQUET
LOCATION '/path/to/destination'; Conversion job: SET parquet.compression=GZIP;
SET hive.merge.mapfiles=true;
SET hive.merge.smallfiles.avgsize=1073741824;
SET mapred.max.split.size=1073741824;
SET dfs.blocksize=268435456;
SET parquet.block.size=268435456;
SET mapreduce.map.memory.mb=4096;
SET mapreduce.reduce.memory.mb=2048;
SET mapreduce.map.java.opts.max.heap=3277;
SET mapreduce.reduce.java.opts.max.heap=1638;
SET mapreduce.task.io.sort.mb=1000;
SET mapreduce.task.io.sort.factor=100;
SET mapred.compress.map.output=true;
SET mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
INSERT OVERWRITE TABLE Destination PARTITION (import_date='2016-08-10')
SELECT col1, col2, ..., col200
FROM Source
WHERE import_date='2016-08-10'; Thank in advance, Petter
... View more
03-01-2016
07:31 AM
We are using CDH5.5.1 with kerberos enabled . Every 5-10th time we restart the Hive group (including the metastore) in Cloudera Manager the metastore fails to access the metastore DB. It seems like the password is forgotten. We've seen this both when the DB is postgres on Amazon RDS and when using a local postgres (as provided by the cloudera-manager-server-db-2.x86_64 package). This started to happen after upgrading to CDH5.5.1 from CDH5.3.x. One change in the environment that seems related is the introduction of the hadoop credential provider to store the password. Anybody else that has experienced this issue? Config: <property>
<name>hadoop.security.credential.provider.path</name>
<value>localjceks://file//run/cloudera-scm-agent/process/352-hive-HIVEMETASTORE/creds.localjceks</value>
</property> Error: 2016-02-29 10:45:08,893 ERROR DataNucleus.Datastore.Schema: [main]: Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:postgresql://x.y.z.com:5432/metastore, username = hive. Terminating connection pool (set lazyInit to true if you expect
to start your database after your app). Original Exception: ------
org.postgresql.util.PSQLException: FATAL: password authentication failed for user "hive"
at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:291)
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:108)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:66)
at org.postgresql.jdbc2.AbstractJdbc2Connection.<init>(AbstractJdbc2Connection.java:125)
at org.postgresql.jdbc3.AbstractJdbc3Connection.<init>(AbstractJdbc3Connection.java:30)
at org.postgresql.jdbc3g.AbstractJdbc3gConnection.<init>(AbstractJdbc3gConnection.java:22)
at org.postgresql.jdbc4.AbstractJdbc4Connection.<init>(AbstractJdbc4Connection.java:30)
at org.postgresql.jdbc4.Jdbc4Connection.<init>(Jdbc4Connection.java:24)
at org.postgresql.Driver.makeConnection(Driver.java:393)
at org.postgresql.Driver.connect(Driver.java:267)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
...
... View more
02-29-2016
11:46 AM
I have a wide table with a lot of complex type from which I would like to create simpler views for end user convenience. Ideally these views would expose one or more structs unaltered from the more complex table that is backing the view. Operations, and renames could occur on the other scalar fields. See snippet below. CREATE TABLE complex_table (
scalar1 BIGINT,
scalar2 BIGINT,
struct1 STRUCT <f1: STRING, f2: BIGINT>,
struct2 STRUCT <f1: STRING, f2: BIGINT>
)
STORED AS PARQUET; CREATE VIEW simple_view AS SELECT scalar1*2 as my_field1, scalar2*4 as my_field2, struct1, struct2 FROM complex_table; This is currently not possible since a view cannot include a struct column but have to expand the fields in the structs as scalars in order to use them. Could anyone comment on if this will hold true for all forseabale future or if intermediate results from inner selects would be allowed to contain structs the future? Any alternative approaches are also welcome. The environment is Impala 2.3.0 in CDH5.5.1 Br, Pettax
... View more
- Tags:
- impala
02-24-2016
12:25 PM
Thank you for your prompt reply! I will hold my attempts using this operation.
... View more
02-23-2016
05:16 AM
1 Kudo
It does not seem like the IS NULL / IS NOT NULL operator is supported for struct data types. We are using Impala 2.3.0/CDH5.5.1. This seem like a basic and vital operator to have. Especially when using wide tables. Anybody out there that has a patch or workaround or that actually succeeded to use this operator on structs? I have reported IMPALA-3060 on the topic.
... View more
02-01-2016
03:22 AM
Thank you for your reply! I also noticed that if I check the option "Enable Kerberos Authentication for HTTP Web-Consoles" in the YARN configuration I can make the kill button work. However, this will enable kerberos for web pages such as for the History Server and Resouce Manager. However, we do not want kerberos authentication on these pages. So, with the fix in CDH5.5.3 the kill button will work without enabeling the above option I assume?
... View more
01-26-2016
01:31 AM
I'm using CDH 5.5.0 with kerberos and Sentry enabled. Trying to kill a job from the job browser fails with the message "There was a problem communicating with the server: The default static user cannot carry out this operation. (error 403)" I can kill the same job using the yarn application -kill command. I guess this is a configuration issue. Could someone assist me in getting this right so that I can can kill jobs from the Job Browser? Stack trace: [26/Jan/2016 10:15:30 +0100] access WARNING 10.128.42.143 di23060584 - "POST /jobbrowser/jobs/application_1453476679853_0011/kill HTTP/1.1"
[26/Jan/2016 10:15:30 +0100] connectionpool INFO Resetting dropped connection: ip-10-255-2-7.eu-west-1.compute.internal
[26/Jan/2016 10:15:30 +0100] kerberos_ ERROR handle_mutual_auth(): Mutual authentication unavailable on 403 response
[26/Jan/2016 10:15:30 +0100] views ERROR Killing job
Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/apps/jobbrowser/src/jobbrowser/views.py", line 246, in kill_job
job.kill()
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/apps/jobbrowser/src/jobbrowser/yarn_models.py", line 185, in kill
return self.api.kill(self.id)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/libs/hadoop/src/hadoop/yarn/mapreduce_api.py", line 117, in kill
get_resource_manager(self._user).kill(app_id) # We need to call the RM
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/libs/hadoop/src/hadoop/yarn/resource_manager_api.py", line 124, in kill
return self._execute(self._root.put, 'cluster/apps/%(app_id)s/state' % {'app_id': app_id}, params=params, data=json.dumps(data), contenttype=_JSON_CONTENT_TYPE)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/libs/hadoop/src/hadoop/yarn/resource_manager_api.py", line 141, in _execute
response = function(*args, **kwargs)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", line 136, in put
return self.invoke("PUT", relpath, params, data, self._make_headers(contenttype))
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/core/src/desktop/lib/rest/resource.py", line 78, in invoke
urlencode=self._urlencode)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/desktop/core/src/desktop/lib/rest/http_client.py", line 161, in execute
raise self._exc_class(ex)
RestException: The default static user cannot carry out this operation. (error 403)
[26/Jan/2016 10:15:30 +0100] middleware INFO Processing exception: The default static user cannot carry out this operation. (error 403): Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/core/handlers/base.py", line 112, in get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/build/env/lib/python2.6/site-packages/Django-1.6.10-py2.6.egg/django/db/transaction.py", line 371, in inner
return func(*args, **kwargs)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/apps/jobbrowser/src/jobbrowser/views.py", line 83, in decorate
return view_func(request, *args, **kwargs)
File "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/hue/apps/jobbrowser/src/jobbrowser/views.py", line 249, in kill_job
raise PopupException(e)
PopupException: The default static user cannot carry out this operation. (error 403)
... View more
01-07-2016
01:38 PM
Ah, m ore information for the team to work with then! Let's hope for a solution.
... View more
01-07-2016
10:34 AM
1 Kudo
Thank you Alex for your quick reply and confirmation! I've created IMPALA-2820 to track this issue.
... View more
01-07-2016
07:53 AM
I have been testing CDH5.5.0 and have noted that Impala does not like reserved words as field names in complex types. This seems strange as reserved words can be used as column names for ordinary columns. Hive does not impose the same restriction. Reserved words can be back-ticked where needed. Does anybody know if this is by design or if this is an issue in Impala 2.3.0? We are using Hive to create Parquet files with complex types. Sample to reproduce issue and error message below. In the case below the word 'replace' is reserved. In Hive: CREATE EXTERNAL TABLE MyTable (
device_id STRING,
added struct<name:string,version_name:string,version_code:int,`replace`:boolean>
)
STORED AS PARQUET
LOCATION '/tmp/impala/mytable'; In Hive: INSERT OVERWRITE TABLE MyTable
SELECT
device_id,
payload AS added
FROM Added where import_id = 106000; In Impala: SELECT * FROM MyTable limit 10; Output: AnalysisException: Failed to load metadata for table: 'mytable' CAUSED BY: TableLoadingException: Unsupported type 'struct<name:string,version_name:string,version_code:int,replace:boolean>' in column 'added' of table 'mytable'
I0107 15:56:01.251721 21006 Frontend.java:818] analyze query SELECT * FROM MyTable limit 10
E0107 15:56:01.252320 21006 Analyzer.java:2212] Failed to load metadata for table: mytable
Unsupported type 'struct<name:string,version_name:string,version_code:int,replace:boolean>' in column 'added' of table 'mytable'
I0107 15:56:01.252908 21006 jni-util.cc:177] com.cloudera.impala.common.AnalysisException: Failed to load metadata for table: 'MyTable'
at com.cloudera.impala.analysis.TableRef.analyze(TableRef.java:180)
at com.cloudera.impala.analysis.Analyzer.resolveTableRef(Analyzer.java:512)
at com.cloudera.impala.analysis.SelectStmt.analyze(SelectStmt.java:155)
at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:342)
at com.cloudera.impala.analysis.AnalysisContext.analyze(AnalysisContext.java:317)
at com.cloudera.impala.service.Frontend.analyzeStmt(Frontend.java:827)
at com.cloudera.impala.service.Frontend.createExecRequest(Frontend.java:856)
at com.cloudera.impala.service.JniFrontend.createExecRequest(JniFrontend.java:147)
Caused by: com.cloudera.impala.catalog.TableLoadingException: Unsupported type 'struct<name:string,version_name:string,version_code:int,replace:boolean>' in column 'added' of table 'mytable'
at com.cloudera.impala.catalog.IncompleteTable.loadFromThrift(IncompleteTable.java:111)
at com.cloudera.impala.catalog.Table.fromThrift(Table.java:240)
at com.cloudera.impala.catalog.ImpaladCatalog.addTable(ImpaladCatalog.java:357)
at com.cloudera.impala.catalog.ImpaladCatalog.addCatalogObject(ImpaladCatalog.java:246)
at com.cloudera.impala.catalog.ImpaladCatalog.updateCatalog(ImpaladCatalog.java:132)
at com.cloudera.impala.service.Frontend.updateCatalogCache(Frontend.java:223)
at com.cloudera.impala.service.JniFrontend.updateCatalogCache(JniFrontend.java:164)
at ========.<Remote stack trace on catalogd>: com.cloudera.impala.catalog.TableLoadingException: Unsupported type 'struct<name:string,version_name:string,version_code:int,replace:boolean>' in column 'added' of table 'mytable'
at com.cloudera.impala.catalog.Table.parseColumnType(Table.java:331)
at com.cloudera.impala.catalog.HdfsTable.addColumnsFromFieldSchemas(HdfsTable.java:571)
at com.cloudera.impala.catalog.HdfsTable.load(HdfsTable.java:1073)
at com.cloudera.impala.catalog.TableLoader.load(TableLoader.java:84)
at com.cloudera.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:232)
at com.cloudera.impala.catalog.TableLoadingMgr$2.call(TableLoadingMgr.java:229)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
... View more