Member since
09-11-2013
20
Posts
1
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1400 | 04-23-2015 10:15 AM | |
3028 | 12-02-2014 10:25 PM | |
10859 | 09-13-2014 01:26 PM |
08-31-2015
01:45 PM
I believe the 32,767 limit always only applied in certain situations. For example, string literals or passing string arguments to certain functions. It might have always worked to read >32K strings out of data files, but what would happen afterwards was not guaranteed. I'll check which kinds of limitations were lifted. John
... View more
04-23-2015
10:15 AM
1 Kudo
Here are 2 ways to constrain the output to only be a single file. You can set the query option NUM_NODES=1, and all work is done on the coordinator node. You can put a large LIMIT on the query, bigger than the number of rows you are actually inserting, and all the intermediate results are combined on the coordinator node. (I have not looked into the mechanics enough to say which way is more efficient.) Here's an example where, by default, a CREATE TABLE AS SELECT operation would produce 4 output files, because I'm on a 4-node cluster. (The source table BILLION_NUMBERS has 113 files and 2.79 GB, enough data so that it won't go into a single output file by accident.) Setting NUM_NODES=1 produces a single output file. Setting NUM_NODES back to 0 and then doing CTAS+LIMIT produces a single output file. [localhost:21000] > show table stats billion_numbers; +-------+--------+--------+--------------+-------------------+--------+-------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+--------+--------+--------------+-------------------+--------+-------------------+ | -1 | 113 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false | +-------+--------+--------+--------------+-------------------+--------+-------------------+ [localhost:21000] > set; Query options (defaults shown in []): ... NUM_NODES: [0] ... [localhost:21000] > create table num_nodes_0 as select * from billion_numbers; +----------------------------+ | summary | +----------------------------+ | Inserted 1000000000 row(s) | +----------------------------+ [localhost:21000] > show table stats num_nodes_0; +-------+--------+--------+--------------+-------------------+--------+-------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+--------+--------+--------------+-------------------+--------+-------------------+ | -1 | 4 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false | +-------+--------+--------+--------------+-------------------+--------+-------------------+ [localhost:21000] > set num_nodes=1; [localhost:21000] > create table num_nodes_1 as select * from oreilly.billion_numbers; +----------------------------+ | summary | +----------------------------+ | Inserted 1000000000 row(s) | +----------------------------+ [localhost:21000] > show table stats num_nodes_1; +-------+--------+--------+--------------+-------------------+--------+-------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+--------+--------+--------------+-------------------+--------+-------------------+ | -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false | +-------+--------+--------+--------------+-------------------+--------+-------------------+ [localhost:21000] > set num_nodes=0; [localhost:21000] > create table ctas_with_limit as select * from billion_numbers limit 100000000000000; +----------------------------+ | summary | +----------------------------+ | Inserted 1000000000 row(s) | +----------------------------+ [localhost:21000] > show table stats ctas_with_limit; +-------+--------+--------+--------------+-------------------+--------+-------------------+ | #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats | +-------+--------+--------+--------------+-------------------+--------+-------------------+ | -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false | +-------+--------+--------+--------------+-------------------+--------+-------------------+
... View more
12-02-2014
10:25 PM
From the latest 2.x documentation: http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_udf.html#udfs_hive_unique_2 "Hive UDAFs and UDTFs are not supported." John
... View more
09-13-2014
01:26 PM
The ROW FORMAT clause only applies to tables using the text format. STORED AS PARQUET means the table expects the data to already be in Parquet format. You will need to make 2 tables. One table with no STORED AS clause but with ROW FORMAT DELIMITED etc. You will be able to query this table after you move the delimited data files into the table directory and REFRESH the table). Then another (empty) table with the same columns and a STORED AS PARQUET clause. Then to convert the data to Parquet, you do: insert into parquet_table select * from text_table; John
... View more
02-24-2014
04:03 PM
There's not currently an impala-shell option to do such substitution. Would you mind opening a JIRA with that request at issues.cloudera.org? Thanks, John
... View more
09-27-2013
02:36 PM
That is a good point, we'll document better the interaction with partitioned tables. It seems that the new partition is a change in the table metadata, and in 1.1.1 REFRESH only fetches the block location data (for the partitions that are already known). This is definitely an area where improvements are being worked, so anything I say about the internals is likely to be obsolete soon...
... View more
09-27-2013
02:06 PM
This certainly seems like it's worth filing a bug. Could you submit a bug with a test case at issues.cloudera.org?
... View more