- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Writing contents of table to file
Created ‎04-22-2015 09:41 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a doubt on writing contents of a table to a file using delimeters such as |
CREATE EXTERNAL TABLE additional_Variables ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/Data/additional_Variables' AS SELECT * FROM MYTABLE
This is the SQL i use to create file. When i visit the directory to view the file generated. Instead of just one file , there is often two files generated where 1 of them is a larger file and the second one is smaller.
Is there any way to write content from a table to a single file.
Created ‎04-23-2015 10:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are 2 ways to constrain the output to only be a single file. You can set the query option NUM_NODES=1, and all work is done on the coordinator node. You can put a large LIMIT on the query, bigger than the number of rows you are actually inserting, and all the intermediate results are combined on the coordinator node. (I have not looked into the mechanics enough to say which way is more efficient.)
Here's an example where, by default, a CREATE TABLE AS SELECT operation would produce 4 output files, because I'm on a 4-node cluster. (The source table BILLION_NUMBERS has 113 files and 2.79 GB, enough data so that it won't go into a single output file by accident.) Setting NUM_NODES=1 produces a single output file. Setting NUM_NODES back to 0 and then doing CTAS+LIMIT produces a single output file.
[localhost:21000] > show table stats billion_numbers;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 113 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set;
Query options (defaults shown in []):
...
NUM_NODES: [0]
...
[localhost:21000] > create table num_nodes_0 as select * from billion_numbers;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats num_nodes_0;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 4 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set num_nodes=1;
[localhost:21000] > create table num_nodes_1 as select * from oreilly.billion_numbers;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats num_nodes_1;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set num_nodes=0;
[localhost:21000] > create table ctas_with_limit as select * from billion_numbers limit 100000000000000;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats ctas_with_limit;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
Created ‎04-23-2015 10:15 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are 2 ways to constrain the output to only be a single file. You can set the query option NUM_NODES=1, and all work is done on the coordinator node. You can put a large LIMIT on the query, bigger than the number of rows you are actually inserting, and all the intermediate results are combined on the coordinator node. (I have not looked into the mechanics enough to say which way is more efficient.)
Here's an example where, by default, a CREATE TABLE AS SELECT operation would produce 4 output files, because I'm on a 4-node cluster. (The source table BILLION_NUMBERS has 113 files and 2.79 GB, enough data so that it won't go into a single output file by accident.) Setting NUM_NODES=1 produces a single output file. Setting NUM_NODES back to 0 and then doing CTAS+LIMIT produces a single output file.
[localhost:21000] > show table stats billion_numbers;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 113 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set;
Query options (defaults shown in []):
...
NUM_NODES: [0]
...
[localhost:21000] > create table num_nodes_0 as select * from billion_numbers;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats num_nodes_0;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 4 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set num_nodes=1;
[localhost:21000] > create table num_nodes_1 as select * from oreilly.billion_numbers;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats num_nodes_1;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
[localhost:21000] > set num_nodes=0;
[localhost:21000] > create table ctas_with_limit as select * from billion_numbers limit 100000000000000;
+----------------------------+
| summary |
+----------------------------+
| Inserted 1000000000 row(s) |
+----------------------------+
[localhost:21000] > show table stats ctas_with_limit;
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
| -1 | 1 | 2.79GB | NOT CACHED | NOT CACHED | TEXT | false |
+-------+--------+--------+--------------+-------------------+--------+-------------------+
