Member since
03-28-2018
338
Posts
3
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1827 | 10-20-2023 11:53 AM |
11-03-2023
11:10 AM
1 Kudo
Hello Seaport, This sounds like the Sequence data file created by pyspark is not being processed by the impala table correctly. the table from pyspark should be values (1, “a1”), (2, “a2"), (3, “a3”). But selecting table in impala shows (“a1", “null”), (“a2", “null”), (“a3", “null”). Can you test a few things to see where the discrepancy is coming from: Can you create impala table in different directory and then do load data in path? https://impala.apache.org/docs/build/html/topics/impala_load_data.html. This is recommended rather than creating a new table on the same file path which has files created outside of impala. This will take data files from your original path and load them into the new Impala table. If the behavior is the same, can you manually insert into your impala table like: insert into table seq_test2 values (1, "a1"), (2, "a2"), (3, "a3"); Then compare the file created by this insert command to your file created by pyspark? Is there any noticeable difference? if it makes a difference, you could also try inserting the data in pyspark using string instead of int values like: rdd = sc.parallelize([("1", "a1"), ("2", "a2"), ("3", "a3")]) Also, run same steps from hive too see if its the same behavior?
... View more
10-24-2023
07:29 AM
Great! Yes, your new explanation matches what is expected in documentation and my personal reproduction. I am glad it is all sorted out now.
... View more
10-20-2023
11:53 AM
1 Kudo
I understand you created an EXTERNAL table without defining LOCATION, so it is an external table stored in default 'warehouse/tablespace/external/hive' location. The problem is that, when you drop a partition for the external table, you are seeing the partition also leave HDFS location. SOLUTION This should not be happening. As per documentation, if you have an external table, then partitions and data in HDFS should not be deleted if you drop the partition or drop the table. The partition and table will be dropped from Impala, but the underlying HDFS files will remain. For a MANAGED table, both Impala and HDFS data will be deleted. This is expected behavior. Below documentation should answer all your questions about differences between EXTERNAL and MANAGED tables: - Refer to documentation [1]: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/impala-sql-reference/topics/impala-create-table.html - Also refer to this documentation: https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/impala-sql-reference/topics/impala-tables.html You should double check the "show create table" output of your table to ensure it is an EXTERNAL table. Also keep in mind these points: 1) create table by default creates MANAGED table. 2) adding EXTERNAL to a create table creates an EXTERNAL table. 3) also adding transaction=false automatically creates an EXTERNAL table. REPRODUCTION Because of the discrepancy in your description, I reproduced this situation internally. I don't see your behavior, but instead see the exact same behavior as described in documentation. 1) I create an external table: a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false'); 2) I create partition in this external table: a) alter table table1_external add if not exists partition (loaddate=20231018); 3) I load data into this new partition: a) load data inpath '/tmp/delta_1_1/' into table table1_external partition (loaddate=20231018); 4) I check that I have partition and a new file loaded in it, within Impala and HDFS: a) show partitions table1_external; loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location 20231018 -1 1 649B NOT CACHED NOT CACHED PARQUET false .../warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Total -1 1 649B 0B b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq 5) I drop the new partition: a) alter table table1_external drop partition (loaddate=20231018); 6) I check that I have partition and a new file loaded in it, within Impala and HDFS. It DOES NOT exist in Impala, but it DOES still exist in HDFS: a) show partitions table1_external; loaddate #Rows #Files Size Bytes Cached Cache Replication Format Incremental stats Location Total -1 0 0B 0B b) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq 7) I even drop the entire table in impala, and hdfs files still exist: a) drop table table1_external; b) show partitions table1_external; Table does not exist: default.table1_external c) # hdfs dfs -ls /warehouse/tablespace/external/hive/table1_external/loaddate=20231018 Found 1 items -rwxrwxrwx 3 hdfs supergroup 649 2023-10-20 17:33 /warehouse/tablespace/external/hive/table1_external/loaddate=20231018/8e469210ba3cea3c-162bfe3c00000000_1156574353_data.0.parq 😎 I can even bring the external table data back by creating table and adding partition again: a) create external table table1_external ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false'); b) select * from table1_external; Done. 0 results. c) alter table table1_external add if not exists partition (loaddate=20231018); d) select * from table1_external; fld1 fld2 loaddate testfld1 testfld2 20231018 9) Finally, testing this on an INTERNAL/MANAGED table instead shows when I drop the table, the HDFS data is also gone. I am unable to drop a single partition because my INTERNAL table was transactional=true. REFERENCES [1] https://docs.cloudera.com/cdp-private-cloud-base/7.1.7/impala-sql-reference/topics/impala-create-table.html _____________________ Internal and external tables (EXTERNAL and LOCATION clauses): By default, Impala creates an internal table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an external table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. Typically, for an external table you include a LOCATION clause to specify the path to the HDFS directory where Impala reads and writes files for the table. For example, if your data pipeline produces Parquet files in the HDFS directory /user/etl/destination, you might create an external table as follows: CREATE EXTERNAL TABLE external_parquet (c1 INT, c2 STRING, c3 TIMESTAMP) STORED AS PARQUET LOCATION '/user/etl/destination'; Although the EXTERNAL and LOCATION clauses are often specified together, LOCATION is optional for external tables, and you can also specify LOCATION for internal tables. The difference is all about whether Impala takes control of the underlying data files and moves them when you rename the table, or deletes them when you drop the table. For more about internal and external tables and how they interact with the LOCATION attribute, see Overview of Impala tables. _____________________
... View more