Member since
04-03-2019
89
Posts
5
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3335 | 01-21-2022 04:31 PM | |
5764 | 02-25-2020 10:02 AM | |
3428 | 02-19-2020 01:29 PM | |
2524 | 09-17-2019 06:33 AM | |
5515 | 08-26-2019 01:35 PM |
10-23-2023
03:43 PM
Ezerihun, Thanks for your reply. I repeated my test, which showed that you are correct. I was not sure what happened to my test case previously. When I dropped an external table, the warehouse path for that table "warehouse/tablespace/external/hive/testdb1.db/table1" remains. Actually, I can even re-create that external table again without any error, and files loaded to "warehouse/tablespace/external/hive/testdb1.db/table1" can be read through the re-created table. In other words, although Impala created this path "warehouse/tablespace/external/hive/testdb1.db/table1", Impala does not manage it at all. Thank you.
... View more
10-18-2023
04:50 PM
I ran into an interesting situation using the Impala external table. In short, I used "create external table" statement but ended up with a table like a managed one. Here are details. Step 1: creating an external table created external table testdb1.table1 ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false'); Step 2: adding partitions and loading data files. alter table testdb1.table1 add if not exists partition (loaddate=20231018); load data inpath '/mytestdata/dir1' into table testdb1.table1 partition (loaddate=20231018); Step 2 shows that this table1 behaves exactly like a managed table. Files at /mytestdata/dir1 are moved to hdfs warehouse path warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 path. If I drop this partition 20231018, the directory at warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 is removed. So what exactly is the difference between a managed vs external partitioned table, except for the different storage location /warehouse/tablespace/managed vs /warehouse/tablespace/external? From what I read, the key difference is that a managed table's storage is managed by hive/impala, but an external table is not. In my case, even this table1 is created as an external table, its storage is still managed by impala/hive. As I understand, if I add a partition (to an external table) and then add files using "load data inpath", then the storage is managed by hive. If I add a partition with the location specified, like alter table testdb.table1 add if not exists partition (loaddate=20231018 ) location '/mytestdata/dir1' Then the storage is NOT managed by hive. Is this correct?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
10-13-2023
11:39 PM
I created SequenceFiles using PySpark code below. path='/data/seq_test2' rdd = sc.parallelize([(1, "a1"), (2, "a2"), (3, "a3")]) rdd.saveAsSequenceFile(path) Then I created an impala table. CREATE EXTERNAL TABLE seq_test2 (key_column STRING, value_column STRING ) STORED AS SEQUENCEFILE LOCATION '/data/seq_test2' Then the query "select * from seq_test2" shows a1, a2, a3 in key_column and null in value_column. But I expect to see 1,2,3 in key column and a1, a2, a3 in value_column. How do I fix it? Thank you.
... View more
Labels:
- Labels:
-
Apache Impala
04-21-2022
05:07 PM
1 Kudo
André, Thanks for the elegant solution. Regards,
... View more
04-20-2022
05:46 PM
I did a workaround by injecting the myfilepath element into the json string. rdd=reader.map(lambda x: str(x[1])[0]+'"myfilepath":"'+x[0]+'",'+str(x[1])[1:]) It does not look like a very clean solution. Is there a better one? Thanks. Regards
... View more
04-20-2022
04:05 PM
I saved thousands of small json files in SequenceFile format to resolve the "small file issue". I use the following pyspark code to parse the json data from saved sequence files. reader= sc.sequenceFile("/mysequencefile_dir", "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
rdd=reader.map(lambda x: x[1])
mydf=spark.read.schema(myschema).json(rdd)
mydf.show(truncate=False) The code worked. However, I do not know how to put the key value from the sequence file, which is actually the original json file name, into the mydf dataframe. Please advise. Thank you. Regards,
... View more
Labels:
04-14-2022
10:42 AM
1 Kudo
@mszurap Thanks for the response. I actually took the 2nd option you mentioned - ingesting it into a table which has only a single (string) column. But I am not sure whether it is the right approach. I appreciate the confirmation. Regards,
... View more
04-12-2022
03:46 PM
Here is the code. create external table testtable1
(code string, codesystem string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{27)(.{50})"
)
LOCATION '/data/raw/testtable1'; The error message is: ParseException: Syntax error in line 3:undefined: ROW FORMAT SERDE 'org.apache.hadoop.hiv... ^ Encountered: IDENTIFIER Expected: DELIMITED CAUSED BY: Exception: Syntax error It looks like Impala table only accepts "Row Format Delimited". Then how can I create an hive table with fixed width layout? Should I just do it outside Impala, bu through Hive, and then do other data operation on this table via Impala? Thanks.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
01-31-2022
07:23 PM
1 Kudo
@jeremymolina That is an excellent explanation. It makes total sense. Thank you very much. Regards,
... View more
01-24-2022
04:00 PM
I saw this kind of notation/style using double curly braces everywhere in HDP(Ambari) or CDP (CMS) UI. Below is a configuration value under zeppelin.shiro.knox.main.block for Zeppelin configuration. (This is a random sample I picked and this question is not about Zeppelin.) ++ krbRealm.signatureSecretFile={{CONF_DIR}}/http_secret ++ I understand that I can simply overwrite {{CONF_DIR}} with the actual path. However, I wonder whether {{CONF_DIR}} an ansible variable? If yes, how do I define the variable CONF_DIR in CDP Cloudera Manager? https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#defining-simple-variables Regards,
... View more
Labels:
- Labels:
-
Cloudera Manager