Member since
04-03-2019
92
Posts
6
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3444 | 01-21-2022 04:31 PM | |
5889 | 02-25-2020 10:02 AM | |
3553 | 02-19-2020 01:29 PM | |
2569 | 09-17-2019 06:33 AM | |
5607 | 08-26-2019 01:35 PM |
12-19-2024
11:22 AM
Additional connection tests show that port 9191 still works on unencrypted connections, although "TLS/SSL for HBase Thrift Server over HTTP" is enabled. Neither the log nor the Cloudera Manager UI gave any warnings or errors.
... View more
12-18-2024
10:36 AM
It appeared that the Thrift Server did not start completely, although it has a green light in Cloudera Manager. Inside the log hbase-cmf-hbase-HBASETHRIFTSERVER-mynode.log.out, there is no entry to acknowledge the start like ++ org.eclipse.jetty.server.AbstractConnector: Started ServerConnector@180e6ac4{SSL, (ssl, http/1.1)}{0.0.0.0:9191} ++ But I have no idea why the starting ended up incomplete. Therer was no warning or error from either the log or the Cloudera Manager UI. Thank you.
... View more
12-17-2024
10:28 PM
1 Kudo
This issue occurred right after I enabled TLS on my CDP Private Cloud Base 7.1.7. The client call to HBASE Thrift API failed at TLS hanshake. Below is the connection test output with the handshake failure. ++ $ openssl s_client -connect mycompany.com:9191 CONNECTED(00000003) write:errno=0 --- no peer certificate available --- No client certificate CA names sent --- SSL handshake has read 0 bytes and written 287 bytes Verification: OK --- New, (NONE), Cipher is (NONE) Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated Early data was not sent Verify return code: 0 (ok) --- ++ My Thrift API port is 9191 (not the default 9090). This port worked well before TLS was enabled. There should be no certificate/ca issue because the Thrift (on the same node) UI over TLS works just fine. Below is the connection test output showing a successful handshake. ++ $ openssl s_client -connect mycompany.com:9095 CONNECTED(00000003) depth=2 CN = MYROOTCA ... --- Certificate chain ... --- Server certificate -----BEGIN CERTIFICATE----- ... ++ All my HBASE instances have green lights inside Cloudera Manager. I do not know where to look. It looks like something internal in SDX went wrong. Any suggestions? Thank you. Best regards,
... View more
Labels:
- Labels:
-
Apache HBase
-
Security
10-23-2023
03:43 PM
Ezerihun, Thanks for your reply. I repeated my test, which showed that you are correct. I was not sure what happened to my test case previously. When I dropped an external table, the warehouse path for that table "warehouse/tablespace/external/hive/testdb1.db/table1" remains. Actually, I can even re-create that external table again without any error, and files loaded to "warehouse/tablespace/external/hive/testdb1.db/table1" can be read through the re-created table. In other words, although Impala created this path "warehouse/tablespace/external/hive/testdb1.db/table1", Impala does not manage it at all. Thank you.
... View more
10-18-2023
04:50 PM
I ran into an interesting situation using the Impala external table. In short, I used "create external table" statement but ended up with a table like a managed one. Here are details. Step 1: creating an external table created external table testdb1.table1 ( fld1 STRING, fld2 STRING ) PARTITIONED BY ( loaddate INT ) STORED AS PARQUET tblproperties('parquet.compress'='SNAPPY','transactional'='false'); Step 2: adding partitions and loading data files. alter table testdb1.table1 add if not exists partition (loaddate=20231018); load data inpath '/mytestdata/dir1' into table testdb1.table1 partition (loaddate=20231018); Step 2 shows that this table1 behaves exactly like a managed table. Files at /mytestdata/dir1 are moved to hdfs warehouse path warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 path. If I drop this partition 20231018, the directory at warehouse/tablespace/external/hive/testdb1.db/table1/loaddate=20231018 is removed. So what exactly is the difference between a managed vs external partitioned table, except for the different storage location /warehouse/tablespace/managed vs /warehouse/tablespace/external? From what I read, the key difference is that a managed table's storage is managed by hive/impala, but an external table is not. In my case, even this table1 is created as an external table, its storage is still managed by impala/hive. As I understand, if I add a partition (to an external table) and then add files using "load data inpath", then the storage is managed by hive. If I add a partition with the location specified, like alter table testdb.table1 add if not exists partition (loaddate=20231018 ) location '/mytestdata/dir1' Then the storage is NOT managed by hive. Is this correct?
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
10-13-2023
11:39 PM
I created SequenceFiles using PySpark code below. path='/data/seq_test2' rdd = sc.parallelize([(1, "a1"), (2, "a2"), (3, "a3")]) rdd.saveAsSequenceFile(path) Then I created an impala table. CREATE EXTERNAL TABLE seq_test2 (key_column STRING, value_column STRING ) STORED AS SEQUENCEFILE LOCATION '/data/seq_test2' Then the query "select * from seq_test2" shows a1, a2, a3 in key_column and null in value_column. But I expect to see 1,2,3 in key column and a1, a2, a3 in value_column. How do I fix it? Thank you.
... View more
Labels:
- Labels:
-
Apache Impala
04-21-2022
05:07 PM
1 Kudo
André, Thanks for the elegant solution. Regards,
... View more
04-20-2022
05:46 PM
I did a workaround by injecting the myfilepath element into the json string. rdd=reader.map(lambda x: str(x[1])[0]+'"myfilepath":"'+x[0]+'",'+str(x[1])[1:]) It does not look like a very clean solution. Is there a better one? Thanks. Regards
... View more
04-20-2022
04:05 PM
I saved thousands of small json files in SequenceFile format to resolve the "small file issue". I use the following pyspark code to parse the json data from saved sequence files. reader= sc.sequenceFile("/mysequencefile_dir", "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
rdd=reader.map(lambda x: x[1])
mydf=spark.read.schema(myschema).json(rdd)
mydf.show(truncate=False) The code worked. However, I do not know how to put the key value from the sequence file, which is actually the original json file name, into the mydf dataframe. Please advise. Thank you. Regards,
... View more
Labels:
04-14-2022
10:42 AM
1 Kudo
@mszurap Thanks for the response. I actually took the 2nd option you mentioned - ingesting it into a table which has only a single (string) column. But I am not sure whether it is the right approach. I appreciate the confirmation. Regards,
... View more