Member since
04-03-2019
97
Posts
7
Kudos Received
6
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
425 | 01-13-2025 11:17 AM | |
4094 | 01-21-2022 04:31 PM | |
6377 | 02-25-2020 10:02 AM | |
4078 | 02-19-2020 01:29 PM | |
2927 | 09-17-2019 06:33 AM |
10-13-2023
11:39 PM
I created SequenceFiles using PySpark code below. path='/data/seq_test2' rdd = sc.parallelize([(1, "a1"), (2, "a2"), (3, "a3")]) rdd.saveAsSequenceFile(path) Then I created an impala table. CREATE EXTERNAL TABLE seq_test2 (key_column STRING, value_column STRING ) STORED AS SEQUENCEFILE LOCATION '/data/seq_test2' Then the query "select * from seq_test2" shows a1, a2, a3 in key_column and null in value_column. But I expect to see 1,2,3 in key column and a1, a2, a3 in value_column. How do I fix it? Thank you.
... View more
Labels:
- Labels:
-
Apache Impala
04-21-2022
05:07 PM
1 Kudo
André, Thanks for the elegant solution. Regards,
... View more
04-20-2022
05:46 PM
I did a workaround by injecting the myfilepath element into the json string. rdd=reader.map(lambda x: str(x[1])[0]+'"myfilepath":"'+x[0]+'",'+str(x[1])[1:]) It does not look like a very clean solution. Is there a better one? Thanks. Regards
... View more
04-20-2022
04:05 PM
I saved thousands of small json files in SequenceFile format to resolve the "small file issue". I use the following pyspark code to parse the json data from saved sequence files. reader= sc.sequenceFile("/mysequencefile_dir", "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
rdd=reader.map(lambda x: x[1])
mydf=spark.read.schema(myschema).json(rdd)
mydf.show(truncate=False) The code worked. However, I do not know how to put the key value from the sequence file, which is actually the original json file name, into the mydf dataframe. Please advise. Thank you. Regards,
... View more
Labels:
04-14-2022
10:42 AM
1 Kudo
@mszurap Thanks for the response. I actually took the 2nd option you mentioned - ingesting it into a table which has only a single (string) column. But I am not sure whether it is the right approach. I appreciate the confirmation. Regards,
... View more
04-12-2022
03:46 PM
Here is the code. create external table testtable1
(code string, codesystem string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{27)(.{50})"
)
LOCATION '/data/raw/testtable1'; The error message is: ParseException: Syntax error in line 3:undefined: ROW FORMAT SERDE 'org.apache.hadoop.hiv... ^ Encountered: IDENTIFIER Expected: DELIMITED CAUSED BY: Exception: Syntax error It looks like Impala table only accepts "Row Format Delimited". Then how can I create an hive table with fixed width layout? Should I just do it outside Impala, bu through Hive, and then do other data operation on this table via Impala? Thanks.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
01-31-2022
07:23 PM
1 Kudo
@jeremymolina That is an excellent explanation. It makes total sense. Thank you very much. Regards,
... View more
01-24-2022
04:00 PM
I saw this kind of notation/style using double curly braces everywhere in HDP(Ambari) or CDP (CMS) UI. Below is a configuration value under zeppelin.shiro.knox.main.block for Zeppelin configuration. (This is a random sample I picked and this question is not about Zeppelin.) ++ krbRealm.signatureSecretFile={{CONF_DIR}}/http_secret ++ I understand that I can simply overwrite {{CONF_DIR}} with the actual path. However, I wonder whether {{CONF_DIR}} an ansible variable? If yes, how do I define the variable CONF_DIR in CDP Cloudera Manager? https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#defining-simple-variables Regards,
... View more
Labels:
- Labels:
-
Cloudera Manager
01-21-2022
04:43 PM
@Scharan By the way, under Zeppelin Shiro Urls Block, the original value is ++ /api/interpreter/** = authc, roles[{{zeppelin_admin_group}}] ++ Could you tell me what this notation {{zeppelin_admin_group}} for? I saw this kind of notation - double curly braces - frequently. Is it a token to be replaced? If yes, what kind of replacement it is waiting for? Thanks.
... View more
01-21-2022
04:31 PM
@Scharan I figured out. CDP Cloudera Manager UI did expose shiro.ini like Ambari, but did it via a different layout, which I should have realized earlier. Under "zeppelin.shiro.user.block", I added admin=admin, admin , and it worked. Thanks.
... View more