Member since
04-03-2019
86
Posts
5
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1077 | 01-21-2022 04:31 PM | |
3729 | 02-25-2020 10:02 AM | |
1572 | 02-19-2020 01:29 PM | |
1510 | 09-17-2019 06:33 AM | |
3744 | 08-26-2019 01:35 PM |
04-21-2022
05:07 PM
1 Kudo
André, Thanks for the elegant solution. Regards,
... View more
04-20-2022
05:46 PM
I did a workaround by injecting the myfilepath element into the json string. rdd=reader.map(lambda x: str(x[1])[0]+'"myfilepath":"'+x[0]+'",'+str(x[1])[1:]) It does not look like a very clean solution. Is there a better one? Thanks. Regards
... View more
04-20-2022
04:05 PM
I saved thousands of small json files in SequenceFile format to resolve the "small file issue". I use the following pyspark code to parse the json data from saved sequence files. reader= sc.sequenceFile("/mysequencefile_dir", "org.apache.hadoop.io.Text", "org.apache.hadoop.io.Text")
rdd=reader.map(lambda x: x[1])
mydf=spark.read.schema(myschema).json(rdd)
mydf.show(truncate=False) The code worked. However, I do not know how to put the key value from the sequence file, which is actually the original json file name, into the mydf dataframe. Please advise. Thank you. Regards,
... View more
- Tags:
- CDP
- SequenceFile
- Spark
Labels:
04-14-2022
10:42 AM
1 Kudo
@mszurap Thanks for the response. I actually took the 2nd option you mentioned - ingesting it into a table which has only a single (string) column. But I am not sure whether it is the right approach. I appreciate the confirmation. Regards,
... View more
04-12-2022
03:46 PM
Here is the code. create external table testtable1
(code string, codesystem string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{27)(.{50})"
)
LOCATION '/data/raw/testtable1'; The error message is: ParseException: Syntax error in line 3:undefined: ROW FORMAT SERDE 'org.apache.hadoop.hiv... ^ Encountered: IDENTIFIER Expected: DELIMITED CAUSED BY: Exception: Syntax error It looks like Impala table only accepts "Row Format Delimited". Then how can I create an hive table with fixed width layout? Should I just do it outside Impala, bu through Hive, and then do other data operation on this table via Impala? Thanks.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Impala
01-31-2022
07:23 PM
1 Kudo
@jeremymolina That is an excellent explanation. It makes total sense. Thank you very much. Regards,
... View more
01-24-2022
04:00 PM
I saw this kind of notation/style using double curly braces everywhere in HDP(Ambari) or CDP (CMS) UI. Below is a configuration value under zeppelin.shiro.knox.main.block for Zeppelin configuration. (This is a random sample I picked and this question is not about Zeppelin.) ++ krbRealm.signatureSecretFile={{CONF_DIR}}/http_secret ++ I understand that I can simply overwrite {{CONF_DIR}} with the actual path. However, I wonder whether {{CONF_DIR}} an ansible variable? If yes, how do I define the variable CONF_DIR in CDP Cloudera Manager? https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#defining-simple-variables Regards,
... View more
Labels:
- Labels:
-
Cloudera Manager
01-21-2022
04:43 PM
@Scharan By the way, under Zeppelin Shiro Urls Block, the original value is ++ /api/interpreter/** = authc, roles[{{zeppelin_admin_group}}] ++ Could you tell me what this notation {{zeppelin_admin_group}} for? I saw this kind of notation - double curly braces - frequently. Is it a token to be replaced? If yes, what kind of replacement it is waiting for? Thanks.
... View more
01-21-2022
04:31 PM
@Scharan I figured out. CDP Cloudera Manager UI did expose shiro.ini like Ambari, but did it via a different layout, which I should have realized earlier. Under "zeppelin.shiro.user.block", I added admin=admin, admin , and it worked. Thanks.
... View more
01-21-2022
03:01 PM
On the Zeppellin node, under the directory /etc/zeppelin/conf, I found the following files. ++ configuration.xsl interpreter-list log4j.properties log4j_yarn_cluster.properties shiro.ini.template zeppelin-env.cmd.template zeppelin-env.sh.template zeppelin-site.xml.template ++ Should I create a shiro.ini file here?
... View more
01-21-2022
02:32 PM
@Scharan Thanks for the reply. I followed your recommendation and got the same permission error. I felt the disconnect is that, I added a user called admin successfully. The configuration /api/interpreter/** = authc, roles[admin] is for a role called admin. The link between a user and a role seems to be inside shiro.ini, which I have no idea how I can access. I used Zeppelin in HDP and the HDP Zeppelin exposes its shiro.ini via Zeppelin configuration inside Ambari. Now in CDP I cannot find a similar configuration inside Cloudera Manager.
... View more
01-20-2022
07:02 PM
I am using CDP 7.1.7 and the cluster has not enabled Kerbores yet. Ranger is not enabled either. I followed the step in this post https://community.cloudera.com/t5/Support-Questions/CDP-7-1-3-Zepplin-not-able-to-login-with-default-username/td-p/303717 to be able to log in as admin. But this "admin" account has no permission to access the configuration or interpreter page. According to CDP documentation, https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/configuring-zeppelin/topics/enabling_access_control_for_interpreter__configuration__and_credential_settings.html, to configure shiro.ini for Zeppelin security, I have to go through Zeppelin web UI. What should I do? Regards,
... View more
Labels:
- Labels:
-
Apache Zeppelin
11-18-2021
01:30 PM
rbiswas1, I tried your code but pssh returned a timeout error. It was waiting for the password but I never got the prompt to enter the password. Could you elaborate more about your method? Thanks.
... View more
09-15-2021
10:32 PM
@RangaReddy The link is exactly what I need. Thanks for your help.
... View more
09-09-2021
01:18 AM
I am trying to parse a nested json document using RDD rather than DataFrame. The reason I cannot use DataFrame (the typical code is like spark.read.json) is that the document structure is very complicated. The schema detected by the reader is useless because child nodes at the same level have different schemas. So I try the script below. import json
s='{"key1":{"myid": "123","myname":"test"}}'
rdd=sc.parallelize(s).map(json.loads) My next step will be using map transformation to parse json string but I do not know where to start. I tried the script below but it failed. rdd2=rdd.map(lambda j: (j[x]) for x in j) I would appreciate any resource on using RDD transformation to parse json.
... View more
- Tags:
- Spark
Labels:
- Labels:
-
Apache Spark
09-03-2021
05:08 PM
Vidya, Thanks for your reply. Could you help me clarify the issue further? Does Spark (or other MapReduce tool) create the container using the local host as its template (to some degree)?
... View more
08-26-2021
02:58 PM
I will use Spark2 in CDP and need to install Python3. Do I need to installation Python3 on every node in the CDP cluster, just only need to install it on one particular node? Spark2 job is executed in JVM containers that could be created on any worker node. I wonder whether the container is created upon a template? If yes, then how the template is created and where is it? Thanks.
... View more
Labels:
- Labels:
-
Apache Spark
11-04-2020
10:59 PM
I resolved the error by following advice from this post. https://community.cloudera.com/t5/Support-Questions/Sharing-how-to-solve-HUE-and-HBase-connect-problem-on-CDH-6/td-p/82030
... View more
11-04-2020
03:10 PM
I got the same error with HappyBase. My code has been working fine for a few weeks. Somehow Thrift API stopped. I restarted the API and then I got this error.
... View more
07-30-2020
03:38 PM
The unpack command will not work without that extra dash. https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop/43704452 I had another try with a file name as the destination. hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3/stage1 the file stage1 appeared in the test3 directory. There is something interesting. The stage1.tar.gz contains three empty txt files. "hdfs dfs -cat /user/testuser/test3/-" ouptut nothing and the file size is 0.1k "hdfs dfs -cat /user/testuser/test3/stage1" output some texts including original file names inside. Also the file size is 10k.
... View more
07-30-2020
03:01 PM
@Shelton Thanks for the quick response. Here is my code to create the gz file. tar cvzf ~/stage1.tar.gz ./* I tried the following command to upload and unzip it into a hdfs directory /user/testuser/test3 hdfs dfs -copyFromLocal stage1.tar.gz /user/testuser
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3 However, what I got in /user/testuser/test3 is a file with the name "-", not the multiple files in the stage1.tar.gz. Does your solution mean to concatenate all files together? Please advise. Thanks.
... View more
07-30-2020
11:31 AM
I am copying a large number of small files (hl7 message files) from Linux local storage to hdfs. I wonder whether this is a performance difference between copying files one by one (though a script) or just using one statement like "hadoop fs -put ./* /hadoop_path". Additional background info: some files have space in their file name, if I use the command "hadoop fs -put ./* /hadoop_path", I got the error "put: unexpected URISyntaxException" for those files. If there is no performance difference, I would just copy file one at a time and my script replaces the space with "%20". Otherwise, I have to rename all files, replacing spaces with underscores, and then use batch copy.
... View more
- Tags:
- Copy
- HDFS
- performance
Labels:
- Labels:
-
HDFS
02-25-2020
10:02 AM
I got the following responses from Cloudera Certification. Regarding Question #1, the FAQ page has the most the up-to-date information. So right now I'd better hold off purchasing the exam until the DE575 is relaunched. Regarding Question #2, the course is the "Spark and Hadoop Developer" training course is the one I should take for preparing DE575. Regarding Question #3, the environment for the exam is fixed and only available on CDH. Candidates do not have the option to take the exam in an HDP environment. The skills tested are applicable to HDP development as well, it is in the developer track, so it should have nothing to do with the environment that it is running in. It is primarily interested in transforming data that sits on the cluster.
... View more
02-19-2020
01:29 PM
1 Kudo
Finally, I figured out what is going on. The root cause is that, I only set up testuser on edge nodes, not the name node. I looked into this page, https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/GroupsMapping.html, which shows that "For HDFS, the mapping of users to groups is performed on the NameNode. Thus, the host system configuration of the NameNode determines the group mappings for the users." After I created the user on the NameNode and ran the command hdfs dfsadmin -refreshUserToGroupsMappings the copy is successful and there is no permission-denied error.
... View more
02-10-2020
11:51 AM
@GangWar Here it is. $ id -Gn testuser hadoop wheel hdfs
... View more
02-10-2020
09:05 AM
I have run the following test case several times and got the same result. Context: 1. My HDP cluster uses the simple mode to determine user identity. Kerberos is not enabled. 2. Below is the permission on hdfs folder /data/test
drwxrwxr-x - hdfs hadoop 0 2020-02-07 13:33 /data/test
So hdfs (the super user) is the owner and hadoop is the owner group. Both the owner user and owner group has write permission on the /data/test folder.
Steps:
On an edge node, I used ID command to confirm that the logged on user "testuser" is in hadoop group.
$ id
uid=1018(testuser) gid=1003(hadoop) groups=1003(hadoop),10(wheel), 1002(hdfs) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
However, testuser still ran into "Permission Denied".
$ hadoop fs -put ./emptyfile1.txt /data/test
put: Permission denied: user=testuser, access=WRITE, inode="/data/test":hdfs:hadoop:drwxrwxr-x
Then I used hdfs account to changed the folder owner to be testuser.
$ hadoop fs -chown testuser /data/test
From the same edge node, now testuser ran put command successfully.
Here is my question: why testuser cannot write to the hdfs folder via the owner group permissions?
... View more
- Tags:
- HDFS
- hdfs-permissions
Labels:
- Labels:
-
HDFS
-
Hortonworks Data Platform (HDP)
01-31-2020
09:07 AM
@cjervis Thanks. I reviewed the FAQ page, but it does not answer my questions. I guess I'd better wait until tomorrow, because the page mentioned the date February 1, 2020 several times for new launches or other changes.
... View more
01-31-2020
08:22 AM
I plan to get a Cloudera certification and need help on following questions: Question #1. I reviewed the page https://www.cloudera.com/about/training/certification.html, It looks like that CCP Data Engineer is the only certification that has not been suspended or retired. Am I right on this? Question #2. To prepare DE575, the only recommended Cloudera course is "Spark and Hadoop Developer" training course. according to this page. https://www.cloudera.com/about/training/certification/ccp-data-engineer.html. Should I consider other courses? Questions #3. My workplace uses HDP. Do I need to get familiar with products like CDH before taking the exam?
... View more
Labels:
- Labels:
-
Certification
01-15-2020
10:02 AM
@Shelton @EricL Thank you both. the correct ACL spec is group::r-x Now the following command works. sudo -u zeppelin hadoop fs -ls /warehouse/tablespace/managed/hive/test1 From what I just ran into, I feel that, by design, Hive takes extra effort to prevent users from accessing managed table files directly. I will follow that design and access Hive managed table only through Hive.
... View more
01-14-2020
05:09 PM
I tried the following command # sudo -u hdfs hadoop fs -setfacl -m g::rx /warehouse/tablespace/managed/hive/test1 But I got the error -setfacl: Invalid type of acl in <aclSpec> :g::rx The acl spec is to modify the owning group permission to rx. Any suggestion?
... View more