Member since
04-03-2019
92
Posts
6
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3443 | 01-21-2022 04:31 PM | |
5885 | 02-25-2020 10:02 AM | |
3553 | 02-19-2020 01:29 PM | |
2568 | 09-17-2019 06:33 AM | |
5606 | 08-26-2019 01:35 PM |
09-09-2021
01:18 AM
I am trying to parse a nested json document using RDD rather than DataFrame. The reason I cannot use DataFrame (the typical code is like spark.read.json) is that the document structure is very complicated. The schema detected by the reader is useless because child nodes at the same level have different schemas. So I try the script below. import json
s='{"key1":{"myid": "123","myname":"test"}}'
rdd=sc.parallelize(s).map(json.loads) My next step will be using map transformation to parse json string but I do not know where to start. I tried the script below but it failed. rdd2=rdd.map(lambda j: (j[x]) for x in j) I would appreciate any resource on using RDD transformation to parse json.
... View more
Labels:
- Labels:
-
Apache Spark
09-03-2021
05:08 PM
Vidya, Thanks for your reply. Could you help me clarify the issue further? Does Spark (or other MapReduce tool) create the container using the local host as its template (to some degree)?
... View more
08-26-2021
02:58 PM
I will use Spark2 in CDP and need to install Python3. Do I need to installation Python3 on every node in the CDP cluster, just only need to install it on one particular node? Spark2 job is executed in JVM containers that could be created on any worker node. I wonder whether the container is created upon a template? If yes, then how the template is created and where is it? Thanks.
... View more
Labels:
- Labels:
-
Apache Spark
11-04-2020
10:59 PM
I resolved the error by following advice from this post. https://community.cloudera.com/t5/Support-Questions/Sharing-how-to-solve-HUE-and-HBase-connect-problem-on-CDH-6/td-p/82030
... View more
11-04-2020
03:10 PM
I got the same error with HappyBase. My code has been working fine for a few weeks. Somehow Thrift API stopped. I restarted the API and then I got this error.
... View more
07-30-2020
03:38 PM
The unpack command will not work without that extra dash. https://stackoverflow.com/questions/34573279/how-to-unzip-gz-files-in-a-new-directory-in-hadoop/43704452 I had another try with a file name as the destination. hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3/stage1 the file stage1 appeared in the test3 directory. There is something interesting. The stage1.tar.gz contains three empty txt files. "hdfs dfs -cat /user/testuser/test3/-" ouptut nothing and the file size is 0.1k "hdfs dfs -cat /user/testuser/test3/stage1" output some texts including original file names inside. Also the file size is 10k.
... View more
07-30-2020
03:01 PM
@Shelton Thanks for the quick response. Here is my code to create the gz file. tar cvzf ~/stage1.tar.gz ./* I tried the following command to upload and unzip it into a hdfs directory /user/testuser/test3 hdfs dfs -copyFromLocal stage1.tar.gz /user/testuser
hdfs dfs -cat /user/testuser/stage1.tar.gz | gzip -d | hdfs dfs -put - /user/testuser/test3 However, what I got in /user/testuser/test3 is a file with the name "-", not the multiple files in the stage1.tar.gz. Does your solution mean to concatenate all files together? Please advise. Thanks.
... View more
07-30-2020
11:31 AM
I am copying a large number of small files (hl7 message files) from Linux local storage to hdfs. I wonder whether this is a performance difference between copying files one by one (though a script) or just using one statement like "hadoop fs -put ./* /hadoop_path". Additional background info: some files have space in their file name, if I use the command "hadoop fs -put ./* /hadoop_path", I got the error "put: unexpected URISyntaxException" for those files. If there is no performance difference, I would just copy file one at a time and my script replaces the space with "%20". Otherwise, I have to rename all files, replacing spaces with underscores, and then use batch copy.
... View more
Labels:
- Labels:
-
HDFS
02-25-2020
10:02 AM
I got the following responses from Cloudera Certification. Regarding Question #1, the FAQ page has the most the up-to-date information. So right now I'd better hold off purchasing the exam until the DE575 is relaunched. Regarding Question #2, the course is the "Spark and Hadoop Developer" training course is the one I should take for preparing DE575. Regarding Question #3, the environment for the exam is fixed and only available on CDH. Candidates do not have the option to take the exam in an HDP environment. The skills tested are applicable to HDP development as well, it is in the developer track, so it should have nothing to do with the environment that it is running in. It is primarily interested in transforming data that sits on the cluster.
... View more
02-19-2020
01:29 PM
1 Kudo
Finally, I figured out what is going on. The root cause is that, I only set up testuser on edge nodes, not the name node. I looked into this page, https://hadoop.apache.org/docs/r3.1.1/hadoop-project-dist/hadoop-common/GroupsMapping.html, which shows that "For HDFS, the mapping of users to groups is performed on the NameNode. Thus, the host system configuration of the NameNode determines the group mappings for the users." After I created the user on the NameNode and ran the command hdfs dfsadmin -refreshUserToGroupsMappings the copy is successful and there is no permission-denied error.
... View more