About paulhernandez

alfred_tene · ‎03-02-2020

Thank you Dkozlowski

zzeng · ‎07-27-2019

It worked. Thanks!

paulhernandez · ‎05-29-2018

Hi guys, thanks so much for the fast support and thanks to the Matts Team @Matt Burgess and @Matt Clarke I finally understood how the processor works. He emits a flow file with no payload and in the meta attributes are the file details like path and filename. Those are used by the HDFSFetch to fetch the correspondent files. Kind regards, Paul

paulhernandez · ‎05-28-2018

Hi everyone, I already solved it after a deep analysis of the code. As you can see in the code I posted above, I am repartitioning the data. As a background, the regular process transforms small files, and I want to collect the partial results and created a sigle file, which is then written into HDFS. That is a desired feature since HDFS works better with bigger files. To explain it better, because small and big could be very fuzzy. Our HDFS has a standard configuration of 128 MB blocks, therefore, a 2 or 3 MB files makes no sense and is also affecting the performance. This is the regular situation, but now a backlog of around 1 TB needs to be processed and the repartition is causing a shuffle operation. As far as I understand, the repartition requires to collect all the parts in one worker to create one partition. Since the original RDD is bigger than the memory available in the workers, this collapses everything and throws the errors I reported above. aswdirCsvDf.repartition(1).write I just removed the ".repartition(1)" from the code and now is everything working. The program, writes several files, that is, one file pro worker, and in this context it is quite ok. Kind regards, Paul

144675 · ‎05-31-2018

@Felix Albani Hi felix, you installed 3.6.4, but according to the document spark2 can only support up to 3.4.x, Can you kindly explain how does this work ?

crayonml · ‎06-20-2018

@Paul Hernandez Hey Paul - did you find a solution to this? It looks like its only parquet thats affected..csv doesnt have this problem. I too have data in subdirectories and spark sql returns null

paulhernandez · ‎02-01-2018

Hi @Krishnaswami Rajagopalan I don't know exactly the detail of this Sandbox in the Azure Cloud. Are you connecting to the Sandbox or to the docker container inside? The docker container is where zeppelin and other services are located. To connect to the docker container use the port 2222 in your SSH command. Example: ssh root@127.0.0.1 -p 2222 It doesn't matter I guess if your cluster or sandbox is running on the cloud. You should be able to find zeppelin under /usr/hdp/current/zeppelin-server Hope this helps. BR. Paul

venkatesh_bhima · ‎10-05-2018

@slachterman I am facing some issues with PySpark code and some places i see there are compatibility issues so i wanted to check if that is probably the issue. Even otherwise it is better to check these compatibility problems upfraont i guess. So i wanted to know some things. I am on 2.3.1 spark and 3.6.5 python, do we know if there is a compatibility issue with these? Do i upgrade to 3.7.0 (which i am planning) or downgrade to <3.6? What in your opinion is more sensible? Info: versions.. Spark --> spark-2.3.1-bin-hadoop2.7.. all installed according to instructions in python spark course venkatesh@venkatesh-VirtualBox:~$ java -version</li><li>openjdk version "10.0.1"2018-04-17</li><li>OpenJDKRuntimeEnvironment(build 10.0.1+10-Ubuntu-3ubuntu1)</li><li>OpenJDK64-BitServer VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)</li></ol> I work MacOS and Linux.

mallain · ‎03-06-2017

Hi everyone, @jwhitmore Thank's for your response, you are right, when exposing 50010, Talend For Big Data works (with component tHDFSConnect and cie) But, even if we exposing the 50010 port, there are always the same error when using Talend ESB with Camel Framework, see below : [WARN ]: org.apache.hadoop.hdfs.DFSClient - DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/Ztest.csv.opened could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) I've design a Scala program, and i'm facing the same issue : 15:59:22.386 [main] ERROR org.apache.hadoop.hdfs.DFSClient - Failed to close inode 500495org.apache.hadoop.ipc.RemoteException: File /user/hdfs/testscala2.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) Any idea ? Thank's in advance. Best regards, Mickaël.

Online	Offline
Last Visited	‎09-23-2019 04:13 AM

Member Since	‎06-17-2016 03:34 PM
Last Visited	‎09-23-2019 04:13 AM
Posts	56
Kudos received	6

Cloudera Community

Re: Zeppelin UI returns 503 error

Re: NiFi 1.4 ConsumeMQTT Processor throws an error...

Re: NiFi 1.4 queue shows millions of files and 0 M...

Re: Spark Metadata Fetch Failed Exception: Missing...

Re: Version of Python of Pyspark for Spark2 and Ze...

Re: Hive External Table with Parquet Format produc...

Re: Problem: Zeppelin 0.6 and Spark 2 on HDP 2.5

Re: PySpark and Python version (<3.6)?

Re: Cannot transfer files to HDP 2.5 sandbox using...