Member since
06-17-2016
56
Posts
5
Kudos Received
0
Solutions
07-27-2019
05:16 AM
It worked. Thanks!
... View more
05-29-2018
09:09 PM
Hi guys, thanks so much for the fast support and thanks to the Matts Team @Matt Burgess and @Matt Clarke I finally understood how the processor works. He emits a flow file with no payload and in the meta attributes are the file details like path and filename. Those are used by the HDFSFetch to fetch the correspondent files. Kind regards, Paul
... View more
05-28-2018
01:03 PM
Hi everyone, I already solved it after a deep analysis of the code. As you can see in the code I posted above, I am repartitioning the data. As a background, the regular process transforms small files, and I want to collect the partial results and created a sigle file, which is then written into HDFS. That is a desired feature since HDFS works better with bigger files. To explain it better, because small and big could be very fuzzy. Our HDFS has a standard configuration of 128 MB blocks, therefore, a 2 or 3 MB files makes no sense and is also affecting the performance. This is the regular situation, but now a backlog of around 1 TB needs to be processed and the repartition is causing a shuffle operation. As far as I understand, the repartition requires to collect all the parts in one worker to create one partition. Since the original RDD is bigger than the memory available in the workers, this collapses everything and throws the errors I reported above. aswdirCsvDf.repartition(1).write I just removed the ".repartition(1)" from the code and now is everything working. The program, writes several files, that is, one file pro worker, and in this context it is quite ok. Kind regards, Paul
... View more
05-31-2018
09:34 AM
@Felix Albani Hi felix, you installed 3.6.4, but according to the document spark2 can only support up to 3.4.x, Can you kindly explain how does this work ?
... View more
06-20-2018
04:49 PM
@Paul Hernandez Hey Paul - did you find a solution to this? It looks like its only parquet thats affected..csv doesnt have this problem. I too have data in subdirectories and spark sql returns null
... View more
02-01-2018
11:13 AM
Hi @Krishnaswami Rajagopalan I don't know exactly the detail of this Sandbox in the Azure Cloud. Are you connecting to the Sandbox or to the docker container inside? The docker container is where zeppelin and other services are located. To connect to the docker container use the port 2222 in your SSH command. Example: ssh root@127.0.0.1 -p 2222 It doesn't matter I guess if your cluster or sandbox is running on the cloud. You should be able to find zeppelin under /usr/hdp/current/zeppelin-server Hope this helps. BR. Paul
... View more
10-05-2018
09:17 AM
@slachterman I
am facing some issues with PySpark code and some places i see there are
compatibility issues so i wanted to check if that is probably the
issue. Even otherwise it is better to check these compatibility problems
upfraont i guess. So i wanted to know some things. I am on 2.3.1
spark and 3.6.5 python, do we know if there is a compatibility issue
with these? Do i upgrade to 3.7.0 (which i am planning) or downgrade to
<3.6? What in your opinion is more sensible? Info: versions.. Spark --> spark-2.3.1-bin-hadoop2.7.. all installed according to instructions in python spark course venkatesh@venkatesh-VirtualBox:~$ java -version</li><li>openjdk version "10.0.1"2018-04-17</li><li>OpenJDKRuntimeEnvironment(build 10.0.1+10-Ubuntu-3ubuntu1)</li><li>OpenJDK64-BitServer VM (build 10.0.1+10-Ubuntu-3ubuntu1, mixed mode)</li></ol> I work MacOS and Linux.
... View more
03-06-2017
07:29 PM
Hi everyone, @jwhitmore Thank's for your response, you are right, when exposing 50010, Talend For Big Data works (with component tHDFSConnect and cie) But, even if we exposing the 50010 port, there are always the same error when using Talend ESB with Camel Framework, see below : [WARN ]: org.apache.hadoop.hdfs.DFSClient - DataStreamer Exceptionorg.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/Ztest.csv.opened could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) I've design a Scala program, and i'm facing the same issue : 15:59:22.386 [main] ERROR org.apache.hadoop.hdfs.DFSClient - Failed to close inode 500495org.apache.hadoop.ipc.RemoteException: File /user/hdfs/testscala2.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1641)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3198)at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3122) Any idea ? Thank's in advance. Best regards, Mickaël.
... View more