About ShilpaSinha

ShilpaSinha · ‎04-21-2017

Hi, I have 3 node cluster running on CentOS 6.7. Since a week I can see warning on all 3 nodes block count more than threshold. My Namenode is also used as DataNode. Its more or less like this on all 3 nodes. Concerning : The DataNode has 1,823,093 blocks. Warning threshold: 500,000 block(s). I know this means the problem of growing small files. I have website data (unstructured) on hdfs, they contain jpg, mpeg, css, js, xml, html types of data. I dont know how to deal with this problem. Please help. The output of the following command on NameNode is: [hdfs@XXXXNode01 ~]$ hadoop fs -ls -R / |wc -l 3925529 Thanks, Shilpa

ShilpaSinha · ‎01-11-2017

Thanks for the reply @srowen The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/ I was able to install them following this link. It is really useful and latest. Thanks, Shilpa

ShilpaSinha · ‎01-06-2017

@pdvorak thanks! Yes, i wrote a java code to pull RSS feed and used Exec source and Avro Sink on 2 nodes and Avro Source as collector and HDFS sink on the 3rd node.

ShilpaSinha · ‎01-06-2017

Hi @pdvorak, Thanks for your comments. The answer to all three questions are Yes. My iptables is turned off. I can ping my NN and tranverse hdfs. The problem was, the IP for hdfs sink I gave was private IP and once I changed it to public. It started streaming the data. So, the issue is resolved. 🙂

ShilpaSinha · ‎01-05-2017

Hi All, @pdvorak I am using Cloudera 5.9 in a 3 node cluster. I have to stream RSS feed of a news channel to HDFS. I have a java code to pull RSS feed and have 3 agents, 2 of which have Exec source to listen on the file generated by java code and Avro sink. the last one has Avro Source and hdfs sink. But when I start Flume on all node and the one with Avro source and hdfs sink is giving hdfs.HDFSEventSink: HDFS IO error java.io.IOException: Callable timed out after 180000 ms on file: hdfs://10.0.10.4:8020/flume/events/FlumeData.1483670786529.tmp error. I have googled the error, I also increased testAgent.sinks.testSink.hdfs.callTimeout = 180000, as suggested by https://issues.apache.org/jira/browse/FLUME-2429 , by default it is 10000. I even increased the value of these 2 HDFS properties, dfs.socket.timeout and dfs.datanode.socket.write.timeout to 30000 from default value 3000. But the error is still there and nothing is being written on hdfs. My flume.conf on this node is: agent.sources = avro-collection-source agent.channels = memoryChannel agent.sinks = hdfs-sink # For each one of the sources, the type is defined agent.sources.avro-collection-source.type = avro agent.sources.avro-collection-source.bind = 10.0.0.6 agent.sources.avro-collection-source.port = 60000 # The channel can be defined as follows. agent.sources.avro-collection-source.channels = memoryChannel # Each sink's type must be defined agent.sinks.hdfs-sink.type = hdfs agent.sinks.hdfs-sink.hdfs.path = hdfs://10.0.10.4:8020/flume/events agent.sinks.hdfs-sink.hdfs.callTimeout = 180000 #Specify the channel the sink should use agent.sinks.hdfs-sink.channel = memoryChannel # Each channel's type is defined. agent.channels.memoryChannel.type = memory # Other config values specific to each type of channel(sink or source) # can be defined as well # In this case, it specifies the capacity of the memory channel agent.channels.memoryChannel.capacity = 10000 Flume,conf on other 2 nodes are agent.sources = reader agent.channels = memoryChannel agent.sinks = avro-forward-sink # For each one of the sources, the type is defined agent.sources.reader.type = exec agent.sources.reader.command = tail -f /var/log/flume-ng/source.txt agent.sources.reader.logStdErr = true agent.sources.reader.restart = true # The channel can be defined as follows. agent.sources.reader.channels = memoryChannel # Each sink's type must be defined agent.sinks.avro-forward-sink.type = avro agent.sinks.avro-forward-sink.hostname = 10.0.0.6 agent.sinks.avro-forward-sink.port = 60000 #Specify the channel the sink should use agent.sinks.avro-forward-sink.channel = memoryChannel # Each channel's type is defined. agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity = 10000 agent.channels.memoryChannel.transactionCapacity = 1004 Error log: 17/01/05 20:46:11 INFO node.Application: Starting Sink hdfs-sink 17/01/05 20:46:11 INFO node.Application: Starting Source avro-collection-source 17/01/05 20:46:11 INFO source.AvroSource: Starting Avro source avro-collection-source: { bindAddress: 10.0.0.6, port: 60000 }... 17/01/05 20:46:11 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SINK, name: hdfs-sink: Successfully registered new MBean. 17/01/05 20:46:11 INFO instrumentation.MonitoredCounterGroup: Component type: SINK, name: hdfs-sink started 17/01/05 20:46:11 INFO instrumentation.MonitoredCounterGroup: Monitored counter group for type: SOURCE, name: avro-collection-source: Successfully registered new MBean. 17/01/05 20:46:11 INFO instrumentation.MonitoredCounterGroup: Component type: SOURCE, name: avro-collection-source started 17/01/05 20:46:11 INFO source.AvroSource: Avro source avro-collection-source started. 17/01/05 20:46:20 INFO ipc.NettyServer: [id: 0x8ed94161, /10.0.0.5:51797 => /10.0.0.6:60000] OPEN 17/01/05 20:46:20 INFO ipc.NettyServer: [id: 0x8ed94161, /10.0.0.5:51797 => /10.0.0.6:60000] BOUND: /10.0.0.6:60000 17/01/05 20:46:20 INFO ipc.NettyServer: [id: 0x8ed94161, /10.0.0.5:51797 => /10.0.0.6:60000] CONNECTED: /10.0.0.5:51797 17/01/05 20:46:26 INFO hdfs.HDFSSequenceFile: writeFormat = Writable, UseRawLocalFileSystem = false 17/01/05 20:46:27 INFO hdfs.BucketWriter: Creating hdfs://10.0.10.4:8020/flume/events/FlumeData.1483670786526.tmp 17/01/05 20:46:49 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 0 time(s); maxRetries=45 17/01/05 20:47:09 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 1 time(s); maxRetries=45 17/01/05 20:47:29 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 2 time(s); maxRetries=45 17/01/05 20:47:49 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 3 time(s); maxRetries=45 17/01/05 20:48:09 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 4 time(s); maxRetries=45 17/01/05 20:48:29 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 5 time(s); maxRetries=45 17/01/05 20:48:49 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 6 time(s); maxRetries=45 17/01/05 20:49:09 INFO ipc.Client: Retrying connect to server: 10.0.10.4/10.0.10.4:8020. Already tried 7 time(s); maxRetries=45 17/01/05 20:49:27 WARN hdfs.HDFSEventSink: HDFS IO error java.io.IOException: Callable timed out after 180000 ms on file: hdfs://10.0.10.4:8020/flume/events/FlumeData.1483670786526.tmp at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:693) at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:235) at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:514) at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418) at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException at java.util.concurrent.FutureTask.get(FutureTask.java:201) at org.apache.flume.sink.hdfs.BucketWriter.callWithTimeout(BucketWriter.java:686) ... 6 more Can anyone help me to resolve this, i have no idea why this is hapenning. Thanks, Shilpa

ShilpaSinha · ‎01-04-2017

Hi @srowen The issue related installing R using epel rpm is resolved. I guess, previously install the wrong EPEL release package on this machine. So to resolve it, I did: [root@LnxMasterNode01 spark]# yum clean all [root@LnxMasterNode01 spark]# yum install epel-release [root@LnxMasterNode01 spark]# yum install R Now, I am able to run 'R' however I cannot see it in my Spark home directory nor spark/bin has sparkR. [root@LnxMasterNode01 spark]# ll total 36276 drwxr-xr-x 3 root root 4096 Oct 21 05:00 assembly drwxr-xr-x 2 root root 4096 Oct 21 05:00 bin drwxr-xr-x 2 root root 4096 Oct 21 05:00 cloudera lrwxrwxrwx 1 root root 15 Nov 25 16:01 conf -> /etc/spark/conf -rw-r--r-- 1 root root 12232 Jan 4 16:20 epel-release-5-4.noarch.rpm drwxr-xr-x 3 root root 4096 Oct 21 05:00 examples drwxr-xr-x 2 root root 4096 Oct 21 05:08 lib -rw-r--r-- 1 root root 17352 Oct 21 05:00 LICENSE drwxr-xr-x 2 root root 4096 Jan 2 18:09 logs -rw-r--r-- 1 root root 23529 Oct 21 05:00 NOTICE drwxr-xr-x 6 root root 4096 Oct 21 05:00 python -rw-r--r-- 1 root root 37053596 Jan 4 17:16 R-2.13.0-2.el6.rf.i686.rpm -rw-r--r-- 1 root root 0 Oct 21 05:00 RELEASE drwxr-xr-x 2 root root 4096 Oct 21 05:00 sbin lrwxrwxrwx 1 root root 19 Nov 25 16:01 work -> /var/run/spark/work [root@LnxMasterNode01 spark]# Is it same as SparkR? Please guide [root@LnxMasterNode01 ~]# R R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-redhat-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. . Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > q() My other question related to Sparklyr is still like earlier. Please guide.

ShilpaSinha · ‎01-04-2017

Hi @srowen, Thanks for your reply. Regarding Sparklyr: I already went to the link you mentioned, it gives and example how to connect to your local Spark. Which I have been able to do however if I try to connect to my remote Spark Cluster running on cloudera it is giving error. library(sparklyr) sc <- spark_connect(master = "spark://lnxmasternode01.centralus.cloudapp.azure.com:7077", spark_home = "hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark", version = "1.6.0") ERROR: Created default hadoop bin directory under: C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop Error in start_shell(master = master, spark_home = spark_home, spark_version = version, : SPARK_HOME directory 'hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark' not found In addition: Warning messages: 1: In dir.create(hivePath, recursive = TRUE) : cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument' 2: In dir.create(hadoopBinPath, recursive = TRUE) : cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument' 3: In file.create(to[okay]) : cannot create file 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe', reason 'Invalid argument' 4: running command '"C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe" chmod 777 "C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hive"' had status 127 Now, regarding SparkR: My Spark version is 1.6.0. As I said, I have downloaded sparkR package from https://amplab-extras.github.io/SparkR-pkg/ Do you think, it is an old package and I should search for new one? Once, I have the package, I just untar the package on Namenode, go to bin directory and Execute it. Is that it? What I did to install R on Spark home, I got the epel RPM, and then tried to install R using YUM however its giving error. I even tried some other RPM however they are giving error too. Using --skip-broken option is also not working. Please help [root@LnxMasterNode01 spark]# rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm Retrieving http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm warning: /var/tmp/rpm-tmp.XuRVi8: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY Preparing... ########################################### [100%] 1:epel-release ########################################### [100%] [root@LnxMasterNode01 spark]# yum install R Loaded plugins: fastestmirror, security Setting up Install Process Loading mirror speeds from cached hostfile * epel: ftp.osuosl.org Resolving Dependencies --> Running transaction check ---> Package R.i686 0:2.13.0-2.el6.rf will be updated ---> Package R.x86_64 0:3.3.2-2.el5 will be an update --> Processing Dependency: libRmath-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64 --> Processing Dependency: R-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64 --> Running transaction check ---> Package R-devel.x86_64 0:3.3.2-2.el5 will be installed --> Processing Dependency: R-core-devel = 3.3.2-2.el5 for package: R-devel-3.3.2-2.el5.x86_64 ---> Package libRmath-devel.x86_64 0:3.3.2-2.el5 will be installed --> Processing Dependency: libRmath = 3.3.2-2.el5 for package: libRmath-devel-3.3.2-2.el5.x86_64 --> Running transaction check ---> Package R-core-devel.x86_64 0:3.3.2-2.el5 will be installed --> Processing Dependency: R-core = 3.3.2-2.el5 for package: R-core-devel-3.3.2-2.el5.x86_64 . . . --> Processing Dependency: libgssapi.so.2()(64bit) for package: libRmath-3.3.2-2.el5.x86_64 ---> Package ppl.x86_64 0:0.10.2-11.el6 will be installed ---> Package texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6 will be installed ---> Package texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6 will be installed --> Finished Dependency Resolution Error: Package: R-core-3.3.2-2.el5.x86_64 (epel) Requires: libtk8.4.so()(64bit) Error: Package: R-core-3.3.2-2.el5.x86_64 (epel) Requires: libtcl8.4.so()(64bit) Error: Package: R-core-3.3.2-2.el5.x86_64 (epel) Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit) Error: Package: R-core-3.3.2-2.el5.x86_64 (epel) Requires: libRblas.so()(64bit) Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel) Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit) Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel) Requires: libgssapi.so.2()(64bit) Error: Package: R-core-3.3.2-2.el5.x86_64 (epel) Requires: libgssapi.so.2()(64bit) You could try using --skip-broken to work around the problem You could try running: rpm -Va --nofiles --nodigest [root@LnxMasterNode01 spark]# I followed this link http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/ and even it suggests te same. Looking forward to your reply. Am I doing something wrong here? Thanks, Shilpa

ShilpaSinha · ‎01-03-2017

Thanks a ton! marking it as solution 🙂 I have another question, if you could help. This is why I need spark https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Run-SparkR-or-R-package-on-my-Cloudera-5-9-Spark/m-p/49019#U49019

ShilpaSinha · ‎01-03-2017

Hi, I have 3 node cluster having Cloudera 5.9 running on CentOS 6.7. I need to connect my R packages (running on my Laptop) to the Spark runing in cluster mode on Hadoop. However If I try to connect the local R through Sparklyr Connect to Hadoop Spark it is giving Error. As it is searching the Spark home on the laptop itself. I googled and found we can install SparkR and use R with Spark. However I have few questions regarding the same. I have downloaded the tar file from https://amplab-extras.github.io/SparkR-pkg/ But my question is I directly copy it to my Linux server and install? Do I have to Stop/delete my existing Spark which is NOT Stand Alone and using Yarn i.e. it is running in Cluster mode? or SparkR can just run on top of it, If I install it on the server? Or do I have to run Spark on Stand Alone mode (get Spark gateways running and Start master/slave using script) and install the package from linux command line on top of it? If it get installed will I be able to access it through CM UI? Please help, I am new in this and really need guidance. Thanks, Shilpa

ShilpaSinha · ‎01-03-2017

Yes, previously(without starting master from the script) i was able to work on spark-shell but not open the master UI. However after I started master through script, I was able to open it. But as you said we dont need to do it as we need it in standalone spark mode. I would shut it down. What do you say? Thanks, Shilpa

Online	Offline
Last Visited	‎05-17-2018 02:52 PM

Member Since	‎11-17-2016 11:39 AM
Last Visited	‎05-17-2018 02:52 PM
Posts	63
Kudos received	7

Cloudera Community

Re: dfs storage(dfs.data.dir) space issue

Re: Install and run Apache Nutch on existing Hadoo...

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Use Flume to get a webpage data. How to config...

Re: hdfs.HDFSEventSink: HDFS IO error java.io.IOEx...

Datanodes report block count more than threshold o...

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Use Flume to get a webpage data. How to config...

Re: hdfs.HDFSEventSink: HDFS IO error java.io.IOEx...

hdfs.HDFSEventSink: HDFS IO error java.io.IOExcept...

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Spark gateway not starting

Run SparkR | or R package on my Cloudera 5.9 Spark

Re: Spark gateway not starting