Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oryx 1 failed to train on big data

Oryx 1 failed to train on big data

Explorer

Sean,

 

We faced the following issue when running big data (about 50GB)

 

We run Oryx 1 on a node outside the Hadoop cluster.

The computation node is inside the same virtual lan as the Hadoop cluster. So, no firewall issue.

I have the /etc/hadoop/conf on the oyx computation node.

When running a big dataset, it ran through several iterations and then the logs reports two errors:

The first one is file access/override warning.

The second one is "too many file opened error", I added ulimits to 65536, but still face the same issues.

Should I increase more to ulimits ? The way I increased the number is that I ulimit the account of the computation node that runs Oryx 1.

Also, I think maybe 2nd problem is related to first one, because it may not close the access to "/etc/hadoop/conf/core-site.xml"..

 

Any thought ?

Thanks

 

Jason

 

(1) many many warning logged as (the following three lines almost occupied the log)

Tue Oct 13 20:55:13 PDT 2015 WARNING file:/etc/hadoop/conf/core-site.xml:an attempt to override final parameter: hadoop.ssl.keystores.factory.class;  Ignoring.
Tue Oct 13 20:55:13 PDT 2015 WARNING file:/etc/hadoop/conf/core-site.xml:an attempt to override final parameter: hadoop.ssl.server.conf;  Ignoring.
Tue Oct 13 20:55:13 PDT 2015 WARNING file:/etc/hadoop/conf/core-site.xml:an attempt to override final parameter: hadoop.ssl.client.conf;  Ignoring.

 

(2)

Tue Oct 13 20:56:57 PDT 2015 INFO Reading estimates from /user/xxxx/00001/tmp/iterations/9/Yconvergence/
Tue Oct 13 20:56:59 PDT 2015 WARNING I/O error constructing remote block reader.
java.net.SocketException: Too many open files....

java.net.SocketException: Too many open files
    at sun.nio.ch.Net.socket0(Native Method)
    at sun.nio.ch.Net.socket(Net.java:415)
    at sun.nio.ch.Net.socket(Net.java:408)
    at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:105)
    at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:60)
    at java.nio.channels.SocketChannel.open(SocketChannel.java:145)
    at org.apache.hadoop.net.StandardSocketFactory.createSocket(StandardSocketFactory.java:62)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2883)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:747)
    at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:662)
    at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:326)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:570)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:648)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:266)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at com.cloudera.oryx.common.servcomp.Store.streamFrom(Store.java:152)
    at com.cloudera.oryx.common.servcomp.Store.readFrom(Store.java:174)
    at com.cloudera.oryx.als.computation.ALSDistributedGenerationRunner.readUserItemEstimates(ALSDistributedGenerationRunner.java:360)
    at com.cloudera.oryx.als.computation.ALSDistributedGenerationRunner.areIterationsDone(ALSDistributedGenerationRunner.java:304)
    at com.cloudera.oryx.computation.common.DistributedGenerationRunner.runSteps(DistributedGenerationRunner.java:104)
    at com.cloudera.oryx.computation.common.GenerationRunner.runGeneration(GenerationRunner.java:236)
    at com.cloudera.oryx.computation.common.GenerationRunner.call(GenerationRunner.java:109)
    at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:214)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

5 REPLIES 5
Highlighted

Re: Oryx 1 failed to train on big data

Master Collaborator

The first issue is more of a Hadoop configuration issue. It seems that your local configuration for Hadoop specifies some settings that are considered non-overrideable by your cluster admin. These settings are unlikely to be relevant however, so I think you can ignore them. Certainly, this app does not set or care about them.

 

The second problem does indeed indicate the machine has too many connections open, somehow. I might expect this on a very busy machine, and suggest a higher ulimit. But you have definitely upped this limit and still see it? is the machine exceptionally busy with data nodes, etc?


What I can do is look at the app's use of things like HDFS connections and add some additional defensive checks to close unused streams early. Although I don't think this is the problem here, it can't hurt. Are you in a position to try SNAPSHOT builds if I push updates?

Highlighted

Re: Oryx 1 failed to train on big data

Explorer

Sean,

Thanks for the reply.

For the 2nd issue

(1) Yes, it's running in a busy cluster. However,the Oryx job is submitted from a dedicated account (let's call it "oryx") and a VM.

And, I set ulimits of the account "oryx" to 65536. I mean, although the cluster is busy, the open-file limit is account-specific. right ?

So, 65536 is the avaialble connection only used by the account "oryx" and not other Hadoop users.

(2) Yes, I can take your SNAPSHOT. Even better, it would be greate that you let me know what your changes (from Git). So, I can just copy

your defensive checks code in.

 

 

Highlighted

Re: Oryx 1 failed to train on big data

Master Collaborator

I believe ulimit is per process even. You're setting this as root on the VM that's running the driver process, which sees this error right? Just making sure it's taking hold.

 

I'm  honestly not sure what would run out of connections anyway. It doesn't do a whole lot except talk to HDFS a little from the driver. I already committed two changes today that are a guess about leaking connections that stick around and don't get cleaned up, but I still am skeptical that's the problem.

 

It is a little hard to know what has too many connections open. You can use "lsof -p [pid]" on the driver to see what files/sockets it has open. Maybe that's somehow a clue though the ouptut is inevitably noisy.

Highlighted

Re: Oryx 1 failed to train on big data

Explorer

Sean,

 

Thanks for the suggestion.

I will try from the lsof approach and move to your two commits, if necessary.

 

Meanwhile, try to understand what you said "...You're setting this as root on the VM that's running the driver process..."..

Let me clarify what I did:

(1) I am running an Oryx 1 computation node outside the Hadoop cluster.

(2) The computation node is inside the same LAN as Hadoop cluster and it uses the /etc/hadoop/conf info to talk to Hadoop cluster

(3) The computation node has a user account named "oryx" that runs oryx-computation.jar... The ulimits of the user "oryx" is set

to 65536 in the *Oryx computation* node. Not in any hadoop cluster node.

(4) There is a HDFS account called "oryx" in the Hadoop cluster, so it can write HDFS from the Oryx computation node.

 

Based on what you said, it looks I need to set the ulimits in the VM that runs the driver process... Right ? Can you clarify that ?

What's the VM runs the driver process ? I think it's the node inside the Hadoop cluster, not the Oryx computation node. correct ?

 

Thanks.

 

Highlighted

Re: Oryx 1 failed to train on big data

Master Collaborator

The max # of file descriptors is a limit that only the superuser can be allowed to increase. "ulimit -n" shows the max you've configured for your process and its children; you can set a number to raise it, but make sure it actually was set to the higher number by running this command again. If it's not actually setting it you may have to lift the hard maximum allowed on your OS. See things like http://askubuntu.com/questions/162229/how-do-i-increase-the-open-files-limit-for-a-non-root-user

 

This may not be a problem; just check that ulimit -n prints 65536 before you run. Yes, it is your driver process that is running out of file handles so it needs the increased limit. Yes, all of this configuration is entirely contained within your VM; it has nothing to do with the host machine or other nodes.

 

I don't think particular accounts matter or the network topology.

Don't have an account?
Coming from Hortonworks? Activate your account here