Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

H2O context issues on CDH-5.7.1 with Spark 1.6.0

H2O context issues on CDH-5.7.1 with Spark 1.6.0

Explorer

Background:  We have been recently trying to integrate H2O with existing ML models with Spark processing engine. We have successfully run the integration on a single node cluster. However, when we tried to get h2o context going on the cdh-5.7.1, we ran into a strange issue where the conversion from a spark dataframe to h2o frame is failing. 

hc.as_h2o_frame(df)

 

Since h20 says, it expects the spark version 1.6.1 with their version of 3.8.2.6 and since CDH-5.7.1 does ship with only 1.6.0, is that a version issue? We have tried extracting get Spark 1.6.1 on CDH-5.7.1 but no luck.

 

Anyone out there faced this issue?

 

 

6 REPLIES 6

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Master Collaborator

Really, CDH contains "1.6.x", meaning some set of maintenance patches that would be found in 1.6.1, 1.6.2 from upstream. I think the issue is that H2O may need to relax the version check a bit, as it may work just fine with other 1.6.x releases. I don't know what specific change means that 1.6.1 is desired.

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Explorer

@srowen: The problem is clearly with Cloudera's version of Spark. H2O works fine with the official release of Spark 1.6.x however it cannot work with Cloudera's version of Spark. Can you point me to CDH distro which has same Official Spark version of 1.6.0 or higher? Appreciate your help. 

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Master Collaborator

No, from what you describe, H2O is looking for a version string "1.6.1". This has nothing to do with Spark at all, right?

 

I'm suggesting that I'd expect H2O works fine with anybody's 1.6.x release, because changes in a maintenance branch are fixes only. It's possible it requires some particular fix, but that is possibly already in CDH, and if that were blocking users we'd be able to fix that.

 

CDH contains 1.6.0 + patches (see release notes for exactly what), but that's obvious.

 

The place to start is understanding why this requirement exists and if it's necessary, but that is an H2O question.

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Explorer

@srowen: So, please allow me to clear few things. H2O version (3.8.2.6) works well with Official Spark version of 1.6.1 and we have tested to see if that version of H2O works with 1.6.0, and it does. The H2O code is running fine on a standalone cluster. However, if we port it to a cluster running CDH-5.7.1, it's failing. I have checked with h2o on gitter and seems like it is an issue with how h2o looks at 'official' spark versions. Since Cloudera tinkered with 1.6.0 by applying some patches, h2o seems to be failing with that version. ( it is not simple version string check ).

 

you might want to take a look at this link: https://groups.google.com/forum/#!msg/h2ostream/j09WstM_waQ/7-1I4id6AgAJ

 

In a nutshell, it is an issue with h2o integrating with Cloudera's version of Spark. 

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Master Collaborator

OK, that helps. Actually, I've seen this before, and Zeppelin had a similar issue. The issue is that they're accessing a private Spark data structure which changed in upstream Spark 2.0, but also was applied to CDH Spark 1.6.x. The reasons why it done are obscure to me, but, it's also not an API. I'm afraid it's an H2O problem though, because these internals should not be accessed by a Spark app.

Re: H2O context issues on CDH-5.7.1 with Spark 1.6.0

Master Collaborator

(To finish the though, I believe you'll find that therefore H2O requires upstream Spark < 2.0.0, not >= 1.6.1. It would be relatively rare to require a maintenance release.)

Don't have an account?
Coming from Hortonworks? Activate your account here