Background: We have been recently trying to integrate H2O with existing ML models with Spark processing engine. We have successfully run the integration on a single node cluster. However, when we tried to get h2o context going on the cdh-5.7.1, we ran into a strange issue where the conversion from a spark dataframe to h2o frame is failing.
Since h20 says, it expects the spark version 1.6.1 with their version of 220.127.116.11 and since CDH-5.7.1 does ship with only 1.6.0, is that a version issue? We have tried extracting get Spark 1.6.1 on CDH-5.7.1 but no luck.
Anyone out there faced this issue?
Really, CDH contains "1.6.x", meaning some set of maintenance patches that would be found in 1.6.1, 1.6.2 from upstream. I think the issue is that H2O may need to relax the version check a bit, as it may work just fine with other 1.6.x releases. I don't know what specific change means that 1.6.1 is desired.
@srowen: The problem is clearly with Cloudera's version of Spark. H2O works fine with the official release of Spark 1.6.x however it cannot work with Cloudera's version of Spark. Can you point me to CDH distro which has same Official Spark version of 1.6.0 or higher? Appreciate your help.
No, from what you describe, H2O is looking for a version string "1.6.1". This has nothing to do with Spark at all, right?
I'm suggesting that I'd expect H2O works fine with anybody's 1.6.x release, because changes in a maintenance branch are fixes only. It's possible it requires some particular fix, but that is possibly already in CDH, and if that were blocking users we'd be able to fix that.
CDH contains 1.6.0 + patches (see release notes for exactly what), but that's obvious.
The place to start is understanding why this requirement exists and if it's necessary, but that is an H2O question.
@srowen: So, please allow me to clear few things. H2O version (18.104.22.168) works well with Official Spark version of 1.6.1 and we have tested to see if that version of H2O works with 1.6.0, and it does. The H2O code is running fine on a standalone cluster. However, if we port it to a cluster running CDH-5.7.1, it's failing. I have checked with h2o on gitter and seems like it is an issue with how h2o looks at 'official' spark versions. Since Cloudera tinkered with 1.6.0 by applying some patches, h2o seems to be failing with that version. ( it is not simple version string check ).
you might want to take a look at this link: https://groups.google.com/forum/#!msg/h2ostream/j09WstM_waQ/7-1I4id6AgAJ
In a nutshell, it is an issue with h2o integrating with Cloudera's version of Spark.
OK, that helps. Actually, I've seen this before, and Zeppelin had a similar issue. The issue is that they're accessing a private Spark data structure which changed in upstream Spark 2.0, but also was applied to CDH Spark 1.6.x. The reasons why it done are obscure to me, but, it's also not an API. I'm afraid it's an H2O problem though, because these internals should not be accessed by a Spark app.
(To finish the though, I believe you'll find that therefore H2O requires upstream Spark < 2.0.0, not >= 1.6.1. It would be relatively rare to require a maintenance release.)