11-01-2016 06:00 AM
I am interested to know how people are using R with Impala.
- I am currently using the generic RODBC package with the Cloudera ODBC driver. This has the advantage that we can use our LDAP credentials instead of Kerberos. For now it works ok (but a bit slow). Also works with RODBCDBI (but not able to list all the tables). Unfortunately use directly with dplyr seems more difficult.
- I am also aware of RImpala (see https://github.com/Mu-Sigma/RImpala) which is JDBC based. However it seems the package received little updates and the install procedure is a little more complicated.
- It seems like an R package based on hs2client would have been nice (https://github.com/cloudera/hs2client). But looks not much further development is happening.
Is there any others?
Finally, pushing data towards Hadoop/Impala. The ODBC connection is definitely not the best approach for that. Does someone have experience with pushing data?
03-28-2017 04:51 PM
There is a very new project on CRAN called implyr that I have had good luck with. The Spart R connections I've had much less luck with, but SparkR and SplarklyR are worth a shot.