As describe in the title while loading big file from my laptop to 'Files View' in Ambari my laptop ran out of power and shutdown last night. So I need to upload this file again but first I need to make room for it, below is the pic from my Command line output: Would like some help with cleaning up HDFS Disk usage which has gone up to 78% but there is no file where I uploaded (I thought I would have like a partial corrupted file that I'd expunge). No the side note is the a way to upload a 10 GB file faster from my local machine into HDP sandbox. Thanks
... View more
I don't have a definitive post to replicate what I am doing and after my first attempt last week leaving me confused I would like to know the following. I have Hortonworks Sandbox running on one lapton ( I have accessed its Hive Sever2 using ODBC/JDBC drivers). I want to access the HDFS files with Spark (SparkR in my case) from a different laptop. From all the posts that I have read I will be submitting the spark jobs to yarn and need to copy from sandbox some files from /etc/hadoop/conf (particularly core-site.xml, yarn-site.xml, hdfs-site.xml) to the /conf dir in laptop's SPARK_HOME. I think I read somewhere that I should map the ip address in host file in windows, that makes sense as I am thinking:: http://hortonworks.sandbox.com to be the http://<ip-address>; . And mention for master "yarn" without giving the url thus mapping would do the url bit. Thanks
... View more
I am trying to troubleshoot my way into using the hdfs on HDP sandbox from my windows R session (SparkR) and its been a frustrating journey. I had quick question here though, in the post give the link to below, what is meant my "Note: HDFS-HA is the nameservice as defined by dfs.nameservices"? Is it something like "sandbox.hortonworks.com:8020"? Again I am trying to access the HDP Sandbox hdfs and read those files in my windows R session to do analysis outside of HDP sandbox on the data stored in hdfs. Thanks a lot. https://community.hortonworks.com/content/supportkb/49491/how-to-access-hdfs-files-using-spark-in-r-for-name.html
... View more
I am usingHDP sandbox to learn to use Spark for processing data stored with Apache Hadoop. For the sake keeping my sandbox running I only start the following services in addition to HDFS, Yarn, MapReduce2, Hive, Zookeeper and Spark2 (I even leave Spark turned off) will I be able to get what I need out of this ( Will I be able to succeed with my objective that of using HDP sandbox to get familiar with using Hadoop with R/Spotfire. (I have the services turned on with HiveServer2, SparkSQL, ThriftServer, Livy server, and Spark cluster for accessing/collecting data as my primary object even if I have to stick to table/dataframe data, I thought Zookeeper is somehow important, why its on as well plus I also suspect that Ranger is anther service I might need turned on. Please clarify and advice. (I have sanbox running on my one laptop with 6 GB Ram & 4 cores which uses about 4GB of RAM/memory when sandbox/VM turns on and stabilizes. I ain't doing nothing else on that laptop just as long as it can give me access to sandbox and remain stable, then I access the sandbox from anther laptop w/ port forwarding where I run R or Spotfire and collect/access data from HDP sandbox) ____
I was running Pi Example from the 'A lap around Apache Spark' tutorial and would say job finished it took 70 second which I thought I tad slow. My hunch is that ( since when I go over to the YARN configs there are some warning in yellow) maybe I should first do the "
CONFIGURING YARN CAPACITY SCHEDULER WITH APACHE AMBARI" tutorial (plus possibly go through Apache Spark Component Guide as well)
In addition this Warning sign is also confusing when I run the Pi example with Spark:
17/11/06 19:19:25 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041 Hope my question is not took general or broad but there is much to learn here and I would appreciate can sharing there knowledge in my quest to integrate Hadoop data via Hive tables and spark clusters into my analytics pipeline. Thanks and happy winter.
... View more
I have many questions, as I have been fiddling with Sandbox as a Hadoop newbie starting with the more basic one first:
¶ I have seen in that from the CLI/shell one can go view `/usr/hdp/current/spark2-thriftserver/conf/hive-site.xml` or `/usr/hdp/current/spark2-client/conf/hive-site.xml` and under port property find the listed port (10016) for Thrift Server. Is this the efficient/preferred way one does this.
Further I am trying doing this to try and use this for an ODBC Spark SQL connection to connect to visualization tool, Spotfire.
I have successfully connected to the hive datatables in Hive-Server2 from sporfire on my laptop at port 10000 by downloading the Apache Hive connector, now I am hoping to do the same with the Spark ODBC driver, any hints or advice.
¶ I am a newbie to HDP and just trying to learn to work with data in hadoop file system, but frankly I don't know what is the reason to want to use one connector over the other is? other than that I'd like to be able to connect with the different methods ( I am an R user and succeeded in getting the hive tables in R as well with OBDC connectors anything I can do in R running on my laptop I could use it with Spotfire which is what I am currently using for analytics), a discussion/answer to this point will be much appreciated.
• ¶ Then there are some more challenging things I'd like to do ( You see I understand that I can install R on HDP sandbox and carry out computations, I have seen the SparkR predicting airline delays tutorial; but if I can connect to the data in HDP HDFS outside of HDP sandbox I can start leveraging R's power with Spotfire client's in-built R engine with data from hadoop file system (apparently Spotfire Server has lot more data access/connectivity options but I don't have access to Spotfire Server , so with that in mind some of the things I am trying to get to are::)
With SparkR from R session running in windows laptop how can I use (csv) files in HDFS in HDP to construst SparkDataFrame, either using Hive table in the Hive Server 2 or some other way? I can only think of extracting the data that I am interested in from the HDP and then make it into a SparkDataFrame to carry out analysis with SparkR library in windows R session. But is there such an option as connecting to a remote spark cluster in HDP sandbox.
And 'Livy for Spark2 Server' is this something I should be getting familiar with first for my purpose of accessing data outside of sandbox. Here is a https://spark.rstudio.com/deployment.html reference from sparklyr package that alludes to this possibility.
Thanks, I don't know how naive my questions are but bare with me and any clarification or attempt there at will be really appreciated.
... View more