I'm new to Hadoop and I've deployed the sandbox into a VM with 32GB of RAM.
However Hive queries and everything run very, very slowly.
Can it be of the VM?
Also I don't have multinodes, only a single node... can this deteriorate (considerably) the performance?
Many thanks in advance.
It could be many things.
1. What volume of data is under consideration in the Hive queries?
2. What file format is the data stored in?
3. How was the data prepared and loaded (sorting, partitioning, etc.)?
There isn't enough information in your question to really give anyone a single answer which will help you. You may have to explore a bit and provide more details...
Yes, a single node has limitations. It's not that it is intentionally deteriorating the performance, but just that the system is designed for scaling through parallelism, and you have just a single node, so you are limiting the abilities of the software to scale (if that is what is needed)
Sandbox is meant for tutorials and exploration of simple capabilities on small data. If you want to try the actual HDP software on real data, you can install a small multi-node cluster using the HDP installation processes documented at docs.hortonworks.com.
I am new to HDP Sandbox and also find it quite slow.
Using the example csv file from the getting started tutorial (https://hortonworks.com/tutorial/hadoop-tutorial-getting-started-with-hdp/), following cell takes 9 Seconds to execute.
val geoLocationDataFrame = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/geolocation.csv")
Took 9 sec. Last updated by anonymous at June 25 2019, 3:02:48 PM.
I would expect only a view ms to load some csv file.
My VM setup is based on VmWare and has 10 GB Ram and 4 x 3.2 GHz.
Is there some benchmark with reference numbers to know expected execution times or some sort of profiling tool to be able to easily find bottlenecks?