Member since
11-15-2016
50
Posts
2
Kudos Received
0
Solutions
02-23-2018
06:15 PM
HI All, I installed a HDP cluster on cloudbreak and am trying to run a simple Spark Job. I open the "pyspark" shell and run the following: ip = "adl://alenzadls1.azuredatalakestore.net/path/to/my/input/directory"
input_data = sc.textFile(ip)
for x in input_data.collect(): print x The print statement returns an error: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark-client/python/pyspark/rdd.py", line 771, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
return f(*a, **kw)
File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.adl.AdlFileSystem not found Can someone point me to where it is going wrong? I did not find anything related to this online.
... View more
Labels:
11-30-2017
08:20 PM
@aengineer Hi, thanks for your reply. It seems someone did perform a massive delete operation on the cluster. The issue is resolved.
... View more
11-30-2017
03:16 PM
One of the disks on one of my data nodes was failing, so I replaced it with the instructions: 1. Stop all services on datanode. 2. shut down the machine. 3. replace the disk 4. power on the machine. 5. mount the disk onto its data point. 6. start all services on HDFS. Now, I get an alert "Pending Deletion Blocks:[276861]" on Ambari. Did I do something wrong? I can I revert it?
... View more
Labels:
- Labels:
-
Apache Hadoop
10-30-2017
05:22 PM
This looks like a bug in Hive LLAP. I get the same error for a simple select count(*) on a table. The query is good in Hive and Spark. This error is what it prints on screen but the actual error when you see the logs is: killed/failed due to:INIT_FAILURE, Fail to create
InputInitializerManager, org.apache.tez.dag.api.TezReflectionException: Unable
to instantiate class with 1 arguments:
org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator
... View more
10-05-2017
04:37 PM
I am running a Sql query in Spark: spark.sql("select person_key, count(*) as count1 from <table_name> group by person_key order by count1 desc").show() This throws a warning: 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9,
17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, But does give correct results. I want to understand what this means. I did not find anything on the net. I want this resolved because though it is giving the correct results, it is taking very long to execute. (The same query on Hive LLAP takes 3 seconds. Spark numbers are usually comparable to Hive LLAP numbers). I checked person_key does exist in the table (I created it so the table so I know it exists). Not sure why the warning is coming.
... View more
Labels:
- Labels:
-
Apache Spark
09-29-2017
03:55 PM
Hi all, This question is in continuation to: https://community.hortonworks.com/questions/138257/how-to-sync-a-new-secondary-namenode-to-the-cluste.html Scenario: I have a 12 node cluster (machines 01-12) with 8 data nodes. 06, 07 are the NN and SNN respectively. 01 and 12 run the Hive related services. This cluster was upgraded to 12 node from 4 node where 01 was the namenode and 02-05 were the data nodes. So I used the "Move Namenode" and "Move Snamenode" wizards on Ambari to move the nn and snn from 01 to 06 and 07 respectively. I verified all the services running on nn are running on snn as well. Now I want to check if my snn is working properly. So I shut down all the services on nn and tried to connect to Hive from one of the hosts (02) and it failed with the error: Call From <machine_02/ip_address_of_02> to <machine_06:8020> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused I don't know why it is trying to connect to nn (06) and not snn (07). Can someone point what I am missing here?
... View more
Labels:
- Labels:
-
Apache Hadoop
09-29-2017
03:46 PM
Thanks @Sridhar Reddy. My name node is doing good. I only wanted to move the name node which I did using the "Move Snamenode" wizard on Ambari. I want to sync up the name node with the cluster and am not sure how to do that.
... View more
09-23-2017
12:32 AM
Thanks Sonu, that helped 🙂
... View more
09-23-2017
12:31 AM
I have 5 node cluster with 1 name node and 4 data nodes. Right now both my secondary name node and primary name node are on the same machine. I want to add a new secondary name node and sync it up with the rest of the hosts. What is the best possible way to do this? I tried using Ambari's "move secondary name node" wizard and it asked me to copy data into /hadoop/hdfs/namesecondary directory (which is on the boot disk with low disk space) in the new host. But I want to move it into the external hard disks I mounted on /mnt/data1-4. There are about 72 files in /hadoop/hdfs/namesecondary/current which I am not sure how to distribute into the /mnt disks. Can some suggest what is the right way to do this?
... View more
Labels:
- Labels:
-
Apache Hadoop
09-22-2017
04:29 PM
I have a 5 node cluster (machines 01-05) with 4 data nodes (02, 03, 04, 05) which is running good. Now I want to upgrade the cluster from 5 nodes to 12 nodes by adding machines (06-12). Now I want to set 06 and 07 and name node and secondary name nodes instead of 01 (which is currently the name node). Can I do this without losing any data on the cluster? What is the best possible way to do this?
... View more
Labels:
- Labels:
-
Apache Hadoop