About sree_kuppa

sree_kuppa · ‎02-23-2018

HI All, I installed a HDP cluster on cloudbreak and am trying to run a simple Spark Job. I open the "pyspark" shell and run the following: ip = "adl://alenzadls1.azuredatalakestore.net/path/to/my/input/directory" input_data = sc.textFile(ip) for x in input_data.collect(): print x The print statement returns an error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/current/spark-client/python/pyspark/rdd.py", line 771, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco return f(*a, **kw) File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.adl.AdlFileSystem not found Can someone point me to where it is going wrong? I did not find anything related to this online.

sree_kuppa · ‎11-30-2017

@aengineer Hi, thanks for your reply. It seems someone did perform a massive delete operation on the cluster. The issue is resolved.

sree_kuppa · ‎11-30-2017

One of the disks on one of my data nodes was failing, so I replaced it with the instructions: 1. Stop all services on datanode. 2. shut down the machine. 3. replace the disk 4. power on the machine. 5. mount the disk onto its data point. 6. start all services on HDFS. Now, I get an alert "Pending Deletion Blocks:[276861]" on Ambari. Did I do something wrong? I can I revert it?

sree_kuppa · ‎10-30-2017

This looks like a bug in Hive LLAP. I get the same error for a simple select count(*) on a table. The query is good in Hive and Spark. This error is what it prints on screen but the actual error when you see the logs is: killed/failed due to:INIT_FAILURE, Fail to create InputInitializerManager, org.apache.tez.dag.api.TezReflectionException: Unable to instantiate class with 1 arguments: org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator

sree_kuppa · ‎10-05-2017

I am running a Sql query in Spark: spark.sql("select person_key, count(*) as count1 from <table_name> group by person_key order by count1 desc").show() This throws a warning: 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, 17/10/05 12:09:03 WARN ReaderImpl: Cannot find field for: person_key in _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, But does give correct results. I want to understand what this means. I did not find anything on the net. I want this resolved because though it is giving the correct results, it is taking very long to execute. (The same query on Hive LLAP takes 3 seconds. Spark numbers are usually comparable to Hive LLAP numbers). I checked person_key does exist in the table (I created it so the table so I know it exists). Not sure why the warning is coming.

sree_kuppa · ‎09-29-2017

Hi all, This question is in continuation to: https://community.hortonworks.com/questions/138257/how-to-sync-a-new-secondary-namenode-to-the-cluste.html Scenario: I have a 12 node cluster (machines 01-12) with 8 data nodes. 06, 07 are the NN and SNN respectively. 01 and 12 run the Hive related services. This cluster was upgraded to 12 node from 4 node where 01 was the namenode and 02-05 were the data nodes. So I used the "Move Namenode" and "Move Snamenode" wizards on Ambari to move the nn and snn from 01 to 06 and 07 respectively. I verified all the services running on nn are running on snn as well. Now I want to check if my snn is working properly. So I shut down all the services on nn and tried to connect to Hive from one of the hosts (02) and it failed with the error: Call From <machine_02/ip_address_of_02> to <machine_06:8020> failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused I don't know why it is trying to connect to nn (06) and not snn (07). Can someone point what I am missing here?

sree_kuppa · ‎09-29-2017

Thanks @Sridhar Reddy. My name node is doing good. I only wanted to move the name node which I did using the "Move Snamenode" wizard on Ambari. I want to sync up the name node with the cluster and am not sure how to do that.

sree_kuppa · ‎09-23-2017

Thanks Sonu, that helped 🙂

sree_kuppa · ‎09-23-2017

I have 5 node cluster with 1 name node and 4 data nodes. Right now both my secondary name node and primary name node are on the same machine. I want to add a new secondary name node and sync it up with the rest of the hosts. What is the best possible way to do this? I tried using Ambari's "move secondary name node" wizard and it asked me to copy data into /hadoop/hdfs/namesecondary directory (which is on the boot disk with low disk space) in the new host. But I want to move it into the external hard disks I mounted on /mnt/data1-4. There are about 72 files in /hadoop/hdfs/namesecondary/current which I am not sure how to distribute into the /mnt disks. Can some suggest what is the right way to do this?

sree_kuppa · ‎09-22-2017

I have a 5 node cluster (machines 01-05) with 4 data nodes (02, 03, 04, 05) which is running good. Now I want to upgrade the cluster from 5 nodes to 12 nodes by adding machines (06-12). Now I want to set 06 and 07 and name node and secondary name nodes instead of 01 (which is currently the name node). Can I do this without losing any data on the cluster? What is the best possible way to do this?

Online	Offline
Last Visited	‎02-23-2018 07:50 PM

Member Since	‎11-15-2016 05:07 PM
Last Visited	‎02-23-2018 07:50 PM
Posts	50
Kudos received	2

Cloudera Community

Spark Error on CloudBreak: Class org.apache.hadoop...

Re: "Pending Deletion Blocks:[276861]" What does t...

"Pending Deletion Blocks:[276861]" What does this ...

Re: Insert overwite query on LLAP is not working b...

Spark Warning: WARN ReaderImpl: Cannot find field ...

How to check if my Secondary Name Node is synced u...

Re: How to Sync a new Secondary Namenode to the Cl...

Re: How to change Namenode to a different host wit...

How to Sync a new Secondary Namenode to the Cluste...

How to change Namenode to a different host without...