About panda_anilkumar

panda_anilkumar · ‎07-10-2017

We are trying to read a teradata table from spark2.0 using jdbc using the following code : import sys spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.1-src.zip')) sys.path.insert(0, os.path.join(spark_home, 'python/lib/pyspark.zip')) filename = os.path.join(spark_home, 'python/pyspark/shell.py') print(os.environ.get('SPARK_HOME', None)) exec(compile(open(filename, "rb").read(), filename, 'exec')) spark_release_file = spark_home + "/RELEASE" if os.path.exists(spark_release_file) and "Spark 2" in open(spark_release_file).read(): print("Spark is there.") argsstr= "--master yarn-client --deploy-mode cluster pyspark-shell --driver-class-path /path/to/teradata/terajdbc4.jar,/path/to/teradata/tdgssconfig.jar --driver-library-path /path/to/teradata/terajdbc4.jar,/path/to/teradata/tdgssconfig.jar --jars /path/to/teradata/terajdbc4.jar,/path/to/teradata/tdgssconfig.jar" pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", argsstr) if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell" print(pyspark_submit_args) os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args os.environ["SPARK_SUBMIT_ARGS"] = pyspark_submit_args from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext url = 'jdbc:teradata://teradata.server.com' user='username' password='' driver = 'com.teradata.jdbc.TeraDriver' dbtable_read = 'mi_temp.bd_test_spark_read' sqlContext = SQLContext(sc) df = sqlContext.read.format("jdbc").options(url=url, user=user, password=password, driver=driver, dbtable=dbtable_read).load() We get the follwoing error : Py4JJavaError: An error occurred while calling o48.load. : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:38) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:49) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:49) at scala.Option.foreach(Option.scala:257) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:49) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:117) at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:53) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:315) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) However the if we run the same code, via command line it works. Can you please give us some pointers?

panda_anilkumar · ‎06-09-2016

It was edit to the /etc/hosts files in all the nodes . The hosts file was not set up correctly .

panda_anilkumar · ‎02-03-2016

@Neeraj Sabharwal I tried this option, but to sucess there yet .

panda_anilkumar · ‎02-03-2016

@Artem Ervits @Neeraj SabharwalI have noticed a few conflicting settings in the Yarn site.xml.yarn.nodemanager.container-executor.class = org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutorand we dont have same linux users across the cluster. Hence waiting for the users to be created. Once that is done will test and post the result.

panda_anilkumar · ‎01-25-2016

@Neeraj Sabharwal Deleting the directory makes the job work for once, but afterwards it fails again.

panda_anilkumar · ‎01-25-2016

Have tried that, also the issue is, when a new folder is created, the permissions dont apply, hence the job starts failing. Some cleaning up is not happening correctly, but I am unable to locate the issue 😞

panda_anilkumar · ‎01-25-2016

@Artem Ervits The service checks run fine. Also we have started the services many time, the issue still persists. umask value in all nodes is set to 0022 . What are the mount options we should check ?

panda_anilkumar · ‎01-22-2016

I am trying to run a benchmark job, with the following command : yarn jar /path/to/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000 -resFile /tmp/TESTDFSio.txt but my job fails with following error messages : 16/01/22 15:08:47 INFO mapreduce.Job: Task Id : attempt_1453395961197_0017_m_000008_2, Status : FAILED Application application_1453395961197_0017 initialization failed (exitCode=255) with output: main : command provided 0 main : user is foo main : requested yarn user is foo Path /mnt/sdb1/yarn/local/usercache/foo/appcache/application_1453395961197_0017 has permission 700 but needs permission 750. Path /var/hadoop/yarn/local/usercache/foo/appcache/application_1453395961197_0017 has permission 700 but needs permission 750. Did not create any app directories Even when I change these directories permission to 750, I get errors. Also these caches dont get cleaned off, after one job'and create collisons when running the next job. Any insights ?

panda_anilkumar · ‎01-18-2016

We have a recently build 5 node HDP cluster, it is not in HA mode. Some time in the future there will be an requirement from the Unix team to apply patches/server maintenance. which would require rebooting the server machines. What is the best way to do it ? I plan to do the following : 1) Shut down all services using Ambari. 2) Shutdown ambari-agents on all nodes. 3) Shutdown ambari-server. 4) Reboot all nodes as required . 5) Restart ambari-server, agents and services in that order. Is this the correct sequence ? or am I missing anything .

panda_anilkumar · ‎01-18-2016

Hi All, We were able to solve to issue, it was an issue with the host-names and ip addresses not being set correctly . Thanks for you replies @Neeraj Sabharwal @Artem Ervits @pankaj singh

Online	Offline
Last Visited	‎07-20-2017 02:52 PM

Member Since	‎12-09-2015 09:08 AM
Last Visited	‎07-20-2017 02:52 PM
Posts	16
Kudos received	2

Cloudera Community

Unable to connect Spark 2.0 via jdbc to Teradata 1...

Re: Ambari Heartbeat Lost during Installation

Re: Yarn jobs fails with "Not able to initialize u...

Re: Yarn jobs fails with "Not able to initialize u...

Re: Yarn jobs fails with "Not able to initialize u...

Re: Yarn jobs fails with "Not able to initialize u...

Re: Yarn jobs fails with "Not able to initialize u...

Yarn jobs fails with "Not able to initialize user ...

What is best way to reboot machines in the Hadoop ...

Re: Ambari Heartbeat Lost during Installation