About ebeb

ebeb · ‎12-28-2017

Yes it worked after disabling Sentry in Kafka configuration in Cloudera Manager. Will need to understand how Sentry can work with Kafka without Kerberos. Thanks.

ebeb · ‎12-21-2017

Hi Kafka experts, I have enabled KAFKA 2.2.x parcel (kafka version 0.10.2) in CDH 5.12. When I run a basic producer or consumer command such as: [root@~]# /opt/cloudera/parcels/KAFKA-2.2.0-1.2.2.0.p0.68/lib/kafka/bin/kafka-console-producer.sh --broker-list xyz1.com:9092 xyz2.com:9092 --topic topic1 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/KAFKA-2.2.0-1.2.2.0.p0.68/lib/kafka/libs/slf4j-log4j12-1.7.21.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/KAFKA-2.2.0-1.2.2.0.p0.68/lib/kafka/libs/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 17/12/21 12:54:21 INFO producer.ProducerConfig: ProducerConfig values: acks = 1 batch.size = 16384 block.on.buffer.full = false .... ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS timeout.ms = 30000 value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer 17/12/21 12:54:21 INFO utils.AppInfoParser: Kafka version : 0.10.2-kafka-2.2.0 17/12/21 12:54:21 INFO utils.AppInfoParser: Kafka commitId : unknown hello hello 17/12/21 12:56:26 WARN clients.NetworkClient: Error while fetching metadata with correlation id 1 : {topic1=UNKNOWN_TOPIC_OR_PARTITION} 17/12/21 12:56:27 WARN clients.NetworkClient: Error while fetching metadata with correlation id 2 : {topic1=UNKNOWN_TOPIC_OR_PARTITION} 17/12/21 12:56:27 WARN clients.NetworkClient: Error while fetching metadata with correlation id 3 : {topic1=UNKNOWN_TOPIC_OR_PARTITION} This CDH cluster has Sentry enabled but no Kerberos and no SSL. I think there is a permission issue for the user as I get the below in the /var/log/kafka/kafka-broker-xyz.log 2017-12-21 13:00:18,813 WARN org.apache.sentry.provider.common.HadoopGroupMappingService: Unable to obtain groups for ANONYMOUS java.io.IOException: No groups found for user ANONYMOUS at org.apache.hadoop.security.Groups.noGroupsForUser(Groups.java:199) at org.apache.hadoop.security.Groups.getGroups(Groups.java:222) at org.apache.sentry.provider.common.HadoopGroupMappingService.getGroups(HadoopGroupMappingService.java:60) at org.apache.sentry.provider.common.ResourceAuthorizationProvider.getGroups(ResourceAuthorizationProvider.java:167) at org.apache.sentry.provider.common.ResourceAuthorizationProvider.doHasAccess(ResourceAuthorizationProvider.java:97) at org.apache.sentry.provider.common.ResourceAuthorizationProvider.hasAccess(ResourceAuthorizationProvider.java:91) at org.apache.sentry.kafka.binding.KafkaAuthBinding.authorize(KafkaAuthBinding.java:212) at org.apache.sentry.kafka.authorizer.SentryKafkaAuthorizer.authorize(SentryKafkaAuthorizer.java:63) at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$authorize$1.apply(KafkaApis.scala:343) at kafka.server.KafkaApis$$anonfun$kafka$server$KafkaApis$$authorize$1.apply(KafkaApis.scala:343) at scala.Option.forall(Option.scala:247) at kafka.server.KafkaApis.kafka$server$KafkaApis$$authorize(KafkaApis.scala:343) at kafka.server.KafkaApis$$anonfun$39.apply(KafkaApis.scala:838) at kafka.server.KafkaApis$$anonfun$39.apply(KafkaApis.scala:838) at scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314) at scala.collection.TraversableLike$$anonfun$partition$1.apply(TraversableLike.scala:314) at scala.collection.immutable.Set$Set1.foreach(Set.scala:94) at scala.collection.TraversableLike$class.partition(TraversableLike.scala:314) at scala.collection.AbstractTraversable.partition(Traversable.scala:104) at kafka.server.KafkaApis.handleTopicMetadataRequest(KafkaApis.scala:838) at kafka.server.KafkaApis.handle(KafkaApis.scala:83) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62) at java.lang.Thread.run(Thread.java:745) 2017-12-21 13:00:19,067 WARN org.apache.sentry.provider.common.HadoopGroupMappingService: Unable to obtain groups for ANONYMOUS What is the correct way to setup the Sentry authorization to give permission to the user on kafka? Any blog or instructions will be greatly appreciated. Thanks!

ebeb · ‎11-17-2017

Thanks for the workaround which was a great help as I got this error just today with CDH 5.12.x. After adding the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/ , it didnt show up in the parcel list when Checking for New Parcels. There was a null pointer exception in cloudera-scm-server.log . Had to remove both http and https sqoop parcel urls after which the Anaconda parcel showed up. http://archive.cloudera.com/sqoop-connectors/parcels/latest/ https://archive.cloudera.com/sqoop-connectors/parcels/latest/

ebeb · ‎11-02-2017

The solution was to put the python script in Hue->Query->Editor->Spark in the Libs field with the full path of the python script example: Libs: /user/userxyz/myscript.py and run the query. Clicking the job_xxxxx link will show if the script ran successfully or not.

ebeb · ‎10-29-2017

The good news is even though the shell script didnt work, I was able to run the same python script using Spark Hivecontext using the Spark action in Hue->Workflow instead of Shell action. The shell script is shexample7.sh: ------------------------------------------------- #!/usr/bin/env bash export PYTHONPATH=/usr/bin/python export PYSPARK_PYTHON=/usr/bin/python echo "starting..." /usr/bin/spark-submit --master yarn-cluster pyexample.py The python script is pyexample.py: ----------------------------------------------- #!/usr/bin/env python from pyspark import SparkContext from pyspark.sql import HiveContext sc = SparkContext("local", "pySpark Hive App") # Create a Hive Context hive_context = HiveContext(sc) print "Reading Hive table..." mytbl = hive_context.sql("SELECT * FROM xyzdb.testdata1") print "Registering DataFrame as a table..." mytbl.show() # Show first rows of dataframe mytbl.printSchema() The python job successfully displays the data but somehow the final status comes back as KILLED even though the python script ran and got back data from hive in stdout.

ebeb · ‎10-29-2017

Was able to resolve this by providing the hive-site.xml location as below: In the Hue->Query->Scheduler->Workflow->drag the Spark action to the step below. Add the following parameters: Jar/py name: example4.py FILES: /user/someuser/example4.py Options list: --files /etc/hive/conf.cloudera.hive/hive-site.xml Clicking the gears icon and gave below properties: Spark master: yarn Mode: cluster App name: MySpark After that when running the spark job in Hue/oozie even though it says KILLED status if we look at the actual job url the job ran successfully and data is displayed. This seems like a bug in Hue that it cannot find the hive-site.xml and gives job status=KILLED even though job is successful.

ebeb · ‎10-29-2017

On CDH 5.12.x running Spark version 1.6.0 I get an error while running a python script in Hue/oozie/spark under master=yarn and deploy=cluster. The same script runs successfully when running as $spark-submit from a terminal. 2017-10-29 14:41:12,416 [Thread-8] INFO org.apache.hadoop.hive.metastore.MetaStoreDirectSql - Using direct SQL, underlying DB is DERBY 2017-10-29 14:41:12,417 [Thread-8] INFO org.apache.hadoop.hive.metastore.ObjectStore - Initialized ObjectStore 2017-10-29 14:41:12,515 [Thread-8] WARN org.apache.hadoop.hive.metastore.ObjectStore - Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.1.0-cdh5.12.1 2017-10-29 14:41:12,641 [Thread-8] WARN org.apache.hadoop.hive.metastore.ObjectStore - Failed to get database default, returning NoSuchObjectException 2017-10-29 14:41:12,764 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - Added admin role in metastore 2017-10-29 14:41:12,766 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - Added public role in metastore 2017-10-29 14:41:12,873 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - No user is added in admin role, since config is empty 2017-10-29 14:41:12,964 [Thread-8] INFO org.apache.hadoop.hive.ql.log.PerfLogger - <PERFLOG method=get_all_functions from=org.apache.hadoop.hive.metastore.RetryingHMSHandler> 2017-10-29 14:41:12,966 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - 0: get_all_functions 2017-10-29 14:41:12,966 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=hive ip=unknown-ip-addr cmd=get_all_functions 2017-10-29 14:41:12,967 [Thread-8] INFO DataNucleus.Datastore - The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table. 2017-10-29 14:41:13,136 [Thread-8] INFO org.apache.hadoop.hive.ql.log.PerfLogger - </PERFLOG method=get_all_functions start=1509302472964 end=1509302473136 duration=172 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=0 error=false> 2017-10-29 14:41:13,819 [Thread-8] INFO org.apache.hadoop.hive.ql.log.PerfLogger - <PERFLOG method=get_table from=org.apache.hadoop.hive.metastore.RetryingHMSHandler> 2017-10-29 14:41:13,820 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore - 0: get_table : db=xyzdb tbl=testdata1 2017-10-29 14:41:13,820 [Thread-8] INFO org.apache.hadoop.hive.metastore.HiveMetaStore.audit - ugi=hive ip=unknown-ip-addr cmd=get_table : db=xyzdb tbl=testdata1 2017-10-29 14:41:13,841 [Thread-8] INFO org.apache.hadoop.hive.ql.log.PerfLogger - </PERFLOG method=get_table start=1509302473819 end=1509302473841 duration=22 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=-1 error=true> Traceback (most recent call last): File "example4.py", line 11, in <module> gctbl = hive_context.sql("SELECT * FROM xyzdb.testdata1") File "/yarn/nm/usercache/hive/appcache/application_1509052489118_0076/container_1509052489118_0076_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql File "/yarn/nm/usercache/hive/appcache/application_1509052489118_0076/container_1509052489118_0076_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/yarn/nm/usercache/hive/appcache/application_1509052489118_0076/container_1509052489118_0076_02_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco pyspark.sql.utils.AnalysisException: u'Table not found: `xyzdb`.`testdata1`; line 1 pos 22' 2017-10-29 14:41:13,948 [Driver] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - User application exited with status 1 2017-10-29 14:41:13,949 [Driver] INFO org.apache.spark.deploy.yarn.ApplicationMaster - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1) 2017-10-29 14:41:13,951 [main] ERROR org.apache.spark.deploy.yarn.ApplicationMaster - Uncaught exception: org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:88) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552) 2017-10-29 14:41:13,954 [Thread-4] INFO org.apache.spark.SparkContext - Invoking stop() from shutdown hook

ebeb · ‎10-27-2017

Hi, I am getting an error while running a python script using shell action in Hue/oozie. My workflow xml is given below. Any ideas? Thanks. from pyspark import SparkContext ImportError: No module named pyspark Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1] -------------------------------------------------------------------------------- <workflow-app name="My Workflow" xmlns="uri:oozie:workflow:0.5"> <start to="shell-8cca"/> <kill name="Kill"> <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <action name="shell-8cca"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>oozie.launcher.mapred.child.env</name> <value>PYTHONPATH=/usr/bin/python</value> </property> <property> <name>oozie.launcher.mapred.child.env</name> <value>PYSPARK_PYTHON=/usr/bin/pyspark</value> </property> </configuration> <exec>shexample7.sh</exec> <env-var>PYTHONPATH=/usr/bin/python</env-var> <env-var>PYSPARK_PYTHON=/usr/bin/pyspark</env-var> <file>/user/admin/shexample7.sh#shexample7.sh</file> <file>/user/admin/pyexample.py#pyexample.py</file> <capture-output/> </shell> <ok to="End"/> <error to="Kill"/> </action> <end name="End"/> </workflow-app>

ebeb · ‎10-25-2017

Hi, can someone give me a few basic steps on how to go about running a simple python script in Hue using Oozie such as reading a hive table and write the data to a csv file. Some steps like where I enter the python code and how to run it through HUE interface to help me get started. An example python script will also be very helpful. Thanks.

ebeb · ‎10-19-2017

You are absolutely right! The fifth file 000004_0 in the parquet hive table directory had one string that matched the row value in the beeline SQL output. I am sure the other strings will also be in the data. So it finally confirms that everything is working fine without any errors in hive. Thanks for all your help as this really confused me but I did learn couple of new things so thanks again!! $ parquet-tools cat 000004_0 | grep '(7256823 row(s) affected)' triptype = (7256823 row(s) affected)

Online	Offline
Last Visited	‎12-20-2023 04:37 PM

Member Since	‎09-14-2017 07:07 AM
Last Visited	‎12-20-2023 04:37 PM
Posts	120
Kudos received	11

Cloudera Community

Re: HUE SAML error after upgrade to CDP 7.1.6

Re: CDP 7.2.4 upgrade - cloudera agents not starti...

Re: How to run Python script in Hue through oozie

Re: Cluster installation failure - src file /opt/c...

Re: spark.yarn.executor.memoryOverhead

Re: Kafka Sentry authorization: HadoopGroupMapping...

Kafka Sentry authorization: HadoopGroupMappingServ...

Re: Can't find cloudera 5.7 parcel

Re: How to run Python script in Hue through oozie

Re: ImportError: No module named pyspark from oozi...

Re: pyspark.sql.utils.AnalysisException: u'Table n...

pyspark.sql.utils.AnalysisException: u'Table not f...

ImportError: No module named pyspark from oozie jo...

How to run Python script in Hue through oozie

Re: error_message : You have exceeded your daily r...