About ozw1z5rd

ozw1z5rd · ‎11-15-2018

I switched from Python to Scala, which is a better supported language since Spark itself has been written in Scala. I keep trying, on a my remote friend suggestion and found that this works fine, so Structured Streaming it is not supported but it is working. scala> import org.apache.spark.sql.Encoders scala> case class Amazon(EventId:String, DOCOMOEntitlementId:String, AmazonSubscriptionId:String, AmazonPlanId:String, DOCOMOUserId:String, MerchantAccountKey:String, ResellerKey:String, Status:String, CreatedDate:String, EndDate:String, ActivatedDate:String, FailedDate:String, ExpiryDate:String, LastUpdated:String, dateTimeStart:String, dateTimeEnd:String, referrerSource:String, reasonCode:String) scala> val schema = Encoders.product[Amazon].schema scala > val data = spark.readStream.schema(schema).csv("/user/ale/csv.csv").as[Amazon] data: org.apache.spark.sql.Dataset[Amazon] = [EventId: string, DOCOMOEntitlementId: string ... scala> data.isStreaming res0: Boolean = true scala> data.writeStream.outputMode("append").format("console") res1: org.apache.spark.sql.streaming.DataStreamWriter[Amazon] = org.apache.spark.sql.streaming.DataStreamWriter@7dc56c0a scala> res1.start() On the other hand using spark.readStream.format("kafka") .... Lead to this issue: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159) ... 49 elided Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618) ... 50 more Is there any Jars from Cloudera that can solve this issue? If there are no Jars from Cloudera, what it the best that matches? Well after a while I found that the library I need is https://repository.cloudera.com/cloudera/cloudera-repos/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.0.cloudera2/spark-sql-kafka-0-10_2.11-2.3.0.cloudera2.jar And anything goes almost fine, but... org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: spark-kafka-source-707ab780-c71c-408b-80dd-be1960a03dd6-360506181-driver-0 18/11/15 10:51:27 WARN kafka010.KafkaOffsetReader: Error in attempt 2 getting Kafka offsets: org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: spark-kafka-source-707ab780-c71c-408b-80dd-be1960a03dd6-360506181-driver-1 18/11/15 10:51:28 WARN kafka010.KafkaOffsetReader: Error in attempt 3 getting Kafka offsets: org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: spark-kafka-source-707ab780-c71c-408b-80dd-be1960a03dd6-360506181-driver-2 and this depends on how the kakfa code works, and there is no option to avoid it. This, of course, is a problem when kafka is under Sentry. Is there any option to have it working without disabling Sentry?

ozw1z5rd · ‎11-04-2018

Hello Manuel, I'm trying to use structured streaming using python, but all attempt made are unsuccessful. Please read this post: https://community.hortonworks.com/articles/197922/spark-23-structured-streaming-integration-with-apa.html This error was the 1st I got,so I attempted to change the jar to spark-sql-kafka-0-10_2.11-2.1.0.cloudera1.jar and the error now is the one shown into the attached picture. There is no way to move a step from this point. So we started to migrate to Direct Stream, also using this last option we found other issue on python so now we are using scala/java code. But this is not a problem. If we can make Structured streaming available, I think that it will be a great option for all the cloudera users.

ozw1z5rd · ‎10-30-2018

Hello all, does cloudera supports Kafka direct stream with python ? I'm facing some issue about missing methods.

ozw1z5rd · ‎08-23-2018

There are HUGE tables with a lot of partitions, this case is not unique. Partitioning helps to address the slice of data that matters and each partition contains a lot of data. Oh well... it's BigData at all.

ozw1z5rd · ‎08-23-2018

It’s not a solution we can adopt in a production environment where the same problem can face again. Any other idea?

ozw1z5rd · ‎08-21-2018

I'm using Hadoop 2.6.0-cdh5.14.2, Hive 1.1.0-cdh5.14.2 In this system a huge table with 183K+ partition does exist, it is an external table and the command: 0: jdbc:hive2://hiveserver2.hd.docomodigital.> drop table unifieddata_work.old__raw_ww_eventsjson does not work, the metastore does not reply within 600 seconds and the task ends with error. I attempted to delete the partitions using the range: 0: jdbc:hive2://hiveserver2.hd.docomodigital.> alter table unifieddata_work.old__raw_ww_eventsjson drop PARTITION (country='ae', year='2017', month='01', day>'29', hour > '00' ); INFO : Compiling command(queryId=hive_20180821140909_ba6c4bb0-d0de-4fd3-a5ec-47e217289c6b): alter table unifieddata_work.old__raw_ww_eventsjson drop PARTITION (country='ae', year='2017', month='01', day>'29', hour > '00' ) INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20180821140909_ba6c4bb0-d0de-4fd3-a5ec-47e217289c6b); Time taken: 0.612 seconds INFO : Executing command(queryId=hive_20180821140909_ba6c4bb0-d0de-4fd3-a5ec-47e217289c6b): alter table unifieddata_work.old__raw_ww_eventsjson drop PARTITION (country='ae', year='2017', month='01', day>'29', hour > '00' ) INFO : Starting task [Stage-0:DDL] in serial mode INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=01 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=02 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=03 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=04 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=05 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=06 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=07 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=08 INFO : Dropped the partition country=ae/year=2017/month=01/day=30/hour=09 ... CUTTED HERE ... It works but something bad happens to Metastore: canary stops working. Any idea about how to solve the issue? Is there an alternative way to delete such a big table?

Online	Offline
Last Visited	‎10-04-2019 06:52 AM

Member Since	‎06-15-2018 03:54 AM
Last Visited	‎10-04-2019 06:52 AM
Posts	9

Cloudera Community

Re: Kafka DirectStream and Python

Re: Kafka DirectStream and Python

Kafka DirectStream and Python

Re: Hive drop partitions using range impacts metas...

Re: Hive drop partitions using range impacts metas...

Hive drop partitions using range impacts metastore