About tarekabouzeid91

tarekabouzeid91 · ‎12-31-2015

i guess the problem is the following configuration : spoolDir.sources.src-1.batchSize = 100000 spoolDir.channels.channel-1.transactionCapacity = 50000000 spoolDir.channels.channel-1.capacity = 60000000 spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20 spoolDir.channels.channel-1.byteCapacity = 6442450944 spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 as that happened with me before , the problem is that the channel capacity got fully loaded so i suggest the following edit to be done , also pay attension to the description of each attribute : spoolDir.sources.src-1.batchSize = 100000 #Number of messages to consume in one batch spoolDir.channels.channel-1.transactionCapacity = 60000000 ## EDIT spoolDir.channels.channel-1.capacity = 60000000 ##EDIT spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000 #The max number of lines to read and send to the channel at a time spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0 #Number of seconds to wait before rolling current file (0 = never roll based on time interval) spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0 #File size to trigger roll, in bytes (0: never roll based on file size) spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0 #Number of events written to file before it rolled (0 = never roll based on number of events) hope it works fine , good luck

tarekabouzeid91 · ‎05-20-2015

i am able to do fuzzy search on single word in solr for example the paragraph indexed in solr is : my name is tarek abouzeid and i am testting the big data technology . for example i can apply fuzzy search on the word testing like this : testing~0.6 ==> i get the paragraph normally but what if i want to apply fuzzy search on the phrase "tarek abouzeid" with fuzzy value 0.6 ? thanks in advance

tarekabouzeid91 · ‎03-24-2015

i am using hue 3.7 ,CDH 5.3.1 , i created a schema on solr using hue , and want to edit this schema , i searched for the schema.xml file but i couln't find it at all also tried to look in zoo keeper configuration in cloudera manager but didn't find any thing as well, so how can i edit an exisiting schema in solr ?

tarekabouzeid91 · ‎03-23-2015

i found a command line which takes files in a directory and recursivly index them : java -classpath /opt/cloudera/parcels/CDH/lib/solr/solr-core-4.4.0-cdh5.3.1.jar -Dauto=yes -Dc=testing -Ddata=files -Drecursive=yes org.apache.solr.util.SimplePostTool mydata/ but i got an error : SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update.. Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log Entering recursive mode, max depth=999, delay=0s Indexing directory mydata (9 files, depth=0) POSTing file Word_Count_input - Copy (4).txt (text/plain) SimplePostTool: WARNING: Solr returned an error #404 Not Found and it doesn't commit the changes as well so nothing is writtin in solr

tarekabouzeid91 · ‎03-23-2015

i am using solr version 4.4 CDH 5.3.1 , and was wondering if its possible to insert a log file "unstrucuted" into solr and search for specific words in this text, is it possible as i don't have a schema for the file , its just a text file ? and if yes , how that's can be done using cloudera manager to configure solr to do so ?

tarekabouzeid91 · ‎03-18-2015

i am using apache spark to collect tweets using twitter4j , then want to save the data into mysql database , i created table in mysql DB with ID , createdat, source,text , location and here's the code i used " i modified an twitter4j example " import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.SparkContext._ import org.apache.spark.streaming.twitter._ import org.apache.spark.SparkConf import org.apache.spark.SparkContext._ import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel import org.apache.spark.util.IntParam import org.apache.spark.rdd.JdbcRDD import java.sql.{Connection, DriverManager, ResultSet} object TwitterPopularTags { def main(args: Array[String]) { val filters = args.takeRight(args.length) System.setProperty("twitter4j.oauth.consumerKey", "H2XXXXXX" ) System.setProperty("twitter4j.oauth.consumerSecret", "WjOXXXXXXXXX" ) System.setProperty("twitter4j.oauth.accessToken", "22XXX") System.setProperty("twitter4j.oauth.accessTokenSecret", "vRXXXXXX") val url = "jdbc:mysql://192.168.4.45:3306/twitter" val username = "root" val password = "123456" Class.forName("com.mysql.jdbc.Driver").newInstance val sparkConf = new SparkConf().setAppName("TwitterPopularTags") val ssc = new StreamingContext(sparkConf, Seconds(2)) val stream = TwitterUtils.createStream(ssc, None, filters) println("CURRENT CONFIGURATION:"+System.getProperties().get("twitter4j.oauth.accessTokenSecret")) val tweets_toprint = stream.map(tuple => "%s,%s,%s,%s,%s".format(tuple.getId, tuple.getCreatedAt,tuple.getSource, tuple.getText.toLowerCase.replaceAll(",", " "),tuple.getGeoLocation)).print val tweets = stream.foreachRDD { rdd => rdd.foreachPartition { it => val conn = DriverManager.getConnection(url,username,password) val del = conn.prepareStatement("INSERT INTO tweets (ID,CreatedAt,Source,Text,GeoLocation) VALUES (?,?,?,?,?)") for (tuple <- it) { del.setLong (1, tuple.getId) del.setString(2, tuple.getCreatedAt.toString) del.setString(3, tuple.getSource) del.setString(4, tuple.getText) del.setString(5, tuple.getGeoLocation.toString) del.executeUpdate } conn.close() } } ssc.start() ssc.awaitTermination() } } i can submit the job normally on spark and it prints out some tweets according to my filter , but i can't write data into the DB and get a NULL pointer error when i receive a tweet, here it is : 15/03/18 13:18:17 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 82, node2.com): java.lang.NullPointerException at TwitterPopularTags$$anonfun$2$$anonfun$apply$1$$anonfun$apply$2.apply(PopularHashTags.scala:55) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1$$anonfun$apply$2.apply(PopularHashTags.scala:50) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1.apply(PopularHashTags.scala:50) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1.apply(PopularHashTags.scala:47) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) and get this error when no tweets received : 15/03/18 13:18:18 ERROR JobScheduler: Error running job streaming job 1426673896000 ms.1 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 85, node2.com): java.lang.NullPointerException at TwitterPopularTags$$anonfun$2$$anonfun$apply$1$$anonfun$apply$2.apply(PopularHashTags.scala:55) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1$$anonfun$apply$2.apply(PopularHashTags.scala:50) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1.apply(PopularHashTags.scala:50) at TwitterPopularTags$$anonfun$2$$anonfun$apply$1.apply(PopularHashTags.scala:47) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) i guess the 2nd error is normal to abort the job if no tweets received but i don't know what i am doing wrong to get the first error when tweet is recieved , i managed to normally insert an RDD into mysql DB normally in spark shell , thanks in advance

tarekabouzeid91 · ‎03-18-2015

i managed to insert RDD into mysql database ! thanks so much here's a sample code if anyone needs it : val r = sc.makeRDD(1 to 4) r2.foreachPartition { it => val conn= DriverManager.getConnection(url,username,password) val del = conn.prepareStatement ("INSERT INTO tweets (ID,Text) VALUES (?,?) ") for (bookTitle <-it) { del.setString(1,bookTitle.toString) del.setString(2,"my input") del.executeUpdate } }

tarekabouzeid91 · ‎03-16-2015

okey great ! thanks so much , but can you provide an example for it , if you please ?

tarekabouzeid91 · ‎03-16-2015

i working on spark streaming context "word count example" , so is it possible to store the output RDD into MYSQL database using bulk insertion in JDBC ? and if its possible is there any examples for it ? Thanks in advance

tarekabouzeid91 · ‎03-15-2015

i fixed that problem by making the spark listen to node02 and flume send events to node02 , that fixed the problem acutally , thanks so much for your help

Online	Offline
Last Visited	‎10-12-2021 03:27 AM

Member Since	‎02-09-2015 12:35 AM
Last Visited	‎10-12-2021 03:27 AM
Posts	95
Kudos received	8

Cloudera Community

Re: Parquet schema error

Re: sqoop jdbc error sandbox hortonwork

Re: Kafka offsets in DR scenario

Re: Hive - tez , vertex failed error during reduc...

Re: Cannot read data using Spark - Hive Warehouse...

Re: Flume Spooling Directory Source: Cannot load f...

Phrase fuzzy search on solr

how to update solr cloud existing schema

Re: solr schema less in text search

solr schema less in text search

Re: Spark Streaming save output to mysql DB

Re: Spark Streaming save output to mysql DB

Re: Spark Streaming save output to mysql DB

Spark Streaming save output to mysql DB

Re: Spark Streaming Fails on Cluster mode ( Flume ...