04:12 AM
thank you gkeys... You are....the best...
12:37 PM
I would like to know that, How can we consume kafka topic messages using PIG? What are the jar files it requires? Any suggestions. Mohan.V
08:43 AM
I have been facing this issue from long time. I tried to solve this but i couldn't. I need some experts advice to solve this. I am trying to load a sample tweets json file. sample.json;- {"filter_level":"low","retweeted":false,"in_reply_to_screen_name":"FilmFan","truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":689085590822891521,"in_reply_to_user_id_str":"6048122","timestamp_ms":"1453125782100","in_reply_to_status_id":null,"created_at":"Mon Jan 18 14:03:02 +0000 2016","favorite_count":0,"place":null,"coordinates":null,"text":"@filmfan hey its time for you guys follow @acadgild To #AchieveMore and participate in contest Win Rs.500 worth vouchers","contributors":null,"geo":null,"entities":{"symbols":[],"urls":[],"hashtags":[{"text":"AchieveMore","indices":[56,68]}],"user_mentions":[{"id":6048122,"name":"Tanya","indices":[0,8],"screen_name":"FilmFan","id_str":"6048122"},{"id":2649945906,"name":"ACADGILD","indices":[42,51],"screen_name":"acadgild","id_str":"2649945906"}]},"is_quote_status":false,"source":"<a href=\"\" rel=\"nofollow\">TweetDeck<\/a>","favorited":false,"in_reply_to_user_id":6048122,"retweet_count":0,"id_str":"689085590822891521","user":{"location":"India ","default_profile":false,"profile_background_tile":false,"statuses_count":86548,"lang":"en","profile_link_color":"94D487","profile_banner_url":"","id":197865769,"following":null,"protected":false,"favourites_count":1002,"profile_text_color":"000000","verified":false,"description":"Proud Indian, Digital Marketing Consultant,Traveler, Foodie, Adventurer, Data Architect, Movie Lover, Namo Fan","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Bahubali","profile_background_color":"000000","created_at":"Sat Oct 02 17:41:02 +0000 2010","default_profile_image":false,"followers_count":4467,"profile_image_url_https":"","geo_enabled":true,"profile_background_image_url":"","profile_background_image_url_https":"","follow_request_sent":null,"url":null,"utc_offset":19800,"time_zone":"Chennai","notifications":null,"profile_use_background_image":false,"friends_count":810,"profile_sidebar_fill_color":"000000","screen_name":"Ashok_Uppuluri","id_str":"197865769","profile_image_url":"","listed_count":50,"is_translator":false}}
I have tried to load this json file using ELEPHANT BIRD script:- REGISTER json-simple-1.1.1.jar
REGISTER elephant-bird-2.2.3.jar
REGISTER guava-11.0.2.jar
REGISTER avro-1.7.7.jar
REGISTER piggybank-0.12.0.jar
twitter = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader();
B = foreach twitter generate (chararray)$0#'created_at' as created_at,(chararray)$0#'id' as id,(chararray)$0#'id_str' as id_str,(chararray)$0#'text' as text,(chararray)$0#'source' as source,com.twitter.elephantbird.pig.piggybank.JsonStringToMap($0#'entities') as entities,(boolean)$0#'favorited' as favorited;
describe B;
OUTPUT:- B: {created_at: chararray,id: chararray,id_str: chararray,text: chararray,source: chararray,entitis: map[chararray],favorited: boolean}
But when I tried to DUMP B the follwoing error has occured ERROR - ERROR 1066: Unable to open iterator for alias B I am providing the complete logs here. 2016-09-11 14:07:57,184 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2016-09-11 14:07:57,184 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2016-09-11 14:07:57,194 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2016-09-11 14:07:57,194 [main] INFO - Pig script settings are added to the job
2016-09-11 14:07:57,194 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2016-09-11 14:07:57,199 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2016-09-11 14:07:57,199 [main] INFO - Key [pig.schematuple] is false, will not generate code.
2016-09-11 14:07:57,199 [main] INFO - Starting process to move generated code to distributed cacche
2016-09-11 14:07:57,199 [main] INFO - Distributed cache not supported or needed in local mode. Setting key [pig.schematuple.local.dir] with code temp directory: /tmp/1473583077199-0
2016-09-11 14:07:57,206 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2016-09-11 14:07:57,207 [JobControl] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2016-09-11 14:07:57,208 [JobControl] WARN org.apache.hadoop.mapreduce.JobResourceUploader - No job jar file set. User classes may not be found. See Job or Job#setJar(String).
2016-09-11 14:07:57,211 [JobControl] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2016-09-11 14:07:57,211 [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2016-09-11 14:07:57,212 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2016-09-11 14:07:57,216 [JobControl] INFO org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_local360376249_0009
2016-09-11 14:07:57,267 [JobControl] INFO org.apache.hadoop.mapreduce.Job - The url to track the job: http://localhost:8080/
2016-09-11 14:07:57,267 [Thread-214] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter set in config null
2016-09-11 14:07:57,270 [Thread-214] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2016-09-11 14:07:57,270 [Thread-214] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2016-09-11 14:07:57,270 [Thread-214] INFO org.apache.hadoop.mapred.LocalJobRunner - OutputCommitter is org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter
2016-09-11 14:07:57,271 [Thread-214] INFO org.apache.hadoop.mapred.LocalJobRunner - Waiting for map tasks
2016-09-11 14:07:57,272 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.LocalJobRunner - Starting task: attempt_local360376249_0009_m_000000_0
2016-09-11 14:07:57,277 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2016-09-11 14:07:57,277 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2016-09-11 14:07:57,277 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorProcessTree : [ ]
2016-09-11 14:07:57,278 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapred.MapTask - Processing split: Number of splits :1
Total Length = 2416
Input split[0]:
Length = 2416
ClassName: org.apache.hadoop.mapreduce.lib.input.FileSplit
2016-09-11 14:07:57,282 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/root/PIG/PIG/sample.json:0+2416
2016-09-11 14:07:57,282 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - File Output Committer Algorithm version is 1
2016-09-11 14:07:57,282 [LocalJobRunner Map Task Executor #0] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
2016-09-11 14:07:57,288 [LocalJobRunner Map Task Executor #0] INFO - Key [pig.schematuple] was not set... will not generate code.
2016-09-11 14:07:57,290 [LocalJobRunner Map Task Executor #0] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map - Aliases being processed per job phase (AliasName[line,offset]): M: twitter[20,10],B[21,4] C: R:
2016-09-11 14:07:57,291 [Thread-214] INFO org.apache.hadoop.mapred.LocalJobRunner - map task executor complete.
2016-09-11 14:07:57,296 [Thread-214] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local360376249_0009
java.lang.Exception: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(
at org.apache.hadoop.mapred.LocalJobRunner$
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.Counter, but class was expected
at com.twitter.elephantbird.pig.util.PigCounterHelper.incrCounter(
at com.twitter.elephantbird.pig.load.LzoBaseLoadFunc.incrCounter(
at com.twitter.elephantbird.pig.load.JsonLoader.getNext(
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(
at org.apache.hadoop.mapred.MapTask.runNewMapper(
at org.apache.hadoop.mapred.LocalJobRunner$Job$
at java.util.concurrent.Executors$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
2016-09-11 14:07:57,467 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local360376249_0009
2016-09-11 14:07:57,467 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases B,twitter
2016-09-11 14:07:57,467 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: twitter[20,10],B[21,4] C: R:
2016-09-11 14:07:57,468 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2016-09-11 14:07:57,468 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2016-09-11 14:07:57,468 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_local360376249_0009 has failed! Stop running all dependent jobs
2016-09-11 14:07:57,468 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-09-11 14:07:57,469 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2016-09-11 14:07:57,469 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2016-09-11 14:07:57,469 [main] ERROR - 1 map reduce job(s) failed!
2016-09-11 14:07:57,470 [main] INFO - Script Statistics:
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 14:07:572016-09-11 14:07:57UNKNOWN
Failed Jobs:
job_local360376249_0009B,twitterMAP_ONLYMessage: Job failed!file:/tmp/temp252944192/tmp-470484503,
Failed to read data from "file:///root/PIG/PIG/sample.json"
Failed to produce result in "file:/tmp/temp252944192/tmp-470484503"
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_local360376249_0009 And please give a clarification on how to use jar files. And what are the versions to use. There is soo much of confusion for me. Someone says use Elephant Bird, and Someone says use AVRO. Please help. Mohan.V
07:04 AM
I think i got it on my own. As gkeys said, i made it too complex. But at last I have realized that I don't need the 3rd step which is grouping, and it is successfully stored into the hbase. Here is the Script:- data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets: bag {(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern =='google_*';
STORE A into 'hbase://tablename' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('tweets:tweets');
02:24 PM
Thank you for your valuable suggestions gkeys. I didnt expected that it will beome a complex script like this. As i said that I am just a beginner in Pig. So please suggest me solve for the same.
08:12 AM
I am new to pig.trying filter the text file and store it in hbase here is the sample input file sample.txt {"pattern":"google_1473491793_265244074740","tweets":[{"tweet::created_at":"18:47:31 ","tweet::id":"252479809098223616","tweet::user_id":"450990391","tweet::text":"rt @joey7barton: ..give a google about whether the americans wins a ryder cup. i mean surely he has slightly more important matters. #fami ..."}]}
{"pattern":"facebook_1473491793_265244074740","tweets":[{"tweet::created_at":"11:33:16 ","tweet::id":"252370526411051008","tweet::user_id":"845912316","tweet::text":"@maarionymcmb facebook mere ta dit tu va resté chez toi dnc tu restes !"}]}
Script:- data = load 'sample.txt' using JsonLoader('pattern:chararray, tweets: bag {t1:tuple(tweet::created_at: chararray,tweet::id: chararray,tweet::user_id: chararray,tweet::text: chararray)}');
A = FILTER data BY pattern == 'google_*';
grouped = foreach (group A by pattern){tweets1 = foreach data generate tweets.(created_at),tweets.(id),tweets.(user_id),tweets.(text); generate group as pattern1,tweets1;}
But i got the error when run grouped 2016-09-10 13:38:52,995 [main] ERROR - ERROR 1200: Pig script failed to parse:
<line 41, column 57> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null) Please correct me what i am doung wrong. thank you
... View more
12:19 PM
I think i got it on my own. Actually I have forgotten the credentials and entered the wrong password. But at last its done by entering right credentials.
11:54 AM
Thanks for your reply jk. as you suggested i have tried to disable the kerberos. But I got struck at 1st step only. Ie. admin session expiration error. asking for admin principal and admin password. when i entered the credentials it is giving the error. So please suggest me to solve this. I am attaching the screenshot, please look into it.
... View more
10:22 AM
Hello everyone. I would like to disable kerberos from my cluster completely. For that I need experts guidance. What are the steps that i need to follow without facing any issues. Any suggestions.
... View more
03:15 AM
Thanks for your reply Predrag Minovic. I have tried by using Elephant Bird JsonLoader. script: REGISTER piggybank.jar
REGISTER json-simple-1.1.1.jar
REGISTER elephant-bird-pig-4.3.jar
REGISTER elephant-bird-core-4.1.jar
REGISTER elephant-bird-hadoop-compat-4.3.jar
json = LOAD 'sample.json' USING com.twitter.elephantbird.pig.load.JsonLoader('created_at:chararray, id:chararray, id_str:chararray, text:chararray, source:chararray, in_reply_to_status_id:chararray, in_reply_to_status_id_str:chararray, in_reply_to_user_id:chararray, in_reply_to_user_id_str:chararray, in_reply_to_screen_name:chararray, geo:chararray, coordinates:chararray, place:chararray, contributors:chararray, is_quote_status:bytearray, retweet_count:long, favorite_count:chararray, entities:map[], favorited:bytearray, retweeted:bytearray, possibly_sensitive:bytearray, lang:chararray');
describe json
Schema for json unknown.
Please suggest me.
