Member since
05-15-2017
86
Posts
12
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13466 | 06-13-2017 12:53 AM | |
4614 | 06-03-2017 03:47 PM | |
2440 | 05-16-2017 08:00 PM | |
2048 | 02-04-2016 02:50 AM |
05-23-2017
10:52 PM
Hi Friends, Have a question regarding the Pig -fragment-replicate join. Could you please let me know whether my (below) understanding is correct or not. I have two files File A (400MB) and File B (50MB), when I join these two files using keyword "replicated" then the small file i,e File B is loaded into memory (because it's small file) and during join with file A (as File A is 400MB and it's distributed in hadoop cluster with 4 blocks - 3*128MB and 116MB) pig will load only parts of the file A (one block each time) into memory for the join with file B and once its done, it will load the next block and so on.. Thanks, Satish.
... View more
Labels:
- Labels:
-
Apache Pig
05-23-2017
07:58 PM
Thanks Greg. do I need to just refer the path while registering in the script? or should I move the script onto HDFS?
... View more
05-23-2017
06:04 PM
Thanks @Lester Martin for the explanation. But,we can achieve the same without using nested foreach right? (please refer the below sample code without nested foreach). When we should exactly use the nested foreach? daily = Load '/user/satu/data.csv'
USING PigStorage(',')
as (exchange, symbol); -- skip other fields --dump daily; dist = distinct daily;
--dump dist; grpd = group dist by exchange; uniqcnt = foreach grpd generate group, COUNT(dist.symbol);
dump uniqcnt;
... View more
05-22-2017
05:45 PM
Hi, I was going through Pig book and found this explanation with this example, Ex: --distinct_symbols.pig daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in other fields grpd = group daily by exchange; uniqcnt = foreach grpd { sym = daily.symbol; uniq_sym = distinct sym; generate group, COUNT(uniq_sym); }; Explanation: In this nested code, each record passed to foreach is handled one at a time. In the first line we see a syntax that we have not seen outside of foreach . In fact, sym =daily.symbol would not be legal outside of foreach . It is roughly equivalent to the toplevel statement sym = foreach grpd generate daily.symbol , but it is not stated that way inside the foreach because it is not really another foreach . There is no relation for it to be associated with (that is, grpd is not defined here). -- Here my question is, it says that the foreach is handled one at a time, in this case how it does the count? To count, we need more than one record, but this explanation says, one record is passed to this foreach at a time. is my understanding correct?
... View more
Labels:
- Labels:
-
Apache Pig
05-22-2017
01:23 AM
1 Kudo
Hi, I am using HDP 2.6 and not able to find the path for Piggybank. Could you please let me know whether this HDP have the piggybank , if yes then where i can find it. Thanks, Satish.
... View more
Labels:
- Labels:
-
Apache Pig
05-16-2017
08:00 PM
I am able to resolve this issue. Just started the Zookeeper and Ozzie. It's working now.
... View more
05-16-2017
07:42 PM
Hi All, I am trying to run the simple create table SQL query using Pig editor in Hue and query is keep running. I have aborted it after 10 minutes. Please let me know why it's taking long time. Note: when I careated database, it took fraction of second. Please let me know whether I need to do any other initial settings. Query History: Query History: 20 minutes ago CREATE TABLE test.EMP(e_id INT, name STRING, age INT,gender STRING, sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE 21 minutes ago show DATABASES 21 minutes ago Create DATABASE test 21 minutes ago show DATABASES 22 minutes ago CREATE TABLE EMP(e_id INT, name STRING, age INT,gender STRING, sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE Query: CREATE TABLE test.EMP(e_id INT, name STRING, age INT,gender STRING, sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; Logs: INFO : Compiling command(queryId=hive_20170516191919_aa55bf85-aa68-4297-8e0f-a13c43cecf38): CREATE TABLE test.EMP(e_id INT, name STRING, age INT,gender STRING, sal INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hive_20170516191919_aa55bf85-aa68-4297-8e0f-a13c43cecf38); Time taken: 0.01 seconds
Bad status for request TFetchResultsReq(fetchType=1, operationHandle=TOperationHandle(hasResultSet=False, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='ew&8\x00PN\xe3\x983\x9d\xf3\xb5\xd98\xca', guid='q\xc3\xd4\xfd\xab\xedN\x92\xa8\x98#4S%\xb2T')), orientation=0, maxRows=-1):
TFetchResultsResp(status=TStatus(errorCode=0, errorMessage='The operation has been closed and its log file /var/log/hive/operation_logs/4643f24d-54fe-466a-9bd6-e0491a39f1c7/71c3d4fd-abed-4e92-a898-23345325b254 has been removed.', sqlState=None, infoMessages=['*org.apache.hive.service.cli.HiveSQLException:The operation has been closed and its log file /var/log/hive/operation_logs/4643f24d-54fe-466a-9bd6-e0491a39f1c7/71c3d4fd-abed-4e92-a898-23345325b254 has been removed.:24:23', 'org.apache.hive.service.cli.operation.OperationManager:getOperationLogRowSet:OperationManager.java:311', 'org.apache.hive.service.cli.session.HiveSessionImpl:fetchResults:HiveSessionImpl.java:770', 'sun.reflect.GeneratedMethodAccessor11:invoke::-1', 'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', 'java.lang.reflect.Method:invoke:Method.java:606', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', 'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', 'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', 'java.security.AccessController:doPrivileged:AccessController.java:-2', 'javax.security.auth.Subject:doAs:Subject.java:415', 'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1796', 'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', 'com.sun.proxy.$Proxy29:fetchResults::-1', 'org.apache.hive.service.cli.CLIService:fetchResults:CLIService.java:462', 'org.apache.hive.service.cli.thrift.ThriftCLIService:FetchResults:ThriftCLIService.java:691', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1553', 'org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults:getResult:TCLIService.java:1538', 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1145', 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:615', 'java.lang.Thread:run:Thread.java:745', '*java.io.IOException:Stream closed:30:6', 'java.io.BufferedReader:ensureOpen:BufferedReader.java:115', 'java.io.BufferedReader:readLine:BufferedReader.java:310', 'java.io.BufferedReader:readLine:BufferedReader.java:382', 'org.apache.hadoop.hive.ql.session.OperationLog$LogFile:readResults:OperationLog.java:199', 'org.apache.hadoop.hive.ql.session.OperationLog$LogFile:read:OperationLog.java:152', 'org.apache.hadoop.hive.ql.session.OperationLog:readOperationLog:OperationLog.java:114', 'org.apache.hive.service.cli.operation.OperationManager:getOperationLogRowSet:OperationManager.java:309'], statusCode=3), results=None, hasMoreRows=None)
... View more
Labels:
- Labels:
-
Apache Hive
05-14-2017
12:23 PM
Hi, I wrote a small pig script to load simple data set to relation and dumping data on to console. When I dump it, it's not showing the first column name (header). Is there anything wrong with my script? Data Set: e_id,fname,mname,lname,age,gender,address,city,state,zip
1,John,m,Smith,35,M,,Princeton,NJ,08536
2,James,S,Clark,M,,Princeton,NJ,08536 Script: A= LOAD '/user/cloudera/sat/data/sampledata1.csv' using PigStorage(',') as (e_id:int,fname:chararray,mname:chararray,lname:chararray,age:chararray,gender:chararray,address:chararray,city:chararray,state:chararray,zip:int);
dump A; Output: (,fname,mname,lname,age,gender,address,city,state,)
(1,John,m,Smith,35,M,,Princeton,NJ,8536)
(2,James,S,Clark,M,,Princeton,NJ,08536,)
... View more
Labels:
- Labels:
-
Apache Pig
05-13-2017
11:00 PM
Hi All, I have requirement where I need to skip the file header on each load using pig. Is there any way to skip the header row while processing? apart from using RANK? Thanks, Satish.
... View more
Labels:
- Labels:
-
Apache Pig