Created 10-16-2016 03:00 PM
I have a CSV file that looks like this:
Report Name: XYZ | |||||||
Report Time: 11/11/1111 | |||||||
Time Zone: (GMT+05:30) i | |||||||
Last Completed | |||||||
Last Completed Available Hour: | |||||||
Report Aggregation: Daily | |||||||
Report Filter: | |||||||
Potential Incomplete Data: true | |||||||
Rows: 1 | |||||||
GregorianDate | AccountId | AccountName | Clicks | Impressions | Ctr | AverageCpc | Spend |
10/15/2016 | 1234556 | ABC | |||||
©2016 Microsoft Corporation. All rights reserved. |
I need all header and footer taken off and only the actual data with column names to stay in this file. How do I do it in Pig?I would need this to be mapped to a Hive table so cannot have it this way.
Created 10-16-2016 03:14 PM
@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.
For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:
filterHeader = STREAM fullFile THROUGH `tail -n +10`;
Hope this helps!!
Created 10-16-2016 03:14 PM
@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.
For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:
filterHeader = STREAM fullFile THROUGH `tail -n +10`;
Hope this helps!!
Created 10-16-2016 03:17 PM
That helps. What about the footer? Yes, headers and footers are static. @grajagopal
Created 10-16-2016 03:18 PM
Also, I am going to be loading file through a CSV loader, so fullfile here is fullfile = LOAD 'Path_to_File' USING PigStorage(',') ?
Created 10-16-2016 03:22 PM
Thats correct. Your full file is what you load initially.
Try the following:
filterHeader = STREAM fullFile THROUGH `tail -n +10| head -n -1`;
and
DUMP filterHeader; to verify the same.
Created 10-16-2016 03:23 PM
Thanks. Also, where can I find how exactly tail and head work here? Looks a little confusing to me . Any good resources?
Created 10-16-2016 03:26 PM
I can iterate filterHeader using forEach as usual we do for file loaded using PigStorage right? There should be no difference?
Created 10-17-2016 05:58 AM
Awesome! It worked perfect. Thanks 🙂
Created 10-16-2016 03:30 PM
Just DUMP the relation filteHeader to verify the same.
Created 12-08-2017 06:44 PM
HI After filtering the file I am not able to load it in Hive please help
Pig Stack Trace
---------------
ERROR 1002: Unable to store alias C
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias C
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1647)
at org.apache.pig.PigServer.registerQuery(PigServer.java:587)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 0:
<line 51, column 0> Output Location Validation Failed for: 'haasbat0200_10215.dslam_dlm_table_nokia_test More info to follow:
Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:75)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311)
at org.apache.pig.PigServer.compilePp(PigServer.java:1392)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1317)
at org.apache.pig.PigServer.execute(PigServer.java:1309)
at org.apache.pig.PigServer.access$400(PigServer.java:122)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1642)
... 14 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.hive.hcatalog.pig.HCatBaseStorer.throwTypeMismatchException(HCatBaseStorer.java:602)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.validateSchema(HCatBaseStorer.java:558)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.doSchemaValidations(HCatBaseStorer.java:495)
at org.apache.hive.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:201)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:68)
... 28 more