Support Questions

Find answers, ask questions, and share your expertise

How to remove header and footer from a CSV file in PIG

avatar
Expert Contributor

I have a CSV file that looks like this:

Report Name: XYZ
Report Time: 11/11/1111
Time Zone: (GMT+05:30) i
Last Completed
Last Completed Available Hour:
Report Aggregation: Daily
Report Filter:
Potential Incomplete Data: true
Rows: 1
GregorianDate AccountId AccountName Clicks Impressions Ctr AverageCpc Spend
10/15/2016 1234556 ABC
©2016 Microsoft Corporation. All rights reserved.

I need all header and footer taken off and only the actual data with column names to stay in this file. How do I do it in Pig?I would need this to be mapped to a Hive table so cannot have it this way.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.

For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:

filterHeader = STREAM fullFile THROUGH `tail -n +10`;

Hope this helps!!

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.

For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:

filterHeader = STREAM fullFile THROUGH `tail -n +10`;

Hope this helps!!

avatar
Expert Contributor

That helps. What about the footer? Yes, headers and footers are static. @grajagopal

avatar
Expert Contributor
Also, I am going to be loading file through a CSV loader, so fullfile here is fullfile = LOAD 'Path_to_File' USING PigStorage(',') ?

avatar
Expert Contributor

Thats correct. Your full file is what you load initially.

Try the following:

filterHeader = STREAM fullFile THROUGH `tail -n +10| head -n -1`;

and

DUMP filterHeader;   to verify the same. 

avatar
Expert Contributor

Thanks. Also, where can I find how exactly tail and head work here? Looks a little confusing to me . Any good resources?

avatar
Expert Contributor

I can iterate filterHeader using forEach as usual we do for file loaded using PigStorage right? There should be no difference?

avatar
Expert Contributor

Awesome! It worked perfect. Thanks 🙂

avatar
Expert Contributor

Just DUMP the relation filteHeader to verify the same.

avatar
New Contributor

HI After filtering the file I am not able to load it in Hive please help

Pig Stack Trace
---------------
ERROR 1002: Unable to store alias C

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias C
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1647)
at org.apache.pig.PigServer.registerQuery(PigServer.java:587)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 0:
<line 51, column 0> Output Location Validation Failed for: 'haasbat0200_10215.dslam_dlm_table_nokia_test More info to follow:
Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:75)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311)
at org.apache.pig.PigServer.compilePp(PigServer.java:1392)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1317)
at org.apache.pig.PigServer.execute(PigServer.java:1309)
at org.apache.pig.PigServer.access$400(PigServer.java:122)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1642)
... 14 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.hive.hcatalog.pig.HCatBaseStorer.throwTypeMismatchException(HCatBaseStorer.java:602)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.validateSchema(HCatBaseStorer.java:558)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.doSchemaValidations(HCatBaseStorer.java:495)
at org.apache.hive.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:201)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:68)
... 28 more