Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How to remove header and footer from a CSV file in PIG

avatar
Expert Contributor

I have a CSV file that looks like this:

Report Name: XYZ
Report Time: 11/11/1111
Time Zone: (GMT+05:30) i
Last Completed
Last Completed Available Hour:
Report Aggregation: Daily
Report Filter:
Potential Incomplete Data: true
Rows: 1
GregorianDate AccountId AccountName Clicks Impressions Ctr AverageCpc Spend
10/15/2016 1234556 ABC
©2016 Microsoft Corporation. All rights reserved.

I need all header and footer taken off and only the actual data with column names to stay in this file. How do I do it in Pig?I would need this to be mapped to a Hive table so cannot have it this way.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.

For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:

filterHeader = STREAM fullFile THROUGH `tail -n +10`;

Hope this helps!!

View solution in original post

9 REPLIES 9

avatar
Expert Contributor

@Simran Kaur - If the headers and trailers are static, you can eliminate them using PIG STREAM.

For example, Once you load the file to a relation, you can stream through the file to remove the first 10 lines as follows:

filterHeader = STREAM fullFile THROUGH `tail -n +10`;

Hope this helps!!

avatar
Expert Contributor

That helps. What about the footer? Yes, headers and footers are static. @grajagopal

avatar
Expert Contributor
Also, I am going to be loading file through a CSV loader, so fullfile here is fullfile = LOAD 'Path_to_File' USING PigStorage(',') ?

avatar
Expert Contributor

Thats correct. Your full file is what you load initially.

Try the following:

filterHeader = STREAM fullFile THROUGH `tail -n +10| head -n -1`;

and

DUMP filterHeader;   to verify the same. 

avatar
Expert Contributor

Thanks. Also, where can I find how exactly tail and head work here? Looks a little confusing to me . Any good resources?

avatar
Expert Contributor

I can iterate filterHeader using forEach as usual we do for file loaded using PigStorage right? There should be no difference?

avatar
Expert Contributor

Awesome! It worked perfect. Thanks 🙂

avatar
Expert Contributor

Just DUMP the relation filteHeader to verify the same.

avatar

HI After filtering the file I am not able to load it in Hive please help

Pig Stack Trace
---------------
ERROR 1002: Unable to store alias C

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias C
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1647)
at org.apache.pig.PigServer.registerQuery(PigServer.java:587)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 0:
<line 51, column 0> Output Location Validation Failed for: 'haasbat0200_10215.dslam_dlm_table_nokia_test More info to follow:
Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:75)
at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311)
at org.apache.pig.PigServer.compilePp(PigServer.java:1392)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1317)
at org.apache.pig.PigServer.execute(PigServer.java:1309)
at org.apache.pig.PigServer.access$400(PigServer.java:122)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1642)
... 14 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY}
at org.apache.hive.hcatalog.pig.HCatBaseStorer.throwTypeMismatchException(HCatBaseStorer.java:602)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.validateSchema(HCatBaseStorer.java:558)
at org.apache.hive.hcatalog.pig.HCatBaseStorer.doSchemaValidations(HCatBaseStorer.java:495)
at org.apache.hive.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:201)
at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:68)
... 28 more