Created 12-27-2016 02:41 PM
Hi All,
While trying to process my data in pig which is a csv dataset from here Link I'm getting the below error .There is some delimitter problem here in the file.If i create the same file manually i'm able to see the data is getting loaded properly.
Pig Script:
A = LOAD 's3a://byr-heor-test/dev1/BJsales.csv' using PigStorage(',') as (Num:Int,time:int,BJsales:int)
Output:
.. .. (149,149,262) (150,150,262) 2016-12-27 09:31:35,632 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " <PATH> "2 "" at line 3, column 8. Was expecting one of:
Created 02-01-2017 03:21 AM
recommendations from my colleagues are valid, you have strings in header row of your CSV documents. You can certainly filter by some known entity but there's a more advanced version of CSV Pig Loader called CSVExcelStorage. It is part of Piggybank library that comes bundled with HDP, hence the register command. You can pass different control parameters to it. Mortar blog is an excellent source of information on working with Pig http://help.mortardata.com/technologies/pig/csv.
grunt> register /usr/hdp/current/pig-client/piggybank.jar; grunt> a = load 'BJsales.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Num:Int,time:int,BJsales:float); grunt> describe a; a: {Num: int,time: int,BJsales: float} grunt> b = limit a 5; grunt> dump b;
output
(1,1,200.1) (2,2,199.5) (3,3,199.4) (4,4,198.9) (5,5,199.0)
notice I am not filtering any relation, I'm telling the loader to skip header outright, it saves a few key strokes and doesn't waste any cycles processing anything extra.
Created 12-27-2016 03:06 PM
looking at the BJsales.csv file it seems the first column is string type. Make sure to use proper datatypes. Also remove any empty rows are end of file.
Created 12-27-2016 03:10 PM
Every Field is a Integer or float here so i gave int to all.
Created on 12-27-2016 06:29 PM - edited 08-18-2019 03:30 AM
To add to @milind pandit, tried opening the AirPassengers file. The first column is enclosed in quotes. This is the same for BJsales.csv as well.
Created 02-01-2017 03:21 AM
recommendations from my colleagues are valid, you have strings in header row of your CSV documents. You can certainly filter by some known entity but there's a more advanced version of CSV Pig Loader called CSVExcelStorage. It is part of Piggybank library that comes bundled with HDP, hence the register command. You can pass different control parameters to it. Mortar blog is an excellent source of information on working with Pig http://help.mortardata.com/technologies/pig/csv.
grunt> register /usr/hdp/current/pig-client/piggybank.jar; grunt> a = load 'BJsales.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as (Num:Int,time:int,BJsales:float); grunt> describe a; a: {Num: int,time: int,BJsales: float} grunt> b = limit a 5; grunt> dump b;
output
(1,1,200.1) (2,2,199.5) (3,3,199.4) (4,4,198.9) (5,5,199.0)
notice I am not filtering any relation, I'm telling the loader to skip header outright, it saves a few key strokes and doesn't waste any cycles processing anything extra.