Created 04-09-2017 05:45 PM
Is it necessary to remove header record while analysis of CSV format file?
When i checked solution for practice exam of HDPCD I observed, header record for CSV is not been removed and data is analysed.
Shall we remove the header record or not ,because it may affect the final output and record count?
How this kind of solutions will be rated in real exam?
Created 04-09-2017 06:33 PM
Hi @Anand Pawar
Ofcourse you should not consider the header!
While analyzing you should remove the header only then you will be able to get proper output. As you have mentioned it will end up in misinterpretation and sometime error. Also this can be handled easily in whatever the tool you choose in hadoop. If you are storing it in a hive table then use tblproperties("skip.header.line.count"="1"); to skip the header.
If it is in pig then you can skip the first line while processing it. For sure you should not consider the header in the file when you analyze the data but however you can store the file with header. Hope this would answer your question.
Created 04-09-2017 06:33 PM
Hi @Anand Pawar
Ofcourse you should not consider the header!
While analyzing you should remove the header only then you will be able to get proper output. As you have mentioned it will end up in misinterpretation and sometime error. Also this can be handled easily in whatever the tool you choose in hadoop. If you are storing it in a hive table then use tblproperties("skip.header.line.count"="1"); to skip the header.
If it is in pig then you can skip the first line while processing it. For sure you should not consider the header in the file when you analyze the data but however you can store the file with header. Hope this would answer your question.
Created 04-10-2017 07:03 AM
Thank you for your answer @Bala Vignesh N V
It means the final output should have header before we store the output to HDFS.
Please correct me if I am wrong.
Created 04-10-2017 07:17 AM
@Anand Pawar Its kind of tricky here. You can have the header when storing in HDFS. While processing the data for analysis you should remember that file contains header and it should be skipped orelse it will cause errors. As mentioned above if you use skip header properties it will be skipped by default in hive. However the base data lying underneath the hive table will contain header which can be used for any further processing. In simple when storing it you can have header but when processing the data you should not have header. If you feel it satisfies your question then accept the answer.
Created 04-10-2017 07:03 AM
@rich Rich and Team