Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

For HDPCD exam, ​Is it necessary to remove header record while analysis of CSV format file?

avatar

Is it necessary to remove header record while analysis of CSV format file?

When i checked solution for practice exam of HDPCD I observed, header record for CSV is not been removed and data is analysed.

Shall we remove the header record or not ,because it may affect the final output and record count?

How this kind of solutions will be rated in real exam?

1 ACCEPTED SOLUTION

avatar

Hi @Anand Pawar

Ofcourse you should not consider the header!

While analyzing you should remove the header only then you will be able to get proper output. As you have mentioned it will end up in misinterpretation and sometime error. Also this can be handled easily in whatever the tool you choose in hadoop. If you are storing it in a hive table then use tblproperties("skip.header.line.count"="1"); to skip the header.

If it is in pig then you can skip the first line while processing it. For sure you should not consider the header in the file when you analyze the data but however you can store the file with header. Hope this would answer your question.

View solution in original post

4 REPLIES 4

avatar

Hi @Anand Pawar

Ofcourse you should not consider the header!

While analyzing you should remove the header only then you will be able to get proper output. As you have mentioned it will end up in misinterpretation and sometime error. Also this can be handled easily in whatever the tool you choose in hadoop. If you are storing it in a hive table then use tblproperties("skip.header.line.count"="1"); to skip the header.

If it is in pig then you can skip the first line while processing it. For sure you should not consider the header in the file when you analyze the data but however you can store the file with header. Hope this would answer your question.

avatar

Thank you for your answer @Bala Vignesh N V

It means the final output should have header before we store the output to HDFS.

Please correct me if I am wrong.

avatar

@Anand Pawar Its kind of tricky here. You can have the header when storing in HDFS. While processing the data for analysis you should remember that file contains header and it should be skipped orelse it will cause errors. As mentioned above if you use skip header properties it will be skipped by default in hive. However the base data lying underneath the hive table will contain header which can be used for any further processing. In simple when storing it you can have header but when processing the data you should not have header. If you feel it satisfies your question then accept the answer.

avatar

@rich Rich and Team