Support Questions

Find answers, ask questions, and share your expertise

Impala -Pig Files - Parquet file?

avatar
Contributor

Hi experts,

I've created a Script using Apache PIG to do some jobs on my data (that are from a text file). After my script I'm getting a big list of files ("part-m-001","part-m-002",...). What I'm asking is: Using Impala is possible to concatenate all the data into one table? The data follows a structured schema so using Parquet Files is a good option? Thanks!

1 ACCEPTED SOLUTION

avatar
Guru

Pig runs map-reduce under the covers and this list of files is the output of a map-reduce job. You should also notice a 0 byte (no contents) file named _SUCCESS at the top of the list. That is just a flag saying the job was a success.

Bottom line is that when you point your job or table to the the parent directory holding these files, it simply sees the union of all files together. So you can think logically of the parent directory as the "file" holding the data.

Thus, there is never a need to concatenate the files on hadoop -- just point to the parent directory and treat it as the file.

So if you make a hive table -- just point to the parent directory. If you load the data into a pig script -- just point to the parent directory. Etc.

If you want to pull the data to an edge node, use the command hdfs dfs -getmerge <hdfsParentDir> <localPathAndName> and it will combine all of the m-001, m-002 ... into a single file.

If you want to pull it to your local machine, use Ambari File Views, open the parent directory, click "+ Select All" and then click "concatenate". That will concatenate all into one file and download it from your browser.

8987-screen-shot-2016-10-29-at-32341-pm.png

If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps.

View solution in original post

1 REPLY 1

avatar
Guru

Pig runs map-reduce under the covers and this list of files is the output of a map-reduce job. You should also notice a 0 byte (no contents) file named _SUCCESS at the top of the list. That is just a flag saying the job was a success.

Bottom line is that when you point your job or table to the the parent directory holding these files, it simply sees the union of all files together. So you can think logically of the parent directory as the "file" holding the data.

Thus, there is never a need to concatenate the files on hadoop -- just point to the parent directory and treat it as the file.

So if you make a hive table -- just point to the parent directory. If you load the data into a pig script -- just point to the parent directory. Etc.

If you want to pull the data to an edge node, use the command hdfs dfs -getmerge <hdfsParentDir> <localPathAndName> and it will combine all of the m-001, m-002 ... into a single file.

If you want to pull it to your local machine, use Ambari File Views, open the parent directory, click "+ Select All" and then click "concatenate". That will concatenate all into one file and download it from your browser.

8987-screen-shot-2016-10-29-at-32341-pm.png

If this is what you are looking for, let me know by accepting the answer; else, let me know of any gaps.