Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

add date time to column from file name

add date time to column from file name

New Contributor

I started an EC2 cluster on amazon to install cloudera...I got it installed and configured and loaded some of the Wiki Page Views public snapshot into HDFS. The structure of the files are as such:

projectcode, pagename, pageviews, bytes

the files are named as such:

pagecounts-20090430-230000.gz
             date    time

when loading the data from HDFS to Impala, I do it as such:

CREATE EXTERNAL TABLE wikiPgvws
(
   project_code varchar(100),
   page_name varchar(1000),
   page_views int,
   page_bytes int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LOCATION '/user/hdfs';

one thing I missed is the date and time of each of the file. The dir:

/user/hdfs

contains multiple pagecount files associated with different dates and times. How can one pull that information and store it in a column when loading to impala?

2 REPLIES 2

Re: add date time to column from file name

New Contributor
anyone???

Re: add date time to column from file name

Rising Star

The only way I can recommend is to unzip the file , add the column, zip it and upload.

 

pagecounts-20090430-230000.gz -> pagecounts-20090430-230000

cat pagecounts-20090430-230000 | awk '{ print $0",pagecounts-20090430-230000.gz" }' > output

zip the output

upload the output

 

And of course, you have to script it to do for every file..