About balavignesh_nag

balavignesh_nag · ‎06-14-2017

@Guillaume Roger I'm not sure whether my understanding is correct based on your reply. If you have compound keys then there are work around available to make it possible. Load the data with concat(compound keys) along with the separate fields into a stage table. For the stage table you have the option of defining hte primary key as well as partition based on the other fields which are used in a compound key creation.

balavignesh_nag · ‎06-14-2017

Hi @Guillaume Roger I don't think we can create partition of primary column. To add few things on top of it, if you create partition based on primary key then there will be only one record placed under each partition which will end up in 'N' of partitions. Suppose if you have 10K records then it will be chaos with that much partition on primary keys. Hope it helps!

balavignesh_nag · ‎06-09-2017

Nikkie Thomas If I partition the data by yyyy-mm-dd field and I receive only one file per day. I assume , I will always have one file per partition irrespective of this setting? --> Its not that simple, because it depends on the size of your input file, block size, size of mapper /reducer an other variables. Considering your input file is less than the block size then it should create only one file. If you partition the table on a daily basis with less size then in growth of time it will cause performance issues and there is not much to do with partition. What I would say on such condition, is that partition the table on yearly basis with buckets on a frequently used filter column. In your case it can be daily/weekly/yearly basis. But still each file in a bucketed folder will be less if the data size is less.

balavignesh_nag · ‎06-09-2017

Hi Nikkie Thomas To control the no of files inserted in hive tables we can either change the no of mapper/reducers to 1 depending on the need, so that the final output file will always be one. If not anyone of the below things should be enable to merge a reducer output if the size is less than an block size. hive.merge.mapfiles -- Merge small files at the end of a map-only job. hive.merge.mapredfiles -- Merge small files at the end of a map-reduce job. hive.merge.size.per.task -- Size of merged files at the end of the job. hive.merge.smallfiles.avgsize -- When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

balavignesh_nag · ‎06-08-2017

Félicien Catherin Could please share the screen shot with error after executing this code. CREATE TABLE FIREWALL_LOGS( time STRING, ip STRING, country STRING, status INT ) CLUSTERED BY (time) into 25 buckets STORED AS ORC ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("transactional"="true");

balavignesh_nag · ‎06-08-2017

CREATE TABLE FIREWALL_LOGS( time STRING, ip STRING, country STRING, status INT ) CLUSTERED BY (time) into 25 buckets STORED AS ORC ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/tmp/server-logs' TBLPROPERTIES("transactional"="true"); Missed the location in the previous answer.

balavignesh_nag · ‎06-08-2017

Félicien Catherin Please use the below DDL. CREATE TABLE FIREWALL_LOGS( time STRING, ip STRING, country STRING, status INT ) CLUSTERED BY (time) into 25 buckets STORED AS ORC ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' TBLPROPERTIES("transactional"="true")

balavignesh_nag · ‎06-08-2017

2) HDFS clent act as :- staging/intermediate layer for DN and NM. --> Does it mean whenever I'm copying a file from local to HDFS, edge node will act as a staging layer using the HDFS client which is also installed in edge node? In turn worker node doesn't have any role to play here. Is my understanding right?

balavignesh_nag · ‎06-08-2017

Hi Félicien Catherin You have missed row format delimited. Please use the below in your DDL. ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' It should work. I hope it helps.

balavignesh_nag · ‎06-08-2017

I'm copying a file from unix server to HDFS. I believe Edge node will act as a gateway for ingest data into HDFS. Consider I have 5 GB of file which I'm trying to copy into HDFS. Where will the data be stored? I understand that it will be stored in the data node. But before the entire file is placed into a data node, it will be placed in staging/intermediate layer. Will edge node holds the place for that staging layer?

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: Hive Primary key on partitioned column

Re: Hive Primary key on partitioned column

Re: Hive Multiple Small Files

Re: Hive Multiple Small Files

Re: error while creating hive table

Re: error while creating hive table

Re: error while creating hive table

Re: Role of edge node & worker node in file copyin...

Re: error while creating hive table

Role of edge node & worker node in file copying