About balavignesh_nag

balavignesh_nag · ‎12-12-2018

Its good approach but the only point which I could find as disadvantage is multiple hops to achieve the desired result. Instead of performing joins we can apply windowing function to achieve the same in a single hop assuming you unique value column and last modified date in your scenario.

balavignesh_nag · ‎12-11-2018

Hi @harsha vardhan Could you explain a bit more on that? Yes you can override the queue whenever you want. But it also depends on the user/groups access as well. If the user is assigned to specific groups and if the groups are not assigned/given privileges to access any other queue then it will not be possible unless proper access are given to user groups. But if you have access to multiple queues then , you can have a parameter passed as a queue name to the sqoop job and if the queue name has to be changed, then you can do that with the combination of shell+sqoop.

balavignesh_nag · ‎04-23-2018

Hi @Swaapnika Guntaka When you are deleting a data from HDFS all the data will be moved to Trash. But there is a time span between which the trash will be flushed out on regular frequency. If its flushed out then there is no way to recover the data unless you DR in place which is possible only in production environment. Hope it Helps!!

balavignesh_nag · ‎04-12-2018

Hi @johny gate Below query works but its kind of dirty. Hope it Helps!! select * from a left join (select*,lag(col3)over (partition by col1 order by col2) as lag_val from a) tblb on tbl b.col1=a.col1 and a.col2=tblb.lag_val

balavignesh_nag · ‎04-06-2018

Hi @Subramaniam Ramasubramanian You would have to start by looking into the executor failures. As you said that this jobs was working fine earlier and recently you were facing this issue. In that case I believe the maximum executor failures was set to 10 and it was working fine. But now the no of executor failures started increasing more than 10. Executor failures may be due to resource unavailability as well. So you may need to consider the cluster resource/ memory availability at the time of your job execution as well. Hope it helps!

balavignesh_nag · ‎03-16-2018

@Timothy Spann If open source is given importance then I would go with Hive using merge, though I haven't tried with merge with huge volume I believe that it would work decent.

balavignesh_nag · ‎03-15-2018

@Timothy Spann I would go with either attunity & or some utility/framework which can be modified depending on the use case. These kind of frameworks reduces time and effort. Multiple tables can be processed in parallel with less effort.

balavignesh_nag · ‎02-21-2018

Hi @hippagun It wont work. though its ORC hive will be able to differentiate the columns based on the delimiter which you have specified during the table creation. So no matter whether you re-create it it wont work. There are two option which you can do now: 1) Create another external table with the additional columns. Write a simple query to load the records from old to the new table specifying null the newly added columns. Once it is done drop the old table. Going forward you can make use of this table. It will be suitable for ORC 2) The other way is, If the schema of the table changes frequently then its better to go with avro table as the schema changes can be handled easily. You have to follow the above step just for the first time. But whenever there is any changes in the schema in future then you need to alter the schema file and nothing else is needed. You can refer to this Link to get the understanding of the handling the schema changes in avro file. Hope it helps!!

balavignesh_nag · ‎02-20-2018

Hi @Ravikiran Dasari If it is for knowledge purpose then what Im going to give has no more information then the previous answers. But if you are looking for something related to work then this answer might help a bit. Have a file watcher which looks for a file with the particular pattern, which has to be ftp'ed to the desired location. Once the file arrives you can move the file to HDFS server. This can be accomplished by a simple shell script which requires basic knowledge on shell and nothing more. Also this can accomplished by either push or pull. If you have any other downstream jobs which has to be executed once the file arrives in hdfs then I would recommend to go with pull approach so that you can execute any other hadoop/hive/pig/spark jobs in hdfs server. Hope it helps!!

balavignesh_nag · ‎02-19-2018

Hi @Lanic When you submit a job, its YARN which gives an information about the resources. So the driver gets the information from name node regarding the HDFS data location, needed to execute the job. Then based on the nearest available resource which are closer to the data will be taken into consideration where the jobs will be executed. Its the name node which gives Yarn about the information of the HDFS data location. Once all the jobs are completed then the communication about all the jobs status will be updated and corresponding metastore will be brought in sync. Hope it Helps!!

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: Basic CDC in Hadoop using Spark with Data Fram...

Re: How to specify yarn queue while running "sqoop...

Re: Accidentally deleted directories on HDFS

Re: How to merge two rows having same values into ...

Re: Spark Session shutting down automatically

Re: What is the fastest way to load data into Apac...

Re: What is the fastest way to load data into Apac...

Re: How to add a column in the middle of a ORC par...

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...