About mqureshi

mqureshi · ‎04-10-2017

@rama Is the following value set to true? keep.failed.task.files (MRv1) or mapreduce.task.files.preserve.failedtasks (MRv2). If yes, that could be the reason staging files are not being deleted. Set this to false and delete the files manually. Do not delete files for currently running job. In rare instances, due to job failure your staging files may not be deleted and you might its remnants here. These are just temporary map reduce files. If no current job is running, you can safely delete these files and reclaim the space. Make sure when you delete these files, they don't end up in trash folder (use -skipTrash option or later delete from trash folder also).

mqureshi · ‎04-03-2017

@Revathy Mourouguessane In your Hive table properties you can specify skip.footer.line.count to remove footer from your data. If you just have one line footer, set this value to 1. You will specify this in your create table properties: tblproperties("skip.header.line.count"="1", "skip.footer.line.count"="1");

mqureshi · ‎04-03-2017

@Bala Vignesh N V Then its likely permission issue. Check permissions on .Trash folder and possible Ranger permissions for the user who is running DROP Table.

mqureshi · ‎04-03-2017

@Bala Vignesh N V If your table is not a hive managed table (data under hive warehouse directory) or in other words when you create an external table, then dropping a table does not delete data. Data is deleted on drop table only for Hive Managed tables.

mqureshi · ‎04-01-2017

@sherri cheng do you mean you have created tables called "drivers, driver1 etc" and now you want to get rid of tables and their associated data? Are the folders created under "/usr/hive/warehouse" directory? Have you used the following? DROP TABLE [IF EXISTS] table_name [PURGE]; --> DROP TABLE IF EXISTS drivers PURGE; Then run it again for other tables.

mqureshi · ‎03-31-2017

@yvora actually if you want to run this as user "a" and not the principal then command changes...you do kinit like you said but then you provide --proxy-user

mqureshi · ‎03-31-2017

@Kevin Ng Can you run a kinit before running the spark command?

mqureshi · ‎03-31-2017

@Lucy zhang Please try the following: --map-column-java isactive=Integer or --map-column-java iactive=String also try the following --map-column-hive isactive=STRING or --map-column-hive iactive=INT

mqureshi · ‎03-30-2017

@kkanchu You are reading defaults for MRV1. With YARN/MRV2 mapreduce.cluster.local.dir has been replaced by yarn.nodemanager.local-dirs This property uses your local disk for storing temporary files. I have not tried mapreduce.cluster.temp.dir but it seems to me the difference is that this is a location in your HDFS and not local file system. You can try running a small sample job and see the difference.

mqureshi · ‎03-29-2017

@sushil nagur I agree with both @Graham Martin and @ccasano. Instead of talking about tools which you already know from above answers, I'll talk about why CIOs prefer Hortonworks for offloading their existing ETL jobs. As Graham mentions, we have partners like Informatica, Talend, Pentaho, Syncsort that you can use to write your ETL jobs in Hadoop. What this gives you is faster time to market which is the same story as previous ETL tools. They save time from writing code and your ETLs manually. Prevents bugs that you may have if you were to write your own code. Under the hood, they use similar technologies like Spark, Map/Reduce and even same fast connectors that Sqoop uses. So why use Hortonworks? Because where is the storage engine where all the processing is happening? Without Hortonworks, in the legacy/existing systems, CIOs are paying significantly higher cost per TB of doing the ETL. Some companies are even doing ELT which means they first load data into their data warehouse and then use the processing power of that system to perform transformation. This takes away very expensive resources from reporting/adhoc queries from business which is what the EDW was purchased for to begin with. When you offload those jobs onto Hadoop, you free up all that capacity from these systems and free up the processing power for reporting and business use. Your per TB cost of doing ETL in Hadoop is fraction of what it is in traditional ETL systems. This is the main motivation of offloading ETL in Hadoop. You perform ETL in Hadoop and then push your final result into your EDW.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: HI .. any one help me to understand what is /...

Re: Apache hive: to ignore the header and footer

Re: I am new to hive. how to delete Hive default f...

Re: I am new to hive. how to delete Hive default f...

Re: I am new to hive. how to delete Hive default f...

Re: Is it possible to submit a spark job remotely ...

Re: Is it possible to submit a spark job remotely ...

Re: Issue with --map-column-java in sqoop command ...

Re: What is functional difference between "mapredu...

Re: traditional ETL vs open source