About mqureshi

mqureshi · ‎03-05-2017

@Yan Liu In the query you are running, set the following hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack. Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column). this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

mqureshi · ‎03-03-2017

@Ancil McBarnett Looking at the documentation, the way I understand it is, that Phoenix JDBC driver uses HBase RPC mechanism and like @Josh Elser noted, that's already covered in the secure client side configuration. See this link and notice how JDBC client is actually connected to Zookeeper. https://streever.atlassian.net/wiki/display/HADOOP/Phoenix+JDBC+Client+Setup

mqureshi · ‎03-03-2017

@Subramaniyam KMV Try this https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/realtime-event-processing-with-hdf/IoT_Lab_Series_DataFlow.xml

mqureshi · ‎03-03-2017

@Adedayo Adekeye do an ls on "~/.ssh/config". Did it work? I am wondering if folder ".ssh" even exists (it should but it could be an issue with the VM you are using). In fact, for the user you are logged in as, is there a home folder? Just check all your permissions including permissions on .ssh folder.

mqureshi · ‎03-03-2017

@Subramaniyam KMV Here is the link your are looking for: https://github.com/hortonworks/tutorials/blob/hdp/assets/realtime-event-processing-with-hdf/IoT_Lab_Series_DataFlow.xml

mqureshi · ‎03-03-2017

@Dylan Wu May be your MapReduce is using a different queue than Tez. how are your queues configured? https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/ref-475bae8f-9da0-4bcb-955c-e9722c4c536a.1.html

mqureshi · ‎03-02-2017

@nedox nedox then just use GetHDFS or ListHDFS -> FetchHDFS. In these processors you will have to specify client config files from your hDP cluster and that's how it knows where to connect, which keytab and principal to use if Kerberos is enabled and which directories to fetch files from.

mqureshi · ‎03-02-2017

@nedox nedox HDF is not an ETL tool. How much data do you want to fetch from HDP? If it's a big chunk (millions of records or more), then why not use Sqoop? Can you please describe what you intend to do with the data you fetch?

mqureshi · ‎02-27-2017

@Oriane I am not aware of a platform similar to Kaggle for Hadoop. Problems on Kaggle are very specific. There is a known input (data) and companies know exactly what they want. They don't have to share their data and in the end they walk out with best data model in terms of accuracy. That's it. There are no adhoc queries, no one is talking about 10 different data sources, seven of which have regulatory data which you cannot copy due to compliance reasons but must somehow work with the teams to build reports. See, Machine learning problems, like the one shared on Kaggle, can be solved with the help of Hadoop eco system, specifically Spark. But then there is whole set of other problems that are solved with Hadoop and there is no platform for those problems similar to Kaggle because they don't have a definitive outcome. For example, if someone wants to run ad hoc queries using Hive LLAP and wants 5 second SLA, you will have to do that in their environment. you can't just solve those on a platform like Kaggle and provide a solution to the customer. Or imagine someone trying to build an app to use HBase as a backend. Some of the problems they will encounter is key design, sizing of the cluster and ensuring SLAs for x number of concurrent queries (omes under sizing and design). You can help guide them, but they will mostly not share their business requirements on a public platform. For whatever help they need, they can come to sites like this and get their questions answered. In most cases they will work directly with Hortonworks to come with the best solution. In a nutshell, machine learning problems tends to be definitive (math is definitive) without requiring access to real data. Kaggle is a perfect platform to bring your problems and pay people to solve them and walk away with the best solution (not just any solution) from some really good minds out there. It also provides companies an opportunity to hire best talent. Other set of problems like the one you are asking for, are not definitive, require disclosing information that companies consider competitive and so you don't see a platform like Kaggle for such problems. Hope this helps.

mqureshi · ‎02-27-2017

@Muhammad Touseef You can use the instructions to stop hadoop services before upgrade. One thing you absolutely want to make sure is no jobs are running. Let any running job complete. The right way to do it would be to stop queues so any running jobs will complete but no new jobs will be submitted. Once all jobs are completed and nothing is running, you can use "stop all" command from Ambari or simply follow the instructions below: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html ---> check for how to stop queues https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_upgrading_Ambari/content/_stop_cluster_and_checkpoint_HDFS_mamiu.html -->stop all.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Hive Order By with Limits query performance o...

Re: HBase end-to-end over the wire encryption

Re: Download link broken in the tutorial https://...

Re: "~/.ssh/config" E212: Can't open file for writ...

Re: Download link broken in the tutorial https://...

Re: It's only use 50% Computational resource of my...

Re: Get Data from HDP using HDF

Re: Get Data from HDP using HDF

Re: Do you know something like kaggle for Hadoop

Re: Shutdown the Hadoop Cluster Cleanly