Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3993 | 10-18-2017 10:19 PM | |
4255 | 10-18-2017 09:51 PM | |
14631 | 09-21-2017 01:35 PM | |
1773 | 08-04-2017 02:00 PM | |
2357 | 07-31-2017 03:02 PM |
03-05-2017
06:44 PM
@Yan Liu In the query you are running, set the following hive.mapred.mode=nonstrict -->it should run much faster but your customer might not be happy with this hack. Instead of using ORDER BY (order by column) use DISTRIBUTE BY (order by column), SORT BY (sort column). this will create multiple reducers on distribute by (column name) and the result fed to these multiple reducers will already be sorted by the SORT BY syntax. This, I think is the right way to do it and your customer will be happy. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
... View more
03-03-2017
09:19 PM
@Ancil McBarnett Looking at the documentation, the way I understand it is, that Phoenix JDBC driver uses HBase RPC mechanism and like @Josh Elser noted, that's already covered in the secure client side configuration. See this link and notice how JDBC client is actually connected to Zookeeper. https://streever.atlassian.net/wiki/display/HADOOP/Phoenix+JDBC+Client+Setup
... View more
03-03-2017
04:46 PM
@Subramaniyam KMV Try this https://raw.githubusercontent.com/hortonworks/tutorials/hdp/assets/realtime-event-processing-with-hdf/IoT_Lab_Series_DataFlow.xml
... View more
03-03-2017
06:00 AM
@Adedayo Adekeye do an ls on "~/.ssh/config". Did it work? I am wondering if folder ".ssh" even exists (it should but it could be an issue with the VM you are using). In fact, for the user you are logged in as, is there a home folder? Just check all your permissions including permissions on .ssh folder.
... View more
03-03-2017
05:44 AM
@Subramaniyam KMV Here is the link your are looking for: https://github.com/hortonworks/tutorials/blob/hdp/assets/realtime-event-processing-with-hdf/IoT_Lab_Series_DataFlow.xml
... View more
03-03-2017
05:21 AM
@Dylan Wu May be your MapReduce is using a different queue than Tez. how are your queues configured? https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_installing_manually_book/content/ref-475bae8f-9da0-4bcb-955c-e9722c4c536a.1.html
... View more
03-02-2017
06:56 PM
@nedox nedox then just use GetHDFS or ListHDFS -> FetchHDFS. In these processors you will have to specify client config files from your hDP cluster and that's how it knows where to connect, which keytab and principal to use if Kerberos is enabled and which directories to fetch files from.
... View more
03-02-2017
03:24 PM
@nedox nedox HDF is not an ETL tool. How much data do you want to fetch from HDP? If it's a big chunk (millions of records or more), then why not use Sqoop? Can you please describe what you intend to do with the data you fetch?
... View more
02-27-2017
06:15 PM
3 Kudos
@Oriane
I am not aware of a platform similar to Kaggle for Hadoop. Problems on Kaggle are very specific. There is a known input (data) and companies know exactly what they want. They don't have to share their data and in the end they walk out with best data model in terms of accuracy. That's it. There are no adhoc queries, no one is talking about 10 different data sources, seven of which have regulatory data which you cannot copy due to compliance reasons but must somehow work with the teams to build reports. See, Machine learning problems, like the one shared on Kaggle, can be solved with the help of Hadoop eco system, specifically Spark. But then there is whole set of other problems that are solved with Hadoop and there is no platform for those problems similar to Kaggle because they don't have a definitive outcome. For example, if someone wants to run ad hoc queries using Hive LLAP and wants 5 second SLA, you will have to do that in their environment. you can't just solve those on a platform like Kaggle and provide a solution to the customer. Or imagine someone trying to build an app to use HBase as a backend. Some of the problems they will encounter is key design, sizing of the cluster and ensuring SLAs for x number of concurrent queries (omes under sizing and design). You can help guide them, but they will mostly not share their business requirements on a public platform. For whatever help they need, they can come to sites like this and get their questions answered. In most cases they will work directly with Hortonworks to come with the best solution. In a nutshell, machine learning problems tends to be definitive (math is definitive) without requiring access to real data. Kaggle is a perfect platform to bring your problems and pay people to solve them and walk away with the best solution (not just any solution) from some really good minds out there. It also provides companies an opportunity to hire best talent. Other set of problems like the one you are asking for, are not definitive, require disclosing information that companies consider competitive and so you don't see a platform like Kaggle for such problems. Hope this helps.
... View more
02-27-2017
05:52 PM
@Muhammad Touseef You can use the instructions to stop hadoop services before upgrade. One thing you absolutely want to make sure is no jobs are running. Let any running job complete. The right way to do it would be to stop queues so any running jobs will complete but no new jobs will be submitted. Once all jobs are completed and nothing is running, you can use "stop all" command from Ambari or simply follow the instructions below: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html ---> check for how to stop queues https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.2.0/bk_upgrading_Ambari/content/_stop_cluster_and_checkpoint_HDFS_mamiu.html -->stop all.
... View more