Member since
12-09-2015
37
Posts
28
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1203 | 12-15-2015 07:47 AM | |
687 | 12-11-2015 07:51 PM | |
785 | 12-10-2015 10:06 PM | |
467 | 12-09-2015 09:17 PM |
04-21-2016
08:16 PM
Hi @Rahul Pathak only some check: Did you set hadoop.security.authorization to true in core-site.xml using Ambari? Can you post the value of the property security.job.client.protocol.acl? Did you configure property security.client.protocol.acl? After setting the properties and restarting all the service thru Ambari did you check the core-site.xml and hadoop-policy.xml manually in order to verify the values? I'll wait your answer.
... View more
04-21-2016
08:02 PM
We've just made a lot of tests: we do not have null value and we already use Hive 1.2.1. We also tried to delete partition from the tables involved in the queries but we always got the same error.
... View more
04-21-2016
07:59 PM
Hi , it could be a compliance issue posting the DDL and query in public. If you can send me an email I can send you the DDL in private. Thanks, Andrea
... View more
04-21-2016
09:47 AM
2 Kudos
@rajdip chaudhuri Hi, if you want to use PostgreSQL as a repository for both Ambari and Ranger gaining full control over the deployment I suggest to deploy a standalone instance of PostgreSQL (refer to this link for the supported version http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.1.0/bk_Installing_HDP_AMB/content/_database_requirements.html ) and then use this instance for both. If you have already deployed Ambari with it's own PostgreSQL installation you can easily migrate ambari DB to the new instance. After you moved the DB run the following command: ambari-server setup to reconfigure Ambari. One last thing: Ranger could generate a huge amounts of logs so do not use the DB as landing area of audis if you don't know exactly how much audits you'll have.
... View more
04-21-2016
09:16 AM
1 Kudo
@Kuldeep Kulkarni Thanks! Do you know when Tez 1.2.1 will be released? It's planned for HDP 2.4.2?
... View more
04-20-2016
08:28 PM
1 Kudo
We're working on a data preparation phase with Hive and Tez. We're experiencing the following error: Error while processing statement: FAILED: Execution
Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Session
stats:submittedDAGs=0, successfulDAGs=0, failedDAGs=0, killedDAGs=0
(state=08S01,code=2) We're using HDP 2.3.2 with Hive 1.2.1 using ORC format and Tez engine. Does anyone already found this error?
... View more
- Tags:
- Data Processing
- Hive
Labels:
04-20-2016
08:19 PM
If you had some time to spend you can use Falcon to orchestrate data replication from one cluster to another: you had to create Hive table on the second cluster but then you can compare easily the two tables directly from Beeline.
... View more
04-20-2016
08:15 PM
2 Kudos
Hi @Rendiyono Wahyu Saputro, I'll add only one last thing to what @Davide Isoardi wrote: we're converting our demo to run inside Hortonworks Sandbox and we'll push all to GitHub. I'll ping you when we'll finish the job: it could be a good starting point to build your own app. Meanwhile we're also planning a text analysis algorithm to analyze tweets to understand the right rilevance for what we're searching. Stay tuned.
... View more
04-16-2016
09:13 PM
@Davide Vergari just released a custom Ambari service to install Apache Drill thru Ambari on HDP 2.4. Feel free to take a look at it.
... View more
04-16-2016
09:06 PM
Right now I was able to enable SSL in Ranger 0.6.0 downloaded from the Apache Foundation but not in Ranger 0.5.0 included in HDP 2.4.0. Hope in the next release Hortonworks will upgrade Ranger to 0.6.0.
... View more
04-16-2016
09:01 PM
As @Neeraj Sabharwal said use Ranger and LDAP / AD (with Kerberos) to implement a multitenant environment. Also close all SSH access and use Knox as unique entry point for your apps (ODBC/JDBC, HDFS, etc...). Once you'll have a securized environment use YARN queues but remember: you need to have unique entry point to your cluster: no direct SSH for browsing data or something else. Also for and enterprise level cluster you'll need a data lineage approach: use Atlas and Falcon for that. In the end do not forget to backup configs, metadata and if you need also data: you can orchestrate the entire backup process with Oozie or you can build a DR site using Falcon and distcp.
... View more
04-16-2016
08:50 PM
Yes, it's possible. I've already done such a config in a lab environment with HDP cluster on CentOS 7 and Hue node on CentOS 6. It could be a pain if you had a kerberized cluster.
... View more
12-15-2015
07:47 AM
2 Kudos
If not configured properly networking could be a real pain in Hadoop. All nodes in an Hadoop cluster need to see each other and they requires DNS (with reverse) and NTP. You can choose to deploy your cluster inside "cluster" network and use a multi-homed edge node as a bridge between "cluster" and "public" network. But you need to understand how you would access your cluster data (e.g.: JDBC thru Hive, WebHDFS for HDFS files and so on...). An edge nodes doesn't grant you access to Ambari Web UI, Ambari API, etc.. so if you deploy such a config you need to open specific TCP/IP ports in order to grant users on the "public" network to access such a service (e.g. Ambari Views). On the edge node you can deploy all clients (HDFS, YARN, Oozie, Hive, etc...) and let users to access edge node using SSH. It really depends how you want to manage access to Hadoop services and which services you need to give access to your end users. You can use NFS gateway or Knox on the edge node but what are your needs? Please take a look at this links: Hadoop TCP/IP ports: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-... Hadoop IDC and firewalls: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-... NFS Gateway: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-... Knox: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...
... View more
12-15-2015
07:28 AM
2 Kudos
@Pardeep Yes, we had to ask Microsoft Azure Support to increase limits for both Cores and Storage Account. Look at this link: https://azure.microsoft.com/en-us/blog/azure-limits-quotas-increase-requests/
... View more
12-11-2015
08:25 PM
@jramakrishnan Do you plan to support Sqoop2 in the near future?
... View more
12-11-2015
08:22 PM
We've some mainframe with DB/2 and we need to bring every new record written inside DB/2 to Hadoop (Hive ORC table) in the fastest way possible. Is Nifi capable of doing such a thing?
... View more
Labels:
12-11-2015
08:16 PM
@Ancil McBarnett Thanks! We need to keep indexes on HDFS but we need also to index files (about 500.000) on HDFS (PDF, EML and P7F). Following your suggestion could we deploy Solr on all DataNodes and also on two master nodes? @azeltov So is it correct to say that any Solr could service request on HTTP port 8983 (both Solr and Banana)? Do you have some suggestion about the load balancer? Thanks a lot!
... View more
12-11-2015
08:07 PM
You could try stopping them from Ambari, then stopping manually from console and look if service is running (if this is try kill daemon manually and then try to start both from Ambari... it should work
... View more
12-11-2015
07:51 PM
1 Kudo
We had some performance issue with a low profile config (4 vCores, 8 GB RAM), expecially with Oozie. Right now we reccomend at least 4 vCores and 24 GB or RAM). If you're planning a IaaS deployment on Azure using as metadata repository SQL Azure start directly with a S2/S3 instance: if you use Oozie it's the minimum requirement. Pay also attention to Ranger: it's OK to Ranger admin and users but for audits you need to look carefully at DB sizes: it can easly grow up faster. Use a script to truncate the table or use a different instance.
... View more
12-10-2015
10:06 PM
3 Kudos
I think it's also a good starting point to use Availability sets for master nodes and worker nodes. Another good point is about using one storage account for every nodes in the cluster in order to bypass IOPS limits for multiple VMs on the same Storage Account. You can also try to use Azure Data Lake Store (with adl://) in order to check the performance on the new Azure service. You also need to remember the maintenance windows of every Azure region according to your customers: some regions could be a good choice for new service availability (e.g.: US East 2) but not from a maintenance point of view (expecially for european customers). We also verified great differences between IaaS performance and PaaS (HDInsight) performance due to low read/write performance of the Blob Storage: with the first one (configured correctly) you can achieve best performance.
... View more
12-10-2015
09:53 PM
We just released on a production cluster (HDP 2.2.8), Waterlinedata (http://www.waterlinedata.com/ 😞 it's a great tool for metadata enrichment, data dictionary, data lineage and autodiscovery for HDFS and Hive data that run on top of Hadoop (YARN). It's ready to "speak" with Atlas thru API and you have a great Web UI. One of the coolest feature is the possibility to create, thru Web UI, an external Hive table from an HDFS in 2 clicks.
... View more
12-10-2015
09:46 PM
1 Kudo
While storage space is absolutely critical as @Neeraj Sabharwal and @Ali Bajwa wrote in their post we just "discovered" that also CPU is a key point. When HWX released AMS we began to deploy Ambari and AMS on the same machine, but soon the understood that for a production environment it could be a good practice to use one VM for Ambari and another VM for AMS, so the really high impact on computation resources of AMS didn't impact Ambari (sometimes, during the aggregation phase we got 16 CPU at 90% for 10/15 minutes).
... View more
12-10-2015
09:38 PM
1 Kudo
We need to deploy Solr 5.2.1 on HDP 2.3.2 on a production environment (3 master nodes with HA on HDFS, YARN and Hive, 13 worker nodes, 2 edge, 2 support and 2 security). Is there a "best practice" for production? This is a multi-purpose cluster in which Hive, Pig, HOYA and Spark jobs are currently running.
... View more
Labels:
12-10-2015
09:25 PM
1 Kudo
If you're experiencing performance issue on Tez you need to start checking hive.tez.container.size: we had worked a lot in Hive / Tez performance optimization and very often you need to check your jobs. Sometimes we lowered the hive.tez.container.size to 1024 (less memory means more containers), other times we need to set the property value to 8192. It really depend on your workload. Hive / Tez optimization could be a real long work but you can achieve good performance using hive.tez.container.size, ORC (and compression algorithm) and "pre-warming" Tez container.
... View more
12-10-2015
09:14 PM
Right now only 22... but they're planning an upgrade next year. I'm waiting the OK for the benchmarks
... View more
12-10-2015
05:51 PM
1 Kudo
WebHDFS is really fast (also thru Knox), but you need to clarify how much files and their size in a timeframe to be more to get a more specif answer.
... View more
12-10-2015
05:47 PM
Secondary NameNode is mandatory in every Hadoop installation. HA NameNode is higly suggested, but is something different from Seconday NameNode. Since Hadoop v2 release you can easily deploy a HA config of the NameNode with an Active and Standby NameNode (requires ZooKeeper). If you've a really small lab cluster you can avoid the HA NameNode config.
... View more
12-10-2015
05:42 PM
2 Kudos
@Neeraj Sabharwal I had to ask to our customer, but i think there'll be no probelm
... View more