About gbraccialli3

gbraccialli3 · ‎01-23-2016

@Ali Bajwa Doesnt Active Directory provide this full-integrated-and-automated way?

gbraccialli3 · ‎01-20-2016

@Balachandran Karnati Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way. Find a simple example below: drop table collect; create external table collect( key string, order_field int, value double ) row format delimited fields terminated by ',' stored as textfile ; [root@sandbox ~]# cat collect1.txt a,1,1.0 a,3,3.0 a,2,2.0 b,1,1.0 b,2,2.0 b,3,3.0 [root@sandbox ~]# cat collect2.txt a,1,1.1 a,3,3.1 a,2,2.1 b,1,1.1 b,2,2.1 b,3,3.1 [root@sandbox ~]# hadoop fs -put collect* /apps/hive/warehouse/collect drop table IF EXISTS collect_sorted; create table collect_sorted as select key, collect_list(value) from (select * from collect sort by key, order_field, value desc) x group by key ;

gbraccialli3 · ‎01-07-2016

Hortonworks does not support sqoop2 now. Sqoop supported in lastest hdp 2.3.4 is: Apache Sqoop 1.4.6

gbraccialli3 · ‎01-07-2016

See this screenshot from sqoop in Ambari. It's only client. There no sqoop metastore service, by default sqoop uses derby database, but if you want, you can use external mysql or postgres database for sqoop, then, you can configure this database in HA mode.

gbraccialli3 · ‎01-07-2016

sqoop is a client only, so you can have sqoop installed in multiple nodes behind a IP load balancer. I don't know about Lilly indexer (part of HDP Search Connector). Documentation is here: https://doc.lucidworks.com/lucidworks-hdpsearch/2.3/Guide-Jobs.html#_hbase-indexer, but I'm not sure if it has HA out-of-the-box with Solr Cloud.

gbraccialli3 · ‎01-07-2016

@Mehdi TAZI There is a High Availability section in: http://docs.hortonworks.com (choose you version) For lastest HDP version (2.3.4), see this: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-...

gbraccialli3 · ‎01-06-2016

Awesome!!!

gbraccialli3 · ‎12-29-2015

Is it possible to execute flume agent outside hadoop network using Knox gateway + WebHDFS? I found this JIRA (https://issues.apache.org/jira/browse/FLUME-2701), but it's not resolved yet. A workaround I found would be to mount hdfs/nfs in flume-agent remote node and use File Roll Sink to this nfs/hdfs directory, but it does not seem be a good approach.

gbraccialli3 · ‎12-28-2015

@Peter Lasne Tutorial will be fixed ASAP. @zblanco

gbraccialli3 · ‎12-28-2015

@Sooraj Antony I believe the problem is most of records has same value for EMP_TYPE (aka skewed records), that would cause all records from same EMP_TYPE value to be sent to same reducer and would cause the last reducer to take a long time to finish. Assuming you have small number of different values of EMP_TYPE (and the result fit in memory), try solution below: set hive.ignore.mapjoin.hint=false; SELECT /*+ MAPJOIN(A) */ STG.EMP_TYPE,DEPT,COUNT(DISTINCT EMP_ID) AS COUNT, A.TOTAL_COUNT FROM STAGE_SOURCE STG LEFT OUTER JOIN (SELECT EMP_TYPE,COUNT(DISTINCT EMP_ID) AS TOTAL_COUNT FROM STAGE_SOURCE GROUP BY EMP_TYPE) A ON STG.EMP_TYPE = A.EMP_TYPE GROUP BY STG.EMP_TYPE,DEPT,A.TOTAL_COUNT;

Online	Offline
Last Visited	‎09-28-2021 03:33 PM

Member Since	‎09-25-2015 05:42 PM
Last Visited	‎09-28-2021 03:33 PM
Posts	230
Kudos received	236

Cloudera Community

Re: How to reset Ambari Admin password?

Re: Connection Refused trying to access port 8000 ...

Re: Flume + Knox

Re: Ambari stuck with "Install Pending" when creat...

Re: HDP 2,3.4- Running jobs is not getting display...

Re: Looking for an automated integration of HDP/Am...

Re: Hive original order of records using collect_l...

Re: Hadoop services high availability

Re: Hadoop services high availability

Re: Hadoop services high availability

Re: Hadoop services high availability

Re: How to control size of log files for various H...

Flume + Knox

Re: Permission Denied for user=hive on LOAD DATA I...

Re: Optimize a long running hive query - has a joi...