About kgautam

kgautam · ‎02-19-2018

Yess its possible only and only if there are no repetition in a column. In this case one will end up with the file and meta info of the columnar file format.

kgautam · ‎02-19-2018

Just to be more specific 1. Driver talks to namenode to find the location of the HDFS blocks. 2. The info is available to the AM. 3. Driver request for AM, Am requests for the required resources based on the blocks info. 4. YARN has no business to talk to namenode directly.

kgautam · ‎02-09-2018

A = load 'Desktop/wordcount.txt'as(col1:chararray); B = foreach A generate flatten(TOKENIZE(col1))as (word:chararray); C = group B by word; cnt = foreach C generate flatten(group), COUNT(B.word); dump cnt;

kgautam · ‎02-02-2018

The most common Input formats are 1. FileInputFormat (Base class for all) 2. TextInputFormat 3. KeyValueTextInputFormat 4. SequenceFileInputFormat 5. BinaryInputFormat

kgautam · ‎02-02-2018

Lets approach your problems from basics. 1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too. 2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop. 3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table. 4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks. The logic is the same as hadoop. 5. Spark handles compression in the same way as MR . 6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage. Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.

kgautam · ‎02-02-2018

1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down. 2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 ) 3. Replication factor of 1 makes the cluster fault tolerant. In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed. I suggest you to use at RF=2 if you are concerned about space utilization.

kgautam · ‎02-02-2018

Yes you can defintely spawn a Spark application from webservice. SparkServer2 is meant for this purpose, It provides a JDBC connection through which you can submit a spark job. * Do remember sparks jobs are not meant for a key value lookup. Please think in terms of Hbase, solr for such use cases.

kgautam · ‎01-01-2018

package com.big.data.sparkserver2; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.security.UserGroupInformation; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.sql.*; public class SparkServer2Client { private static final Logger LOGGER = LoggerFactory.getLogger(SparkServer2Client.class); private static String HIVESERVE2DRIVER = "org.apache.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException, IOException { // set the configurationConfiguration conf = new Configuration(); conf.set("hadoop.security.authentication", "Kerberos"); UserGroupInformation.setConfiguration(conf); // /etc/security/keytab/hive.service.keytab is from local machine, This is the user which is executing the commandUserGroupInformation.loginUserFromKeytab("hive/hostname.com@FIELD.HORTONWORKS.COM", "/etc/security/keytab/hive.service.keytab"); // load the drivertry { Class.forName(HIVESERVE2DRIVER); } catch (ClassNotFoundException e) { LOGGER.error("Driver not found"); } Connection con = DriverManager.getConnection("jdbc:hive2://hostname.com:10016/default;httpPath=/;principal=hive/hostname.com@FIELD.HORTONWORKS.COM"); Statement stmt = con.createStatement(); // Table NameString tableName = "testHiveDriverTable"; stmt.execute("drop table " + tableName); LOGGER.info("Table {} is dropped", tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tablesString sql = "show tables '" + tableName + "'"; LOGGER.info("Running {} ", sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { LOGGER.info(" return from HiveServer {}", res.getString(1)); } // describe tablesql = "describe " + tableName; LOGGER.info("DESCRIBE newwly created table sql command : {}" + sql); res = stmt.executeQuery(sql); while (res.next()) { //System.out.printf("HOOOO");LOGGER.info("Return from HiveServer {}", res.getString(1) + "\t" + res.getString(2)); } // close the connectioncon.close(); } } Dependency: <dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1000.2.6.0.3-8</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1000.2.6.0.3-8</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </dependency> </dependencies> repo : http://repo.hortonworks.com/content/groups/publi

kgautam · ‎01-01-2018

package com.big.data.hiveserver2; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.security.UserGroupInformation; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.sql.*; public class HiveServer2Client { private static final Logger LOGGER = LoggerFactory.getLogger(HiveServer2Client.class); private static String HIVESERVE2DRIVER = "org.apache.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException, IOException { // set the configuration Configuration conf = new Configuration(); conf.set("hadoop.security.authentication", "Kerberos"); UserGroupInformation.setConfiguration(conf); // /etc/security/keytab/hive.service.keytab is from local machine, This is the user which is executing the command UserGroupInformation.loginUserFromKeytab("hive/cluster20182.field.fieldorg.com@REALM_NAME", "/etc/security/keytab/hive.service.keytab"); // load the driver try { Class.forName(HIVESERVE2DRIVER); } catch (ClassNotFoundException e) { LOGGER.error("Driver not found"); } Connection con = DriverManager.getConnection("jdbc:hive2://cluster20181.field.fieldorg.com:2181,cluster20180.field.fieldorg.com:2181,cluster20182.field.fieldorg.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/cluster20182.field.fieldorg.com@REALM_NAME"); Statement stmt = con.createStatement(); // Table Name String tableName = "testHiveDriverTable"; stmt.execute("drop table " + tableName); LOGGER.info("Table {} is dropped", tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; LOGGER.info("Running {} ", sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { LOGGER.info(" return from HiveServer {}", res.getString(1)); } // describe table sql = "describe " + tableName; LOGGER.info("DESCRIBE newwly created table sql command : {}" + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.printf("HOOOO"); LOGGER.info("Return from HiveServer {}", res.getString(1) + "\t" + res.getString(2)); } // close the connection con.close(); } } Dependency: <dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1000.2.6.0.3-8</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </dependency> </dependencies> Repo : http://repo.hortonworks.com/content/groups/public <br>

kgautam · ‎11-09-2017

Please run fuser /hadoop/yarn/local/registeredExecutors.ldb/LOCK. to know which PID is holding to the lock. ps -eaf | grep pid will give you the process holding to it

Online	Offline
Last Visited	‎08-14-2019 06:45 PM

Member Since	‎10-02-2017 07:17 AM
Last Visited	‎08-14-2019 06:45 PM
Posts	112
Kudos received	70

Cloudera Community

Re: Accessing HBase Table through Hive is very slo...

Re: Yarn job stuck at Accepted state in a kerberiz...

Re: How to create a hadoop custom inputformat/file...

Re: Can columnar format occupy more space than row...

Re: Handling compression in Spark

Re: Can columnar format occupy more space than row...

Re: when yarn communicates with the namenodes when...

Re: Apache Pig scrips alias name usage

Re: What are the most common InputFormats in Hadoo...

Re: Handling compression in Spark

Re: when need to set Block replication to 1

Re: Can I run the spark for the web application?

Connect to Spark Thrift server (Kerberos enabled) ...

HiveServer2 (Kerberos Authentication) JDBC Connect...

Re: IO error: lock /hadoop/yarn/local/registeredEx...