Member since
10-02-2017
112
Posts
71
Kudos Received
11
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3098 | 08-09-2018 07:19 PM | |
3900 | 03-16-2018 09:21 AM | |
4038 | 03-07-2018 10:43 AM | |
1156 | 02-19-2018 11:42 AM | |
4030 | 02-02-2018 03:58 PM |
02-19-2018
11:42 AM
1 Kudo
Yess its possible only and only if there are no repetition in a column. In this case one will end up with the file and meta info of the columnar file format.
... View more
02-19-2018
11:38 AM
Just to be more specific 1. Driver talks to namenode to find the location of the HDFS blocks. 2. The info is available to the AM. 3. Driver request for AM, Am requests for the required resources based on the blocks info. 4. YARN has no business to talk to namenode directly.
... View more
02-09-2018
07:02 PM
A = load 'Desktop/wordcount.txt'as(col1:chararray); B = foreach A generate flatten(TOKENIZE(col1))as (word:chararray); C = group B by word; cnt = foreach C generate flatten(group), COUNT(B.word); dump cnt;
... View more
02-02-2018
07:27 PM
1 Kudo
The most common Input formats are 1. FileInputFormat (Base class for all) 2. TextInputFormat 3. KeyValueTextInputFormat 4. SequenceFileInputFormat 5. BinaryInputFormat
... View more
02-02-2018
03:58 PM
Lets approach your problems from basics. 1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too. 2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop. 3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table. 4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks. The logic is the same as hadoop. 5. Spark handles compression in the same way as MR . 6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage. Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.
... View more
02-02-2018
03:46 PM
1. Block replication if for redundancy of data which ensures data is not lost due to bad disk or node going down. 2. Replication 1 is set in situation when data can recreated at any point of time, the loss of data is not crucial. Like a job chain, output of one job is consumed by others and ebntually all intermediate data needs to be deleted. The intermediate data can be marked for Replication of 1 ( Still its good to have 2 ) 3. Replication factor of 1 makes the cluster fault tolerant. In you case you have 3 worker node, RF of 1 means if a worker is bad, you loose data and the it cant be processed. I suggest you to use at RF=2 if you are concerned about space utilization.
... View more
02-02-2018
03:24 PM
1 Kudo
Yes you can defintely spawn a Spark application from webservice. SparkServer2 is meant for this purpose, It provides a JDBC connection through which you can submit a spark job. * Do remember sparks jobs are not meant for a key value lookup. Please think in terms of Hbase, solr for such use cases.
... View more
01-01-2018
08:35 PM
1 Kudo
package com.big.data.sparkserver2; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.security.UserGroupInformation; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.io.IOException; import java.sql.*; public class SparkServer2Client { private static final Logger LOGGER = LoggerFactory.getLogger(SparkServer2Client.class); private static String HIVESERVE2DRIVER = "org.apache.hive.jdbc.HiveDriver"; public static void main(String[] args) throws SQLException, IOException { // set the configurationConfiguration conf = new Configuration(); conf.set("hadoop.security.authentication", "Kerberos"); UserGroupInformation.setConfiguration(conf); // /etc/security/keytab/hive.service.keytab is from local machine, This is the user which is executing the commandUserGroupInformation.loginUserFromKeytab("hive/hostname.com@FIELD.HORTONWORKS.COM", "/etc/security/keytab/hive.service.keytab"); // load the drivertry { Class.forName(HIVESERVE2DRIVER); } catch (ClassNotFoundException e) { LOGGER.error("Driver not found"); } Connection con = DriverManager.getConnection("jdbc:hive2://hostname.com:10016/default;httpPath=/;principal=hive/hostname.com@FIELD.HORTONWORKS.COM"); Statement stmt = con.createStatement(); // Table NameString tableName = "testHiveDriverTable"; stmt.execute("drop table " + tableName); LOGGER.info("Table {} is dropped", tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tablesString sql = "show tables '" + tableName + "'"; LOGGER.info("Running {} ", sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { LOGGER.info(" return from HiveServer {}", res.getString(1)); } // describe tablesql = "describe " + tableName; LOGGER.info("DESCRIBE newwly created table sql command : {}" + sql); res = stmt.executeQuery(sql); while (res.next()) { //System.out.printf("HOOOO");LOGGER.info("Return from HiveServer {}", res.getString(1) + "\t" + res.getString(2)); } // close the connectioncon.close(); } } Dependency: <dependencies> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1000.2.6.0.3-8</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-jdbc</artifactId> <version>1.2.1000.2.6.0.3-8</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> </dependency> </dependencies> repo : http://repo.hortonworks.com/content/groups/publi
... View more
Labels:
01-01-2018
06:25 PM
package com.big.data.hiveserver2;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.security.UserGroupInformation;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.sql.*;
public class HiveServer2Client {
private static final Logger LOGGER = LoggerFactory.getLogger(HiveServer2Client.class);
private static String HIVESERVE2DRIVER = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException, IOException {
// set the configuration
Configuration conf = new Configuration();
conf.set("hadoop.security.authentication", "Kerberos");
UserGroupInformation.setConfiguration(conf);
// /etc/security/keytab/hive.service.keytab is from local machine, This is the user which is executing the command
UserGroupInformation.loginUserFromKeytab("hive/cluster20182.field.fieldorg.com@REALM_NAME", "/etc/security/keytab/hive.service.keytab");
// load the driver
try {
Class.forName(HIVESERVE2DRIVER);
} catch (ClassNotFoundException e) {
LOGGER.error("Driver not found");
}
Connection con = DriverManager.getConnection("jdbc:hive2://cluster20181.field.fieldorg.com:2181,cluster20180.field.fieldorg.com:2181,cluster20182.field.fieldorg.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;principal=hive/cluster20182.field.fieldorg.com@REALM_NAME");
Statement stmt = con.createStatement();
// Table Name
String tableName = "testHiveDriverTable";
stmt.execute("drop table " + tableName);
LOGGER.info("Table {} is dropped", tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
// show tables
String sql = "show tables '" + tableName + "'";
LOGGER.info("Running {} ", sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
LOGGER.info(" return from HiveServer {}", res.getString(1));
}
// describe table
sql = "describe " + tableName;
LOGGER.info("DESCRIBE newwly created table sql command : {}" + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.printf("HOOOO");
LOGGER.info("Return from HiveServer {}", res.getString(1) + "\t" + res.getString(2));
}
// close the connection
con.close();
}
}
Dependency:
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1000.2.6.0.3-8</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</dependency>
</dependencies>
Repo : http://repo.hortonworks.com/content/groups/public
<br>
... View more
Labels:
11-09-2017
12:39 PM
Please run fuser /hadoop/yarn/local/registeredExecutors.ldb/LOCK. to know which PID is holding to the lock. ps -eaf | grep pid will give you the process holding to it
... View more