Member since
06-07-2016
923
Posts
319
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1215 | 10-18-2017 10:19 PM | |
1193 | 10-18-2017 09:51 PM | |
4984 | 09-21-2017 01:35 PM | |
306 | 08-04-2017 02:00 PM | |
360 | 07-31-2017 03:02 PM |
07-23-2016
04:43 AM
@Aman Poonia I think what you are asking for is N+2 redundancy for namenode. This feature will be available in Hadoop 3.0. It would allow 3-5 name ndoes. Please see the following Jira. https://issues.apache.org/jira/browse/HDFS-6440
... View more
07-22-2016
10:48 PM
@Ravi Mutyala Are there more than one rule that hdfs-xyz@EXDOMAIN.COM might evaluate to? Is it possible to share the hadoop.security.auth_to_local from your core-site.xml?
... View more
07-22-2016
09:27 PM
1 Kudo
@Ravi Mutyala HDFS balancer must run as a user who has same capabilities as hdfs super user. Does this user have same capabilities as a super user ('hdfs')? What is the command that you use? In a kerberized cluster, you need to run something like this kinit -kt <keytab> <principal> and then your command like this hdfs balancer -threshold <threshold> Hope this helps.
... View more
07-22-2016
04:36 AM
@vinay kumar I am not hundred percent sure but I think you need to reduce your reducer size. You are getting this error, in the following file and it expects the number of rows for that particular reducer to be less than Inter.MAX_VALUE (see line 99). I think you have more than 2147483647 rows being processed by this one reducer. If you reduce the size of reducers such that no reducer processes more than 2147483647 records, then you should not run into this issue. https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/PTFPartition.java (check line 99) I hope this helps.
... View more
07-19-2016
06:36 PM
1 Kudo
@alain TSAFACK I think what you are looking for is the real machine learning code rather than an example of implementation of it. Here is one place to look at (Spark MLLib). https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark You can also look for H20 github. https://github.com/h2oai/h2o-3
... View more
07-19-2016
06:14 PM
Check configuring flume and then starting flume. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/configuring_flume.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_installing_manually_book/content/starting_flume.html you basically have to run following command once you have your conf file. /usr/hdp/current/flume-server/bin/flume-ng agent -c /etc/flume/conf -f /etc/flume/conf/ flume.conf -n agent or if flume is configured as a service (run "chkconfig" to see if it's configured as a service), then simple use service flume-agent start
... View more
07-19-2016
01:22 AM
Hi @sujitha sanku The administration tool is Ambari. You can share the details from Ambari docs on how much details you want to share. Thanks
... View more
07-18-2016
01:40 PM
1 Kudo
Can you check /usr/hdp/current/flume-server? check inside bin folder.
... View more
07-13-2016
06:22 PM
Those db's are likely for hive metastore as well as for Ambari. These services are often run on master or edge nodes.
... View more
07-13-2016
05:57 PM
@Kumar Veerapan It is not true that namenode will perform all admin functions. You need Ambari to manage the cluster. Namenode only stores the metadata for Hadoop files. As for gateway, you need these because in a large cluster you don't want clients to connect directly to the cluster and open up the cluster for clients. You would rather have gateway nodes so clients use these to access the cluster.
... View more
07-06-2016
10:02 PM
Check the link I just added to my answer.
... View more
07-06-2016
09:58 PM
Hi @Qi Wang Which user is running the sqoop command? Can you verify file /etc/hive/2.5.0.0-817/0/xasecure-audit.xml exists? Does the user running sqoop import has read access to this file? Also, check the following link. It might be your issue. https://community.hortonworks.com/questions/369/installed-ranger-in-a-cluster-and-running-into-the.html
... View more
07-06-2016
05:12 PM
@Sunile Manjee Yes. Here is what I did. Let me know if you have any questions. try{
UserGroupInformation ugi = UserGroupInformation.loginUserFromKeytabAndReturnUGI(kerberos_principal, kerberos_keytab);
objectOfMyType = ugi.doAs(new PrivilegedExceptionAction<MyType>(){
@Override
public MyType run() throws Exception{
System.setProperty("spark.serializer","org.apache.spark.serializer.KryoSerializer");
System.setProperty("spark.kryo.registrator","fire.util.spark.Registrator");
System.setProperty("spark.akka.timeout","900");
System.setProperty("spark.worker.timeout","900");
System.setProperty("spark.storage.blockManagerSlaveTimeoutMs","3200000");
// create spark context
SparkConf sparkConf = new SparkConf().setAppName("MyApp");
sparkConf.setMaster("local");
sparkConf.set("spark.broadcast.compress", "false");
sparkConf.set("spark.shuffle.compress", "false");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
DataFrame tdf = ctx.sqlctx().read().format("com.databricks.spark.csv")
.option("header", String.valueOf(header)) // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.option("delimiter", delimiter)
.load(path);
//some more application specific code here
return objectOfMyType;
}
});
}
catch (Exception exception){
exception.printStackTrace();
}
... View more
07-06-2016
03:28 PM
1 Kudo
I figured this out. I changed master to local and then simply loading remote HDFS data. It was still giving an exception because it's a kerberized cluster. While I was using UserGroupInformation and then creating a proxy user with valid keytab to access my cluster, the reason it was failing was because I was creating JavaSparkContext outside of "doAs" method. Once I created JavaSparkContext using the right proxy user, everything worked.
... View more
07-01-2016
07:36 PM
Hive jdbc jar should be at the following location. You can copy it from here.
/usr/hdp/current/hive-client/lib/hive-jdbc.jar
... View more
07-01-2016
08:26 AM
Hi I am trying to run an application from my eclipse so I can put break points as well as monitor changing values of my variables. I create a JavaSparkContext which uses "SparkConf" object. This object should have access to my yarn-site.xml and core-site.xml so it knows how to connect to the cluster. I have these files under /etc/hadoop/conf and two environment variables set "HADOOP_CONF_DIR" and "YARN_CONF_DIR" on my mac using ~/Library/LaunchAgents/environment.plist where I have eclipse. I have verified these variables are available when I boot up mac and I can view these variables in my my app in eclipse using "System.getenv("HADOOP_CONF_DIR") and they point to the right location. I have also tried adding environment variables in my build configuration in eclipse. After doing all this, my code consistently fails because it's unable to read yarn-site.xml or core-site.xml because I run into following issue INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:803216/07/01 00:57:16 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) As you can see, it's not trying to connect to the correct location of resource manager. Here is how the code looks in create(). Please let me know what you think as this is blocking me. public static JavaSparkContext create() {
System.setProperty("spark.serializer","org.apache.spark.serializer.KryoSerializer");
System.setProperty("spark.kryo.registrator","fire.util.spark.Registrator");
System.setProperty("spark.akka.timeout","900");
System.setProperty("spark.worker.timeout","900");
System.setProperty("spark.storage.blockManagerSlaveTimeoutMs","3200000");
// create spark context
SparkConf sparkConf = new SparkConf().setAppName("MyApp");
// if (clusterMode == false)
{
sparkConf.setMaster("yarn-client");
sparkConf.set("spark.broadcast.compress", "false");
sparkConf.set("spark.shuffle.compress", "false");
}
JavaSparkContext ctx = new JavaSparkContext(sparkConf); <- Fails Here
return ctx;
}
... View more
Labels:
06-30-2016
08:41 PM
@hoda moradi You will have to do some research but you might be missing a jar file. Are you sure you have jdbc jar files in classpath? See the following two links. https://community.hortonworks.com/questions/19396/oozie-hive-action-errors-out-with-exit-code-12.html https://community.hortonworks.com/articles/9148/troubleshooting-an-oozie-flow.html
... View more
06-30-2016
08:22 PM
Hi @hoda moradi Here is the issue you are running into. User: hive is not allowed to impersonate anonymous at org.apache.hive.service.cli.session.SessionManager.openSession(SessionManager.java:266) at I am assuming this is simple development and you are not so much concerned about policies. If you are, then only your organization's security team can tell you which users can hive impersonate. But basically you need to enable hive impersonation. Can you see if following is set to true in your hive-site.xml? <property>
<name>hive.server2.enable.impersonation</name>
<description>Enable user impersonation for HiveServer2</description>
<value>true</value>
</property> and check the following link to setup proxyuser settings for hive user in core-site.xml http://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_ambari_views_guide/content/_setup_HDFS_proxy_user.html You need to set the following. Remember, this definitely cannot be * if this is for work and that is where your security team comes in. They will tell you who hive use can impersonate. hadoop.proxyuser.hive.groups=*
hadoop.proxyuser.hive.hosts=*
... View more
06-30-2016
08:04 PM
1 Kudo
Hi @bigdata.neophyte I think Sunil has explained well enough. In case you are still confused, I'll try to rephrase this. First let's talk about Hive Metastore which I believe from your comment to Sunil's answer, you already understand. Basically when you create tables in Hive, you have to specify somewhere, the location of the data files, the file format of data, table name, columns and so on. You need a place to store this information. That place is Hive Metastore. It is some database usually MySQL (or Postgres or Oracle). Now why do you need HA for this metastore db? For the same reason you need HA for anything else. If for some reason your MySQL instance containing Hivemetastore goes down, you want to be able to failover to your standby so your users are not impacted. You also need HA for metastore service because even if DB is working, for some reason your metastore service can fail and again you want to failover to standby without impacting your users. Now let's talk about HCatalog. When Hive was created, you could run Hive QL which is pretty much SQL on top of your tabular data in Hadoop. This is great. But that is not all where the power of Hadoop lies. One of the most significant difference between Hadoop and traditional platforms is it's ability to run different engines to prosecute your data. So for your tabular/structured data in Hadoop, you can not only create Hive tables and run SQL queries, but you can also read the same data in your map reduce jobs or pig scripts. But how would you do that if there is no HCatalog? You can write custom map reduce jobs to read the structure of the table and custom Pig scripts from Hive metastore. That is what most people did before HCatalog. But with HCatalog, they now have access to the same information that's in hive metastore so they can quickly and easily read those hive tables rather than writing their own custom jobs. Check slide number 4 of the following link and see how Hive can go directly to Hivemetastore but other services need some way to talk to Hive metastore. That way is HCatalog. http://www.slideshare.net/Hadoop_Summit/future-of-hcatalog
... View more
06-28-2016
05:26 PM
Do you have Ambari running? You should be able to check from Ambarithe status of your JHS. Otherwise, this should bring the UI assuming you haven't modified the default ports. http://<host>:19888
... View more
06-28-2016
04:22 PM
@hoda moradi Can you please share your log? Is your job history server running? Thanks
... View more
06-26-2016
05:18 AM
This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link. https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html sc.parallelize(rdd.collect.toSeq.transpose) See the link here for more details.
... View more
06-26-2016
01:54 AM
@Akash Mehta So, even following wont work for you? If not, I think currently there is no other way given we have looked at all other possible options. //a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
... View more
06-25-2016
08:44 PM
@Sri Bandaru since you are not running in a sandbox, what does --master yarn resolves to?
... View more
06-23-2016
11:51 PM
load will infer schema and convert to a row. Question is whether it will take an http url. Can you try?
... View more
06-23-2016
10:58 PM
@Akash Mehta Can you do something like this?
dataframe = sqlContext.read.format(“json”).load(your json here)
... View more
06-23-2016
09:59 PM
I am assuming you have 141 partitions by default (your number of blocks). But you have only 4 or even 8 executors. See if you can increase this to 16 executors with 1G each. I would also try to use coalesce to reduce the number of partitions so it's not too high compared to executors. Also assign more cores using --executor-cores. I hate to give up but at the end, doing count on a 20G file with your hardware, might just take 20 minutes.
... View more
06-23-2016
08:51 PM
In this case, I would suggest that rather than doing a direct import into a hive table, you first stage the data, then do cleansing and then final import into hive. You can also import data as one of the supported file formats like "--as-sequencefile" or "--as-avrodatafile". I recommend you read the following link to taylor your import strategy. http://getindata.com/blog/post/surprising-sqoop-to-hive-gotchas/
... View more
06-23-2016
06:17 PM
1 Kudo
ALTER TABLE MAGNETO.SALES_FLAT_ORDER SET SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' assuming you have hive 0.14 or later.
... View more
06-23-2016
05:47 PM
@Simran Kaur check your data by doing a "cat" to see what the data looks like. Are fields separated by a space or whatever it is. You can also instead create a table and in create table statement specify what you want your fields to be terminated by and then do an import using Sqoop.
... View more
- « Previous
- Next »