Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5410 | 08-12-2016 01:02 PM | |
2203 | 08-08-2016 10:00 AM | |
2612 | 08-03-2016 04:44 PM | |
5501 | 08-03-2016 02:53 PM | |
1424 | 08-01-2016 02:38 PM |
07-21-2016
11:19 AM
@Josh Elser Sorry about that :-). Protobuf?
... View more
07-21-2016
11:13 AM
1 Kudo
Honestly if I knew I would have mentioned them :-). I setup a cluster with a simple shell script ssh-all.sh: for i in server1 server2 server3; do ssh $i S1; done and created users manually on a small cluster before ( we only had ~10 users so it didn't seem worth it to setup LDAP ). I never bothered about uids and never ran into problems. But we used standard stuff oozie, hive ... and never ran into problems. But other people told me that some components don't take this well. Honestly not sure which could be that I am sure that Namenode HA setup with NFS does not work because NFS depends on the same UID but I have problems thinking of a component that would need the same uids in an hadoop environment. HDFS does not care about uids. It cares about usernames.
... View more
07-21-2016
11:01 AM
3 Kudos
Yarn timeline store should cleanup old values. Parameters are: Cleanup cycle ( when he deletes ) yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms Time to live ( what to delete ) yarn.timeline-service.ttl-ms Oh and finally enable the age off yarn.timeline-service.ttl-enable https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ref-e54bc3f2-f1bc-4bc6-b6cb-e6337589feb6.1.html I would check those. If you want to clean things you can set that to a low settings and restart. Alternatively you should be able to simply delete the database if you want to its just log information after all. Finally if the parameters are correct and do not work you might want to open a support ticket.
... View more
07-21-2016
09:36 AM
4 Kudos
HDFS: - You need ( per default ) 30GB on the datanodes ( 3x replication ) - On the namenode the space is negligible you have 40 blocks x 3 = 120 blocks ~ 12 kbytes of RAM on the Namenode. ( you need around 100byte of RAM for every block in the Namenode memory. You also need a bit of space on disc but thats even less in the fsimage ( files but not blocks are stored on disc, however namenodes need a bit more since they also store edits and multiple versions of the fsimage. But still very small. HBase: More complicated question. In hbase it depends on the way you store data. Every field in your hbase table is stored in HFiles together with the key, the fieldname, the timestamp ... So if you store it in a single field per row your storage is much less than if you would have hundreds of 2 byte columns. On the other hand you can also enable compression in Hbase so that reduces space.
... View more
07-21-2016
09:20 AM
3 Kudos
On big clusters people normally setup an ldap server. Ipa for example is free and simple. Look on github for the security workshops of Ali baijwa. Or as said below use a ssh script or ansible or pshell to run commands on all nodes. Note some more esoteric components of the stack require that usernames have the same uid on all nodes of the cluster. https://github.com/abajwa-hw
... View more
07-20-2016
04:29 PM
1 Kudo
I heard there is some group caching in HDFS. But it should be refreshed after 5 minutes hadoop.security.groups.cache.secs Any chance to restart hdfs/yarn to make sure thats not the problem?
... View more
07-20-2016
02:49 PM
I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do. http://spark.apache.org/docs/latest/programming-guide.html I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel. Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially. http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/ <code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs)
val vectorsAndClusterIdx = data.map{ point =>
val prediction = clusters.predict(point)
(point.toString, prediction)
}
... View more
07-20-2016
02:35 PM
If you want the information for your input points which belongs to which clusters you need to use the predict method.
... View more
07-20-2016
02:34 PM
The class provides the method clusterCenters public Vector[] clusterCenters() Each Vector is a point or cluster center. Or as said export it to PMML
... View more
07-20-2016
01:00 PM
2 Kudos
The data is a Java class that contains the cluster information. Clusters Centers Statistic ... If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models kmeansModel.toPMML("/path/to/kmeans.xml") https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html Not all Mlib models support PMML though
... View more