About bleonhardi

bleonhardi · ‎07-21-2016

@Josh Elser Sorry about that :-). Protobuf?

bleonhardi · ‎07-21-2016

Honestly if I knew I would have mentioned them :-). I setup a cluster with a simple shell script ssh-all.sh: for i in server1 server2 server3; do ssh $i S1; done and created users manually on a small cluster before ( we only had ~10 users so it didn't seem worth it to setup LDAP ). I never bothered about uids and never ran into problems. But we used standard stuff oozie, hive ... and never ran into problems. But other people told me that some components don't take this well. Honestly not sure which could be that I am sure that Namenode HA setup with NFS does not work because NFS depends on the same UID but I have problems thinking of a component that would need the same uids in an hadoop environment. HDFS does not care about uids. It cares about usernames.

bleonhardi · ‎07-21-2016

Yarn timeline store should cleanup old values. Parameters are: Cleanup cycle ( when he deletes ) yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms Time to live ( what to delete ) yarn.timeline-service.ttl-ms Oh and finally enable the age off yarn.timeline-service.ttl-enable https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_yarn_resource_mgt/content/ref-e54bc3f2-f1bc-4bc6-b6cb-e6337589feb6.1.html I would check those. If you want to clean things you can set that to a low settings and restart. Alternatively you should be able to simply delete the database if you want to its just log information after all. Finally if the parameters are correct and do not work you might want to open a support ticket.

bleonhardi · ‎07-21-2016

HDFS: - You need ( per default ) 30GB on the datanodes ( 3x replication ) - On the namenode the space is negligible you have 40 blocks x 3 = 120 blocks ~ 12 kbytes of RAM on the Namenode. ( you need around 100byte of RAM for every block in the Namenode memory. You also need a bit of space on disc but thats even less in the fsimage ( files but not blocks are stored on disc, however namenodes need a bit more since they also store edits and multiple versions of the fsimage. But still very small. HBase: More complicated question. In hbase it depends on the way you store data. Every field in your hbase table is stored in HFiles together with the key, the fieldname, the timestamp ... So if you store it in a single field per row your storage is much less than if you would have hundreds of 2 byte columns. On the other hand you can also enable compression in Hbase so that reduces space.

bleonhardi · ‎07-21-2016

On big clusters people normally setup an ldap server. Ipa for example is free and simple. Look on github for the security workshops of Ali baijwa. Or as said below use a ssh script or ansible or pshell to run commands on all nodes. Note some more esoteric components of the stack require that usernames have the same uid on all nodes of the cluster. https://github.com/abajwa-hw

bleonhardi · ‎07-20-2016

I heard there is some group caching in HDFS. But it should be refreshed after 5 minutes hadoop.security.groups.cache.secs Any chance to restart hdfs/yarn to make sure thats not the problem?

bleonhardi · ‎07-20-2016

I think you should look at Spark RDD programming introduction. What you get is an RDD of integers. You can then use Spark functions like map/foreach etc. to do stuff with it. So the question is what you actually want to do. Why do you want a List is my question. You can do rdd.collect to get it all in a big Array on your driver but that is most likely not what you actually want to do. http://spark.apache.org/docs/latest/programming-guide.html I.e. clusterPoints.collect() will give you an array of points in your local driver. However it downloads all results to your local driver and doesn't work in parallel anymore. If that works with your data volumes great. But normally you should use the functions like map etc. of spark to make computations in parallel. Below is a scoring example that runs a scoring point by point so you could do other things in this function as well. Whatever you want to do with the information essentially. http://blog.sequenceiq.com/blog/2014/07/31/spark-mllib/ <code>val clusters: KMeansModel = KMeans.train(data, K, maxIteration, runs) val vectorsAndClusterIdx = data.map{ point => val prediction = clusters.predict(point) (point.toString, prediction) }

bleonhardi · ‎07-20-2016

If you want the information for your input points which belongs to which clusters you need to use the predict method.

bleonhardi · ‎07-20-2016

The class provides the method clusterCenters public Vector[] clusterCenters() Each Vector is a point or cluster center. Or as said export it to PMML

bleonhardi · ‎07-20-2016

The data is a Java class that contains the cluster information. Clusters Centers Statistic ... If you want to work with that you either need to use the spark mlib library to do extraction/scoring etc. OR you can export a lot of these models as PMML which is a XML based standard for Clustering models that is understood by a lot of data mining tools. And can be exported for a lot of the models kmeansModel.toPMML("/path/to/kmeans.xml") https://databricks.com/blog/2015/07/02/pmml-support-in-apache-spark-mllib.html Not all Mlib models support PMML though

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Which gives better performance for both writes...

Re: Capacity scheduler - error while executing job

Re: Yarn Timeline db consuming 466GB space

Re: How much actual space required to store 10GB t...

Re: Capacity scheduler - error while executing job

Re: Capacity scheduler - error while executing job

Re: Spark Clustering K-mean

Re: Spark Clustering K-mean

Re: Spark Clustering K-mean

Re: Spark Clustering K-mean