About avijeetd

kkawamura · ‎02-13-2017

Hi @Avijeet Dash What @Jobin George suggested would help to share common static configuratiosn at various part of a NiFi flow. In addition to that, if you'd like to know how to Put/Get from distributed cache, and how to enrich FlowFiles with cached values, this example might be helpful: Template file is available here: https://gist.github.com/ijokarumawak/8ba9a2a1b224603f877e960a942a6f2b Thanks, Koji

avijeetd · ‎02-03-2017

That's great @Tibor Kiss - I am trying to run a spark streaming - how do I say to run on standalone cluster mode?

mqureshi · ‎02-03-2017

@Avijeet Dash I agree with you. It is much more reliable if after your streaming job, your data lands in Kafka and then written to HBase/HDFS. This decouples your streaming job from writing. I wouldn't recommend using Flume. Go with the combination of Nifi and Kafka.

jknulst · ‎01-25-2017

In my opinion it is best to still regard Hive as an analytical DB. With the ACID (updates) and streaming features the community is stretching the tool to things it wasn't designed for. These are not to be used at very large scale and very large loads. ACID and streaming will put tremendous strain on the Hive metastore. In the end the native storage model of Hive is still based on streaming through whole HDFS files, even with ORC. Without true indexes Hive will never be a real good match for high transactional workloads. Doing large analytical sweeps/scans through data is still at odds with high speed random read/write/update/delete. But that is not bad, there are just other components in HDP to do the other jobs right.

mqureshi · ‎01-23-2017

The only thing you can do is limit which IP's can access your cluster. Basically specifying security rules for inbound traffic (or outbound also). http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#ec2-classic-security-groups

mqureshi · ‎01-11-2017

@Avijeet Dash Here is a link for HBase sizing that you can use: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_Sys_Admin_Guides/content/ch_clust_capacity.html If you are using both HBase and SOLR, I am going to assume you are going to index HBase columns in SOLR. There are two concepts in SOLR when it comes to sizing. What will you be indexing and what will you be storing. If you know what you'll be storing (all of HBase columns? Probably not, but I am no one to say) and what will you be indexing (definitely not everything but whatever you index will be in addition to what you store). As for SOLR is better without HDFS is more of an opinion. I have seen cluster where SOLR cloud is running just fine along side HBase and HDFS. Here is what you should remember. Zookeeper should have its own dedicated disk (please do not share zookeeper disks - I cannot over emphasize this). Size appropriately. Meaning have the right amount of CPU and memory resources. If you are going to give 4GB of heap space to SOLR then there will likely be problems (do not go on the other extreme as it will result in Java garbage collection pauses - ideal heap to start with is 8-12 GB). Another thing to remember is what kind of queries will your end users be running. If they start scanning entire SOLR index, there shouldn't be a doubt that you will run into issues.

avijeetd · ‎01-27-2017

Thanks @mqureshi - that answers my question. However a number of components have started using HBASE as a meta-data store such as Atlas, Falcon etc. How to see these use cases?

dvillarreal · ‎01-04-2017

I was unable to find a way around this. The NameNode just gives admin rights to the system user name which started its process, by default hdfs user. You can also give others superuser permissions with dfs.permissions.superusergroup and dfs.cluster.administrators. It seems ranger doesn't disallow superusers unless in the case of KMS encrypted zones. In terms of KMS I can see there is a blacklist mechanism to disallow superuser. I don't think there is a similar feature for Ranger itself.

avijeetd · ‎01-02-2017

@Divakar Annapureddy I checked the document Eliminates the root account and replaces it with a compliance administrator account that executes commands with sudo This requirement doesn't seem to be supported by Ranger - hdfs can access folders protected by Ranger

elserj · ‎12-13-2016

"I read that accumulo supports cell level security, and hbase doesn't. Is this true?" Both systems support cell-level security; however, I would say that Accumulo's is a more "battle-hardened" implementation. I'm not aware of any case studies behind comparing the two implementations. "and secondly accumulo supports multiple data sources ingestion better and hbase one source such as one web site. is it true? in what ways?" No, I don't know in what way this would be possible. Both systems can ingest data from a variety of sources. This sounds like something was taken out of context. "can someone share any accumulo case studies?" http://accumulo.apache.org/papers/ has some content, http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf, https://arxiv.org/abs/1406.4923, and http://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf are each interesting. This talk from PHEMI by Russ Weeks is also particularly nice http://accumulosummit.com/program/talks/preventing-bugs-how-phemi-put-accumulo-to-work-in-the-field/ "Can accumulo be used with full support and rest of hadoop ecosystem?" In short, "yes", but this is subjective due to what you consider the "rest of hadoop ecosystem" and what degree of integration you're expecting. The same goes for HBase. As for HDP, yes, both HBase and Accumulo are fully supported as Tim pointed out already. I would suggest you ask more pointed questions if you have specific concerns.

Online	Offline
Last Visited	‎01-18-2021 12:06 AM

Member Since	‎06-09-2016 06:30 AM
Last Visited	‎01-18-2021 12:06 AM
Posts	185
Kudos received	22

Cloudera Community

Re: storm supervisor error

Re: Storm-hdfs - java.lang.RuntimeException: Error...

Re: zeppelin architecture

Re: Falcon server - cannot create cluster

Re: Hive Streamaing

Re: how to put data in PutDistributedMapCache

Re: stream processing runtimes

Re: streaming ingest to hdfs

Re: HIVE positioning

Re: ports required to be open

Re: HBASE and SOLR capacity planning

Re: HIVE and HBASE clusters

Re: Can Ranger support SEC 17a-4

Re: hadoop as a data-archival solution

Re: hbase vs accumulo