About mqureshi

mqureshi · ‎01-11-2017

@Avijeet Dash Please see my comments inline: if I keep it in HDFS - I would need 200TB+ storage considering compression factor 2.3 and replication factor 3 (100/2.3 *4) how do you know your compression will be 2.3? What type of data it is? Is this going to be in structural format (like columnar data)? Anyways, assuming your replication factor is 3, you need to multiple by 3 not 4 which will bring your storage requirements around 100x3/2.3 = 130 TB. if I keep in HBASE - I would need around 1000TB storage considering HBASE needs 5 times storage of HDFS Where did you get this five times number? HBase uses HDFS for storage which means a replication factor of 3. Compression can also be enabled in HBase so considering that, your storage requirement should be very similar to what you have in HDFS (not exactly same because ORC stores data differently than HBase which means your compression will result in different storage requirements). if I keep in SOLR - I would need around 100TB (close to the raw data size) - no hdfs If you are not storing data in HDFS then where is it going? Regular file system? If yes, then are you going to have a RAID array or SAN environment where your data will be stored and replicated for resiliency. Both SAN and RAID make additional copies of data for resiliency. Now using something like Erasure coding (Reed Solomon), you can save space but it is not going to be exactly 1 to 1 ratio. Imagine if you store only one copy and your disk fails (are you okay with losing data - If yes then you can set replication factor in HDFS to 1)? Also, SOLR will require more space because it needs to create large indexes which are stored in addition to your data and can be very large. One question that's bugging me here is that it seems to me that you are looking for one of these systems for your use case. If yes, then you cannot make a decision based on your storage requirement. Bringing raw data in HDFS is very different then importing it in HBase which in turn is very different then using SOLR. Which approach you will take depends on your ultimate use case. If you need to index data for fast searches then you'll likely use SOLR which by the way will require more storage vs if your use case is ETL offload then you probably should just use HDFS combined with Hive and Spark.

mqureshi · ‎01-09-2017

@Hoang Le No, Ambari UI will set it for future files that you will create. It will not run setrep command for you. That you will have to run from shell as described above.

mqureshi · ‎01-09-2017

@Hoang Le 1. I know default of replication block is 3. But when I configure dfs.replication=1, Do it affected to cluster performance. Since you are not replicating, your writes will be faster at the expense of significant risk of data loss as well as read performance. Your reads can be slow because your data might happen to be on a node experiencing issues with no other block available as well as job failure in case of just one node failure. 2. I have a lot of data with configure dfs.replication=1, and now I change configure to dfs.replication= 3. So my data will auto replicate or I have to build my data again to replication running. I need to be sure because my data is very important. Use setrep to change replication factor for existing files. It will replicate existing data (you will have to provide the path). hadoop fs -setrep [-R] [-w] <numReplicas> <path> hadoop fs -setrep -w 3 /user/hadoop/dir1 The -R flag is accepted for backwards compatibility. It has no effect. The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time. Returns 0 on success and -1 on error. P/S: any best practice for dfs.replication configure. Always use default replication factor of 3. It provides data resiliency as well as redundancy in case of node failures. It also helps read performance. In rare cases, you can increase replication factor to help even more data distribution to make reads faster.

mqureshi · ‎12-29-2016

@milind pandit Storm core has abstractions for bolts to save and retrieve the state of its operations. There is a default in-memory based state implementation and also a Redis backed implementation that provides state persistence. Currently the only kind of State implementation that is supported is KeyValueState which provides key-value mapping. Bolts that requires its state to be managed and persisted by the framework should implement the IStatefulBolt interface or extend the BaseStatefulBolt and implement void initState(T state) method. Please see following link for details: http://storm.apache.org/releases/2.0.0-SNAPSHOT/State-checkpointing.html

mqureshi · ‎12-28-2016

@Ye Jun Ambari-infra is SOLR but it used internally by HDP services. Think about where Ranger, Atlas and other services are storing/indexing their data which you are able to search in sub seconds. This is all done in Ambari-Inra which is SOLR. Ambari-infra is NOT where our customers can store and index their own data. If you have a SOLR use case, you cannot use Ambari-infra. It is not supported. If you want to use SOLR for indexing your data, then you install a separate SOLR service using Ambari (HDP 2.5 and future versions). Link here.

mqureshi · ‎12-27-2016

@Atif Mohammad You can use rJDBC package. Have you tried it? https://cran.r-project.org/web/packages/RJDBC/index.html https://cran.r-project.org/web/packages/RJDBC/RJDBC.pdf ---> docs

mqureshi · ‎12-27-2016

@Ye Jun Here is the link you are lookng for https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ch01s13.html for version 2.5 the link is below: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.0/bk_command-line-installation/content/download_hdp_maven_artifacts.html

mqureshi · ‎12-19-2016

@slachterman Not sure if I have enough information but couple of things come to mind. In a Kerberized environment, you need to do a kinit and then use proxy user. How are you doing that? I guess you already know this but you cannot for example use keytab and proxy user together. See the details below: https://issues.cloudera.org/browse/LIVY-98 Also, I have not used livy with zeppelin but according to the docs, you should use "livy.spark" (May be this is just another way of doing it but I thought I'll point out). https://zeppelin.apache.org/docs/0.6.0/interpreter/livy.html

mqureshi · ‎12-16-2016

@Sami Ahmad user "sami@abc.com" does not have permissions to create table in HBase. you need to set these permissions for this user in Ranger.

mqureshi · ‎12-16-2016

@Michael Kalika Rather than going this path, since it is still just the same cluster, why not leverage "Isolated Processor"?. Run ListFile on primary node and then load balance? "Isolated Processor" under following link. https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: HBASE and SOLR capacity planning

Re: [HDFS] Block replication dfs.replication affec...

Re: [HDFS] Block replication dfs.replication affec...

Re: In terms of implementation where does Storm st...

Re: what is difference between ambari-infra and so...

Re: WHat is the replacement for RHive Library Pack...

Re: where can i find HDP maven Repos

Re: Livy HTTP 403 Error

Re: cant create HBASE table

Re: Load Balance NiFi Cluster