About cstanca

cstanca · ‎08-25-2016

@Bishop Susan Courageous decision! I assume you made the change OS-level. You only changed the timezone, not the time and if your jobs are not impacted by the timezone change (you know better), be aware that HDP uses UTC by default. In your case, this is the same you wanted, UTC is GMT with DST awareness and if you are in UK, that does not matter, but you could also change it to GMT, but that will require making some changes service by service and restarting each service. For UI display, you may not have to restart anything. There is no one place do it all. This is an ecosystem of independent tools working together. Regarding services awareness of timezone, let's take for example Oozie server timezone. Valid values are UTC and GMT(+/-)####. All dates parsed and generated dates by Oozie Coordinator/Bundle will be done in the specified timezone. The default value of 'UTC' should not be changed under normal circumstances. If for any reason is changed, note that GMT(+/-)#### timezones do not observe DST changes. Be aware of that. Usually set the timezone in the Oozie database to GMT. Databases do not handle Daylight Saving Time (DST) shifts correctly, mostly. On a different note, I thought I should raise your awareness that even you changed the timezone in all your nodes, OS-level, your ecosystem also uses some databases like MySQL, PostgreSQL, or Derby for Hive metadata store or Ambari configurations. You may want to check on those too. They should be also GMT like your servers OS. All other services have a similar configuration. If you want to make a timezone change, this is a tedious effort taking service by service if you want to make the change globally, or setting it in each job pre-requisite settings. If you were using different timezones because you had MapReduce jobs for which timezone mattered, e.g. processing a calendar day in Japan vs. calendar day in US, then you would need to make a change on how you start each service to include a parameter set to your timezone of choice (tool global) or set it by job session: SET mapred.child.java.opts= -Duser.timezone=GMT If you wish to show the new timezone in various UI screens, that is a matter of display configuration and you should do that in the context of HDP configuration via Ambari. For example, if you wish to change the timezone shown in Ambari UI for measured metrics, follow instructions: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_Ambari_Users_Guide/content/_setting_display_timezone.html Again, this is a lot for a single response. I just tried to give you a glimpse of the analysis effort needed to plan such a major change avoiding any risk. This is usually tried in development and test environment and all consequences addressed. If this helped, don't forget to vote/accept answer.

cstanca · ‎08-25-2016

Another option would be to pre-convert XML to JSON.

cstanca · ‎08-25-2016

@milind pandit @Joseph Niemiec mentioned the use of this XML SerDe: http://search.maven.org/remotecontent?filepath=com/ibm/spss/hive/serde2/xml/hivexmlserde/1.0.5.3/hivexmlserde-1.0.5.3.jar I understand that you are looking for XML SerDe, but you may be open for an alternative. NiFi provides ConvertCharacterSet processor to convert the character set used to encode the content from one character set to another. Maybe that helps.

cstanca · ‎08-25-2016

@Brandon Wilson Sadly no. It's all or nothing. As long as you can reach that ResourceManager UI, you see it all. There is no segregation. ResourceManager is really not user-aware. You would need another layer in top of that capable to show only some parts of what ResourceManager UI provides. This could be something that maybe Ranger could implement, but it would require some major redo of that ResourceManagerUI. I don't think that is that configurable. Just check the code for it.

cstanca · ‎08-25-2016

@Michel Sumbul My understanding of your question is that aside from HFile encryption (very well covered by @mqureshi response), you are asking also about non-TDE column-level encryption. HBase does not have column-level encryption feature out of box. You could use Dataguise (http://hortonworks.com/partner/dataguise/), or go with the option to develop your UDF for encryption and decryption separately using some algorithm. The encryption key can be stored in Ranger KSM. UDF could leverage https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/crypto/Encryption.html If any of the responses addressed your question, please don't forget to vote/accept answer.

cstanca · ‎08-25-2016

@suresh krish Answer from Santhosh B Gowda could be helpful, but that is brute force with 50-50% chance of luck. You need to understand query execution plan, how much data is processed, how many tasks execute the job. Each task has a container allocated. You could increase the RAM allocated for the container but if you have a single task performing the map and data is more than the container allocated memory you are still seeing "Out of memory". What you have to do is to understand how much data is processed and how to chunk it for parallelism. Increasing the size of the container is not always needed. It is almost like saying that instead of tuning a bad SQL, let's throw more hardware at it. It is better to have reasonable size containers and have enough of them to process your query data. For example, let's take a cross-join of a two tables that are small, 1,000,000 records each. The cartesian product will be 1,000,000 x 1,000,000 = 1,000,000,000,000. That is a big size input for a mapper. You need to translate that in GB to understand how much memory is needed. For example, assuming that the memory requirements are 10 GB and tez.grouping.max-size is set to the default 1 GB, 10 mappers will be needed. Those will use 10 containers. Now assume that each container is set to 6 GB each. You will be wasting 60 GB for 10 GB need. In that specific case, it would be actually better to have 1 GB container. Now, if your data is 10 GB and you have only one 6 GB container, that will generate "Out of memory". If the execution plan of the query has one mapper that means one container is allocated and if that is not big enough, you will get your out of memory error. However, if you reduce tez.grouping.max-size to a lower value that will force the execution plan to have multiple mappers, you will have one container for each and those tasks will work in parallel reducing the time and meeting data requirements. You can override the global tez.grouping.max-size for your specific query. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-ffec9e6b-41f4-47de-b5cd-1403b4c4a7c8.1.html describes Tez parameters and some of them could help, however, for your case you could give tez.grouping.max-size a shot. Summary: - Understanding data volume that needs to be processed - EXPLAIN SqlStatement to understand the execution plan - tasks and containers - use ResouceManager UI to see how many containers are used and cluster resources used for this query; Tez View can also give you a good understanding of Mapper and Reducer tasks involved. The more of them the more resources are used, but the response time is better. Balance that to use reasonably resources for a reasonable response time. - setting tez.grouping.max-size to a value that makes sense for your query; by default is set to 1 GB. That is a global value.

cstanca · ‎08-24-2016

@milind pandit I will not repeat the content of the responses from @zkfs and @Michael Young. The above responses are great, but they are not exclusive, just complementary, my 2c. Falcon will help with HDFS, but it won't help with HBase. I would use Falcon for active-active clusters or disaster recovery. Your question implies that data is migrated from an old cluster to a new cluster. As such you could go with options from @zkfs, however, Falcon is also an option for the HDFS part, but as I said, the effort to set it up and administrate it is worth it for something that is a continuous replication, not one time deal. For that case too, HBase replication should be also considered. It was not mentioned in the above responses.

cstanca · ‎08-23-2016

@Kumar Veerappana Assuming that you are only interested who has access to Hadoop services, extract all OS users from all nodes by checking /etc/passwd file content. Some of them are legitimate users needed by Hadoop tools, e.g. hive, hdfs, etc.For hdfs, they will have a /user/username folder in hdfs. You can see that with hadoop -fs ls -l /user executed as a user member of the hadoop group. If they have access to hive client, they are able to also perform DDL and DML actions in Hive. The above will allow you to understand the current state, however, this is your opportunity to improve security even without the bells and whistles of Kerberos/LDAP/Ranger. You can force the users to access Hadoop ecosystem client services via a few client/edge nodes, where only client services are running, e.g. Hive client. Users, other than power users, should not have accounts on name node, admin node or data nodes. Any user that can access those nodes where client services are running can access those services, e.g. hdfs or Hive.

cstanca · ‎08-22-2016

@Anitha R Windows is still supported: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4-Win/bk_HDP_Install_Win/content/ref-9bdea823-d29d-47f2-9434-86d5460b9aa9.1.html There have been some opinions about discontinuing Windows support due to business reasons, but there is nothing officially released. Windows Server is still a largely used operating system. It is true, very little used for Hadoop and that could be a driver for a such business decision which is pure economics.

cstanca · ‎08-20-2016

@vpemawat Yes. Hipchat me. I'll explain. The question is loaded and I'd like to be able to give you good help for your design exercise. I have a few starter questions which if it is too much to answer, especially since you were satisfied with an answer, we can discuss in the same HipChat. I'd like to learn how it met your requirements and if I can help you with anything. 1. What is "huge data" for MySQL? What is the current size, what is the daily growth? 2. How long it took since the MySQL solution was put in place to realize that it will not scale? What was the rate of growth since then? It must have been something that drove the choice of MySQL from the first place and probably something changed in conditions. What is the change in conditions? Why was chosen MySQL to store blob from the first place? What kind of blob? 3. About "to scale": Is it that you have to query more data preserving the concurrency and the response or you want all to be better, more data, higher concurrency, lower response time? How the SLA changed for your customer to want all these? What is the new use case which was not accounted by the original design that used MySQL? Usually, I would think that the challenge is the data growth challenge, but it seems that the expectation is that by replacing MySQL with something else, the response time needs also to be better. 4. How much time it takes now to query? To measure success of a better solution, a reference baseline is good. 5. The three-week data is often queried, how is it stored and what was done to address challenges today? About the rest of the queries (10%) going beyond three weeks, is the expected response time similar? What is the concurrency needed for those 90% and, respectively, 10%? 6. Could you share a bit about the infrastructure used currently? Need to understand how is setup to still be able to satisfy the requirements until replaced. I guess that the business is still running. How does it do it? What was the mitigation in MySQL to keep it running? 7. Could you share a about data access security requirements, in transport and at rest? 8. Could you explain how blob columns are currently used by the query? Are they just retrieved as a whole or you do more with them in the query? 9. What is an example of WHERE clause on those 90% queries? ... I asked these sample questions with a goal: to understand the thinking process for the initial choice, changing in conditions and driver for new requirements, matching to one technology or other from the list of technologies that are very popular these days in big data. Some of the responses would help to recommend, for example, HBase, Hive, SolR, HDFS etc. I went in so many details because you mentioned "design" and not "please help to find at 10,000 ft view big data technology". That't how I read your question, but based on the accepted answer you were actually looking for that 10,000 ft.

Online	Offline
Last Visited	‎03-22-2019 03:12 AM

Member Since	‎03-16-2016 04:06 PM
Last Visited	‎03-22-2019 03:12 AM
Posts	707
Kudos received	1728

Cloudera Community

Re: 5th attempt at getting an answer to this quest...

Re: Trying to reinstall Apache NiFi 1.5 on HDF 3.1

Re: Is it mandatory that we should have exact moun...

Re: Alternate to smartsense

Re: Tracking of Hive tables metadata changes in re...

Re: How do you change the timezone for the HDP clu...

Re: I am looking some info on XML SerDe that can ...

Re: I am looking some info on XML SerDe that can ...

Re: Is it possible to limit a user's visibility in...

Re: Hbase Encryption of the cell content and encry...

Re: HIVE TEZ Java Heap size

Re: Data transfer between two clusters

Re: How to get the list of users

Re: Hortonworks support for Windows HDP

Re: I am facing issue of huge data in mysql table ...