About hduraiswamy

hduraiswamy · ‎04-17-2017

Thank you!

hduraiswamy · ‎04-17-2017

Thank you!

hduraiswamy · ‎04-13-2017

When I think of managing my stack from Ambari, then hdp-search makes that much more sense, but what am I loosing out on? Are there any limitations for me in using hdp-search over solr.

hduraiswamy · ‎03-28-2017

Can someone advise as which HDP release will I get to see Storm 1.1.0 fully GA-ed ? - I am especially interested in the HDFSBolt partitioning funcltionality that is added to Storm 1.1.0

hduraiswamy · ‎03-14-2017

NiFi doesnot YET have a CDC kind of processor - as-in the processor that would look into logs to determine the changed rows in a given time span. However, there is a processor "QueryDatabaseTable" which essentially returns the rows that have been updated since last retrieval - but the problem with this processor is that it scans the whole table to find the changes values, and this could pose a performance bottle neck if your source table is really big. Here is the documentation for QueryDatabaseTable - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/ (esp pay attention to the property 'Maximum value columns') Here is the blog that walks you through setting up a CDC using QueryDatabaseTable - https://hortonworks.com/blog/change-data-capture-using-nifi/ Lastly, specific to your question, should you go down this route, below are the nifi processors that you probably need: QueryDatabaseTable ConvertAvroToJson PublishKafka PutHiveQL / PutHiveStreaming As an alternate to this you may also look into Attunity which has a CDC capability Hopefully this helps, if it does, please remember to upvote and 'accept' the answer. Thank you!

hduraiswamy · ‎03-07-2017

Working further with our support team and customer, it was determined that this issue was coming mostly from Postgres side. The reason being the ticket caching was not enabled for the PG side, and the customer is currently working on enabling the same. This document link should talk about enabling caching from Postgres - http://jpmens.net/2012/06/23/postgresql-and-kerberos/ As far as the above question on multiple requests on the same session goes - Yes, Hive metastore does caching my default, and the multiple commands executed within the same HS2 session is translated to a single auth request due to caching at the HMS level

hduraiswamy · ‎02-28-2017

I am working with a customer who complains of recurring production issue (about once a month) due to overloading auth requests to their Kerberos infrastructure (10’s of thousands of auth attempts within a very short time-frame) - and any help with the below questions would be very appreciated. Apparently, these requests come from their Hive Metastore Service (aka HMS) account “hcatalog” and their postgres database host. The customer would like better understand how HMS and the postgres metastore handle authentication requests ?? It kind of makes sense to have some form of ticket caching to keep these auth attempts fairly low – no? If yes, that should be the expectation. Is this driven by some kind configuration on the HMS or Postgre side (that the customer has perhaps, either mis-configured or missing) ? Thanks and let me know your thoughts.

hduraiswamy · ‎02-01-2017

Hive is very similar to a database design - so as a first step you can create a hive table using syntax like (in its simplest form) create table table_name ( id int, date string, name string ) partitioned by (date string) There are many variants that you can add to this table creation such as where it is stored, how it is delimited, etc.. but in my opinion keep it simple first and then you can expand your mastery. This link (the one that I always refer to) will talk in detail on the syntax (for DDL operations), different options etc - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Once you got this taken care of.. you can then start inserting data into Hive. Different options available for this is explained here at the DML documentation - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML So these 2 links will be good to start for getting closer to hive in general. Then sepecifically for your question on loading xml data - you can either load the whole xml file data as a single column and then read it using xpath udf at the read time, or break each xml tags as a seperate column at the write time. I will go through both of those options here in little details: Writing xml data as a single column: you can simply create a table like CREATE TABLE xmlfiles (id int, xmlfile string) and then put the entire xml data into the string column. Then at the time of reading, you can use the XPATH udf (user defined function that come along with Hive) to read the data. Details here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF This approach is easy to write data, but may have some performance overhead at the time of reading data (as well as limitations on doing some aggregates on the result set) Writing xml data as a columnar value into Hive: This approach is little more drawn out at the time of writing data. but easier and more flexible for read operation. Here first you convert your xml data into either an Avro or Json and then using one of the serde (Serialize / deserialize) to write data to Hive. This will give you some context - https://community.hortonworks.com/repos/30883/hive-json-serde.html Hope this makes sense. If you find this answer helpful, please 'Accept' my initial answer above

hduraiswamy · ‎02-01-2017

I think this question is similar to this one https://community.hortonworks.com/questions/79103/what-is-the-best-way-to-store-small-files-in-hadoo.html and I have posted my answer there.

hduraiswamy · ‎01-23-2017

@ripunjay godhani I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem. The primary questions that need to be asked when picking up the data archive strategy are How am I going to access this data? How often am I going to access this archived data? Am I going to be bound by some stringent SLAs? The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file. You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above. So here are my two suggestions (in order of preference): Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there) Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get) Let me know if you have further questions. Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!

Online	Offline
Last Visited	‎06-01-2017 06:37 PM

Member Since	‎12-14-2015 01:38 AM
Last Visited	‎06-01-2017 06:37 PM
Posts	70
Kudos received	92

Cloudera Community

Re: Change Data Capture using NiFi

Re: Hive metastore and postgres authentication to ...

Re: what is the best way to store small files in h...

Re: cases where changing hadoop block size is not ...

Re: How to delete log folder faster having files l...

Re: What are the key differences in unsing hdp-sea...

Re: What are the key differences in unsing hdp-sea...

What are the key differences in unsing hdp-search ...

which HDP release will storm 1.1.0 be packaged int...

Re: Change Data Capture using NiFi

Re: Hive metastore and postgres authentication to ...

Hive metastore and postgres authentication to Kerb...

Re: what is the best way to store small files in h...

Re: cases where changing hadoop block size is not ...

Re: what is the best way to store small files in h...