Member since
12-14-2015
70
Posts
94
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3844 | 03-14-2017 03:56 PM | |
484 | 03-07-2017 07:20 PM | |
2165 | 01-23-2017 05:57 AM | |
1977 | 01-23-2017 05:40 AM | |
923 | 10-18-2016 03:36 PM |
04-17-2017
12:24 AM
Thank you!
... View more
04-17-2017
12:24 AM
Thank you!
... View more
04-13-2017
06:52 PM
1 Kudo
When I think of managing my stack from Ambari, then hdp-search makes that much more sense, but what am I loosing out on? Are there any limitations for me in using hdp-search over solr.
... View more
Labels:
- Labels:
-
Apache Solr
03-28-2017
02:19 AM
1 Kudo
Can someone advise as which HDP release will I get to see Storm 1.1.0 fully GA-ed ? - I am especially interested in the HDFSBolt partitioning funcltionality that is added to Storm 1.1.0
... View more
Labels:
- Labels:
-
Apache Storm
03-14-2017
03:56 PM
3 Kudos
NiFi doesnot YET have a CDC kind of processor - as-in the processor that would look into logs to determine the changed rows in a given time span. However, there is a processor "QueryDatabaseTable" which essentially returns the rows that have been updated since last retrieval - but the problem with this processor is that it scans the whole table to find the changes values, and this could pose a performance bottle neck if your source table is really big. Here is the documentation for QueryDatabaseTable - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/ (esp pay attention to the property 'Maximum value columns') Here is the blog that walks you through setting up a CDC using QueryDatabaseTable - https://hortonworks.com/blog/change-data-capture-using-nifi/ Lastly, specific to your question, should you go down this route, below are the nifi processors that you probably need:
QueryDatabaseTable ConvertAvroToJson PublishKafka PutHiveQL / PutHiveStreaming As an alternate to this you may also look into Attunity which has a CDC capability Hopefully this helps, if it does, please remember to upvote and 'accept' the answer. Thank you!
... View more
03-07-2017
07:20 PM
4 Kudos
Working further with our support team and customer, it was determined that this issue was coming mostly from Postgres side. The reason being the ticket caching was not enabled for the PG side, and the customer is currently working on enabling the same. This document link should talk about enabling caching from Postgres - http://jpmens.net/2012/06/23/postgresql-and-kerberos/ As far as the above question on multiple requests on the same session goes - Yes, Hive metastore does caching my default, and the multiple commands executed within the same HS2 session is translated to a single auth request due to caching at the HMS level
... View more
02-28-2017
05:32 AM
2 Kudos
I am working with a customer who complains of recurring production issue (about once a month) due to overloading auth requests to their Kerberos infrastructure (10’s of thousands of auth attempts within a very short time-frame) - and any help with the below questions would be very appreciated. Apparently, these requests come from their Hive Metastore Service (aka HMS) account “hcatalog” and their postgres database host. The customer would like better understand how HMS and the postgres metastore handle authentication requests ?? It kind of makes sense to have some form of ticket caching to keep these auth attempts fairly low – no? If yes, that should be the expectation. Is this driven by some kind configuration on the HMS or Postgre side (that the customer has perhaps, either mis-configured or missing) ? Thanks and let me know your thoughts.
... View more
Labels:
- Labels:
-
Apache Hive
02-11-2017
09:38 PM
@kishore sanchina - how did you download nifi? did you download from apache website? -- the ranger integration of nifi is available as part of HDF.. you can download HDF from http://hortonworks.com/downloads/#dataflow
... View more
02-11-2017
09:30 PM
3 Kudos
The sandbox shares the uderlying harddisk on which your VM runs. To increase the default storage footprint, go to VM where you have mounted HDP sandbox, go to settings, choose storage, and add a new storage. If you find this answer helpful, please upvote and accept the answer. Below is the screen shot for Oracle VM:
... View more
02-01-2017
04:12 PM
Hive is very similar to a database design - so as a first step you can create a hive table using syntax like (in its simplest form) create table table_name (
id int,
date string,
name string
)
partitioned by (date string)
There are many variants that you can add to this table creation such as where it is stored, how it is delimited, etc.. but in my opinion keep it simple first and then you can expand your mastery. This link (the one that I always refer to) will talk in detail on the syntax (for DDL operations), different options etc - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL Once you got this taken care of.. you can then start inserting data into Hive. Different options available for this is explained here at the DML documentation - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML So these 2 links will be good to start for getting closer to hive in general. Then sepecifically for your question on loading xml data - you can either load the whole xml file data as a single column and then read it using xpath udf at the read time, or break each xml tags as a seperate column at the write time. I will go through both of those options here in little details: Writing xml data as a single column: you can simply create a table like CREATE TABLE xmlfiles (id int, xmlfile string)
and then put the entire xml data into the string column. Then at the time of reading, you can use the XPATH udf (user defined function that come along with Hive) to read the data. Details here - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF This approach is easy to write data, but may have some performance overhead at the time of reading data (as well as limitations on doing some aggregates on the result set) Writing xml data as a columnar value into Hive: This approach is little more drawn out at the time of writing data. but easier and more flexible for read operation. Here first you convert your xml data into either an Avro or Json and then using one of the serde (Serialize / deserialize) to write data to Hive. This will give you some context - https://community.hortonworks.com/repos/30883/hive-json-serde.html Hope this makes sense. If you find this answer helpful, please 'Accept' my initial answer above
... View more
02-01-2017
03:53 PM
I think this question is similar to this one https://community.hortonworks.com/questions/79103/what-is-the-best-way-to-store-small-files-in-hadoo.html and I have posted my answer there.
... View more
01-23-2017
05:57 AM
3 Kudos
@ripunjay godhani I also answered to your another post on changing the block size and why you should refrain from doing so. So here I will simply address the other ways you can overcome this small file problem. The primary questions that need to be asked when picking up the data archive strategy are How am I going to access this data? How often am I going to access this archived data? Am I going to be bound by some stringent SLAs? The answer to these questions will lead you to figure out if you need some kind of low density spinning disk or some kind of SSDs from hardware perspective, or am I going to put this data to HBase (memory intensive) or just a plain old file. You put into Hbase when you have a very stringent SLAs - like sub-second response, and have the luxury of clustering lot of nodes with high memory (RAM) - this doesn't seem to be the case from your explanation above. So here are my two suggestions (in order of preference): Put data into Hive. There are ways to put xml data into hive. At a very dirty level you have an xpath udf to work on xml data in Hive, or you can package it luxuriously by converting xml to avro and then using serde to map the filelds to column names. (let me know if you want to go over this in more detail and I can help you there) Combine bunch of files, zip it up and upload to hdfs. This option is good, if your access is very cold (once in a while) and you are going to access the files physically (like hadoop fs -get) Let me know if you have further questions. Lastly, if you find this answer to be helpful, please upvote and accept my answer. Thank you!!
... View more
01-23-2017
05:40 AM
4 Kudos
@ripunjay godhani Here is the general answer - reducing the default block size will result in creation of too many blocks which results an overhead on Name Node. By architecture each node (in newer architecture it will be each storage type per node - but that conversation is for a different time) on the Hadoop cluster will report a storage report and block report back to the Name Node, which will then be used when retrieving/accessing the data at a later time. So, as you would imagine this will increase the chattiness between name node and data node, as well as increase the meta data on the Name node iteself. Also, when you start hitting 100's of millions of file range, then your Name node will start filling up the memory and may result in going through a major garbage collection, which is a stop the world operation and may result in your whole cluster being down for few minutes.. there are ways around this - like increasing the memory size of NN or changing the gc, but none of those are economical or easy. These are all the down sides of reducing the block size - or even a small file problem, in general. And now coming to your specific use case - why do you think you have so many smaller files? Is there a way you can merge multiple of those into a larger file? I know one of my customer had similar issue while storing tick symbols - they mitigated this by combining the tick data on a hourly basis. Another customer had a source file FTP-ed that is quite small and they mitigated by gzipping bunch of those file into a really large one. Also, archiving data to Hive is another option. The bottom line being the small file issue on the hadoop must be viewed as a combination of technical + business problem, and you will be best off by looking to ways to eliminate this situation from business standpoint as well. Simply playing around the block size is not going to give you the most mileage. Lastly, if you felt this answer to be helpful, please upvote and accept the answer. Thank you!
... View more
10-18-2016
04:01 PM
Thanks my friend!
... View more
10-18-2016
03:36 PM
3 Kudos
@vpemawat If you are not using log4j: If you are looking to delete the files for good then there is not many options available other than rm -rf; however there are few tweaks that you can do to make it faster you can perhaps run multiple rm scripts in parallel (multiple threads) In order to do this, you should be able to logically separate the log files either by folder or name format Once you have done that, you can run multiple rm commands in background like something below nohup rm -fr app1-2016* > /tmp/nohup.out 2>&1 &
nohup rm -fr app1-2015* > /tmp/nohup.out 2>&1 & If using log4j: You should probably be 'DailyRollingFileAppender" with 'maxBackupIndex' - this will essentially limit the max file size of your log and then purge the older contents. More details here: http://www.codeproject.com/Articles/81462/DailyRollingFileAppender-with-maxBackupIndex Outside of this, you should consider the below 2 things for future use cases Organize the logs by folder (normally broken down like /logs/appname/yyyy/mm/dd/hh/<log files> Have a mechanism that will either delete the old log files, or archive it to a different log archive server Hopefully this helps. If it does, please 'accept' and 'upvote' the answer. Thank you!!
... View more
10-09-2016
10:16 PM
1 Kudo
You are welcome. Glad it worked.
... View more
10-09-2016
10:11 PM
2 Kudos
Go to your Ambari >> Kafka >> Configs and look for the port the Kafka broker listens on If it is HDP sandbox 2.4, it will most probably be on 6667 and therefore you should be running the below command instead ./kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic test1 Let me know if this works. Else, post with the proper exception and we can look deeper. If this answer helps you, please don't forget to upvote / accept this answer
... View more
09-26-2016
05:48 AM
3 Kudos
@Bala Vignesh N V Unfortunately, you cannot run multiple insert commands on the same destination table at the same time (technically you can, but the job will get executed one after the other) however, if you are using external file, you can achieve parallelism by writing multiple files into your destination folder and creating a hive external table on top of your destination folder. It will look something like this: CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
LOCATION '/logs/mywebapp/' where '/logs/mywebapp/' will be your hdfs directory and you will write multiple files (one for each of your parallel jobs) into this directory. ** If this answers your question, please don't forget to upvote and Accept the answer **
... View more
09-22-2016
02:54 AM
@Girish Chaudhari what happened right after you executed the Alter table command? Did you get any errors? I am assuming, you tried describe extended <table_name> to determine the location that it is referring to??
... View more
09-21-2016
03:17 AM
Thanks @Randy Gelhausen
... View more
09-21-2016
03:17 AM
Thanks @ajaysingh
... View more
09-21-2016
02:51 AM
2 Kudos
I know Syncsort is a possible solution here, but wanted to check if we HDF can do the job and if we have any other recommendation other than Syncsort ??
... View more
Labels:
09-01-2016
02:00 AM
1 Kudo
Just an update - the 'SelectHiveQL' has been added as part of Nifi 0.7
... View more
08-30-2016
12:19 PM
3 Kudos
Before I answer the question specifically, let me address (based on my research) the fault tolerance for each of the components within Storm: 1) Nimbus - A stateless daemon which sits on master node which deploys the job (Topology) and keeps track of it. Now there are two scenarios. First, If Nimbus goes down after you submit the topology it will not have any adverse affect on current topology as it is running on worker nodes (and not on the master node), Now if you have kept this process under supervision the process will restart and when your nimbus comes back as it is fail-fast it will retrieve all the meta information of all the active topologies on the cluster from Zookeeper and start tracking them again. Second, If Nimbus goes down before you submit the topology you will simply have to restart it. 2) Supervisor - This daemon is responsible to keep track of worker processes (JVM Process) on the node he sits and coordinate the state with Nimbus through Zookeeper. If this daemon goes down your worker process will not be affected and it will keep on running unless it doesn't get crashed, once it comes back (due to supervisord or monit) it will collect the state from Zookeeper and resume tracking the worker processes. If timeout occurs Nimbus will reschedule the topology on different worker node. 3) Worker Processes (JVM Processes) - These container processes actually execute your topology’s components (Spouts + Bolts), If it goes down Supervisor will simply restart them on different ports(on same worker node) and if it is out of ports it will notify Nimbus and then it will reschedule the process on different worker node. 4) Worker Node (Supervisor + Worker Process) - During this scenario Nimbus will stop receiving the heartbeats (due to timeout) from the Worker Node and then the Nimbus will simply reassign the work to different Worker Node(s) in the cluster. 5) Zookeeper (Zk) - Okay! From all the above you might have infered that all the state gets stored on Zk, Hmmh! What if it goes down or can it go down? Zk is again not a single node process, it has its own cluster and the state stored in Zk is constantly replicated, so even if a single Zk node goes down, the new node will be elected as a leader which will start communicating with Apache Storm. Now, going back to the specific question: When a supervisor with 4 slots (ports) go down The very first thing the Nimbus will try to do is to restart the processes on the SAME worker node on the available ports. And for those processes that it does not have port, it will be reassigned to a different worker node - so yes, it will increase the executor threads on these worker nodes. And from design perspective itself, you should not necessarily counter for redundant ports as Nimbus is designed to take care of this by either restarting the processes on that port or by re-distributing among other worker nodes.
... View more
08-30-2016
12:14 PM
Thanks @Rajkumar Singh
... View more
08-30-2016
12:07 AM
2 Kudos
Say, if there were 3 extra slots, and a supervisor with 4 slots (supervisor.slots.ports) go down, what happens?? - Does storm automatically increase the number of executor threads in worker process from other supervisors?
... View more
Labels:
- Labels:
-
Apache Storm
08-20-2016
05:55 PM
3 Kudos
@gkeys The mergecontent processor has 2 properties that I normally use to determine the output file size
Minimum number of entries Minimum group size For your question as how do i increase the file size to reach a desired file size (say 1gb)? - Set the minimum group size to the size that you would like (i.e 1 gb) AND set the minimum number of entries to 1. This will merge the content to the 1 gb before it writes out to the next processor Can you clarify a little more about your other question on how do i double the size on existing setting?? - do you mean double the size of incoming file? - this will be direct. Just set the minimum number of entries to 2 and minimum group size to 0 b
... View more
08-06-2016
09:56 PM
1 Kudo
@Iyappan Gopalakrishnan Download the nifi-0.7.0-bin.zip file from the downloads link https://nifi.apache.org/download.html After that, if you unzip the file, you will see the folder structure similar to this one below: Then based on the OS, you can either use 'bin/run-nifi.bat' for windows or 'bin/nifi.sh start' for mac/linux. More details on how to start nifi is here https://nifi.apache.org/docs/nifi-docs/html/getting-started.html#starting-nifi You can tail the logs from logs/nifi-app.log (to see if it starts properly) OPTIONAL: By default, nifi starts on port 8080 - but if you see any port conflict or want to start this on a different port, you can change that by editing the file 'conf/nifi.properties', search for 8080 and update the port number. If you like the answer, please make sure to upvote or accept the answer.
... View more