Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5383 | 08-12-2016 01:02 PM | |
2195 | 08-08-2016 10:00 AM | |
2591 | 08-03-2016 04:44 PM | |
5470 | 08-03-2016 02:53 PM | |
1414 | 08-01-2016 02:38 PM |
07-27-2016
11:16 AM
The cluster is fairly small as its mostly experimental but I have 3 out of the 4 nodes in the cluster that each have 4 vCores and 1GB of memory, with a global YARN minimum memory container size of 256MB. So when you say slots I'm assuming that would translate into 12 slots/containers potentially? i.e. a container representing 1vCore + 256MB. I had assumed that for the resource (CPU/RAM) available in my cluster that the query I'm running on the dataset sizes I'm working with i..e 30-40k records would be more than enough?
... View more
07-11-2016
04:28 PM
1 Kudo
I think the majority of people do not use ssh fencing at all. The reason for this is that Namenode HA works fine without it. The only issue can be that during a network partitioning old connections to the old standby might still exist and get stale old date during read-only operations. - They cannot do any write transactions since the Journalnode majority prohibits that - Normally if zkfc works correctly an active namenode will not go into zombie mode, he is dead or not. So the chances of a split brain are low and the impact is pretty limited. If you use ssh fencing the important part is that your script cannot block other wise the failover will be stopped, you need to have all scripts return in a sensible amount of time even if the access is not possible. Fencing by definition is always an attempt. Since most of the time the node is simply down. And they need to return success in the end. So you need a fork with a timeout and then return true. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Verifying_automatic_failover
... View more
07-11-2016
10:31 AM
Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ... But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.
... View more
07-06-2016
12:03 PM
You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.
... View more
07-07-2016
02:28 PM
Thank you for your answers, that really helps. Im a bit further now Right now:
A croned python script on the NameNode writes the kafka stream every 5 min to hdfs. (External Table JSON).
Every hour another script which executes a "insert overwrite" moves the data from the external table to an orc partitioned and clustered table.
This table should be the BI Table for realtime Analysis.
My next plan would be to change the 1. script to directly update/insert the hive table, so that i can eleminate the second script.
Thanks for any suggestions.
... View more
06-19-2016
03:02 PM
normally mapper dont fail with OOM and 8192M is pretty good, I suspect that if you have some big records while reading from csv, are you doing some memory intensive operation inside mapper. could you please share the task log for this attempt attempt_1466342436828_0001_m_000008_2
... View more
06-16-2016
05:16 PM
That is amazing!
... View more
06-14-2016
09:20 AM
Good you fixed it. I would just read a good hadoop book and understand the MapCombinerShuffleReduce process in detail. After that the majority of markers should be pretty self evident. https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/184-6666119-1311365?ie=UTF8&*Version*=1&*entries*=0
... View more
06-09-2016
06:45 PM
Rajkumar, Have you tried connecting directly with the hive jdbc driver? I'm suspecting it's a jar conflict somewhere. Here's my hive driver config in IntelliJ, obviously took the shotgun approach and added all client jar but the main required are hive-common, hive-jdbc.
... View more
06-06-2016
08:45 AM
1 Kudo
Thanks Benjamin.Yes It is bulk upload.
... View more