About bleonhardi

mph · ‎07-27-2016

The cluster is fairly small as its mostly experimental but I have 3 out of the 4 nodes in the cluster that each have 4 vCores and 1GB of memory, with a global YARN minimum memory container size of 256MB. So when you say slots I'm assuming that would translate into 12 slots/containers potentially? i.e. a container representing 1vCore + 256MB. I had assumed that for the resource (CPU/RAM) available in my cluster that the query I'm running on the dataset sizes I'm working with i..e 30-40k records would be more than enough?

bleonhardi · ‎07-11-2016

I think the majority of people do not use ssh fencing at all. The reason for this is that Namenode HA works fine without it. The only issue can be that during a network partitioning old connections to the old standby might still exist and get stale old date during read-only operations. - They cannot do any write transactions since the Journalnode majority prohibits that - Normally if zkfc works correctly an active namenode will not go into zombie mode, he is dead or not. So the chances of a split brain are low and the impact is pretty limited. If you use ssh fencing the important part is that your script cannot block other wise the failover will be stopped, you need to have all scripts return in a sensible amount of time even if the access is not possible. Fencing by definition is always an attempt. Since most of the time the node is simply down. And they need to return success in the end. So you need a fork with a timeout and then return true. https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Verifying_automatic_failover

bleonhardi · ‎07-11-2016

Regarding how refer to Sunile. Pig is nice and flexible, Hive is good if you know SQL and your RFID data is already basically in a flat table format, Spark also works well ... But the question is if you really want to process 100GB of data on the sandbox. The memory settings are tiny there is a single drive data is not replicated ... If you do it like this you can just use python on a local machine. If you want a decent environment you might want to set up 3-4 nodes on a VMware server perhaps 32GB of RAM for each? That would give you a nice little environment and you could actually do some fast processing.

bleonhardi · ‎07-06-2016

You mean to exclude two columns? That one would definitely work: (id1|id2)?+.+ Your version would say id1 once or not at all followed by id2 once or not at all followed by anything else. So should work too I think.

kf1 · ‎07-07-2016

Thank you for your answers, that really helps. Im a bit further now Right now: A croned python script on the NameNode writes the kafka stream every 5 min to hdfs. (External Table JSON). Every hour another script which executes a "insert overwrite" moves the data from the external table to an orc partitioned and clustered table. This table should be the BI Table for realtime Analysis. My next plan would be to change the 1. script to directly update/insert the hive table, so that i can eleminate the second script. Thanks for any suggestions.

rajkumar_singh · ‎06-19-2016

normally mapper dont fail with OOM and 8192M is pretty good, I suspect that if you have some big records while reading from csv, are you doing some memory intensive operation inside mapper. could you please share the task log for this attempt attempt_1466342436828_0001_m_000008_2

bleonhardi · ‎06-16-2016

That is amazing!

bleonhardi · ‎06-14-2016

Good you fixed it. I would just read a good hadoop book and understand the MapCombinerShuffleReduce process in detail. After that the majority of markers should be pretty self evident. https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/184-6666119-1311365?ie=UTF8&*Version*=1&*entries*=0

agauthier · ‎06-09-2016

Rajkumar, Have you tried connecting directly with the hive jdbc driver? I'm suspecting it's a jar conflict somewhere. Here's my hive driver config in IntelliJ, obviously took the shotgun approach and added all client jar but the main required are hive-common, hive-jdbc.

RajbMandal · ‎06-06-2016

Thanks Benjamin.Yes It is bulk upload.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Hive query running on Tez contains a Mapper th...

Re: SSH Fence not working?

Re: How to process large volume of data(e.g, 100 G...

Re: Excluding Duplicate Key Columns from Hive usin...

Re: Best practise stream Datapipeline on a hadoop

Re: Mapreduce - GC overhead limit exceeded

Re: Simple example of Jenkins-HDP integration

Re: Explanation of Tez task counters.

Re: while running a hive jdbc client from Intellij...

Re: Load terabytes of data from Local system to ...