About bleonhardi

bleonhardi · ‎07-05-2016

The oozie installation makes changes to the core-site.xml. Specifically the oozie user. needs to be able to impersonate other users so he can kick off a job as the user who owns the oozie flow. HDFS allows you to configure these powerful users in the proxyuser settings. So to enable that you need to restart HDFS and yarn. hadoop.proxyuser.oozie.groups hadoop.proxyuser.oozie.hosts

bleonhardi · ‎07-05-2016

1) Normally Mapreduce reads and creates very large amounts of data. The framework is also parallel and failed tasks can be rerun, so until all tasks have finished you are not sure what the output is. You can write a program that returns data to the caller directly obviously but this is not the norm. Hive for example writes files to a tmp dir and then the hiveserver uses the hdfs client to read the results. In pig you have the options to store (save in hdfs ) or dump ( show on screen ) data. But Not sure if pig also utilizes a tmp file here. In Mapreduce you can do whatever you want. 2) Mapreduce is used when you want to run computations in parallel on the cluster. So pig/hive utilize it. But you can also just read the data directly using the client. However in that case you have a single threaded read,

bleonhardi · ‎07-05-2016

Not sure what you mean with chunk. Essentially a stream of data is piped into the HDFS write API. Each 128MB a new block is created internally. Inside each block the buffer sends data when a network package is full ( 64KB or so ) So essentially 1GB file is written into HDFS API - Block1 is created on (ideally local) node1, copy on node2 and node3 - data is streamed into it, in 64KB chunks, from client to node1, whenever datanode receives 64KB chunk it writes it to disc into the block and tells client that write was successful and at the same time sends a copy to node2 -node2 writes chunk to its replica of the block and sends data to node3 -node3 writes chunk to block on disc - next 64kb chunk is send from client to node1 ... - 128MB is full and next block is create. The write is successful once the client received notification from node1 that it successfully wrote the last block If node1 dies during the write client will rewrite blocks on a different node. ...

bleonhardi · ‎07-05-2016

1. where does this splitting of huge file takes place.? A Client is a (mostly) Java program using the HDFS Filesystem API to write a file to HDFS. This can be the hadoop command line client or a program running in the cluster ( like mapreduce, ... ) In this case each mapper/reducer who writes to HDFS will write one file ( you may have seen mapreduce output folders that contain part-0000 part-0001 files, these are the files written by each mapper/reducer. MapReduce takes these folders as if they are one big file. If you write a file with a client ( let's say one TB) the file is written into the API and transparently chunked into 128MB blocks by the API. 2.Does the client forms 3 pipelines for each block to replicate which run in parallel??. No its a chain. The block is committed once it is persisted on the first node, but it is written in parallel to the other two nodes in a chain Client -> Node1 -> Node2(different rack )->node3(same rack as node2) 3. DN1 which received B1 will start sending the data to DN2 before 128 MB of its block is full?? Yes HDFS API writes in buffer fields I think 64kb or so? So every buffer package is written through at the same time "doesn't that contradict the replication principle where "we will get the complete block of data and then start replicating" " Never heard of that replication principle. And it is definitely not true in HDFS. A file doesn't even need three copies to be written successful a put operation is successful if it is persisted on ONE node. The namenode would make sure that the correct replication level is reached eventually. "Can you also provide the possible reasons why the flow is not the otherwise. Because its faster? If you had to wait for three copies sequentially a put would take much longer to succeed. A lot in HDFS is effiiency

bleonhardi · ‎07-04-2016

OK for an OLTP requirement and user lookup you are looking at HBase/Phoenix in the HDP distribution ( other possibilities would be Cassandra, Gemfire ... ) HBase is a NoSQL datastore that allows a simple get/put/scan api on flat tables with a unique primary key and arbitrary number of fields. Phoenix is a SQL layer on top of hbase so its very fast for key lookups/inserts of single rows and can also do aggregations on thousands to millions of rows very efficiently. Both scale up very well since you can add Region servers dynamically. Phoenix can maintain secondary indexes as well which might be helpful in your scenario. ( you can obviously directly maintain that in Hbase by adding a translation table and maintaining it yoruself )

bleonhardi · ‎07-01-2016

38 digits is the max for Decimal. You cannot do Decimal(38,2) You could do Decimal(36,2) or Decimal (10,10 ) or whatever.

bleonhardi · ‎07-01-2016

The reason is that in a distributed transactional system with a Paxos or similar algorithm you need a quorum ( majority ). Essentially a transaction is committed once more than 50% of nodes say that the transaction is committed. You could do 4 journalnodes / zookeeper nodes as well but you would not get any benefit over 3 nodes and you add additional overhead. 4 Nodes can only survive 1 failed node because 3 journal-nodes are a majority. But not 2. Therefore you need an uneven number. 3 nodes can survive 1 failure, 5 nodes can survive 2 failures, 7 nodes can survive 3 failures and so on ... https://en.wikipedia.org/wiki/Paxos_(computer_science)

bleonhardi · ‎06-30-2016

You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.

bleonhardi · ‎06-30-2016

This is something you should discuss with a Sales person: especially licence costs. http://hortonworks.com/contact-us/ The big advantage of HDP is that it is 100% open source and open and always adds the full Apache version of all components. Components are also in Open Hadoop so there is no lock-in like you have with the proprietary components of Mapr ( Map R file system and their proprietary other components ) The second big advantage of Hortonworks is that we have the widest range of committers in the different components so you will get the best support and influence in adding features. Doing a simple feature by feature comparison doesn't really compare these fundamental differences.

bleonhardi · ‎06-30-2016

As said a question of priorities. Connectors normally go from a community project into the actual product. Once they are in they have to be tested and supported and often need changes like Kerberos support as well. So including something is not zero effort. Its just lower on the priority list than the Kafka connection which is the standard approach to ingest realtime data into a bigdata environment. If you need the integration you could post an idea in the community board. Our product management teams read them. Regarding the Machine Learning inquiry please open a new question. This would allow people who have tried this before to more easily see the question. Sorry for not being of more help 🙂

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Oozie restarts Yarn and HDFS in the process of...

Re: HOW file storage in HDFS is Done? please go th...

Re: HOW file storage in HDFS is Done? please go th...

Re: HOW file storage in HDFS is Done? please go th...

Re: A Typical Requirement in Hadoop

Re: Hive is rounding the number columns automatica...

Re: I just wondering why should we take odd number...

Re: Hive Storage Handlers

Re: I would like to know, how the Hortonworks is m...

Re: Why does HDF come with Storm and not Spark?