About bleonhardi

bleonhardi · ‎05-06-2016

The good thing about hadoop is that once you have the data in and in a format that is correct ( i.e. when you query it you can see all your data, the row count is correct etc. ) You can always transform it in any form you want. To transform into any Hive table from HDFS a) Make an external Table on the data folder in your case stored as avro CREATE EXTERNAL TABLE AVROTABLE ( colums ) STORED AS AVRO LOCATION '/tmp/myfolder'; https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-Hive0.14andlater b) Check that you can see all data and the data is correct c) Create your performance tables ( ORC, partitioned, ... ) just use the CTAS statement to create a new copy of the table in your new format: CREATE TABLE X STORED AS ORC AS SELECT * FROM AVROTABLE; Or create your new table first and then insert into it CREATE TABLE ORCTABLE ( columns) STORED AS ORC ... ; INSERT INTO ORCTABLE SELECT * FROM AVROTABLE; If you load directly into Hive in the first step just skip the create external table step. Finally some tips: - In Hive use ORC in Impala Parquet - Use partitioning based on your where conditions ( date perhaps ) - in HDP2.2+ use the default ORC settings ( zip compression, ... ) - Sort your data during insert by a column of your where condition ( customer, location, product ... ) - Forget about indexes ORC and parquet have built in ones that work very similar without the overhead - Make sure you have a new version of HDP - Use statistics ( Analyze keyword ) - When you run a query check in resource manager or with hive.tez.exec.print.summary=true that you utilize the cluster nicely and don't have a couple reducers blocking your query. Have fun, Hive has gotten pretty awesome recently. http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

bleonhardi · ‎05-04-2016

"Should I go with HDP 2.3 or 2.4 ?" I would tend to 2.4 here. Although it depends a bit what you need. An older release of 2.3 may provide some extra stability but a lot of security features ( kerberos for Kafka->Spark Streaming etc. ) and a new spark release are in 2.4 ( and other goodies ). Also upgrading to a point version is normally easier than jumping releases. But again depends on your needs. "Any issue with this configuration. Also do we need the client on all the machines " For most cases not ( Sqoop action with hive in oozie needs the hive clients on all nodes but that is an exception ) Regarding your node distribution: How many datanodes are you planning? I see 5 different node types so I assume you want 3 master nodes one edge node and 3 datanodes? That may make sense if you plan to grow the cluster later but if you want to get the maximum amount of work done I would rather go for 2 master nodes and perhaps even reuse one of them as edge nodes. And have a decent amount of datanodes. Obviously depends on your server size as well but big modern servers with 12+ cores and 256GB of RAM can host an awful lot of master components at the same time without creating a bottleneck. Others may disagree with me here but I setup a 7 datanode plus 1 master+edge node cluster once ( didn't design it ) and it worked fine as long as you do not expect constant uptime for your cluster ( colocating this many services increases the chance of something going wrong and bringing down the whole cluster because of a server reboot so its nothing you would do for a mission critical system that cannot go down. If you have much smaller servers then you might need more master nodes as well though.

bleonhardi · ‎05-03-2016

Are you sure? I was pretty sure that even big files have their first copy written locally.

bleonhardi · ‎05-03-2016

Normally they are not. Hadoop tries to write the first copy of every block locally if possible ( this has a lot of nice characteristics if the same nodes normally read the files they write ( for example HBase Region Servers. ) So the general rule is: One client writes a big file with 10 blocks: - HDFS tries to write a first version of each block to the local datanode if the uploading client is colocated with a datanode in rack1 ( otherwise any node is used) - While the block is written the datanode initiates a copy to another node in a different rack ( rack2 ) - while that block is written the second datanode initiates a copy to another node in the same rack as itself ( rack2 ) So you essentially have a copy chain, however not all three writes need to finish successfully. If at least the first copy is written successfully the block is assumed as written and the client will start writing the second block and the third and so on. But the first copy of each will per default end up on the same datanode colocated with the client. As said this has a lot of good results in practice. However all of that goes out the window when the datanode should run full.

bleonhardi · ‎05-02-2016

I don't think there is any max number Most likely some theoretical int number that you wouldn't reach in any cluster. Or do you mean at the same time? In that case it would be (RAM * Nodenumber) / *yarn.scheduler.minimum-allocation-mb So if you have a 100 nodes and your nodes have 96GB yarn memory and your minimum allocation size ( and your map size ) is 2GB it would be 4800.

bleonhardi · ‎05-02-2016

1) You can enable some parameters like huge file support in Linux to make it faster but nothing near 128MB. You just don;t get much benefit there If you want to know how to tune the filesystem you can refer to our reference architecture Appendix B https://hortonworks.com/wp-content/uploads/2013/10/4AA5-9017ENW.pdf - Mounting filesystems with nodiratime, noatime - Change block readahead to 8k Some file systems can also configure huge file support. 2) Windows Basically yes. NTFS has something they call a cluster. But it seems to be very similar to a block https://support.microsoft.com/en-gb/kb/140365

bleonhardi · ‎04-29-2016

As you say this could be anything. In general a shell action in oozie works like any other shell action however you need to upload all the files you need using file tags since the action could be executed on any datanode in a yarn tmp folder. If you want to interact with a kerberized cluster you need to run a kinit in the shell command. ( you also need to upload the keytab to the temp folder of the shell action using a <file> tag. ) But I think first you need to find out what the problem is: You need to look at the logs of the Map Task executing the shell action. You will not find this in the oozie logs. Also it doesn't sound like you have tested your shell script locally so that would be the first step of ALL. I.e. Make sure that your script runs successfully in a local environment including curl and all. Good luck.

bleonhardi · ‎04-29-2016

I think that question was well answered in this thread. https://community.hortonworks.com/questions/27567/write-performance-in-hdfs.html#answer-27633 If I can quote myself, here the explanation of why Hadoop block size is much larger than a local filesystem block "2. As HDFS is a Virtual Filesystem, the data it stores will ultimately stored on Underlying Operating system (Most cases, Linux). Assuming Linux has Ext3 File system (whose block size is 4KB), How does having 64MB/128 MB Block size in HDFS help ? Does the 64 MB Block in HDFS will be split into x * 4 KB Blocks of underlying Opertating system ?" Again small additional comment to what Kuldeep said. A block is just a Linux file so yes all the underlying details of ext3 or whatever still apply. The reason blocks are so big is not because of the storage on the local node but because of the central FS storage - To have a distributed filesystem you need to have a central FS Image that keeps track of ALL the blocks on ALL the servers in the system. In HDFS this is the Namenode. He keeps track of all the files in HDFS and all the blocks on all the datanodes. This needs to be in memory to support the high number of concurrent tasks and operations happening in an hadoop cluster so the traditional ways of setting this up ( btrees ) doesn't really work. It also needs to cover all nodes on all discs so he needs to keep track of thousands of nodes with tens of thousands of drives. Doing that with 4kb blocks wouldn't scale so the block size is around 128MB on most systems. ( You can count roughly 1GB of Namenode memory for 100TB of data ) - If you want to process a huge file in HDFS you need to run a parallel task on it ( MapReduce, Tez, Spark , ... ) In this case each task gets one block of data and reads it. It might be local or not. Reading a big 128 MB block or sending him over the network is efficient. Doing the same with 30000 4KB files would be very inefficient. That is the reason for the block size.

bleonhardi · ‎04-29-2016

There is actually a specific parameter for offset retention offsets.retention.minutes which by default is 1440 or one day. So it actually should be deleted sooner. http://kafka.apache.org/documentation.html

bleonhardi · ‎04-29-2016

Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe ) There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce: Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat. TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way. There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Import to HDFS or Hive

Re: Recommendation of a Multi-Node cluster

Re: HDFS File Placement when File Size Exceeds Blo...

Re: HDFS File Placement when File Size Exceeds Blo...

Re: Hi, Can some one tell me how many max mappers ...

Re: Hadoop Block Size

Re: Can Oozie curl from an external API and write ...

Re: Hadoop Block Size

Re: Kafka offset topic not following retention pol...

Re: Using java.util.UUID.randomUUID() for UUID gen...