About bleonhardi

bleonhardi · ‎06-30-2016

I mean a bit relation is there. Normally MapReduce will create one Map task for every block. ( Unless small split merge is switched on ). And one map task will run in one container. So half the block size means twice the number of containers running. ( Again not always true since Pig/Tez merge small blocks together using something called the CombineFileInputFormat)

bleonhardi · ‎06-30-2016

There is a Spark Streaming connector available but if its not in the installation then its not supported (yet) . https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark In the end its a question of priorities. Most of the time you would go Nifi->Kafka->Storm/Spark anyway to have a scalable proper bigdata buffer in Kafka.

bleonhardi · ‎06-29-2016

With "text" you mean delimited files right? You can convert them in hive using a CTAS statement for example. ( or in pig reading with PigStorage and writing with any of the other Storage classes ) etc. like CREATE TABLE X ... ROW FORMAT DELIMITED FIELDS TERMINATED BY ...; CREATE TABLE ORCX STORED AS ORC AS SELECT * FROM X; Regarding which file formats are best: Delimited files: Good for import/export, you can often leave the input data unchanged which is desirable in a system of records, often no conversion needed. Sequence File: Binary format, not readable, but faster to read write. Native format of Hadoop. With the arrival of ORC files a bit out of vogure Optimized column storage: (use ORC in HDP, Parquet in Cloudera but they are very similar): Optimized column storage format. 10-100x faster for queries. Definitely the way to go to store your Hive data for optimal performance. Including compression ( 10x for zip ), predicate pushdown ( skipping blocks based on where conditions ), column storage ( only the needed columns are read ) ... Avro : The way to go if you have XML/Json files and changing schemas. You can add columns to the formats above but its hard. Avro supports schema evolution and the integration into hive allows you to change the hive table schema based on new changed underlying data. If your input is XML/Json data this can be a very good data format. Because unlike Json/XML its binary and fast while still keeping the schema.

bleonhardi · ‎06-29-2016

So the data is already in a Kafka Topic? Then you have a whole flower arrangement of possibilities to stream the data into hdfs/hive. One question is if you want your hive tables to be ORC or if they can be delimited. a) Directly streaming into Hive tables using Hive ACID http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/ I don't like this approach too much since Hive ACID is still very new. However it has been out for a while and may be worth a shot. It would create ORC files directly b) Stream data into HDFS using Storm ( HDFSBolt ) then use a rotator to move data into hive table partition http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time-apache-storm/ You can also schedule an oozie job every 15min/1h to create ORC files. Normally that cadence is good enough for batch queries and you can run any realtime queries in storm directly. c) Spark Streaming Similar to Storm you can run realtime queries directly in spark streaming ( you can even use Spark SQL if you like SQL ). You can then write into a hive table. You just need to make sure you don't create too small files. So if your writes to the hive table are supposed to be in a very short timeframe you will run into issues. But as said normally this is not needed since you can run your realtime queries directly in spark. If you need a SQL interface that allows you to insert and query data in seconds and is pretty stable you could look at phoenix: https://community.hortonworks.com/articles/25726/spark-streaming-explained-kafka-to-phoenix.html d) Tons of frameworks that move kafka data into hdfs directly like camus or gobblin https://github.com/linkedin/camus If you ask me? I would most likely go on the save side and use storm or spark streaming to write to hdfs folders. Then have a background task running that creates ORC partitions in the background. ( oozie/falcon ) this way your main ( orced ) hive table will be nice fast and optimized, you can create an external table on the delimited intermediate results ( and union them with the ORC table ) and you can run any really realtime queries in storm/spark streaming. If you want to query data with SQL in realtime and you don't want to aggregate more than a couple million rows at a time then Phoenix would be better than hive. Once Hive ACID is more stable this will be the way to go though.

bleonhardi · ‎06-29-2016

You could create a hashed column for that key and choose the hash algorithm in a way that a hash collision is very unlikely. However I don't get the usecase completely. Couldn't you just create a new View that already joins the two tables together and only give access to the resulting columns? Table A ( customers web site a ) Name, Address, CreditCard Table B ( customers web site b ) Name, Address, CreditCard Create View C as SELECT NameA,AddressA, NameB, AddressB from A,B where a.CreditCard = B.CreditCard; And only give access to that view. I know this doesn't give the same flexibility but you do not need to do the whole hash thing. If more flexibility is desired then your proposed approach of adding a masked column to both tables would be the way to go. I would think something like sha2 or the aes_encrypt function should provide a way to be very secure in avoiding hash collisions.

bleonhardi · ‎06-29-2016

You can just delete anything ending in a timestamp that is old enough for you if you want. Other things I have seen are people using "find -mtime" to delete all older logs older than x days. Or you can configure the log4j settings of your hadoop components. ( ambari->hdfs->advanced hdfs-log4j ) Unfortunately the very useful DailyRollingFileAppender currently does not support deleting older files. ( It does in a newer version some hadoop components may support that parameter ). However you could change the log appender to the RollingFileAppender which provides a maxBackupIndex attribute where you can keep up to x log files. ( Don't use it for oozie though since oozie admin features depend on the dailyrollingfileappender ) So as usual a plethora of options 🙂 http://www.tutorialspoint.com/log4j/log4j_logging_files.htm Edit: the DailyRollingFileAppender in HDFS seems to be newer and has the following setting commented out in HDP 2.4. You can try just commenting that in and set it to a number you are comfortable with. The one below would keep 30 day of log files around. #log4j.appender.DRFA.MaxBackupIndex=30

bleonhardi · ‎06-29-2016

Depends on your processing. Can you send the query. Phoenix DOES push down local aggregations in the server side but does the full aggregation in the client. ( Like a Combiner ( Regionserver ) -> Reducer ( Client ) pattern.) One way to fix this can be salting of the keys to make sure specific keys are only in specific region servers. For example if you have 100 regions and 100 group by keys, he will have to copy 100x100 rows to the client and merge them there. However if you make sure that each key is only present in one Region through salting he would only have to pull up 100x1 key. But your query would help.

bleonhardi · ‎06-29-2016

1) hadoop env are linux environment variables for the processes. Some things need to be set this way because they are used by the shell scripts starting the applications. ( RAM settings ... ) The XML files can by definition only work after the JVM is started 2) that is true although the defaults don't have everything as well. Some defaults are hard coded in the applications 3) /usr/hdp/2.4.0.0.169 is the actual folder containing the distribution. If you upgrade the cluster HDP will create a new folder /usr/hdp/2.4.2.xxx for example to enable rollback operations. /usr/hdp/current is a folder with symbolic links to the current distribution i.e. pointing to the real underlying folder with the version you have selected. ( They also change the structure a bit ). Under the cover HDP uses autility called hdp-select that sets these symbolic links to the version you selected.

bleonhardi · ‎06-29-2016

In general .log files are the java log files that tell you about any operations issues .out files are the log files of the java process starter. So if you get any system faults like jvm cannot start or segmentation faults you will find them there. All logs will have a rollover. I.e. the file without any timestamp at the end is the newest one, and log4j will keep an amount of older rolled over logs with a timestamp in the name. Apart from that the naming is pretty straightforward hadoop-hdfs-datanode: the log of the datanode on the cluster hadoop-hdfs-namenode: the log of the namenode hadoop-hdfs-secondarynamenode: the log of the secondary namenode hdfs-audit: the audit log of hdfs, logs all activities happening in the cluster. ( Users doing things ) gc files: Garbage collection logs enabled for the namenode/datanode processes So if you have any problems you normally find them in the hadoop-hdfs logs, if the problem is jvm configuration related in .out but normally in .log.

bleonhardi · ‎06-27-2016

yeah if you want to see it in action look into the HDFS folder before the insert and after ( you should see a couple new files like 00000_1 ... in there ) . These are the newly added rows in the new output files from your insert job. You can look at the bloom filter indexes with hive --orcfiledump -rowindex ... <filename> http://www.slideshare.net/BenjaminLeonhardi/hive-loading-data

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: What is the releatiohship between yarn contain...

Re: Why does HDF come with Storm and not Spark?

Re: Storage format in HDFS

Re: Best practise stream Datapipeline on a hadoop

Re: Hadoop data linking from multiple sources

Re: Query on Hadoop logs

Re: Pushing down Phoenix quey

Re: Introductory Hadoop queries

Re: Query on Hadoop logs

Re: Bloom filter maintenance or updates?