About bleonhardi

bleonhardi · ‎06-14-2016

Good you fixed it. I would just read a good hadoop book and understand the MapCombinerShuffleReduce process in detail. After that the majority of markers should be pretty self evident. https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/184-6666119-1311365?ie=UTF8&*Version*=1&*entries*=0

bleonhardi · ‎06-14-2016

Not sure what you mean with consuming factor. You can see that the reducer took 15 min and had 23m records as input. You can also see that the shuffle had 500MB. Which should not take 15 minutes in the reducer to count by group. So I am wondering if you by any chance do not have enough memory for the reducers and they cannot keep the groups ( 8m ) in memory or something. You should definitely increase the number of reducers. Since you have 8m groups and both tasks took a long time ( so most likely not a single huge group ) you can essentially create as many as you have task slots in the cluster. But I would also look at my hive memory configuration to see if I would increase the task memory and have a look at what happens on the machines running a reducer since aggregating 23m rows should not take 15 minutes. Quick way to test with more reducers: SET MAPRED.REDUCE.TASKS = x; ( where x is the number of task slots in your cluster ) Quick way to test with more RAM: set hive.tez.java.opts="-Xmx3400m"; set hive.tez.container.size = 4096; where the Xmx RAM value is 75-90% of the container size. Depending on your level of conservatism.

bleonhardi · ‎06-13-2016

@Daniel PerryI don't suppose the stats approach works? That should be instantaneous. The only other option I could think of is updating the record when you write your data. I.e. in your ingestion job select the biggest record and write it into a file/hive table. You then have it available immediately when you need it. ( its like manual stats )

bleonhardi · ‎06-13-2016

Get date from Filename There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there: 1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable 2) Write your own recordreader 3) Pig seems to provide some value called tagsource that can do the same http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-pig-latin-script 4) Hive has a hidden column for the filename so you could use that to compute a date column https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns

bleonhardi · ‎06-13-2016

Ah just read that its an export. But again I don't think there is any automated way to do that. So I suppose your only choice is to make a custom shell/ssh action that runs a script ( shell, python ... ) that lists the files from the directory and then executes sqoop jobs for them. Would be my approach.

bleonhardi · ‎06-10-2016

Thanks a lot 🙂

bleonhardi · ‎06-10-2016

And shameless plug: https://community.hortonworks.com/content/kbentry/25726/spark-streaming-explained-kafka-to-phoenix.html You can have a look at the parser class I wrote. You would need to write something similar that parses your JSON object and returns a Java/Scala object that you can then use in your analytics

bleonhardi · ‎06-10-2016

yeah it should be enabled by default though. You would get the log files through the yarn logs command line or you can use pig as well. https://community.hortonworks.com/articles/33703/mining-tez-app-log-file-with-pig-script.html

bleonhardi · ‎06-09-2016

There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example. Frameworks with data mining algorithms in the hadoop ecosystem: SparkML ( cool kid on the block and a lot of the algorithms are parallelized ) SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark ... Mahout ( a bit out of vogue wouldn't use it ) And many more ( like running Python MapReduce streaming ... ) If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ). If you know R better, something like SparkR might be the way to go

bleonhardi · ‎06-09-2016

What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender. http://spark.apache.org/docs/latest/ml-clustering.html

Online	Offline
Last Visited	‎08-27-2016 12:14 PM

Member Since	‎09-23-2015 08:23 PM
Last Visited	‎08-27-2016 12:14 PM
Posts	800
Kudos received	888

Cloudera Community

Re: where an when does the fileinputformat() runs...

Re: We perform frequently Cartesian products invo...

Re: Kafka for queue to spark

Re: How new DAGs are submitted to existing Tez App...

Re: What is it meant by "HiveServer cannot handle ...

Re: Explanation of Tez task counters.

Re: Explanation of Tez task counters.

Re: What is the fastest way to get the most recent...

Re: Insert a new column with value based on file t...

Re: Multiple tables export with Sqoop in a Oozie w...

Re: can someone point me to a good tutorial on spa...

Re: can someone point me to a good tutorial on spa...

Re: Approach to collect logs from mappers

Re: Spark and Structured Data

Re: Spark and Structured Data