Member since
09-23-2015
800
Posts
898
Kudos Received
185
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5431 | 08-12-2016 01:02 PM | |
2204 | 08-08-2016 10:00 AM | |
2613 | 08-03-2016 04:44 PM | |
5519 | 08-03-2016 02:53 PM | |
1430 | 08-01-2016 02:38 PM |
06-14-2016
09:20 AM
Good you fixed it. I would just read a good hadoop book and understand the MapCombinerShuffleReduce process in detail. After that the majority of markers should be pretty self evident. https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/184-6666119-1311365?ie=UTF8&*Version*=1&*entries*=0
... View more
06-14-2016
08:41 AM
Not sure what you mean with consuming factor. You can see that the reducer took 15 min and had 23m records as input. You can also see that the shuffle had 500MB. Which should not take 15 minutes in the reducer to count by group. So I am wondering if you by any chance do not have enough memory for the reducers and they cannot keep the groups ( 8m ) in memory or something. You should definitely increase the number of reducers. Since you have 8m groups and both tasks took a long time ( so most likely not a single huge group ) you can essentially create as many as you have task slots in the cluster. But I would also look at my hive memory configuration to see if I would increase the task memory and have a look at what happens on the machines running a reducer since aggregating 23m rows should not take 15 minutes. Quick way to test with more reducers: SET
MAPRED.REDUCE.TASKS = x; ( where x is the number of task slots in your cluster ) Quick way to test with more RAM: set hive.tez.java.opts="-Xmx3400m"; set hive.tez.container.size =
4096; where the Xmx RAM value is 75-90% of the container size. Depending on your level of conservatism.
... View more
06-13-2016
06:18 PM
@Daniel PerryI don't suppose the stats approach works? That should be instantaneous. The only other option I could think of is updating the record when you write your data. I.e. in your ingestion job select the biggest record and write it into a file/hive table. You then have it available immediately when you need it. ( its like manual stats )
... View more
06-13-2016
01:29 PM
1 Kudo
Get date from Filename There are some ways to get at the filename in mapreduce but its difficult. MapReduce by definition abstracts filenames away. You have two options there: 1) Use a little python/java/shell whatever preprocessing script OUTSIDE hadoop that adds a field with the date to each row of each file taken from the filename. Easy but not that scalable 2) Write your own recordreader 3) Pig seems to provide some value called tagsource that can do the same http://stackoverflow.com/questions/9751480/how-can-i-incorporate-the-current-input-filename-into-my-pig-latin-script 4) Hive has a hidden column for the filename so you could use that to compute a date column https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VirtualColumns
... View more
06-13-2016
11:11 AM
1 Kudo
Ah just read that its an export. But again I don't think there is any automated way to do that. So I suppose your only choice is to make a custom shell/ssh action that runs a script ( shell, python ... ) that lists the files from the directory and then executes sqoop jobs for them. Would be my approach.
... View more
06-10-2016
02:59 PM
Thanks a lot 🙂
... View more
06-10-2016
10:53 AM
2 Kudos
And shameless plug: https://community.hortonworks.com/content/kbentry/25726/spark-streaming-explained-kafka-to-phoenix.html You can have a look at the parser class I wrote. You would need to write something similar that parses your JSON object and returns a Java/Scala object that you can then use in your analytics
... View more
06-10-2016
10:06 AM
yeah it should be enabled by default though. You would get the log files through the yarn logs command line or you can use pig as well. https://community.hortonworks.com/articles/33703/mining-tez-app-log-file-with-pig-script.html
... View more
06-09-2016
06:11 PM
1 Kudo
There is no clustering algorithm in hive. I think Spark is greatly overselling its story as "unstructured" data analytics. To run a clustering algorithm you always need a schema and you need to create one in Spark as well to run a clustering model. You can use spark to read directly from an Hive/ORC table for example. Frameworks with data mining algorithms in the hadoop ecosystem: SparkML ( cool kid on the block and a lot of the algorithms are parallelized ) SparkR: a lot of data prep functions get pushed down to Spark and you have the full power of R and work with RStudio R Mapreduce frameworks ( RMR ... 😞 If you don't like Spark ... Mahout ( a bit out of vogue wouldn't use it ) And many more ( like running Python MapReduce streaming ... ) If you ask for an opinion I would put the tables in Hive (ORC ) and use SparkML for the clustering. It has just a lot of push and you can use Python or Scala ( use Scala ). If you know R better, something like SparkR might be the way to go
... View more
06-09-2016
05:39 PM
1 Kudo
What do you mean with clusters? A Datamining clustering or segmentation algorithm? In this case you have different options but Spark ML is definitely a strong contender. http://spark.apache.org/docs/latest/ml-clustering.html
... View more