About pminovic

pminovic · ‎05-18-2016

Long ago, when it was available free on Internet as a manuscript, I found this book helpful in learning about MapReduce: Data-Intensive Text Processing with MapReduce.

pminovic · ‎05-18-2016

You are emitting (key,value) pairs, but "value" doesn't have to be a count. If you emit (hashtag, twit-id), reducers will receive on input (hashtag, [twitid-1, twitid-2, ..., twitid-n]). You can still count the number of tweets by counting the number of entries in the input value array. But now you also have a list of twits talking about each particular hashtag. That's one way of preprocessing twits, getting ready for search, so when someone searches for a certain hashtag, you can quickly show all related twits. In the next iteration you can emit from mappers (hashtag, (twitid, retwit-count)), and in the reducer sort twits for each hashatg by their rewtit count.

pminovic · ‎05-18-2016

You have already told Sqoop about your intention using the "--hive-database" option. By your choice of --warehouse-dir location you are trying to "assist" Sqoop in doing it, but you are actually obstructing it. Because when importing into Hive, Sqoop first imports table files into --warehouse-dir and from there into Hive warehouse. So, in your case files are already there but Sqoop is unaware of that and tries to move them into the same location, which by default causes a failure in hdfs. As a solution, as I mentioned above, just drop your --warehouse-dir option and retry.

pminovic · ‎05-18-2016

@John Garrigan, with emit(hashtag, 1) you will be counting hashtags. With emit(hastag, twit-id) you will be also counting hash tags and also have a list of twits for each of them. And if you restrict the reducer count to 1 you will have hashtags sorted by popularity (count).

pminovic · ‎05-18-2016

HDP-2.4 release notes say that Cascading-3.0.1 is included. There is also ver. 3.0.4 which works on Tez (Cascading Platform: hadoop2-tez), you can find it here. You can download Cascading SDK from here, it includes ver. 2.7, but you can connect it to another version I guess.

pminovic · ‎05-17-2016

That should work I think. Just connect all your Hue instances to your central DB. It won't work with the embedded SQLite, but you have already switched to Postgres.

pminovic · ‎05-17-2016

Remove --warehouse-dir from your Sqoop command or move it to some neutral location, for example under your home directory in hdfs. Note that /apps/hive/warehouse is the place where Hive keeps its managed (internal) tables. Nobody else is supposed to write anything there. On the other hand, warehouse-dir in Sqoop is a temporary location used during Hive import; it's also used as the default location for Sqoop hdfs import when --target-dir is not provided.

pminovic · ‎05-16-2016

You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.

pminovic · ‎05-16-2016

Hi @Mats Johansson, if you use Ranger for lineage, like who did what and when to certain files or tables or databases, then have to be careful what are you deleting. You can then go for more restrictive deletions from the Audit DB if you want to keep on using it to browse the lineage. Or delete from Range DB after 1 or 2 months but keep all data in HDFS. To reduce the amount of audit data, you can also revisit your security policies and disable audit on policies which are not critical. Also, a new version of Atlas is coming, which will work together with Ranger to provide more cool features for lineage.

pminovic · ‎05-15-2016

You can create a Hive external table mapped onto your HBase table using HBaseStorageHandler, see the example at the end of the Usage section, and then, as what you did with your Sequence file, "select *" from this table into a csv table (stored as textfile fields terminted by ',').

Online	Offline
Last Visited	‎08-19-2019 01:20 AM

Member Since	‎09-24-2015 04:02 AM
Last Visited	‎08-19-2019 01:20 AM
Posts	816
Kudos received	481

Cloudera Community

Re: datanode + Error occurred during initializatio...

Re: Problem when Distcp between two HA Cluster.

Re: Beeline over KNOX fails with HTTP Response co...

Re: What does nclients option of performance evalu...

Re: missing directories in ambari installation pac...

Re: MapReduce for Twitter Hashtags

Re: MapReduce for Twitter Hashtags

Re: Permission error in HDP 2.4 when trying to imp...

Re: MapReduce for Twitter Hashtags

Re: Is cascading installed part of HDP?

Re: Centralized DB for multiple hue servers

Re: Permission error in HDP 2.4 when trying to imp...

Re: MapReduce for Twitter Hashtags

Re: Can the ranger_audit.xa_access_audit table be ...

Re: Export HBase data to csv