Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 3173 | 12-25-2018 10:42 PM | |
| 14193 | 10-09-2018 03:52 AM | |
| 4764 | 02-23-2018 11:46 PM | |
| 2481 | 09-02-2017 01:49 AM | |
| 2914 | 06-21-2017 12:06 AM |
05-18-2016
03:25 PM
Long ago, when it was available free on Internet as a manuscript, I found this book helpful in learning about MapReduce: Data-Intensive Text Processing with MapReduce.
... View more
05-18-2016
03:13 PM
You are emitting (key,value) pairs, but "value" doesn't have to be a count. If you emit (hashtag, twit-id), reducers will receive on input (hashtag, [twitid-1, twitid-2, ..., twitid-n]). You can still count the number of tweets by counting the number of entries in the input value array. But now you also have a list of twits talking about each particular hashtag. That's one way of preprocessing twits, getting ready for search, so when someone searches for a certain hashtag, you can quickly show all related twits. In the next iteration you can emit from mappers (hashtag, (twitid, retwit-count)), and in the reducer sort twits for each hashatg by their rewtit count.
... View more
05-18-2016
03:01 PM
You have already told Sqoop about your intention using the "--hive-database" option. By your choice of --warehouse-dir location you are trying to "assist" Sqoop in doing it, but you are actually obstructing it. Because when importing into Hive, Sqoop first imports table files into --warehouse-dir and from there into Hive warehouse. So, in your case files are already there but Sqoop is unaware of that and tries to move them into the same location, which by default causes a failure in hdfs. As a solution, as I mentioned above, just drop your --warehouse-dir option and retry.
... View more
05-18-2016
01:16 PM
@John Garrigan, with emit(hashtag, 1) you will be counting hashtags. With emit(hastag, twit-id) you will be also counting hash tags and also have a list of twits for each of them. And if you restrict the reducer count to 1 you will have hashtags sorted by popularity (count).
... View more
05-18-2016
12:07 AM
HDP-2.4 release notes say that Cascading-3.0.1 is included. There is also ver. 3.0.4 which works on Tez (Cascading Platform: hadoop2-tez), you can find it here. You can download Cascading SDK from here, it includes ver. 2.7, but you can connect it to another version I guess.
... View more
05-17-2016
11:53 PM
1 Kudo
That should work I think. Just connect all your Hue instances to your central DB. It won't work with the embedded SQLite, but you have already switched to Postgres.
... View more
05-17-2016
10:05 PM
Remove --warehouse-dir from your Sqoop command or move it to some neutral location, for example under your home directory in hdfs. Note that /apps/hive/warehouse is the place where Hive keeps its managed (internal) tables. Nobody else is supposed to write anything there. On the other hand, warehouse-dir in Sqoop is a temporary location used during Hive import; it's also used as the default location for Sqoop hdfs import when --target-dir is not provided.
... View more
05-16-2016
11:00 PM
1 Kudo
You can drop non-hashtag strings in your Mapper by emitting only hashtag terms (beginning with "#"). And yes, you can use the tweet identifier as docid, and tweet text as doc. However, if you don't emit docid (tweet-id) you will lose connecction between tweets and hashtags. You can emit (hastag, tweet-id), and in the Reducer phase count tweets for each hash-tag. In this way, you can detect most popular hastags, and the tweets using them.
... View more
05-16-2016
12:39 PM
Hi @Mats Johansson, if you use Ranger for lineage, like who did what and when to certain files or tables or databases, then have to be careful what are you deleting. You can then go for more restrictive deletions from the Audit DB if you want to keep on using it to browse the lineage. Or delete from Range DB after 1 or 2 months but keep all data in HDFS. To reduce the amount of audit data, you can also revisit your security policies and disable audit on policies which are not critical. Also, a new version of Atlas is coming, which will work together with Ranger to provide more cool features for lineage.
... View more
05-15-2016
07:17 AM
You can create a Hive external table mapped onto your HBase table using HBaseStorageHandler, see the example at the end of the Usage section, and then, as what you did with your Sequence file, "select *" from this table into a csv table (stored as textfile fields terminted by ',').
... View more