Member since
09-21-2015
9
Posts
5
Kudos Received
0
Solutions
03-29-2018
09:06 PM
This is awesome.
... View more
02-24-2016
07:38 PM
3 Kudos
If you can work your way through SQL Server and sqoop, I agree that's probably the cleanest option. If you're looking for something that you can automate entirely on the cluster or don't have the luxury of pushing the data through SQL Server, then here's another option. There's a very simple Open Source toolset called mdbtools that makes it really easy to extract metadata and data from MS Access databases. In a series of about 10 lines of a shell script, you can get a list of tables in the mdb, dump the data out to text files, import those to HDFS, and wrap a generic Hive schema around the files. Since you're going through intermediate text files, you might not be able to support some character sets and could run into an issues or two with file formats that can be cleaned up with a secondary sed or perl script. If you don't want to go through SQL Server to get the data transferred over, though, this might be a good solution for you.
... View more
11-05-2015
11:27 PM
1 Kudo
@Simon Elliston Ball is right, there's a huge variety of options for NLP as there are many niches for natural language processing. Keep in mind that NLP libraries rarely directly solve business solutions directly. Rather, they give you the tools to build a solution. Often this is segmenting free text into chunks suitable for analysis (e.g. sentence disambiguation), annotating free text (e.g. part of speech tagging), converting free text to a more structured form (e.g. vectorization). All of these are tools that are useful in processing text, but are insufficient by themselves. These tools help you convert free, unstructured text into a form suitable as input into a normal machine learning or analysis pipeline (i.e. classification, etc.). I suppose the one exception to this that I can think of is sentiment analysis..that is a properly valuable analytic in and of itself. Also, keep in mind the license for some of these libraries are not as permissive as Apache (e.g. CoreNLP is GPL with the option to purchase a license for commercial use).
... View more
04-27-2018
03:08 PM
The HDP 2.5/2.6 repos include the required tools to implement HDFS-FUSE. But it should be noted, that as of today, this only runs in "user space", and as such, does not perform as well as something like the NFS gateway option. Also, there is currently no option for an Ambari plugin to manage the service. I wrote a recent article that goes over this on HDP (current release) that walks through this. Using HDFS-FUSE for POSIX directory mounts
... View more
03-28-2016
06:16 AM
As the answer suggests, just use the built-in JMS spout. The Storm project page has details on how to set it up....
... View more
12-14-2015
04:50 PM
@nasghar:Why you want to move data from phoenix to Hive/Orc? If your intention is to run hive queries on HBase/Phoenix table then you can easily create Hive External tables on top of your existing HBase/Phoenix table. Or you intentionally wants to duplicate the data in Hive internal table? This way you will create 2 set of data which you have to maintain.(One in Hive and other in HBase)
... View more
11-04-2015
03:06 PM
1 Kudo
@nasghar@hortonworks.com this is a bug that is addressed in the next release. The ticket where it was addressed is https://issues.apache.org/jira/browse/NIFI-1010.
... View more