Member since
09-24-2015
816
Posts
488
Kudos Received
189
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2630 | 12-25-2018 10:42 PM | |
12083 | 10-09-2018 03:52 AM | |
4171 | 02-23-2018 11:46 PM | |
1853 | 09-02-2017 01:49 AM | |
2172 | 06-21-2017 12:06 AM |
02-23-2016
04:22 AM
@Sagar Shimpi, nice article! Are there any prebuilt jars (or rpm/tar's) ready to be installed? There are many places where I cannot install git and gradle, but could show up Ambari shell as a kind of "coming soon" add-on for Ambari. Tnx.
... View more
02-23-2016
03:39 AM
1 Kudo
Hi @rajdip chaudhuri, mysql-connector-java.jar in /usr/hdp/sqoop-client/lib is a symlink to /usr/share/java/mysql-connector-java.jar which itself is a symlink to the real jar file with a version in its name. Make sure it points to the right file: ls -l /usr/share/java/mysql-connector-java.jar and ispect the destination file. Also, mysql-connecter that can be installed by yum is good for Mysql-5.1, but for 5.5 and 5.6 you need the latest version.
... View more
02-22-2016
11:21 PM
3 Kudos
Hi @Cristina Lopes, you can add your jar and other files to HDFS using the Files view, then go back to the Hive view and write your Hive scripts refering to those files. If you have some specific problems please let us know.
... View more
02-22-2016
06:44 AM
3 Kudos
Hi @Sam Mingolelli, namespaces are indeed poorly documented. Here is a list of hbase shell commands you can use: alter_namespace, create_namespace, describe_namespace,
drop_namespace, list_namespace, list_namespace_tables The first 4 are self-explanatory. With list_namespace you can list all available namespaces. The system namespace is now called "hbase" and tables without namespace all go to the "default" namespace. With "list_namespace_tables <namespace>" you can list tables in a given namespace. After that you can reference created tables by 'namespace:table', note that a semicolon is used as the delimiter, not a dot like in the design document.
... View more
02-21-2016
07:08 AM
1 Kudo
Hi @Shishir Saxena, Oracle connector for Hadoop, the so-called Oraoop is included in Sqoop-1.4.5 and 1.4.6 (shipped with HDP-2.3.x). Sqoop user guide has a very detailed explanation here. It's enabled when "--direct" is used. Regarding benchmarks it's the best to build your own, for example using Sqoop with and without Oraoop with different number of mappers, various table sizes etc.
... View more
02-19-2016
11:06 AM
4 Kudos
Hi @Kevin Vasko, yes such things can happen, specially early on, and XML and JSON while kind of trivial are not so suitable for direct processing in Hadoop. Also, in case of your JSON efforts, it looks like you are having a classical "last 10%" problem, the main functionality is working early on, but then you realize that you have to handle many special cases. Regarding the problem at hand I found an interesting presentation about XML, suggesting pre-parsing of XML (or JSON) files to Avro, and processing them using Pig. I think I agree with that approach. Also, when reading huge XML files instead of XPath you can extract desired elements using Sax. And these days instead of Avro you can also use ORC files which provide much better performance. Pre-parsing will take a while but it has to be done only once. After that you end up with Hadoop friendly input files which you can process repeatedly in many ways. Just keep on going, and I'm sure you will overcome these early issues pretty quickly!
... View more
02-19-2016
08:15 AM
3 Kudos
@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files. However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example: hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir
If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.
... View more
02-18-2016
10:45 AM
1 Kudo
Hi @petri koski, well I'm sorry, but as you can imagine the sandbox is not of industrial strength 🙂 It's intended for functional testing, to make sure your scripts are working as expected. Several hundred Pig jobs are a piece of cake for a small cluster but it's a heavy lifting for the sandbox on 8G of RAM. Now, to answer your question, I also noticed some strange behavior of my sandbox after my Mac reboots (has a crush) while the sandbox is running (I'm using VirtualBox). Some files can be damaged I guess, but IMHO it doesn't make sense to troubleshot the sandbox to that level.
... View more
02-18-2016
08:19 AM
5 Kudos
Hi @Rushikesh Deshmukh, for a list of backup options check this. CopyTable is a nice option, using multiple mappers, you can copy individual tables to the same or another cluster. You can miss a few edits but you will end up with a useful copy.
... View more
02-18-2016
06:14 AM
3 Kudos
Hi @Rushikesh Deshmukh, Hazelcast is somewhat related to CouchDB since they both provide a kind of distributed cache, Hazelcast is more Java oriented, while CouchDB is a kind of a document store (like MongoDB). Cassandra is not in that group, it's a column-family oriented key-value store similar to HBase. Here is a great and pretty exhaustive list of NoSql storage solutions, more than 225 of them right now! It's a great reading, but at the end of the day you have to select one, or a few for your application.
... View more