About pminovic

pminovic · ‎02-23-2016

@Sagar Shimpi, nice article! Are there any prebuilt jars (or rpm/tar's) ready to be installed? There are many places where I cannot install git and gradle, but could show up Ambari shell as a kind of "coming soon" add-on for Ambari. Tnx.

pminovic · ‎02-23-2016

Hi @rajdip chaudhuri, mysql-connector-java.jar in /usr/hdp/sqoop-client/lib is a symlink to /usr/share/java/mysql-connector-java.jar which itself is a symlink to the real jar file with a version in its name. Make sure it points to the right file: ls -l /usr/share/java/mysql-connector-java.jar and ispect the destination file. Also, mysql-connecter that can be installed by yum is good for Mysql-5.1, but for 5.5 and 5.6 you need the latest version.

pminovic · ‎02-22-2016

Hi @Cristina Lopes, you can add your jar and other files to HDFS using the Files view, then go back to the Hive view and write your Hive scripts refering to those files. If you have some specific problems please let us know.

pminovic · ‎02-22-2016

Hi @Sam Mingolelli, namespaces are indeed poorly documented. Here is a list of hbase shell commands you can use: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables The first 4 are self-explanatory. With list_namespace you can list all available namespaces. The system namespace is now called "hbase" and tables without namespace all go to the "default" namespace. With "list_namespace_tables <namespace>" you can list tables in a given namespace. After that you can reference created tables by 'namespace:table', note that a semicolon is used as the delimiter, not a dot like in the design document.

pminovic · ‎02-21-2016

Hi @Shishir Saxena, Oracle connector for Hadoop, the so-called Oraoop is included in Sqoop-1.4.5 and 1.4.6 (shipped with HDP-2.3.x). Sqoop user guide has a very detailed explanation here. It's enabled when "--direct" is used. Regarding benchmarks it's the best to build your own, for example using Sqoop with and without Oraoop with different number of mappers, various table sizes etc.

pminovic · ‎02-19-2016

Hi @Kevin Vasko, yes such things can happen, specially early on, and XML and JSON while kind of trivial are not so suitable for direct processing in Hadoop. Also, in case of your JSON efforts, it looks like you are having a classical "last 10%" problem, the main functionality is working early on, but then you realize that you have to handle many special cases. Regarding the problem at hand I found an interesting presentation about XML, suggesting pre-parsing of XML (or JSON) files to Avro, and processing them using Pig. I think I agree with that approach. Also, when reading huge XML files instead of XPath you can extract desired elements using Sax. And these days instead of Avro you can also use ORC files which provide much better performance. Pre-parsing will take a while but it has to be done only once. After that you end up with Hadoop friendly input files which you can process repeatedly in many ways. Just keep on going, and I'm sure you will overcome these early issues pretty quickly!

pminovic · ‎02-19-2016

@Rushikesh Deshmukh, -getmerge will download all parts from HDFS to your local machine and merge them in a local-destination there. If you have 8 parts, each say 128M each, you will end up downloading 1G of data. Though it makes sense for small files. However, if you want to keep the resulting file on HDFS one way to do it is to create a MR job with unit mappers and a single unit reducer. For example: hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -mapper cat -reducer cat -input 'inputdir/part-r-*' -output outputdir If you keep the output on HDFS, another question is "Seeing by whom as a single file?". Above command will create a single output file, but if you run another MR job using that file as input, the MR framework will by default "see" it as 8 files, actually 8 HDFS blocks (assuming block size of 128M), and will process it using 8 mappers.

pminovic · ‎02-18-2016

Hi @petri koski, well I'm sorry, but as you can imagine the sandbox is not of industrial strength 🙂 It's intended for functional testing, to make sure your scripts are working as expected. Several hundred Pig jobs are a piece of cake for a small cluster but it's a heavy lifting for the sandbox on 8G of RAM. Now, to answer your question, I also noticed some strange behavior of my sandbox after my Mac reboots (has a crush) while the sandbox is running (I'm using VirtualBox). Some files can be damaged I guess, but IMHO it doesn't make sense to troubleshot the sandbox to that level.

pminovic · ‎02-18-2016

Hi @Rushikesh Deshmukh, for a list of backup options check this. CopyTable is a nice option, using multiple mappers, you can copy individual tables to the same or another cluster. You can miss a few edits but you will end up with a useful copy.

pminovic · ‎02-18-2016

Hi @Rushikesh Deshmukh, Hazelcast is somewhat related to CouchDB since they both provide a kind of distributed cache, Hazelcast is more Java oriented, while CouchDB is a kind of a document store (like MongoDB). Cassandra is not in that group, it's a column-family oriented key-value store similar to HBase. Here is a great and pretty exhaustive list of NoSql storage solutions, more than 225 of them right now! It's a great reading, but at the end of the day you have to select one, or a few for your application.

Online	Offline
Last Visited	‎08-19-2019 01:20 AM

Member Since	‎09-24-2015 04:02 AM
Last Visited	‎08-19-2019 01:20 AM
Posts	816
Kudos received	481

Cloudera Community

Re: datanode + Error occurred during initializatio...

Re: Problem when Distcp between two HA Cluster.

Re: Beeline over KNOX fails with HTTP Response co...

Re: What does nclients option of performance evalu...

Re: missing directories in ambari installation pac...

Re: How to install ambari-shell

Re: Error in using Sqoop

Re: How do I add files/jar's through hive view @am...

Re: Introduction of HBase namespaces into a pre-ex...

Re: Are there any benchmarks for SQOOP data transf...

Re: Am I stupid or does anyone else have constant ...

Re: Seeing output of a MR job as a single file eve...

Re: HDP Sandbox 2.3.1 Problems with Pig

Re: Which is best method for taking backup of hbas...

Re: Has anyone tried Hazelcast?