About mraghuram_87

mraghuram_87 · ‎05-13-2016

I created a relation like A = Load from 'soemtable' using org.apache.hive.hcatalog.pig.HCatLoader(); A_Filtered = FILTER A BY Col1 == 1 ; Now I changed the underlying table structure in Hive and so get the proper data type reflected in relation "A" I ran the same command again to create A. I changed the column type of Col1 from Int to String. A = Load from 'soemtable' using org.apache.hive.hcatalog.pig.HCatLoader(); Now when I created this relation, I understand that A_Filtered becomes invalid. So I got below error when I tried to create A again. But I am unable to even create the same A_Filtered relation again or any relation at all with new definition, I am getting this error at all time that the relation A_Filtered has different data types in equal operator. How to fix this? Can I delete the relation so that it does not occur? This will be cleared once I exit grunt shell and login again. But I wanted to know how to fix this without exiting the shell, is it possible? 2016-05-13 12:32:02,354 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2016-05-13 12:32:02,403 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.pre-event.listeners does not exist 2016-05-13 12:32:02,403 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.semantic.analyzer.factory.impl does not exist 2016-05-13 12:32:02,425 [main] INFO hive.metastore - Connected to metastore. 2016-05-13 12:32:02,519 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2016-05-13 12:32:02,553 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS 2016-05-13 12:32:02,603 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.pre-event.listeners does not exist 2016-05-13 12:32:02,604 [main] WARN org.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.semantic.analyzer.factory.impl does not exist 2016-05-13 12:32:02,666 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: <line 2, column 56> In alias A_Filtered, incompatible types in Equal Operator left hand side:chararray right hand side:int

mraghuram_87 · ‎05-13-2016

And one more question, I am actually loading the Hbase table through pig, where I am generating the sequence number before loading this table using Rank. Even though I pre split the table, when I looked at the UI, it showed that only three regions of the Hbase table was getting loaded initially for first few minutes. Why would this happen? When we pre-split the table, shouldn't all the regions of the Hbase table get its data from previous operator simultaneously and get loaded in parallel? Why is it that only three of the 10 regions were loading? Could it be because it was streaming the data as and when the previous operator generator row number and passed it down to Storer? Any suggestions would be helpful. Thanks!

mraghuram_87 · ‎05-13-2016

Actually the main purpose of the table is to be a lookup table. Major problem is that the lookup with this table is based on the Sequence number so I chose sequence number as the row key and that is the reason I think this caused hot spots. May I know how the phoenix ensures a good distribution? Because I thought phoenix was just a SQL layer on top of Hbase to enable query on Hbase tables.

mraghuram_87 · ‎05-13-2016

Thanks for the info.

mraghuram_87 · ‎05-11-2016

Hi Ben, Thanks for taking time to explain each of these questions. For qn. 4, I actually meant to type Vertex, but instead I mentioned it as node. What I meant to ask was, by setting the number of reducers, we affect all the vertices that run only the reducers. Based on your explanation. I think it affects all the vertices running reducers. And it would not affect any vertex running mappers or combination of mappers and combiners. Right? By mapper and reducer, my understanding was that any class extending Mapper is mapper and any class extending Reducer is reducer. Just out of curiosity when I look into the pig source code, there are many operators like PORank, POFRJoin etc.. and these are the ones that are showed in explain plan also as tasks of each vertex. So essentially in Tez DAG pig latin gets converted to these operators right? Are these operators run as part of Mapper and reducers? So irrespective of the underlying task being a true mapper or reducer class or one of the tez pig operators, is it correct to assume that that the parallelism of root vertices which read the data from file or table to be controlled based on file split or table parttions and the leaf vertices and other vertices in between are all like reducers and its parallelism is controlled by reducers properties? like number of reducers or bytes per reducers? And if I write a UDF, is it possible to identify if it is run inside mapper class or reducer class?

mraghuram_87 · ‎05-10-2016

Hi Artem, Thanks for your reply. I will try hashing my row-key. But even then I think I might need to pre-split the Hbase table so that there are many regions even in the beginning of the load. (I was just reading about creating the pre splits). But even then Hbase might end up creating all the regions in same server right? Is there any option to ensure that all the regions of same table are distributed across multiple servers?

mraghuram_87 · ‎05-10-2016

Hi Josh, I saw the table split as you suggested. I see that the table has 18 regions. But the problem is all 18 regions of the table is in same node. (or region server). How do I split the regions across multiple region servers? And also is there a Command to check the table splits and Hbase regions via CLI? Any parameter I can use to improve the load performance using pig? currently I am using only hbase.scanner.caching to reduce roundtrips. Thanks for your help!

mraghuram_87 · ‎05-09-2016

I am trying to load a HBase table using Pig from HDFS file. The file is just 3 GB with 30350496 records. It takes a long time to load the table. Pig is running in tez. Can you please suggest me any ways to improve the performance? How to identify where the performance bottle neck is? I am not able to get much from the pig explain. Any ways to identify if single Hbase region server is overloaded or if it is getting distributed properly. How to identify Hbase regionserver splits ?

mraghuram_87 · ‎05-09-2016

I have couple of basic questions on tez dag task and map reduce jobs and pig running on tez. 1. My understanding is that each vertex of the tez task can run either mapper or reducer and not both, is this right? 2. Each vertex of tez task can have its own parallelism. Is this right? Say if there are three vertices, one map and two reducer vertices, then can each reducer vertices run with different parallelism? How to control this? 3. When I see pig explain plan running on tez, I see vertices and operations on them, but I dont see the parallelism for each vertex in the explain plan, I see it only when I dump the relation. How to see the parallelism of each vertex in explain plan of pig? 4. If I use parallel clause to control the number of reducers in the pig in tez mode, does it control the parallelism of the vertices running only the reducers? and does it affect all nodes runinng the reducers? Is there a way to control the number of parallelism of each vertex separately? 5. If there are 4 splits of file, then there would ideally be 4 mappers right? In this case, in tez would there be 4 vertices each running one mapper or one vertex running 4 mappers? 6. How to control the number of mappers (or the parallelism of vertex running the mappers)? 7. When the pig command is running, I see the number of total tasks, but how to find the number of tasks in each vertex?

mraghuram_87 · ‎04-26-2016

Hi, Thanks for your suggestion. I have just started using this so can you please help me in understanding few more things. I found the webhcat server using ambari and the templeton.libjars value in webhcat.-site.xml are as follows <name>templeton.libjars</name> <value>/usr/hdp/${hdp.version}/zookeeper,/usr/hdp/${hdp.version}/hive/lib/hive-common.jar/zookeeper.jar</value> I think this has wrong values or typo in it as you suggested. I dont have access to edit this file. 1. Is there any other way to use webhcat without editing the webhcat-site.xml file? Like passing as a post parameter in Curl or something? 2. My cluster has edge node in it. Why is it that the webhcat-site.xml file is present only in webhcat server. Why is it not present in edge node. Edge node only has webhcat-default.xml? Shouldn't all the *-site.xml be present in edge node as well? 3. How to access hiveserver2 via knox? Is it possible to use hiveserver2 to insert values into hive table from outside the cluster using knox?

Online	Offline
Last Visited	‎09-16-2017 09:59 PM

Member Since	‎12-30-2015 07:20 PM
Last Visited	‎09-16-2017 09:59 PM
Posts	68
Kudos received	16

Cloudera Community

Re: Sqoop to import data to Hive through oozie she...

Re: Reading data from Hive (Hbase) external table ...

Re: TDCH export from hive to teradata

Is it possible to delete a relation in pig?

Re: Improve performance in loading Hbase table usi...

Re: Improve performance in loading Hbase table usi...

Re: Question on tez dag task and pig on tez

Re: Question on tez dag task and pig on tez

Re: Improve performance in loading Hbase table usi...

Re: Improve performance in loading Hbase table usi...

Improve performance in loading Hbase table using P...

Question on tez dag task and pig on tez

Re: Issue accessing Hive via WebHcat/Knox