Member since
05-16-2016
270
Posts
18
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1717 | 07-23-2016 11:36 AM | |
3053 | 07-23-2016 11:35 AM | |
1563 | 06-05-2016 10:41 AM | |
1157 | 06-05-2016 10:37 AM |
12-14-2016
07:00 AM
1 Kudo
Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key): hive> CREATE TABLE table_a (
> id STRING,
> quantity INT
> );
hive> INSERT INTO table_a VALUES (1, 30);
hive> INSERT INTO table_a VALUES (2, 20);
hive> INSERT INTO table_a VALUES (3, 10); table_b has duplicate id's: note that id=1 appears twice: hive> CREATE TABLE table_b (
> id STRING
> );
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (1);
hive> INSERT INTO table_b VALUES (2);
hive> INSERT INTO table_b VALUES (3);
If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60: hive> SELECT
> SUM(quantity)
> FROM table_a;
60 If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three: hive> SELECT
> *
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
1 30 1
1 30 1
2 20 2
3 10 3 Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated: hive> SELECT
> SUM(quantity)
> FROM table_a
> LEFT JOIN table_b
> ON table_a.id = table_b.id;
90 I suspect that's what's happening with your query.
... View more
11-05-2016
02:27 AM
1 Kudo
@Simran Kaur Please try hadoop dfsadmin -refreshNodes but I have seen another question from you I believe in which you are asking if you can add a node in a different data center. If this is the same cluster then this is not supported and I highly recommend that you don't spend too much time on it trying to make it work.
... View more
12-08-2017
06:44 PM
HI After filtering the file I am not able to load it in Hive please help Pig Stack Trace --------------- ERROR 1002: Unable to store alias C org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias C at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1647) at org.apache.pig.PigServer.registerQuery(PigServer.java:587) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:547) at org.apache.pig.Main.main(Main.java:158) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 0: <line 51, column 0> Output Location Validation Failed for: 'haasbat0200_10215.dslam_dlm_table_nokia_test More info to follow: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY} at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:75) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311) at org.apache.pig.PigServer.compilePp(PigServer.java:1392) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1317) at org.apache.pig.PigServer.execute(PigServer.java:1309) at org.apache.pig.PigServer.access$400(PigServer.java:122) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1642) ... 14 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY} at org.apache.hive.hcatalog.pig.HCatBaseStorer.throwTypeMismatchException(HCatBaseStorer.java:602) at org.apache.hive.hcatalog.pig.HCatBaseStorer.validateSchema(HCatBaseStorer.java:558) at org.apache.hive.hcatalog.pig.HCatBaseStorer.doSchemaValidations(HCatBaseStorer.java:495) at org.apache.hive.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:201) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:68) ... 28 more
... View more
10-11-2016
03:15 PM
Hi @Simran Kaur. Edge/client nodes are only for user access to the cluster. Having said that, they are not mandatory for a hadoop cluster since users can access through other means (e.g. Ambari views, Zeppelin, WebHDFS, HDFS mounts and other). So edge/client nodes are a bit distracting. The main architecture to Hadoop is the master-slave architecture of services. At the highest level, services typically have a master that manages a job and slaves that do the work distributed on the cluster. These are never on an edge node (edge node let's the user communicate to the master service).
... View more
10-11-2016
09:58 AM
3 Kudos
@Simran Kaur Please find answers inline - 1. a node means a server, right? No VM'S ? - Node means server. A server can be physical hardware or virtual machine also. 2. How many servers I would need to add to have a healthy cluster - It depends upon what type of configuration you use for production. Generally a broader question to discuss. For Master services I would recommend to deploy on individual node and slave nodes as per your requirement. In case of HA you need to revisit placement of the above services. master1 - Active NN,ZK,JN master1 - Standby NN, ZK, JN,RM, AM,HS master1 - Ambari, ZK, HIVE,SQOOP,OOZIE,HUE,Ranger,etc.. Slave Nodes - DN,N,etc.. 3. Which of the above mentioned services should be co-located? - For HDFS make sure JN should run most probably on both namenodes, also if possible you should have dedicated disk for JN and ZK. 4. What should be the distribution like? - You can go for n-1 distribution [where is n=latest stable release from hdp] You can migrate services after installation.
... View more
11-23-2017
05:46 AM
if I am using above function getting null . Please reply
... View more
10-04-2016
06:34 PM
1 Kudo
Assuming 'id' is never null in either table select case when s.id is null then t.id else s.id end, case when s.id is null then t.count else s.count end from old_data t full outer join new_data s on t.id = s.id though what you really want is the MERGE statement which is WIP (https://issues.apache.org/jira/browse/HIVE-10924)
... View more
09-15-2016
06:00 PM
NIFI/HDF is the way, very easy and a huge number of sources. https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach.html ttps://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html
h ttps://community.hortonworks.com/articles/46258/iot-example-in-apache-nifi-consuming-and-producing.html
h ttps://community.hortonworks.com/articles/45531/using-apache-nifi-070s-new-putslack-processor.html
h ttps://community.hortonworks.com/articles/45706/using-the-new-hiveql-processors-in-apache-nifi-070.html
h https://community.hortonworks.com/content/kbentry/44018/create-kafka-topic-and-use-from-apache-nifi-for-hd.html https://community.hortonworks.com/content/kbentry/55839/reading-sensor-data-from-remote-sensors-on-raspber.html
... View more
09-11-2016
10:44 PM
@Simran Kaur you should set tblproperties on your table to treat blanks as NULL https://community.hortonworks.com/questions/21001/i-have-few-column-to-make-external-table-in-hive-f.html TBLPROPERTIES ('serialization.null.format'='')
... View more
08-31-2016
01:25 PM
1 Kudo
Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly. It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).
... View more