About simran_k

awoolford · ‎12-14-2016

Let's create two Hive tables: table_a and table_b. table_a contains the column you want to aggregate, and has only one record per id (i.e. the key): hive> CREATE TABLE table_a ( > id STRING, > quantity INT > ); hive> INSERT INTO table_a VALUES (1, 30); hive> INSERT INTO table_a VALUES (2, 20); hive> INSERT INTO table_a VALUES (3, 10); table_b has duplicate id's: note that id=1 appears twice: hive> CREATE TABLE table_b ( > id STRING > ); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (1); hive> INSERT INTO table_b VALUES (2); hive> INSERT INTO table_b VALUES (3); If we aggregate the quantity column in table_a, we see that the aggregated quantity is 60: hive> SELECT > SUM(quantity) > FROM table_a; 60 If we join the table_a and table_b together, you can see that the duplicate keys in table_b have caused there to be four rows, and not three: hive> SELECT > * > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 1 30 1 1 30 1 2 20 2 3 10 3 Since joins happen before aggregations, when we aggregate the quantity in table_a, the quantity for id=1 has been duplicated: hive> SELECT > SUM(quantity) > FROM table_a > LEFT JOIN table_b > ON table_a.id = table_b.id; 90 I suspect that's what's happening with your query.

mqureshi · ‎11-05-2016

@Simran Kaur Please try hadoop dfsadmin -refreshNodes but I have seen another question from you I believe in which you are asking if you can add a node in a different data center. If this is the same cluster then this is not supported and I highly recommend that you don't spend too much time on it trying to make it work.

nikhilkasturkar · ‎12-08-2017

HI After filtering the file I am not able to load it in Hive please help Pig Stack Trace --------------- ERROR 1002: Unable to store alias C org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store alias C at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1647) at org.apache.pig.PigServer.registerQuery(PigServer.java:587) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:1093) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:501) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69) at org.apache.pig.Main.run(Main.java:547) at org.apache.pig.Main.main(Main.java:158) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.pig.impl.plan.VisitorException: ERROR 0: <line 51, column 0> Output Location Validation Failed for: 'haasbat0200_10215.dslam_dlm_table_nokia_test More info to follow: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY} at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:75) at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66) at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53) at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator.validate(InputOutputFileValidator.java:45) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:311) at org.apache.pig.PigServer.compilePp(PigServer.java:1392) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1317) at org.apache.pig.PigServer.execute(PigServer.java:1309) at org.apache.pig.PigServer.access$400(PigServer.java:122) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1642) ... 14 more Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 0: Pig 'bytearray' type in column 0(0-based) cannot map to HCat 'STRING'type. Target filed must be of HCat type {BINARY} at org.apache.hive.hcatalog.pig.HCatBaseStorer.throwTypeMismatchException(HCatBaseStorer.java:602) at org.apache.hive.hcatalog.pig.HCatBaseStorer.validateSchema(HCatBaseStorer.java:558) at org.apache.hive.hcatalog.pig.HCatBaseStorer.doSchemaValidations(HCatBaseStorer.java:495) at org.apache.hive.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:201) at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:68) ... 28 more

gkeys · ‎10-11-2016

Hi @Simran Kaur. Edge/client nodes are only for user access to the cluster. Having said that, they are not mandatory for a hadoop cluster since users can access through other means (e.g. Ambari views, Zeppelin, WebHDFS, HDFS mounts and other). So edge/client nodes are a bit distracting. The main architecture to Hadoop is the master-slave architecture of services. At the highest level, services typically have a master that manages a job and slaves that do the work distributed on the cluster. These are never on an edge node (edge node let's the user communicate to the master service).

sshimpi · ‎10-11-2016

@Simran Kaur Please find answers inline - 1. a node means a server, right? No VM'S ? - Node means server. A server can be physical hardware or virtual machine also. 2. How many servers I would need to add to have a healthy cluster - It depends upon what type of configuration you use for production. Generally a broader question to discuss. For Master services I would recommend to deploy on individual node and slave nodes as per your requirement. In case of HA you need to revisit placement of the above services. master1 - Active NN,ZK,JN master1 - Standby NN, ZK, JN,RM, AM,HS master1 - Ambari, ZK, HIVE,SQOOP,OOZIE,HUE,Ranger,etc.. Slave Nodes - DN,N,etc.. 3. Which of the above mentioned services should be co-located? - For HDFS make sure JN should run most probably on both namenodes, also if possible you should have dedicated disk for JN and ZK. 4. What should be the distribution like? - You can go for n-1 distribution [where is n=latest stable release from hdp] You can migrate services after installation.

theepireddysant · ‎11-23-2017

if I am using above function getting null . Please reply

ekoifman · ‎10-04-2016

Assuming 'id' is never null in either table select case when s.id is null then t.id else s.id end, case when s.id is null then t.count else s.count end from old_data t full outer join new_data s on t.id = s.id though what you really want is the MERGE statement which is WIP (https://issues.apache.org/jira/browse/HIVE-10924)

TimothySpann · ‎09-15-2016

NIFI/HDF is the way, very easy and a huge number of sources. https://community.hortonworks.com/articles/52415/processing-social-media-feeds-in-stream-with-apach.html ttps://community.hortonworks.com/content/kbentry/47854/accessing-facebook-page-data-from-apache-nifi.html h ttps://community.hortonworks.com/articles/46258/iot-example-in-apache-nifi-consuming-and-producing.html h ttps://community.hortonworks.com/articles/45531/using-apache-nifi-070s-new-putslack-processor.html h ttps://community.hortonworks.com/articles/45706/using-the-new-hiveql-processors-in-apache-nifi-070.html h https://community.hortonworks.com/content/kbentry/44018/create-kafka-topic-and-use-from-apache-nifi-for-hd.html https://community.hortonworks.com/content/kbentry/55839/reading-sensor-data-from-remote-sensors-on-raspber.html

aervits · ‎09-11-2016

@Simran Kaur you should set tblproperties on your table to treat blanks as NULL https://community.hortonworks.com/questions/21001/i-have-few-column-to-make-external-table-in-hive-f.html TBLPROPERTIES ('serialization.null.format'='')

bpreachuk · ‎08-31-2016

Without seeing the input file you may need to do a pre-processing step in pig or do a second 'create table as' step in order to reformat the data correctly. It's great to use csv-serde since it does such a good job of stripping out quoted text among other things, but you may need that extra processing in order to use csv-serde effectively (handling NULLs and double-quotes the way you want it to).

Online	Offline
Last Visited	‎05-04-2018 06:36 AM

Member Since	‎05-16-2016 08:37 AM
Last Visited	‎05-04-2018 06:36 AM
Posts	270
Kudos received	18

Cloudera Community

Re: Merge MapReduce job fails in oozie

Re: IllegalArgumentException and Illegal partition...

Re: only some sqoop jobs ask for password when run...

Re: Sqoop jobs ask for passward even when record p...

Re: aggregate function with join gives wrong valu...

Re: Datanode denied communication with namenode be...

Re: How to remove header and footer from a CSV fil...

Re: where to install hive pig oozie and ranger on ...

Re: Minimum number of nodes to add in a multi-node...

Re: convert ISO8601 date in d mm yyhh mm ss format...

Re: insert or update if exists in hive

Re: How to pull data from API and store it in HDFS

Re: OpenCSV serde in hive gives " ( double quotes)...

Re: hive csv serde not working properly