About LesterMartin

LesterMartin · ‎03-21-2016

I had some troubles a while back similar to this as shown at https://martin.atlassian.net/wiki/x/C4BRAQ. Try replacing REGISTER /tmp/stackexchange/piggybank.jar with REGISTER 'hdfs:///tmp/stackexchange/piggybank.jar' and let us know if that works.

LesterMartin · ‎03-20-2016

Yep, that could work. Putting it in HBase could also allow you to maintain some version of the record, too. Good luck and feel free to share more.

LesterMartin · ‎03-17-2016

A couple of observations and a few recommendations. First, if you are trying to run the pig script from the linux command line, I would recommend you save your pig script locally and then run it. Also, you don't really need to fully qualify the location of the input file like you are doing above. Here is a walk through of something like you are doing now; all from the command line. SSH to the Sandbox and become maria_dev. I have an earlier 2.4 version and it does not have a local maria_dev user account (she does have an account in Ambari as well as a HDFS home directory) so I had to create that first as shown below. If the first "su" command works then skip the "useradd" command. Then verify she has a HDFS home directory. HW10653-2:~ lmartin$ ssh root@127.0.0.1 -p 2222 root@127.0.0.1's password: Last login: Tue Mar 15 22:14:09 2016 from 10.0.2.2 [root@sandbox ~]# su maria_dev su: user maria_dev does not exist [root@sandbox ~]# useradd -m -s /bin/bash maria_dev [root@sandbox ~]# su - maria_dev [maria_dev@sandbox ~]$ hdfs dfs -ls /user Found 17 items <<NOTE: I deleted all except the one I was looking for... drwxr-xr-x - maria_dev hdfs 0 2016-03-14 22:49 /user/maria_dev Then copy a file to HDFS that you can then later read. [maria_dev@sandbox ~]$ hdfs dfs -put /etc/hosts hosts.txt [maria_dev@sandbox ~]$ hdfs dfs -cat /user/maria_dev/hosts.txt # File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1localhost.localdomain localhost 10.0.2.15sandbox.hortonworks.com sandbox ambari.hortonworks.com Now put the following two lines of code into a LOCAL file called runme.pig as shown when listing it below. [maria_dev@sandbox ~]$ pwd /home/maria_dev [maria_dev@sandbox ~]$ cat runme.pig data = LOAD '/user/maria_dev/hosts.txt'; DUMP data; Then just run it (remember, no dashes!!). NOTE: many lines removed from the logging output that is bundled in with the DUMP of the hosts.txt file. [maria_dev@sandbox ~]$ pig runme.pig ... REMOVED A BUNCH OF LOGGING MESSAGES ... 2016-03-17 22:38:45,636 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics: HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures 2.7.1.2.4.0.0-1690.15.0.2.4.0.0-169maria_dev2016-03-17 22:38:102016-03-17 22:38:45UNKNOWN Success! Job Stats (time in seconds): JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs job_1458253459880_00011077770dataMAP_ONLYhdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848, Input(s): Successfully read 5 records (670 bytes) from: "/user/maria_dev/hosts.txt" Output(s): Successfully stored 5 records (310 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848" Counters: Total records written : 5 Total bytes written : 310 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_1458253459880_0001 ... REMOVED ABOUT 10 MORE LOGGING MESSAGES ... .... THE NEXT BIT IS THE RESULTS OF THE DUMP COMMAND .... (# File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh) (# Do not remove the following line, or various programs) (# that require network functionality will fail.) (127.0.0.1,,localhost.localdomain localhost) (10.0.2.15,sandbox.hortonworks.com sandbox ambari.hortonworks.com) 2016-03-17 22:38:46,662 [main] INFO org.apache.pig.Main - Pig script completed in 43 seconds and 385 milliseconds (43385 ms) [maria_dev@sandbox ~]$ Does this work for you?? If so, the you can run a Pig script from the CLI and remember... you do NOT need all the fully qualified naming junk if running this way. GOOD LUCK!

LesterMartin · ‎03-16-2016

As http://pig.apache.org/docs/r0.15.0/perf.html#replicated-joins indicates, replicated joins are available to allow for a much more efficient map-side join instead of the standard reduce-side join. Simply put, it makes sense when the smaller relation that you are joining on can fit into memory. In Hive, this is automatically happening when possible, but in Pig we have to explicitly indicate we want it. The explain plan should show you the difference and remember that you are responsible for testing to make sure this will work as the script will fail if the "smaller" relation will not fit into memory. NOTE: Pig's replicated joins can only work on inner or left outer joins because when a given map task sees a record in the replicated (small) input which does not match any records in the fragment (large) input, it doesn’t know whether there might be a matching record in another map task, so it can’t determine whether to emit a record or not in the current task. The replicated (small) input always needs to comes last – or right – in the join order, which is why this technique can’t be used for right or full outer joins.

LesterMartin · ‎03-15-2016

ok on docs.hortonworks.com, but wondering what websites you can access for the dev exam?

LesterMartin · ‎03-15-2016

can you run "hdfs dfs -ls /" to see what is in the root directory as it looks like you are looking for "/a.pig" which is in the root file system? is that where you have stored your pig script?

LesterMartin · ‎03-15-2016

don't supply the dash. so just type "pig risk.pig". if you want to guarantee you run it with Tez they type "pig -x tez risk.pig". well... that's assuming that risk.pig is on the local file system, not HDFS. are you trying to run a pig script that is stored on HDFS, or are you within your pig script trying to reference a file to read. if the later, then you shouldn't need the full HDFS path, just the directory such as "/user/it1/data.txt". if the script is on hdfs then you should be able to run it with "pig hdfs://nn.mydomain.com:9020/myscripts/script.pig" as described in http://pig.apache.org/docs/r0.15.0/start.html#batch-mode.

LesterMartin · ‎03-15-2016

It is no wonder this one has set unanswered for a few days as it really is a big "it depends" question. That said, please check out my "Mutable Data in Hive's Immutable World" talk at the 2015 Hadoop Summit conference. The video is at https://www.youtube.com/watch?v=EUz6Pu1lBHQ and you can retrieve the presentation deck at http://www.slideshare.net/lestermartin/mutable-data-in-hives-immutable-world. Again, this is a BIG topic and this presentation talks through some of the classically simple & novel solutions that have worked for many for a while. It would be good to compare & contrast the thoughts presented in this talk with those presented by others as your end solution might be completely different that the approaches I discuss. GOOD LUCK!!

LesterMartin · ‎03-14-2016

You can look for the following stanza in /etc/hadoop/conf/hdfs-site.xml (this KVP can also be found in Ambari; Services > HDFS > Configs > Advanced > Advanced hdfs-site > dfs.namenode.rpc-address). <property> <name>dfs.namenode.rpc-address</name> <value>sandbox.hortonworks.com:8020</value> </property> Then plug that value into your request. [root@sandbox conf]# hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/user/it1 Found 5 items drwx------ - it1 hdfs 0 2016-03-07 06:16 hdfs://sandbox.hortonworks.com:8020/user/it1/.staging drwxr-xr-x - it1 hdfs 0 2016-03-07 02:47 hdfs://sandbox.hortonworks.com:8020/user/it1/2016-03-07-02-47-10 drwxr-xr-x - it1 hdfs 0 2016-03-07 06:16 hdfs://sandbox.hortonworks.com:8020/user/it1/avg-output drwxr-xr-x - maria_dev hdfs 0 2016-03-13 23:42 hdfs://sandbox.hortonworks.com:8020/user/it1/geolocation drwxr-xr-x - it1 hdfs 0 2016-03-07 02:44 hdfs://sandbox.hortonworks.com:8020/user/it1/input For HA solutions, you would use dfs.ha.namenodes.mycluster as described in http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hadoop-ha/content/ha-nn-config-cluster.html.

LesterMartin · ‎03-09-2016

Looks like others have reported the same problem before. See http://grokbase.com/t/cloudera/hue-user/137axc7mpm/upload-file-over-64mb-via-hue and https://issues.cloudera.org/browse/HUE-2782. I do agree with HUE-2782's "we need to intentionally limit upload file size" as this is a web app and probably isn't the right tool when we are getting to some file size. Glad to hear "hdfs dfs -put" is working fine. On the flipside, I did test this out a bit with the HDFS Files "Ambari View" that is available with the 2.4 Sandbox and as you can see from the screenshot, user maria_dev was able to load a 80MB file to her home directory via the web interface as well as a 500+MB file. I'm sure this Ambari View also has some upper limit. Maybe it is time to start thinking about moving from Hue to Ambari Views??

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: piggybank jar file does not exist

Re: Best way to update data in MDM on hadoop

Re: How to get the full file path to my hdfs root ...

Re: How the Replicated join gives better performan...

Re: HDPCA - Preparation

Re: How to get the full file path to my hdfs root ...

Re: How to get the full file path to my hdfs root ...

Re: Best way to update data in MDM on hadoop

Re: How to get the full file path to my hdfs root ...

Re: Can not upload file greater than 64MB via File...