Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
6997 | 06-03-2019 09:31 PM | |
1671 | 05-22-2019 02:38 AM | |
2123 | 05-22-2019 02:21 AM | |
1321 | 05-04-2019 08:17 PM | |
1628 | 04-14-2019 12:06 AM |
03-21-2016
01:00 AM
2 Kudos
I had some troubles a while back similar to this as shown at https://martin.atlassian.net/wiki/x/C4BRAQ. Try replacing REGISTER /tmp/stackexchange/piggybank.jar with REGISTER 'hdfs:///tmp/stackexchange/piggybank.jar' and let us know if that works.
... View more
03-20-2016
10:30 PM
Yep, that could work. Putting it in HBase could also allow you to maintain some version of the record, too. Good luck and feel free to share more.
... View more
03-17-2016
10:41 PM
2 Kudos
A couple of observations and a few recommendations. First, if you are trying to run the pig script from the linux command line, I would recommend you save your pig script locally and then run it. Also, you don't really need to fully qualify the location of the input file like you are doing above. Here is a walk through of something like you are doing now; all from the command line. SSH to the Sandbox and become maria_dev. I have an earlier 2.4 version and it does not have a local maria_dev user account (she does have an account in Ambari as well as a HDFS home directory) so I had to create that first as shown below. If the first "su" command works then skip the "useradd" command. Then verify she has a HDFS home directory. HW10653-2:~ lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password:
Last login: Tue Mar 15 22:14:09 2016 from 10.0.2.2
[root@sandbox ~]# su maria_dev
su: user maria_dev does not exist
[root@sandbox ~]# useradd -m -s /bin/bash maria_dev
[root@sandbox ~]# su - maria_dev
[maria_dev@sandbox ~]$ hdfs dfs -ls /user
Found 17 items <<NOTE: I deleted all except the one I was looking for...
drwxr-xr-x - maria_dev hdfs 0 2016-03-14 22:49 /user/maria_dev Then copy a file to HDFS that you can then later read. [maria_dev@sandbox ~]$ hdfs dfs -put /etc/hosts hosts.txt
[maria_dev@sandbox ~]$ hdfs dfs -cat /user/maria_dev/hosts.txt
# File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1localhost.localdomain localhost
10.0.2.15sandbox.hortonworks.com sandbox ambari.hortonworks.com Now put the following two lines of code into a LOCAL file called runme.pig as shown when listing it below. [maria_dev@sandbox ~]$ pwd
/home/maria_dev
[maria_dev@sandbox ~]$ cat runme.pig
data = LOAD '/user/maria_dev/hosts.txt';
DUMP data; Then just run it (remember, no dashes!!). NOTE: many lines removed from the logging output that is bundled in with the DUMP of the hosts.txt file. [maria_dev@sandbox ~]$ pig runme.pig
... REMOVED A BUNCH OF LOGGING MESSAGES ...
2016-03-17 22:38:45,636 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
2.7.1.2.4.0.0-1690.15.0.2.4.0.0-169maria_dev2016-03-17 22:38:102016-03-17 22:38:45UNKNOWN
Success!
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_1458253459880_00011077770dataMAP_ONLYhdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848,
Input(s):
Successfully read 5 records (670 bytes) from: "/user/maria_dev/hosts.txt"
Output(s):
Successfully stored 5 records (310 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848"
Counters:
Total records written : 5
Total bytes written : 310
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1458253459880_0001
... REMOVED ABOUT 10 MORE LOGGING MESSAGES ...
.... THE NEXT BIT IS THE RESULTS OF THE DUMP COMMAND ....
(# File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh)
(# Do not remove the following line, or various programs)
(# that require network functionality will fail.)
(127.0.0.1,,localhost.localdomain localhost)
(10.0.2.15,sandbox.hortonworks.com sandbox ambari.hortonworks.com)
2016-03-17 22:38:46,662 [main] INFO org.apache.pig.Main - Pig script completed in 43 seconds and 385 milliseconds (43385 ms)
[maria_dev@sandbox ~]$ Does this work for you?? If so, the you can run a Pig script from the CLI and remember... you do NOT need all the fully qualified naming junk if running this way. GOOD LUCK!
... View more
03-16-2016
11:22 AM
As http://pig.apache.org/docs/r0.15.0/perf.html#replicated-joins indicates, replicated joins are available to allow for a much more efficient map-side join instead of the standard reduce-side join. Simply put, it makes sense when the smaller relation that you are joining on can fit into memory. In Hive, this is automatically happening when possible, but in Pig we have to explicitly indicate we want it. The explain plan should show you the difference and remember that you are responsible for testing to make sure this will work as the script will fail if the "smaller" relation will not fit into memory. NOTE: Pig's replicated joins can only work on inner or left outer joins because when a given map task sees a record in the replicated (small) input which does not match any records in the fragment (large) input, it doesn’t know whether there might be a matching record in another map task, so it can’t determine whether to emit a record or not in the current task. The replicated (small) input always needs to comes last – or right – in the join order, which is why this technique can’t be used for right or full outer joins.
... View more
03-15-2016
11:18 PM
1 Kudo
ok on docs.hortonworks.com, but wondering what websites you can access for the dev exam?
... View more
03-15-2016
11:16 PM
2 Kudos
can you run "hdfs dfs -ls /" to see what is in the root directory as it looks like you are looking for "/a.pig" which is in the root file system? is that where you have stored your pig script?
... View more
03-15-2016
05:04 PM
2 Kudos
don't supply the dash. so just type "pig risk.pig". if you want to guarantee you run it with Tez they type "pig -x tez risk.pig". well... that's assuming that risk.pig is on the local file system, not HDFS. are you trying to run a pig script that is stored on HDFS, or are you within your pig script trying to reference a file to read. if the later, then you shouldn't need the full HDFS path, just the directory such as "/user/it1/data.txt". if the script is on hdfs then you should be able to run it with "pig hdfs://nn.mydomain.com:9020/myscripts/script.pig" as described in http://pig.apache.org/docs/r0.15.0/start.html#batch-mode.
... View more
03-15-2016
05:59 AM
1 Kudo
It is no wonder this one has set unanswered for a few days as it really is a big "it depends" question. That said, please check out my "Mutable Data in Hive's Immutable World" talk at the 2015 Hadoop Summit conference. The video is at https://www.youtube.com/watch?v=EUz6Pu1lBHQ and you can retrieve the presentation deck at http://www.slideshare.net/lestermartin/mutable-data-in-hives-immutable-world. Again, this is a BIG topic and this presentation talks through some of the classically simple & novel solutions that have worked for many for a while. It would be good to compare & contrast the thoughts presented in this talk with those presented by others as your end solution might be completely different that the approaches I discuss. GOOD LUCK!!
... View more
03-14-2016
10:06 PM
3 Kudos
You can look for the following stanza in /etc/hadoop/conf/hdfs-site.xml (this KVP can also be found in Ambari; Services > HDFS > Configs > Advanced > Advanced hdfs-site > dfs.namenode.rpc-address). <property>
<name>dfs.namenode.rpc-address</name>
<value>sandbox.hortonworks.com:8020</value>
</property> Then plug that value into your request. [root@sandbox conf]# hadoop fs -ls hdfs://sandbox.hortonworks.com:8020/user/it1
Found 5 items
drwx------ - it1 hdfs 0 2016-03-07 06:16 hdfs://sandbox.hortonworks.com:8020/user/it1/.staging
drwxr-xr-x - it1 hdfs 0 2016-03-07 02:47 hdfs://sandbox.hortonworks.com:8020/user/it1/2016-03-07-02-47-10
drwxr-xr-x - it1 hdfs 0 2016-03-07 06:16 hdfs://sandbox.hortonworks.com:8020/user/it1/avg-output
drwxr-xr-x - maria_dev hdfs 0 2016-03-13 23:42 hdfs://sandbox.hortonworks.com:8020/user/it1/geolocation
drwxr-xr-x - it1 hdfs 0 2016-03-07 02:44 hdfs://sandbox.hortonworks.com:8020/user/it1/input For HA solutions, you would use dfs.ha.namenodes.mycluster as described in http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hadoop-ha/content/ha-nn-config-cluster.html.
... View more
03-09-2016
03:08 PM
3 Kudos
Looks like others have reported the same problem before. See http://grokbase.com/t/cloudera/hue-user/137axc7mpm/upload-file-over-64mb-via-hue and https://issues.cloudera.org/browse/HUE-2782. I do agree with HUE-2782's "we need to intentionally limit upload file size" as this is a web app and probably isn't the right tool when we are getting to some file size. Glad to hear "hdfs dfs -put" is working fine. On the flipside, I did test this out a bit with the HDFS Files "Ambari View" that is available with the 2.4 Sandbox and as you can see from the screenshot, user maria_dev was able to load a 80MB file to her home directory via the web interface as well as a 500+MB file. I'm sure this Ambari View also has some upper limit. Maybe it is time to start thinking about moving from Hue to Ambari Views??
... View more