Member since
05-02-2019
319
Posts
144
Kudos Received
58
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3568 | 06-03-2019 09:31 PM | |
730 | 05-22-2019 02:38 AM | |
1051 | 05-22-2019 02:21 AM | |
603 | 05-04-2019 08:17 PM | |
774 | 04-14-2019 12:06 AM |
07-26-2016
04:54 PM
Sorry, no, I wasn't suggesting you abandon Pig just that you might need to wrap it with a script or a program to discretely call your generalized Pig script since Pig does not inherently have general purpose looping constructs like we do in other languages. That said, check out my NEW answer and related link which should be able to dynamically do what you want -- and in ONE line of Pig code!! Good luck!
... View more
07-26-2016
04:52 PM
2 Kudos
Better answer, check out my simple example of using MultiStorage at https://martin.atlassian.net/wiki/x/AgCHB and then assuming that your "Date" field in the original question was the first one in the record format of "a" then the following should get you taken care of. STORE a INTO '/path'
USING org.apache.pig.piggybank.storage.MultiStorage(
'/path', '0', 'none', '\\t'); This would create folders like /path/2016-07-01 which themselves will have the 1+ "part files" for that given date. You could then use that directory location as your input path for another job. Good luck!!
... View more
07-26-2016
03:34 PM
I'm not 100% sure of the sequence of events that got you to this point. If they are easily reproducible, please share the steps here and others may be able to look at it for you. Good luck.
... View more
07-26-2016
03:32 PM
The project's wiki pages at key top-pages like https://cwiki.apache.org/confluence/display/Hive/Home and https://cwiki.apache.org/confluence/display/Hive/LanguageManual have been life-savers for me and are often the target of many google search results. I often find many of my answers in individual blog postings that are returned for google searches. And for myself, although I'm now realizing it needs some love, I maintain a "cheat sheet" of links at https://martin.atlassian.net/wiki/x/QIAoAQ that I'd also appreciate and update suggestions on. Good luck!
... View more
07-26-2016
03:23 PM
5 Kudos
That sounds like all is working as designed/implemented since Ranger does not currently (as of HDP 2.4) have a supported plug-in for Spark and knowing that when spark is reading Hive tables that it really isn't going through the "front door" of Hive to actual run queries (it is reading these files from HDFS directly). That said, the underlying HDFS authorization policies (either w/or w/o using Ranger) will be honored if they are in-place.
... View more
07-24-2016
04:32 PM
As you already know, Pig really isn't a general purpose programming that account for such things as this. The "Control Structures" page at http://pig.apache.org/docs/r0.15.0/cont.html gives your the project's recommendations on such things. Generally speaking, a custom script that fires off a generic Pig script, or maybe a Java program, might be your best friend. Good luck!
... View more
07-21-2016
06:35 PM
Good to go then, @Johnny Fugers? No further help needed, right?
... View more
07-19-2016
10:48 PM
I know the tutorial at http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/#section_6 does this slightly different as they have the Spark code save a DF as an ORC file (step 4.5.2) and then they run a Hive LOAD command (step 4.5.3), but your INSERT-AS-SELECT model sounds like it should work. Maybe someone could test it out if you supplied a simple working example of what you have so they could try to resolve the issue for you.
... View more
07-19-2016
10:39 PM
Maybe a small sample dataset would help so someone could try to run it as maybe the gotcha is not with the STORE, but the presence of it means all the transformations have to run. If it is a problem prior to the STORE, then the error (whatever it is since we don't have the output of that) should surface just the same if you used DUMP to display the data. If DUMP (and not STORE) is working, then probably just a permissions issue with the location you are storing. Again, supply a small sample dataset and I'm sure somebody will chew on it for a few minutes to see if they can figure it out.
... View more
07-19-2016
03:20 PM
I agree with the notes identified in the comments section between you and @Benjamin Leonhardi that the gotcha is the zip files. http://stackoverflow.com/questions/17200731/pig-tagsource-doesnt-work-with-multiple-files suggests that you can set pig.splitCombination to false to get over the hump that you may be running multiple files in a single mapper. That said, I did a simple test on the 2.4 Sandbox having three files (named file1.txt, file2.txt and file3.txt) with the following contents. a,b,c
a,b,c
a,b,c I then ran the following simple script (tried it with MR as well as Tez as the execution engine). a = LOAD '/user/maria_dev/multFiles' using PigStorage(' ','-tagFile');
DUMP a; And I got the following (expected) output. (file1.txt,a,b,c)
(file1.txt,a,b,c)
(file1.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file2.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c)
(file3.txt,a,b,c) Good luck!
... View more
07-19-2016
01:47 PM
2 Kudos
A quick way to determine specific versions of core Hadoop (and all the components making up HDP) is to visit the particular HDP version's release notes under http://docs.hortonworks.com. When you are on a box itself, you can read the "cookie crumbs" such as shown below that does show that HDP 2.4.2 uses Hadoop 2.7.1 as you identified above. Hint: look at the long jar name that includes the Apache version number followed by the HDP version number. [root@ip-172-30-0-91 hdp]# pwd/usr/hdp
[root@ip-172-30-0-91 hdp]# ls
2.4.2.0-258 current
[root@ip-172-30-0-91 hdp]# cd current/hadoop-hdfs-client
[root@ip-172-30-0-91 hadoop-hdfs-client]# ls hadoop-hdfs-2*
hadoop-hdfs-2.7.1.2.4.2.0-258.jar
hadoop-hdfs-2.7.1.2.4.2.0-258-tests.jars
As for a Hadoop 2.8 release date, I'm sure not the person who can comment on that, but you can go to https://issues.apache.org/jira/browse/HADOOP/fixforversion/12329058/ to see all the JIRAs that are currently slated to be part of it. Good luck!
... View more
07-19-2016
01:29 PM
THEORETICALLY.... you could move the underlying files on a particular DataNode and put them on another DataNode, but... you'd have to have that DataNode processes not running during that. When the other DataNode started with the files you have moved it, it will send a block report that contains the blocks you copied. If that was done in pretty tight synchronization with taking down the original DataNode it ~might~, again THEORETICALLY, work, but... DON'T DO THAT! Seriously, that is a bunny trail that you would not really want to explore outside of just a learning exercise in a non-production environment to help you understand how the NN and DN processes interoperate. Maintenance mode is not your answer here as that who premise is to help Ambari Monitoring not send bad alarms about services you intentionally want unavailable.
Generally speaking, decommissioning a DataNode is the way to go as it would give the NameNode the time to not put new blocks on the DataNode being decommissioned while redistributing them so you are never under-replicated. If you just delete the node, then you'll be under-replicated until the NameNode can resolve that for you.
... View more
06-17-2016
03:05 PM
1 Kudo
What I think you are looking for is a list ordered by the first N categories with the highest total sales and for each of those, you want to have the first N items sub-sorted by their totals. For that, I added the following testing data for myself that features 6 categories, each with 3 products within. As with your data (and to keep the example simple) I left in the total sales per category, but if the data did not have this we could calculate it easily enough... NOTE: your 2 rows of data for the Oil category each had a different category total -- I changed that in my test data. [root@sandbox hcc]# cat raw_sales.txt
CatZ,Prod22-cZ,30,60
CatA,Prod88-cA,15,50
CatY,Prod07-cY,20,40
CatB,Prod18-cB,10,50
CatX,Prod29-cZ,40,60
CatC,Prod09-cC,80,140
CatZ,Prod83-cZ,20,60
CatA,Prod17-cA,25,50
CatY,Prod98-cY,10,40
CatB,Prod99-cB,30,50
CatX,Prod19-cZ,10,60
CatC,Prod73-cC,50,140
CatZ,Prod52-cZ,10,60
CatA,Prod58-cA,15,50
CatY,Prod57-cY,10,40
CatB,Prod58-cB,10,50
CatX,Prod59-cZ,10,60
CatC,Prod59-cC,10,140 That said, the end answer should show CatC (140 total sales) with Prod09-cC & Prod73-cC as well as CatZ (60 total sales) with its Prod22-cZ and Prod83-cZ. Here's my code. I basically clumped up and ordered the items by category so I could throw away all but the top N ones of them first. After that, it was basically what you had already done. [root@sandbox hcc]# cat salesHCC.pig
rawSales = LOAD 'raw_sales.txt' USING PigStorage(',')
AS (category: chararray, product: chararray,
sales: long, total_sales_category: long);
-- group them by the total sales / category combos
grpByCatTotals = GROUP rawSales BY
(total_sales_category, category);
-- put these groups in order from highest to lowest
sortGrpByCatTotals = ORDER grpByCatTotals BY group DESC;
-- just keep the top N
topSalesCats = LIMIT sortGrpByCatTotals 2;
-- do your original logic to get the top sales within the categories
topProdsByTopCats = FOREACH topSalesCats {
sorted = ORDER rawSales BY sales DESC;
top = LIMIT sorted 2;
GENERATE group, FLATTEN(top);
}
DUMP topProdsByTopCats; The output is as initially expected. [root@sandbox hcc]# pig -x tez salesHCC.pig
((140,CatC),CatC,Prod09-cC,80,140)
((140,CatC),CatC,Prod73-cC,50,140)
((60,CatZ),CatZ,Prod22-cZ,30,60)
((60,CatZ),CatZ,Prod83-cZ,20,60) I hope this was what you were looking for. Either way, good luck!
... View more
06-10-2016
02:41 PM
2 Kudos
I'm not aware of the concept of Spark's Accumulators exposed as "first-class" objects in Pig and have always advised that you would need to build a UDF for such activities if you couldn't simply get away with filtering the things to count (such as "good" records and "rejects") into separate aliases then count them up. Here is a blog post going down the UDF path; https://dzone.com/articles/counters-apache-pig. Good luck & I'd love to hear if there was something I've been missing all along directly from Pig.
... View more
06-06-2016
09:10 PM
5 Kudos
For a batch model, the "classic" pattern of Sqooping the incremental data you need to ingest into a working directory on HDFS followed by a Pig script that loads (have a hive table defined against this working directory so you can inherit the schema from HCatLoader) the data & does any transformations needed (possibly only a single FOREACH to project it in the correct order) before using HCatStorer to store the data into a pre-existing ORC-back Hive table works for many. You can stitch it all together with an Oozie workflow. I know of a place that uses a simple-and-novel Pig script like this to ingest 500 billion records per day into Hive.
... View more
05-27-2016
10:13 PM
I've got a weird/wild one for sure and wondering if anyone has any insight. Heck, I'm giving out "BONUS POINTS" for this one. I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks. When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below. >>> rdd1 = sc.textFile("statePopulations.csv",8)
>>> rdd1.getNumPartitions()
8
>>> rdd1 = sc.textFile("statePopulations.csv",9)
>>> rdd1.getNumPartitions()
10
>>> rdd1 = sc.textFile("statePopulations.csv",10)
>>> rdd1.getNumPartitions()
11 I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have. >>> rdd2 = sc.textFile("/proto/2000.csv")
>>> rdd2.getNumPartitions()
5
>>> rdd2 = sc.textFile("/proto/2000.csv",9)
>>> rdd2.getNumPartitions()
9
>>> rdd2 = sc.textFile("/proto/2000.csv",45)
>>> rdd2.getNumPartitions()
45 Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?) Anyone know what's going on?
... View more
Labels:
- Labels:
-
Apache Spark
05-26-2016
02:28 PM
Hey Vijay, yep, this might be too big of a set of questions for HCC. My suggestion is to search for particular topics to see if they are already being addressed and then ultimately, imagine these as separate discrete questions. For example, see https://community.hortonworks.com/questions/35539/snapshots-backup-and-dr.html as a pointed set of questions around snapshots; ok... that one had a bunch of Q's in one, too. 😉 Another alternative is to get hold of a solutions engineer from a company like (well, like Hortonworks!) to try to help you through all of these what-if questions. Additionally, a consultant can help you build an operational "run book" that addresses all of these concerns in a customized version for your org. Good luck!
... View more
05-25-2016
02:28 PM
3 Kudos
Wow, a TON of questions around Snapshots; I'll try to hit on most of them. Sounds like you might have already found these older posts on this topic, http://hortonworks.com/blog/snapshots-for-hdfs/ & http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/. For DR (data onto another cluster) you'll need to export these snapshots with a tool like distcp. As you go up into the Hive and HBase stacks, you have some other tools and options in addition to this. My recommendation is to open a dedicated HCC question for each after you do a little research and we can all jump in to help anything you don't understand. As with all things, the best way to find out is to give it a try. As the next bit shows, you cannot delete a snapshot like "normal"; you have to use the special delete snapshot command. [root@sandbox ~]# hdfs dfs -mkdir testsnaps
[root@sandbox ~]# hdfs dfs -put /etc/group testsnaps/
[root@sandbox ~]# hdfs dfs -ls testsnaps
Found 1 items
-rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/group
[root@sandbox ~]# su - hdfs
[hdfs@sandbox ~]$ hdfs dfsadmin -allowSnapshot /user/root/test
snapsAllowing snaphot on /user/root/testsnaps succeeded
[hdfs@sandbox ~]$ exit
logout
[root@sandbox ~]# hdfs dfs -createSnapshot /user/root/testsnaps snap1
Created snapshot /user/root/testsnaps/.snapshot/snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot/snap1
Found 1 items
-rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/.snapshot/snap1/group
[root@sandbox ~]# hdfs dfs -rmr -skipTrash /user/root/testsnaps/.snapshot/snap1
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: Modification on a read-only snapshot is disallowed
[root@sandbox ~]# hdfs dfs -deleteSnapshot /user/root/testsnaps snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot
[root@sandbox ~]# There is no auto-delete of snapshots. The rule of thumb is that if you create them (likely with an automated process) then you need to have a complimentary process to delete them as you can clog up HDFS space if the data directory you are snapshotting actually does change. Snapshots should not adversely affect your quotas, with the exception I just called out about them hanging onto HDFS space for items you have deleted from the actual directory that you do have 1+ snapshot pointing to. Have fun playing around with snapshots & good luck!
... View more
05-23-2016
12:12 PM
Good thing it is an ASF project!! 😉 See if http://zeppelin.apache.org/, http://zeppelin.apache.org/download.html, http://zeppelin.apache.org/docs/0.5.6-incubating/index.html, http://zeppelin.apache.org/docs/0.5.6-incubating/install/install.html and/or http://zeppelin.apache.org/docs/0.5.6-incubating/install/yarn_install.html can get you going. Good luck!
... View more
05-18-2016
10:06 PM
Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.html that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉
... View more
05-18-2016
02:03 PM
1 Kudo
Sounds like https://community.hortonworks.com/questions/33961/how-to-import-data-return-by-google-analytic-s-api.html was a repost of this earlier question. I provided my (more generic) answer over there, but maybe someone has a more specific response tied directly to Google Analytics and Hadoop. Good luck!
... View more
05-18-2016
02:01 PM
1 Kudo
Sure!! Just navigate over to http://hortonworks.com/products/sandbox/ to download the free Hortonworks Sandbox and to check out all the available tutorials listed there. I do assume you already know about this, so feel free to refine your question. 😉
... View more
05-18-2016
01:59 PM
1 Kudo
My thoughts on #1 & #2: Some googling shows there are folks out there with somewhat direct ways for you to open an HDFS file and write to it as you are getting the data from the external system (google api in this case). That said, I'd consider applying the KISS principle and have your python program write the results into a file so that when you are done (and you are sure you are done -- i.e. this helps prevent a half-baked file in HDFS) simply use the hadoop fs -put command to drop the complete file exactly where you want it in HDFS. As for #3: You have to create table (external or not -- as even "managed" tables can reside outside of /apps/hive/warehouse) as this is a layered on ecosystem tool above base HDFS and this CREATE TABLE DDL command will store metadata about the logical table you want to be mapped to your data. The good news is that you can create that table before or after you load the data. Additionally, if you are going to continue to add net-new data to the table, you don't have to create it again. Good luck!
... View more
05-17-2016
08:51 PM
good points; an example of some of the "corner cases" on CSV files (especially those generated by tools like Excel) are discussed in https://martin.atlassian.net/wiki/x/WYBmAQ.
... View more
05-17-2016
12:58 AM
2 Kudos
Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. I usually get something like the following when trying to use a multi-line file. scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string] That said, all seems to be working for me with a file like the following. [root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"} As you can see by the two ways I read the JSON file below. SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+ Again, if this doesn't help feel free to share some more details. Good luck!
... View more
05-16-2016
11:17 PM
1 Kudo
As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS. As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.
... View more
03-21-2016
01:12 AM
2 Kudos
I'm surely not going to give you the best answer on this one, but "Hadoop Streaming", as described at http://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html, is a way to run a MapReduce job that executes your Python code. In this case, you'll need to have Python installed on all the cluster nodes, but since you're starting out on the Sandbox that makes it easy (just one place!). Yes, you could also run a Python app that queries Hive, but only that query itself will be running in the cluster. In this case, you'll obviously just need Python wherever you are running it from.
... View more
03-21-2016
01:00 AM
2 Kudos
I had some troubles a while back similar to this as shown at https://martin.atlassian.net/wiki/x/C4BRAQ. Try replacing REGISTER /tmp/stackexchange/piggybank.jar with REGISTER 'hdfs:///tmp/stackexchange/piggybank.jar' and let us know if that works.
... View more
03-20-2016
10:30 PM
Yep, that could work. Putting it in HBase could also allow you to maintain some version of the record, too. Good luck and feel free to share more.
... View more
03-17-2016
10:41 PM
2 Kudos
A couple of observations and a few recommendations. First, if you are trying to run the pig script from the linux command line, I would recommend you save your pig script locally and then run it. Also, you don't really need to fully qualify the location of the input file like you are doing above. Here is a walk through of something like you are doing now; all from the command line. SSH to the Sandbox and become maria_dev. I have an earlier 2.4 version and it does not have a local maria_dev user account (she does have an account in Ambari as well as a HDFS home directory) so I had to create that first as shown below. If the first "su" command works then skip the "useradd" command. Then verify she has a HDFS home directory. HW10653-2:~ lmartin$ ssh root@127.0.0.1 -p 2222
root@127.0.0.1's password:
Last login: Tue Mar 15 22:14:09 2016 from 10.0.2.2
[root@sandbox ~]# su maria_dev
su: user maria_dev does not exist
[root@sandbox ~]# useradd -m -s /bin/bash maria_dev
[root@sandbox ~]# su - maria_dev
[maria_dev@sandbox ~]$ hdfs dfs -ls /user
Found 17 items <<NOTE: I deleted all except the one I was looking for...
drwxr-xr-x - maria_dev hdfs 0 2016-03-14 22:49 /user/maria_dev Then copy a file to HDFS that you can then later read. [maria_dev@sandbox ~]$ hdfs dfs -put /etc/hosts hosts.txt
[maria_dev@sandbox ~]$ hdfs dfs -cat /user/maria_dev/hosts.txt
# File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1localhost.localdomain localhost
10.0.2.15sandbox.hortonworks.com sandbox ambari.hortonworks.com Now put the following two lines of code into a LOCAL file called runme.pig as shown when listing it below. [maria_dev@sandbox ~]$ pwd
/home/maria_dev
[maria_dev@sandbox ~]$ cat runme.pig
data = LOAD '/user/maria_dev/hosts.txt';
DUMP data; Then just run it (remember, no dashes!!). NOTE: many lines removed from the logging output that is bundled in with the DUMP of the hosts.txt file. [maria_dev@sandbox ~]$ pig runme.pig
... REMOVED A BUNCH OF LOGGING MESSAGES ...
2016-03-17 22:38:45,636 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures
2.7.1.2.4.0.0-1690.15.0.2.4.0.0-169maria_dev2016-03-17 22:38:102016-03-17 22:38:45UNKNOWN
Success!
Job Stats (time in seconds):
JobIdMapsReducesMaxMapTimeMinMapTimeAvgMapTimeMedianMapTimeMaxReduceTimeMinReduceTimeAvgReduceTimeMedianReducetimeAliasFeatureOutputs
job_1458253459880_00011077770dataMAP_ONLYhdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848,
Input(s):
Successfully read 5 records (670 bytes) from: "/user/maria_dev/hosts.txt"
Output(s):
Successfully stored 5 records (310 bytes) in: "hdfs://sandbox.hortonworks.com:8020/tmp/temp212662320/tmp-490136848"
Counters:
Total records written : 5
Total bytes written : 310
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1458253459880_0001
... REMOVED ABOUT 10 MORE LOGGING MESSAGES ...
.... THE NEXT BIT IS THE RESULTS OF THE DUMP COMMAND ....
(# File is generated from /usr/lib/hue/tools/start_scripts/gen_hosts.sh)
(# Do not remove the following line, or various programs)
(# that require network functionality will fail.)
(127.0.0.1,,localhost.localdomain localhost)
(10.0.2.15,sandbox.hortonworks.com sandbox ambari.hortonworks.com)
2016-03-17 22:38:46,662 [main] INFO org.apache.pig.Main - Pig script completed in 43 seconds and 385 milliseconds (43385 ms)
[maria_dev@sandbox ~]$ Does this work for you?? If so, the you can run a Pig script from the CLI and remember... you do NOT need all the fully qualified naming junk if running this way. GOOD LUCK!
... View more
- « Previous
- Next »