Member since
01-12-2016
123
Posts
12
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
885 | 12-12-2016 08:59 AM |
04-12-2019
01:40 PM
What is an empty bag, empty tuple, and Null? How to represent empty bag, empty tuple, and Null? which operations will generate empty tuple, empty bag? I have this confusion because I do not have clarity when to use datafu.pig.bags.NullToEmptyBag() and EmptyBagToNullFields functions? Can someone give some example for my understanding
... View more
Labels:
- Labels:
-
Apache Pig
01-15-2019
10:08 AM
Hi All, Any Input on my clarifications?Faced this scenario one more time
... View more
12-08-2018
02:39 PM
This is from Hive Manual:- Recover Partitions (MSCK REPAIR TABLE) Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively. Doubt:- How to add new partitions directly using hadoop fs -put? can someone give one example for this.To my knowledge, I know about alter table add partition or dynamic partition.
... View more
Labels:
- Labels:
-
Apache Hive
11-24-2018
11:37 AM
Which one will occur first in MapReduce Flow among shuffling and sorting? To my knowledge shuffling will occur first and then Sorting? Correct me I am wrong. Any body can explain these two things? Below statement from the Definative guide: MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort—and transfers the map outputs to the reducers as inputs—is known as the shuffle.
... View more
Labels:
- Labels:
-
Apache Hadoop
11-15-2018
08:56 AM
@Aditya Sirna Do you mean if we are familiar with Python,We can Work on Spark.In Real time Projects only Python is sufficient. Do I need to learn Scala or Java for real time Projects?
... View more
11-15-2018
05:47 AM
Could anybody guide me what is the learning path for Spark?
I am familiar with Hadoop,Hive,Pig,sqoop,oozie,Python and Hbase.I do not know much about Java.
Do I need to learn both Java and Scala to start with spark?
I am completed confused where to start for Spark?
... View more
Labels:
- Labels:
-
Apache Spark
10-14-2018
08:49 AM
How to apply rows between unbounded preceding and unbounded following in Pig? Currently, I am using below code calculate the cumulative sum: A = load 'T' AS (si:chararray, i:int, d:long, f:float, s:chararray);
C = foreach (group A by si) {
Aord = order A by d;
generate flatten(Stitch(Aord, Over(Aord.f, 'sum(float)')));
}
D = foreach C generate s, $5;This is equivalent to the SQL statement
select s, sum(f) over (partition by si order by d) from T;
I know I need to modify the Over(Aord.f, 'sum(float)' clause but not sure what exactly I need to do?
... View more
Labels:
- Labels:
-
Apache Pig
10-13-2018
05:13 AM
I have set the No of reducers to 2 but still Hive is executing with 1.Any body help on this
set hive.exec.reducers.max=2 Hive (default)> insert overwrite directory '/input123456'
> select count(*) from partitioned_user;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201810122125_0003, Tracking URL = http://ubuntu:50030/jobdetails.jsp?jobid=job_201810122125_0003
Kill Command = /home/naresh/Work1/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201810122125_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-10-12 21:36:24,774 Stage-1 map = 0%, reduce = 0%
2018-10-12 21:36:32,825 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.12 sec
2018-10-12 21:36:41,919 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 4.12 sec
2018-10-12 21:36:42,926 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.38 sec
MapReduce Total cumulative CPU time: 6 seconds 380 msec
Ended Job = job_201810122125_0003
Moving data to: /input123456
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 6.38 sec HDFS Read: 354134 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 380 msec
OK
_c0
Time taken: 37.199 seconds
... View more
Labels:
- Labels:
-
Apache Hive
10-11-2018
02:24 AM
How to get the list of functions available in any jar file? Let us say I have Piggybank.Jar.It contains Reverse,UnixToISO() etc. Is there any command to get list of functions available in Jar file rather than using Google for it?
... View more
Labels:
- Labels:
-
Apache Pig
03-09-2017
12:55 PM
I am new to Avro and any input on below clarifications is appreciated. I got clarifications While reading Article on Avro with Hive. Article Link 1)we are not mentioning .avsc schema file here.I read in some other article where they will be mentioning .avsc file in Table properties.Is it mandatory or optional? hive> CREATE EXTERNAL TABLE user_profile (id BIGINT, name STRING, bday STRING) STORED AS avro; 2)For the below change do I need to recreate the .avsc file each and every time?Is it one-time activity or for every step like a,b,c i need to recreate the .avsc file? You can see that a number of operations can be allowed as a simple requirement change: a)Adding a new column to a table (the “country” column in the 2nd file) b)Dropping a column from a table (the “id” column in the 3nd file) b)Renaming a column (the “birthday” column in the 4th file)
... View more
- Tags:
- article
- Avro
- Data Processing
- data-processing
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Hive
03-07-2017
08:37 AM
Thanks for comments.I will do it definately starting from this post.
... View more
03-03-2017
08:48 AM
Thanks for input.what is the problem with my relation C. STRSPLIT will generate tuple as output.Here it will consists of two fields in a tuple. (a1:chararray, a1of1:chararray) is also a tuple since it is enclosed in parentheses and also consists of two fields
... View more
03-02-2017
02:23 PM
my input file is below a.txt
aaa.kyl,data,data
bbb.kkk,data,data
cccccc.hj,data,data
qa.dff,data,data
A = LOAD '/pigdata/a.txt' USING PigStorage(',') AS(a1:chararray,a2:chararray,a3:chararray); How to resolve below error and what is the reason for this error ERROR:-
C = FOREACH A GENERATE STRSPLIT(a1,'\\u002E') as (a1:chararray, a1of1:chararray),a2,a3;
2017-02-03 00:45:42,803 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1031: Incompatable schema: left is "a1:chararray,a1of1:chararray", right is ":tuple()"
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
02-15-2017
03:05 PM
From Pig Textbook:- Tuple:- A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;
Bag:- when you project fields in a bag, you are creating a new bag with only those fields:
A = load 'input' as (b:bag{t:(x:int, y:int)});
B = foreach A generate b.x;
This will produce a new bag whose tuples have only the field x in them
How to get more information about this i mean reference a relation by using dot operator(out_max.ninetieth).
I do not find anything from pig manual and any input's on this.
... View more
02-15-2017
01:04 PM
Thanks @Artem Ervits plays relation does not contain alias Nineteith and I can understand it is generated in step 4.How can we use Nineteith in step 5 since plays does not contain Nineteith alias trim_outliers =foreach plays generate (Here we need to select any alias from plays) can I select alias from any relation while using foreach generate statement
... View more
02-15-2017
09:11 AM
I am new to pig any input is really appreciated.plays relation does not contain field ninetieth.How can we use out_max.ninetieth in step 5?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Pig
02-09-2017
12:42 PM
The process involved in the implementation of SCD Type 1 are Identifying the new record and inserting it into the dimension table. Identifying the changed record and updating the dimension table. How to implement SCD1 in Hive? Found below things and not sure how to get the complete solution: a)we can Surrogate key using datafu.pig.hash.SHA(); in pig or http://www.remay.com.br/blog/hdp-2-2-how-to-create-a-surrogate-key-on-hive/ in hive using row_number? b)For change capture I can use full outer to identify new record and update record c)To use update statement in hive we have to use transaction property+ORC Format I want to do either in hive or pig source and target is as below
... View more
- Tags:
- hadoop
- Hadoop Core
- HDFS
- Hive
- Pig
- Upgrade to HDP 2.5.3 : ConcurrentModificationException When Executing Insert Overwrite : Hive
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Pig
01-24-2017
09:32 AM
Hi @cdubyThanks for input. For loading base table we are following: sqoop import-->create external table-->load the data into base_table of ORC format from external. one small clarification on this.When I run sqoop import it will create so many part-m* files+_success file+ other files since map reduce job will be triggered. But in external table consists of data related to only part-m* files,so in this case do I need to delete other files(_success file,job.xml) or do we have any other option for this to skip these files for external table? sqoop import --connect jdbc:teradata://{host name}/Database=retail
--connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc
--password dbc --table SOURCE_TBL --target-dir /user/hive/base_table -m 1
CREATE TABLE base_table (
id STRING,
field1 STRING,
modified_date DATE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC;
... View more
01-23-2017
04:11 PM
This is very superb article and any I have following clarifications? https://community.hortonworks.com/articles/29410/four-step-strategy-for-incremental-updates-in-apac.html Clarification 1:-base_table is not a external table.How we are loading the data into base_table is not clear during first run is not clear?could you please provide input on this?we are not using any load statement or insert into statement? I think after using below statement.Manually we have to load the data from files present in path:- /user/hive/incremental_table/incremental_table into table base_table using load data statement sqoop import --connect jdbc:teradata://{host name or ip address}/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 Clarification 2:-During first run only base_table will be loaded and there is no need to implement the Reconcile,compact and Purge process since we do not have incremental data.Please correct me if i am wrong
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Sqoop
01-23-2017
04:04 PM
base_table is not a exernal table.How we are loading the data into base_table is not clear during first run is not clear?could you please provide input on this?we are not using any load statement or insert into statement? I think after using below statement.Manually we have to load the data from files present in path:-
/user/hive/incremental_table/incremental_table into table base_table sqoop import --connect jdbc:teradata://{host name or ip address}/Database=retail --connection-manager org.apache.sqoop.teradata.TeradataConnManager --username dbc --password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1
... View more
01-13-2017
03:51 PM
a)Thanks for input @Artem Ervits .Your input is always appreciated. I will go for coordinator Job with time and data availability-based scheduling but still have following clarifications clarification 1:- suppose if i am using below command to trigger the coordinator job.Is it one time activity in production to run this command once in production since it will trigger based on frequency for day 2?please correct me if i am wrong or do i need to run this command on day2 also?
oozie job -oozie http://sandbox.hortonworks.com:11000/oozie -config /path/to/job.properties -run
<coordinator-app name="my_first_job" start="2014-01-01T02:00Z"
end="2014-12-31T02:00Z" frequency="${coord:days(1)}"
xmlns="uri:oozie:coordinator:0.4">
clarification 2:-How to Implement condition logic in your Oozie workflow and if there's new data, run the actions, otherwise proceed to end action?
... View more
01-12-2017
09:43 AM
HI @Santhosh B Gowda Thanks for input. a)My question is how to run below command in production since we should not run manually. oozie job --oozie http://host_nameofoozieserver:8080/oozie -Doozie.wf.application.path=hdfs://namenodepath/pathof_workflow_xml/workflow.xml-run
b)I know about coordinator but at this point of time i am not sure whether i have to use data or time triggers. currently we are running flume continusoly
... View more
01-12-2017
08:57 AM
Currently we are using oozie workflow(consists of hive,pig,sqoop actions) using below command in dev environment. In Production environment we should not run manually.can I create a shell script for below command and I can run that shell script using crontab scheduler.Is my approach is correct if yes what is the timings for the script. If not what is the approach to run below command in Production? oozie job --oozie http://host_nameofoozieserver:8080/oozie -D
oozie.wf.application.path=hdfs://namenodepath/pathof_workflow_xml/workflow.xml-run
... View more
Labels:
- Labels:
-
Apache Oozie