Member since
11-04-2015
44
Posts
18
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1344 | 03-17-2017 06:17 AM | |
63091 | 02-29-2016 12:25 PM | |
12373 | 02-03-2016 01:25 PM |
03-17-2017
06:17 AM
1 Kudo
We can achieve this using JOIN as follows. 1. JOIN A and B BY Id. B_joined = JOIN A by Id, B by Id; 2. JOIN A and C by Id: C_joined = JOIN A by Id, C by Id; Now, we can get the required fields of A and C from their respective joined data sets as follows: B_filtered = FOREACH B_joined GENERATE B::Id,B::t1; C_filtered =FOREACH C_joined GENERATE C::Id;
... View more
03-16-2017
12:34 PM
I have many files. one of which,say, header.csv, serves as a header file,i.e., it contains primary key(in database analogy) which servers as foreign key in the rest of the files. Now, I want to do FOREACH and FILTER as follows: A =LOAD 'header.csv' AS (Id:chararray,f1:chararrat,f2:chararray); B = LOAD 'file1.csv' AS (Id:chararray,t1:chararray); C = LOAD 'file2.csv' AS (Id:chararray) .......... D = foreach A { file1_filtered = FILTER file1 BY Id == A.Id; file2_filtered = FILTER file2 BY Id == A.Id; GENERATE file1_filtered,file2_filtered; }; Finally I need to access the relations file1_filtered and file2_filtered.
When I follow this approach I got the following error: "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<line 2651, column 28> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)" How can I achieve this in Pig?
... View more
Labels:
- Labels:
-
Apache Pig
03-11-2017
12:42 PM
1 Kudo
I have independent clusters of HDF and HDP. I wonder if I can have a single KDC Admin server for both of the clusters. If it is possible, how do I achieve that.
... View more
Labels:
03-07-2017
06:09 AM
Thank you very much @Lester Martin! This is exactly what I was looking for. If you don't mind I have another related question. This logic is done on more than 36 different files. In database concept, one of the files uses the ID and CreateDate fields as Primary Key and these fields are used as Foreign Keys in the rest of the files.
* The files are dropped daily into Hadoop local directory
* The files have current date appended to their file names So, I need to read all the files from Hadoop local directory, do the above logic on each of them, then store the results into HDFS.
Is Pig the optimal (or feasible at all) solution for my use case.
Currently, I am doing this logic using C# program to read the files, do the logic and insert into relational database.
Why I am seeking for Pig is to improve the performance of the ETL process. Any recommendation on this please? Thanks!
... View more
03-03-2017
03:43 PM
I have two files on my hdfs. One of the file(latest file) contains some updates on the other file(previous file). Now, I want to check if value of specific columns on the latest file also exist on the previous file(or if they have same value), and replace such records of the previous file with records of the latest file.(i.e. delete such records from the previous file and replace with records from latest file). That means, I need to check each record of the previous file against each record of the latest file based on specific columns. If matching is found, delete the whole record from the previous file , then replace with the record from latest file. How can I achieve this with Pig? thanks!
... View more
Labels:
- Labels:
-
Apache Pig
02-22-2017
05:20 AM
@Pierre Villard
Thank you very much dear! You made my day!
... View more
02-21-2017
02:14 PM
1 Kudo
I have multiple files on my SFTP server with different filenames(filenames have date time appended).Now, I am using ListSftp, RouteOnAttribute, FetchSftp and putHdfs processors. But on the FetchSftp processor I have doubt on how to put all the files on the remote SFTP server to HDFS.
Is there any option to provide the list of the file names to "Remote File" property of FetchSftp processor configuration? Thanks!
... View more
Labels:
- Labels:
-
Apache NiFi
02-21-2017
12:03 PM
Thank you very much @Pierre Villard
Your are answer was really helpful.
... View more
02-21-2017
09:19 AM
I have 7 node Kerberized HDP cluster. I have installed apache Nifi on one of my HDP cluster nodes just for testing purpose. When I try to configure putHdfs processor, the following warning pops up: I tried to set the Kerberos properties as follows: In addition to this, I set
nifi.kerberos.krb5.file=/etc/krb5.conf in the nifi.properties file.
What is the correct configuration(on the Nifi host or HDFS host) for the putHdfs processor to work properly in this case? Do I need to create Kerberos principal and Kerberos keytab file for the Nifi?
Which service's principal or keytab file am I required to provide for "Kerberos Principal" and "Kerberos Keytab" fields in the putHdfs processor configuration?(is it the Nifi's or the Hdfs?) Thanks,
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
07-21-2016
01:38 PM
@ Venkat ramanann In addition to @Jon Maestas workaround, add the IP address of the host (on which PostgreSQL server is running) to the pg_hba.conf file and make sure that the method is set to "trust". Note: Replace <ip address> with the IP address of your host to allow connections. # TYPE DATABASE USER CIDR-ADDRESS METHOD # IPv4 local connections: host all all 127.0.0.1/32 md5 host all all <ip address>/24 trust # IPv6 local connections:
host all all ::1/128 md5 Now, restart postgresql service. For more details, visit https://confluence.atlassian.com/confkb/confluence-unable-to-connect-to-postgresql-due-to-unconfigured-pg_hba-conf-file-300814422.html
... View more