About kib

kib · ‎03-17-2017

We can achieve this using JOIN as follows. 1. JOIN A and B BY Id. B_joined = JOIN A by Id, B by Id; 2. JOIN A and C by Id: C_joined = JOIN A by Id, C by Id; Now, we can get the required fields of A and C from their respective joined data sets as follows: B_filtered = FOREACH B_joined GENERATE B::Id,B::t1; C_filtered =FOREACH C_joined GENERATE C::Id;

kib · ‎03-16-2017

I have many files. one of which,say, header.csv, serves as a header file,i.e., it contains primary key(in database analogy) which servers as foreign key in the rest of the files. Now, I want to do FOREACH and FILTER as follows: A =LOAD 'header.csv' AS (Id:chararray,f1:chararrat,f2:chararray); B = LOAD 'file1.csv' AS (Id:chararray,t1:chararray); C = LOAD 'file2.csv' AS (Id:chararray) .......... D = foreach A { file1_filtered = FILTER file1 BY Id == A.Id; file2_filtered = FILTER file2 BY Id == A.Id; GENERATE file1_filtered,file2_filtered; }; Finally I need to access the relations file1_filtered and file2_filtered. When I follow this approach I got the following error: "ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 2651, column 28> expression is not a project expression: (Name: ScalarExpression) Type: null Uid: null)" How can I achieve this in Pig?

kib · ‎03-11-2017

I have independent clusters of HDF and HDP. I wonder if I can have a single KDC Admin server for both of the clusters. If it is possible, how do I achieve that.

kib · ‎03-07-2017

Thank you very much @Lester Martin! This is exactly what I was looking for. If you don't mind I have another related question. This logic is done on more than 36 different files. In database concept, one of the files uses the ID and CreateDate fields as Primary Key and these fields are used as Foreign Keys in the rest of the files. * The files are dropped daily into Hadoop local directory * The files have current date appended to their file names So, I need to read all the files from Hadoop local directory, do the above logic on each of them, then store the results into HDFS. Is Pig the optimal (or feasible at all) solution for my use case. Currently, I am doing this logic using C# program to read the files, do the logic and insert into relational database. Why I am seeking for Pig is to improve the performance of the ETL process. Any recommendation on this please? Thanks!

kib · ‎03-03-2017

I have two files on my hdfs. One of the file(latest file) contains some updates on the other file(previous file). Now, I want to check if value of specific columns on the latest file also exist on the previous file(or if they have same value), and replace such records of the previous file with records of the latest file.(i.e. delete such records from the previous file and replace with records from latest file). That means, I need to check each record of the previous file against each record of the latest file based on specific columns. If matching is found, delete the whole record from the previous file , then replace with the record from latest file. How can I achieve this with Pig? thanks!

kib · ‎02-22-2017

@Pierre Villard Thank you very much dear! You made my day!

kib · ‎02-21-2017

I have multiple files on my SFTP server with different filenames(filenames have date time appended).Now, I am using ListSftp, RouteOnAttribute, FetchSftp and putHdfs processors. But on the FetchSftp processor I have doubt on how to put all the files on the remote SFTP server to HDFS. Is there any option to provide the list of the file names to "Remote File" property of FetchSftp processor configuration? Thanks!

kib · ‎02-21-2017

Thank you very much @Pierre Villard Your are answer was really helpful.

kib · ‎02-21-2017

I have 7 node Kerberized HDP cluster. I have installed apache Nifi on one of my HDP cluster nodes just for testing purpose. When I try to configure putHdfs processor, the following warning pops up: I tried to set the Kerberos properties as follows: In addition to this, I set nifi.kerberos.krb5.file=/etc/krb5.conf in the nifi.properties file. What is the correct configuration(on the Nifi host or HDFS host) for the putHdfs processor to work properly in this case? Do I need to create Kerberos principal and Kerberos keytab file for the Nifi? Which service's principal or keytab file am I required to provide for "Kerberos Principal" and "Kerberos Keytab" fields in the putHdfs processor configuration?(is it the Nifi's or the Hdfs?) Thanks,

kib · ‎07-21-2016

@ Venkat ramanann In addition to @Jon Maestas workaround, add the IP address of the host (on which PostgreSQL server is running) to the pg_hba.conf file and make sure that the method is set to "trust". Note: Replace <ip address> with the IP address of your host to allow connections. # TYPE DATABASE USER CIDR-ADDRESS METHOD # IPv4 local connections: host all all 127.0.0.1/32 md5 host all all <ip address>/24 trust # IPv6 local connections: host all all ::1/128 md5 Now, restart postgresql service. For more details, visit https://confluence.atlassian.com/confkb/confluence-unable-to-connect-to-postgresql-due-to-unconfigured-pg_hba-conf-file-300814422.html

Online	Offline
Last Visited	‎11-05-2015 09:03 AM

Member Since	‎11-04-2015 11:44 PM
Last Visited	‎11-05-2015 09:03 AM
Posts	44
Kudos received	18

Cloudera Community

Re: How to FILTER in nested foreach

Re: Error: "Cannot retrieve repository metadata (r...

Re: How to solve "Connection refused " errors in H...

Re: How to FILTER in nested foreach

How to FILTER in nested foreach

Can I manage separate clusters of HDP and HDF with...

Re: How to use Pig to replace records from a relat...

How to use Pig to replace records from a relation ...

Re: How to copy multiple files from SFTP server to...

How to copy multiple files from SFTP server to HDF...

Re: how to use apache Nifi on kerberized HDP Clust...

how to use apache Nifi on kerberized HDP Cluster n...

Re: PSQLException: FATAL: no pg_hba.conf entry for...