About prodgers125

prodgers125 · ‎06-16-2016

Hi experts, I'm trying to do some data transformations (simple) in my text files using Apache Pig. I've 80 text files in my HDFS and I want to add a new column based on filnename. I test the code for to only one text file and works fine. But when I put the code reading all the files it don't do the job (it stays 0% at long time). Here is my code: A = LOAD '/user/data' using PigStorage(' ','-tagFile') STORE A INTO '/user/data/Data_Transformation/SourceFiles' USING PigStorage(' '); In your opinion, Pig are the best way to this? Thanks!!

prodgers125 · ‎06-13-2016

Hi, I've multiple files (in hDFS) with the same schema and I will aggregate all of them into Hive at only one table. Each files represents a date but I only have this info on file title. Which is the best way to insert the file title (the date) as a new column on this files. Java? NiFi? Thanks!

prodgers125 · ‎06-06-2016

Hi Paul, thnaks for your attention. My goal is do some Social Analysis (find patterns, etc.) that's why I want SAS too. The subject is the relationships between a company. I've the emails, telephones, etc. What I've: 5 months data collection (Aug, Set, Oct, Nov and Dec) Each text file correspond to a day Each type of communication have an specific ID (imagine, email I've ID 1, Phone ID 2, etc.) Each line corresponds to an aggregation of multiple communications (separated by the department and every 30 minutes) The attributes are: Communication ID Time Department Email Code Phone Code Phone Duration One possible line of the text file would be: 1 10:30:87 3 12 1 10:30:22 1 10:45:21 3 12 2 10:30:22 2 12 2 10:30:22 1 12 10:30:22 So as you can see, I can have multiple Communication ID by line (that's one of my doubts to create the Hive tables). The size of the text files are 6GB. Many thanks for your help Paul 🙂 Hope you can understand the problem. Thanks!

prodgers125 · ‎06-05-2016

I'm planning analyse some data using Hadoop. I've 200 text files to analyze. I'm thinking: Using Spark to load data into HDFS (are PIG or Sqoop better?) Create the structure in Hive, creating the tables (basically this first data model will have 200 tables, each table will be a text file) Load data into Hive (all the files) Do some data cleansing with Spark (I will need to put Spark reading from the Hive) and try to reduce the amount of data Create the new data model in Hive (now with a smaller amount of data after the cleansing in previous step) Use a Analytical Tool (like SAS, Tableau, etc.) to do some analytical operations (in this tool I will put the all the data returned in previous step) I believe that this will not be the best way to analyze big data . My goal is in the end of the process in Hadoop have a smaller data set in order to successfully integrate in SAS , for example. What is your opinion ? Many thanks!

prodgers125 · ‎05-28-2016

Yes when I think about Hadoop I'm saying to storage the data into HDFS. I don't know what type of advantage that can I take with Spark. Data cleansing?

prodgers125 · ‎05-25-2016

There exists some use case that shows how Hadoop and Spark work together? I already read the theory but I want to see something pratical to have a better understand. Thanks!!!

prodgers125 · ‎05-20-2016

I've download the cloudera-quickstart-vm-5.7.0-0-virtualbox Virtual Machine to do my Big Data Project. In my PC i've two Zips files (2GB each on) that contains my source data (there are a lot of txt files). I need to upload this files to HDFS in Virtual Machine, however I getting some troubles When I try to copy/drag the txt files to Virtual Machine. I was thinking loading the files directly on HDFS (no use Sqoop, for example), my question is: Is there a way that can I load the Source Data from my Local PC to HDFS? Java? Sqoop? There a lot of Txt Files... Thanks!

prodgers125 · ‎05-17-2016

Hi Lester, many thanks for your attention 🙂 I was thinking use Sqoop to get the correct format of my data but I think it will be better in terms of simplicity and speed put the files directly on HDFS. When I talk about segmentation, I was thiking in clusters analysis, basically divide the date into more smaller data sets. However, I think I can do that in Hive. Many thanks!!!

prodgers125 · ‎05-16-2016

Hello experts, I've two simple questions: In your opinion which is the best way to load data to HDFS (My source data are txt files)? Pig, Sqoop, directly in HDFS, etc. Second question is: Is a good option use Spark to do some data transformation, segmentation? Thanks!

prodgers125 · ‎05-11-2016

Sean, just many thanks for your response. This machine have pySpark ???

Online	Offline
Last Visited	‎07-13-2016 11:56 AM

Member Since	‎04-27-2016 01:54 AM
Last Visited	‎07-13-2016 11:56 AM
Posts	60
Kudos received	20

Cloudera Community

Apache Pig - Load 80 files into another direcotry

Insert a new column with value based on file title...

Re: Best way to analyze and transform big data in ...

Best way to analyze and transform big data in Hado...

Re: Hadoop + Spark Use Case

Hadoop + Spark Use Case

Lad Source Files to HDFS from My local Machine

Re: Load data to HDFS & Data Transformation with S...

Load data to HDFS & Data Transformation with Spark

Re: Cloudera VM Free to use Apache Hadoop with Spa...