Member since
01-21-2018
58
Posts
4
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2293 | 09-23-2017 03:05 AM | |
988 | 08-31-2017 08:20 PM | |
4333 | 05-15-2017 06:06 PM |
05-11-2018
04:01 PM
Hello everyone, I have a situation and I would like to count on the community advice and perspective. I'm working with pyspark 2.0 and python 3.6 in an AWS environment with Glue. I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF's. The problem comes up when I need to apply a join among the DF's created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later, So far my code looks like: query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []
for item in months:
df = 'cohort_2013_{}'.format(item)
query = query_text.format(item)
frame_list.append(df) # I pretend to retain in a list the name of DF to recall it later
df = spark.sql(query)
df = DynamicFrame.fromDF( df , glueContext, "df")
applyformat = ApplyMapping.apply(frame = df, mappings =
[("field1","string","field1","string"),
("field2","string","field2","string")],
transformation_ctx = "applyformat")
for df in frame_list:
create a join query for all created DF. Please if someone knows how could I achieve this requirement let me know your ideas. thanks so much
... View more
Labels:
- Labels:
-
Apache Spark
02-25-2018
09:14 PM
Sorry sometime not read completely come up an issue 😞 works seamlessly.!
... View more
02-25-2018
08:50 PM
Hello everyone, My sandbox 2.6 HDP is already running as never before actually is super cool. However, I'm not able to connect with hive thru the beeline because I need to change my loging to maria_dev or any other in the shell command. I tried doing su maria_dev or any other user but it doesnt go thru. Take a look the image attached. Please if anyone can give me an idea about how to log in under maria_dev credentials I appreciate because I need to run Spark SQL and Spark scripts. thanks so much
... View more
Labels:
- Labels:
-
Apache Hive
01-12-2018
06:16 PM
Do you know some tutorial or place documentation about how is the riht way to sert up the required services at least Spark2 and HDFS cux my VM is failing and failing even if I turn off the maintenanence mode and I'm not able to start the services. thanks so much
... View more
01-12-2018
03:23 PM
1 Kudo
Guys thanks so much I'm already in, please I want also include this amazing tutorial with enough documentation to play with our sandbox. https://github.com/hortonworks/data-tutorials/blob/master/tutorials/hdp/learning-the-ropes-of-the-hortonworks-sandbox/tutorial.md. thanks so much @Julián Rodríguez @Edgar Orendain
... View more
01-12-2018
03:16 PM
@Edgar Orendain @Julián Rodríguez Guys I get already the machine up thru my browser and Putty as you specified but where can I find info about how to access it thru ambari and the user and pass I can use by ssh? thanks so much
... View more
01-12-2018
02:06 PM
I'm going to try the suggested workaround , however this is not the first time it happens to me, months ago I was playing with the VM HDP2.4 and was the same. I will keep you posted. My goal truly is play with spark in HDP I dont know if HDF can works in that purpose. thanks
... View more
01-09-2018
10:07 PM
Hello @Jay Kumar SenSharmaI get the same issue please take a look on my current host configuration: OS: Windows 10 VirtualBox 5.4.2 Ram 16gb Hard DIsk 1TB Besides the issue on the image downloaded that says Docker (I re-downloaded with the right MD5). My machine never startup , 30 minutes later and see how it looks.
... View more
Labels:
- Labels:
-
Hortonworks Data Platform (HDP)
01-09-2018
05:07 PM
Hello @Jay Kumar SenSharmaI get the same issue please take a look on my current host configuration: OS: Windows 10 VirtualBox 5.4.2 Ram 16gb Hard DIsk 1TB Besides the issue on the image downloaded that says Docker (I re-downloaded with the right MD5). My machine never startup , 30 minutes later and see how it looks. #tutorial-350 #hdp-2.6.0
... View more
09-23-2017
03:05 AM
Hi Guys, I'm so so .... Well, I just remember that you can create just an external table stored in the same folder all files with the same structure are located. So , in that way I will load whole records in one shoot. > CREATE EXTERNAL TABLE bixi_his > ( > STATIONS ARRAY<STRUCT<id: INT,s:STRING,n:string,st:string,b:string,su:string,m:string,lu:string,lc:string,bk:string,bl:string,la:float,lo:float,da:int,dx:int,ba:int,bx:int>>, > SCHEMESUSPENDED STRING, > TIMELOAD BIGINT > ) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION '/user/ingenieroandresangel/datasets/bixi2017/'; thanks
... View more
09-23-2017
01:02 AM
Look I'm trying to analyze too many files into just one HIVE table. Key insights, I'm working with json files and the tables structure is : CREATE EXTERNAL TABLE test1 ( STATIONS ARRAY<STRING>, SCHEMESUSPENDED STRING, TIMELOAD TIMESTAMP ) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION '/user/andres/hive/bixihistorical/'; I need to load around 50 files with the same structure all of them. I have tried things like: LOAD DATA INPATH '/user/andres/datasets/bixi2017/*.json' OVERWRITE INTO TABLE test1; LOAD DATA INPATH '/user/andres/datasets/bixi2017/*' OVERWRITE INTO TABLE test1; LOAD DATA INPATH '/user/andres/datasets/bixi2017/' OVERWRITE INTO TABLE test1; Any of those above have worked, any idea guys about how should I go thru? thanks so much
... View more
Labels:
- Labels:
-
Apache Hive
09-08-2017
01:18 PM
both VmWare and VirtualBox 13GB RAM and Bridge network
... View more
09-08-2017
12:36 PM
I moved forward VirtualBox and same results I can't start up my VM. I don't get any error , the vm just stay all time loading the OS
... View more
09-08-2017
01:14 AM
Hey everyone, I have also problem setting up my sandbox, listen up. I'm in a windows 10 and I set for my VM 13GB RAM and 2 processors, I have enough storage but always when I start it up I got this message: And then the machine stuck a long time loading the OS CentOS. Someone could please let me know how could I figure it out? thanks
... View more
08-31-2017
08:20 PM
Hi guys I want to posted the solution , finally I have added in my flume file the options below: TwitterAgent.sources.Twitter.maxBatchSize = 50000 TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000 thanks
... View more
08-31-2017
06:46 PM
Hello guys, I've used flume to catch a few twitts. My agent and whole flume configuration run pretty well, my key point is when I need to read the outcome file with HIVE. I create an avro schema file to reuse it in hive to create the table to store flume data. (flume outcome file comes in avro format) Once the table in hive is ready I tried to check it out and confirm the right format and that looks good as you can see in the attached file tweettable.jpg. Then, I perform the command to load the flume data into this table and according to result message in hive that's also performed as expected. Even if the numRows is marked as 0. (attached file load.jpg) Now, finally when I try to read the data is when I got an error message saying that is not possible read this data. Unfortunately, I don't understand why guys. Please if someone can give me a hand with that I really appreciate. (attached file result.jpg) If you need more details about the scripts and everything I have used in this test I've posted all info in a GitHub repository https://github.com/AndresUrregoAngel/Flume_Twitter thanks so much
... View more
Labels:
- Labels:
-
Apache Flume
-
Apache Hive
08-16-2017
06:48 PM
You are so amazing I really appreciate each of your comments and the time that you have put on. thanks so much. Just to let you know buddy the part that I forgot to tell you is that before going to pig I load the file information in a Hive table within the DB POC. then this is why I used: july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader; Then the data coming up from Hive already have a format and the relation in Pig will match the same schema. the problem is that even after setting a schema for the output I'm not able to store this outcome in a Hive table 😞 . so to get my real scenario you should: 1. Load the CSV file in HDFS without headers (I delete them before to avoid filters) run: tail -n +2 OD_XXX.csv >> july.csv 2. Create the table and load the file: Hive: create table july ( start_date string, start_station int, end_date string, end_station int, duration int, member_s int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE; LOAD DATA INPATH '/user/andresangel/datasets/july.CSV'
OVERWRITE INTO TABLE july;
3. Follow my script posted up to the end to try to store the final outcome on a hive table 🙂 thanks buddy @Dinesh Chitlangia
... View more
08-16-2017
05:39 PM
1 Kudo
yes exactly you right but this is only in import as --export-dir for export operation 🙂 this how that works
... View more
08-16-2017
04:00 PM
Checking the official documentation link here. they suggest that each table will create automatically a separate folder for the outcome to store the data in the default HDFS path for the user who perform the operation. $ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
$ hadoop fs -ls
Found 4 items
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x - someuser somegrp 0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
... View more
08-14-2017
07:54 PM
Hi everyone, I have already a hive table called roles. I need to update this table with info coming up from mysql. So, I have used this script think that it will add and update new data on my hive table:` sqoop import --connect jdbc:mysql://xxxx/retail_export --username xxxx --password xxx \ --table roles --split-by id_emp --check-column id_emp --last-value 5 --incremental append \ --target-dir /user/ingenieroandresangel/hive/roles --hive-import --hive-database poc --hive-table roles Unfortunatelly that only insert the new data but I can't update the record that already exits. before you ask a couple of statements: * the table doesn't have a PK * if i dont specify --last-value as a parametter I will get duplicated records for those who already exist. How could I figure it out without apply a truncate table or recreate the table using a PK? exist the way? thanks guys.
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Sqoop
08-14-2017
07:09 PM
Hi guys, I would like to know if in Sqoop is possible load too many files at the same time as we load whole included files in a folder into a pig relation just indicating in the load something like: test = LOAD '/user/me/datasets/*' USING PigStorage(','); I have tried to apply the same logic in a SQOOP stement to load or update in one shoot many files with the same structure into a mysql table with the code: sqoop export --connect jdbc:mysql://nn01.itversity.com/retail_export --username retail_dba --password itversity \ --table roles --update-key id_emp --update-mode allowinsert --export-dir /user/ingenieroandresangel/datasets/export/* \ -m 1 The weird thing in my scenario is that even if at the end of my statement the systems shows up a message: 17/08/14 15:05:02 INFO mapreduce.Job: Counters: 8 Job Counters Failed map tasks=1 Launched map tasks=1 Rack-local map tasks=1 Total time spent by all maps in occupied slots (ms)=5548 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=2774 Total vcore-milliseconds taken by all map tasks=2774 Total megabyte-milliseconds taken by all map tasks=4260864 17/08/14 15:05:02 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead 17/08/14 15:05:02 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 19.3098 seconds (0 bytes/sec) 17/08/14 15:05:02 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead 17/08/14 15:05:02 INFO mapreduce.ExportJobBase: Exported 0 records. 17/08/14 15:05:02 ERROR mapreduce.ExportJobBase: Export job failed! 17/08/14 15:05:02 ERROR tool.ExportTool: Error during export: Export job failed! the table in my sql has load all records in my two proposed txt files so, in theory that worked even with this error but i'm not sure so please if someone can explain that behavior in technical sens i appreciate. thanks.
... View more
Labels:
- Labels:
-
Apache Sqoop
08-14-2017
04:07 PM
I couldnt load it here cuz it's a little big huge. So please download from my one drive just clicking here . Thanks buddy!
... View more
08-14-2017
02:02 PM
thanks so much @Dinesh Chitlangia I set the output format finally like : GENERATE FLATTEN( group) AS (day, code_station),(int)total_dura as (total_dura:int),(float)avg_dura as (avg_dura:float),(int)qty_trips as (qty_trips:int). Now before storing the output in HIVE I have created the table below: CREATE TABLE july_analysis
(day int,code_station int, total_dura double,avg_dura float,qty_trips int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE; My problem now is when I try to store the data because I get back a message saying: STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer (); 2017-08-14 09:56:55,712 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1002: Unable to store alias july_result I saved the output as file to confirm that everything was coming up right and that worked , I also to the moment to open my pig consoled I taped Pig -x tez -useHCatalog. thanks for whole info you can provide I apprefciate. Andres U,
... View more
08-10-2017
11:10 PM
here is my deal today. Well, I have created a relation as result of a couple of transformations after have read the relation from hive. the thing is that I want to store the final relation after a couple of analysis back in Hive but I can't. Let see that in my code much clear. The first String is when I LOAD from Hive and transform my result: july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader ;
july_cl = FOREACH july GENERATE GetDay(ToDate(start_date)) as day:int,start_station,duration; jul_cl_fl = FILTER july_cl BY day==31;
july_gr = GROUP jul_cl_fl BY (day,start_station);
july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group),total_dura,avg_dura,qty_trips; }; So, now when I try to store the relation july_result I can't because the schema has changed and I suppose that it's not compatible with Hive: STORE july_result INTO 'poc.july_analysis' USING org.apache.hive.hcatalog.pig.HCatStorer (); Even if I have tried to set a special scheme for the final relation I haven't figured it out. july_result = FOREACH july_gr {
total_dura = SUM(jul_cl_fl.duration);
avg_dura = AVG(jul_cl_fl.duration);
qty_trips = COUNT(jul_cl_fl);
GENERATE FLATTEN(group) as (day:int),total_dura as (total_dura:int),avg_dura as (avg_dura:int),qty_trips as (qty_trips:int);
}; PDS: the table in hive exists previusly!!
... View more
Labels:
- Labels:
-
Apache Hive
-
Apache Pig
06-23-2017
07:18 PM
Thanks so much @Lester Martin I appreciate your help now worked, I replaced my statement using yours and it worked. salaries_cl = FOREACH salaries_fl GENERATE (int)year as year:int,$1,$2,$3, (long)salary as salary:long; Weird why the other one didn't work but well thanks so much.
... View more
06-20-2017
01:07 PM
Hey everyone, My case today is a little weird for me cause according to me I'm running the right scripts but well anyways, the thing is that I need load a file with a char structure but then I need to delete the headers and set the right structure relation. Here my code: Finally, I'm trying to perform a sum using this new structure but Pig always according with the log return an error casting the column with the new format as long for salary and int for the year so is like Pig won't be able to get the new structure. Error: Could some of our gurus let me know the right way to get a nice transformation and perform my scripts. thanks so much
... View more
Labels:
- Labels:
-
Apache Pig
06-14-2017
01:55 PM
did you create the user that you use in hive or someone else did it for you
... View more
06-14-2017
01:26 PM
A couple of questions from my side are: Just to let you know my scenario, I'M playing with a single node configuration in a virtual machine with hortonworks services as Hive, Pig , etc. * Do you use a specific user, created to get access to hive in your cluster using a view? *Did you follow this tutorial? here
... View more
06-14-2017
01:15 PM
have you configured already the odbc connection parammeter using the hortonworks driver and it worked, I' reaching you cause i'm trying that and i dont know how to do it. thanks
... View more
06-13-2017
02:43 AM
Any update buddy?
... View more