Member since
01-21-2018
58
Posts
4
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3442 | 09-23-2017 03:05 AM | |
1682 | 08-31-2017 08:20 PM | |
6402 | 05-15-2017 06:06 PM |
08-13-2020
12:42 AM
While starting Hortonworks sandbox it gets stuck on "extracting and loading the hortonworks sandbox..." And after some times it shows the message of critical error or sometimes it says "your system has ran into an error we'll restart it"
... View more
08-22-2019
10:43 AM
Did you find a solution to this?
... View more
05-11-2018
04:01 PM
Hello everyone, I have a situation and I would like to count on the community advice and perspective. I'm working with pyspark 2.0 and python 3.6 in an AWS environment with Glue. I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF's. The problem comes up when I need to apply a join among the DF's created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later, So far my code looks like: query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []
for item in months:
df = 'cohort_2013_{}'.format(item)
query = query_text.format(item)
frame_list.append(df) # I pretend to retain in a list the name of DF to recall it later
df = spark.sql(query)
df = DynamicFrame.fromDF( df , glueContext, "df")
applyformat = ApplyMapping.apply(frame = df, mappings =
[("field1","string","field1","string"),
("field2","string","field2","string")],
transformation_ctx = "applyformat")
for df in frame_list:
create a join query for all created DF. Please if someone knows how could I achieve this requirement let me know your ideas. thanks so much
... View more
Labels:
- Labels:
-
Apache Spark
02-25-2018
09:14 PM
Sorry sometime not read completely come up an issue 😞 works seamlessly.!
... View more
01-12-2018
06:53 PM
@Andres Urrego Regarding the VM failing, is it the services shutting down on their own and not staying up? One common cause of this is not enough memory - to reduce resource usage try turning off all services and starting only HDFS, Zookeeper, YARN and Spark. Also make sure that you give your VM at least 8GB of RAM (https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide shows how). As far as documentation for Spark2/HDFS, here is a good Spark2 starter tutorial followed by a Spark2/HDFS project walkthrough. https://hortonworks.com/tutorial/hands-on-tour-of-apache-spark-in-5-minutes/#option-2-download-and-setup-hortonworks-data-platform-hdp-sandbox https://hortonworks.com/tutorial/sentiment-analysis-with-apache-spark/
... View more
09-23-2017
03:05 AM
Hi Guys, I'm so so .... Well, I just remember that you can create just an external table stored in the same folder all files with the same structure are located. So , in that way I will load whole records in one shoot. > CREATE EXTERNAL TABLE bixi_his > ( > STATIONS ARRAY<STRUCT<id: INT,s:STRING,n:string,st:string,b:string,su:string,m:string,lu:string,lc:string,bk:string,bl:string,la:float,lo:float,da:int,dx:int,ba:int,bx:int>>, > SCHEMESUSPENDED STRING, > TIMELOAD BIGINT > ) > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' > LOCATION '/user/ingenieroandresangel/datasets/bixi2017/'; thanks
... View more
08-31-2017
08:20 PM
Hi guys I want to posted the solution , finally I have added in my flume file the options below: TwitterAgent.sources.Twitter.maxBatchSize = 50000 TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000 thanks
... View more
08-28-2017
07:49 PM
Thank you @Nandish B Naidu..!! The solution worked.
... View more
08-15-2017
10:54 PM
1 Kudo
@Andres Urrego, What you are looking for (UPSERTS) aren't available in SQOOP-import. There are several approaches on how to actually update data in Hive. One of them is described here: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes. You can also check Hive ACID transactions, or using Hive-Hbase integration package. Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc...
... View more
08-16-2017
06:48 PM
You are so amazing I really appreciate each of your comments and the time that you have put on. thanks so much. Just to let you know buddy the part that I forgot to tell you is that before going to pig I load the file information in a Hive table within the DB POC. then this is why I used: july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader; Then the data coming up from Hive already have a format and the relation in Pig will match the same schema. the problem is that even after setting a schema for the output I'm not able to store this outcome in a Hive table 😞 . so to get my real scenario you should: 1. Load the CSV file in HDFS without headers (I delete them before to avoid filters) run: tail -n +2 OD_XXX.csv >> july.csv 2. Create the table and load the file: Hive: create table july ( start_date string, start_station int, end_date string, end_station int, duration int, member_s int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE; LOAD DATA INPATH '/user/andresangel/datasets/july.CSV'
OVERWRITE INTO TABLE july;
3. Follow my script posted up to the end to try to store the final outcome on a hive table 🙂 thanks buddy @Dinesh Chitlangia
... View more