Member since 
    
	
		
		
		01-21-2018
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                58
            
            
                Posts
            
        
                4
            
            
                Kudos Received
            
        
                3
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3966 | 09-23-2017 03:05 AM | |
| 2010 | 08-31-2017 08:20 PM | |
| 7252 | 05-15-2017 06:06 PM | 
			
    
	
		
		
		08-13-2020
	
		
		12:42 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 While starting Hortonworks sandbox it gets stuck on "extracting and loading the hortonworks sandbox..." And after some times it shows the message of critical error or sometimes it says "your system has ran into an error we'll restart it" 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-22-2019
	
		
		10:43 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Did you find a solution to this? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-11-2018
	
		
		04:01 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hello everyone,   I have a situation and I would like to count on the community advice and perspective. I'm working with pyspark 2.0 and python 3.6 in an AWS environment with Glue.  I need to catch some historical information for many years and then I need to apply a join for a bunch of previous queries. So decide to create a DF for every query so easily I would be able to iterate in the years and months I want to go back and create on the flight the DF's.   The problem comes up when I need to apply a join among the DF's created in a loop because I use the same DF name within the loop and if I tried to create a DF name in a loop the name is read as a string not really as a DF then I can not join them later,  So far my code looks like:  query = 'SELECT * FROM TABLE WHERE MONTH = {}'
months = [1,2]
frame_list = []
for item in months:
    df = 'cohort_2013_{}'.format(item)
    query = query_text.format(item) 
    frame_list.append(df)  # I pretend to retain in a list the name of DF to recall it later
    df = spark.sql(query)
    df = DynamicFrame.fromDF( df , glueContext, "df")
    applyformat = ApplyMapping.apply(frame = df, mappings =
        [("field1","string","field1","string"),
         ("field2","string","field2","string")],
        transformation_ctx = "applyformat")
for df in frame_list:
      create a join query for all created DF.	  Please if someone knows how could I achieve this requirement let me know your ideas.  thanks so much 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Spark
			
    
	
		
		
		02-25-2018
	
		
		09:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Sorry sometime not read completely come up an issue 😞 works seamlessly.! 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		01-12-2018
	
		
		06:53 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @Andres Urrego Regarding the VM failing, is it the services shutting down on their own and not staying up?  One common cause of this is not enough memory - to reduce resource usage try turning off all services and starting only HDFS, Zookeeper, YARN and Spark.  Also make sure that you give your VM at least 8GB of RAM (https://hortonworks.com/tutorial/sandbox-deployment-and-install-guide shows how).  As far as documentation for Spark2/HDFS, here is a good Spark2 starter tutorial followed by a Spark2/HDFS project walkthrough.  https://hortonworks.com/tutorial/hands-on-tour-of-apache-spark-in-5-minutes/#option-2-download-and-setup-hortonworks-data-platform-hdp-sandbox  https://hortonworks.com/tutorial/sentiment-analysis-with-apache-spark/ 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-23-2017
	
		
		03:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi Guys, I'm so so .... Well, I just remember that you can create just an external table stored in the same folder all files with the same structure are located. So , in that way I will load whole records in one shoot.   > CREATE EXTERNAL TABLE bixi_his   > ( > STATIONS ARRAY<STRUCT<id: INT,s:STRING,n:string,st:string,b:string,su:string,m:string,lu:string,lc:string,bk:string,bl:string,la:float,lo:float,da:int,dx:int,ba:int,bx:int>>, > SCHEMESUSPENDED STRING,   > TIMELOAD BIGINT > )   > ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'   > LOCATION '/user/ingenieroandresangel/datasets/bixi2017/';     thanks 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-31-2017
	
		
		08:20 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Hi guys I want to posted the solution , finally I have added in my flume file the options below:  TwitterAgent.sources.Twitter.maxBatchSize = 50000  TwitterAgent.sources.Twitter.maxBatchDurationMillis = 100000  thanks  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-28-2017
	
		
		07:49 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Thank you @Nandish B Naidu..!!  The solution worked. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-15-2017
	
		
		10:54 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 @Andres Urrego,  What you are looking for (UPSERTS) aren't available in SQOOP-import.  There are several approaches on how to actually update data in Hive. One of them is described here:  https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_data-access/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html   Other approaches are also using side load and merge as post-sqoop or scheduled jobs/processes.  You can also check Hive ACID transactions, or using Hive-Hbase integration package.  Choosing right approach is not trivial and depends on: initial volume, incremental volumes, frequency or incremental jobs, probability of updates, ability to identify uniqueness of records, acceptable latency, etc... 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		08-16-2017
	
		
		06:48 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You are so amazing I really appreciate each of your comments and the time that you have put on. thanks so much. Just to let you know buddy the part that I forgot to tell you is that before going to pig I load the file information in a Hive table within the DB POC. then this is why I used:  july = LOAD 'POC.july' USING org.apache.hive.hcatalog.pig.HCatLoader;  Then the data coming up from Hive already have a format and the relation in Pig will match the same schema.  the problem is that even after setting a schema for the output I'm not able to store this outcome in a Hive table 😞 . so to get my real scenario you should:      1. Load the CSV file in HDFS without headers (I delete them before to avoid filters)  run: tail -n +2 OD_XXX.csv >> july.csv  2. Create the table and load the file:  Hive:  create table july ( start_date string, start_station int, end_date string, end_station int, duration int, member_s int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;  LOAD DATA INPATH '/user/andresangel/datasets/july.CSV'
OVERWRITE INTO TABLE july;
    3. Follow my script posted up to the end to try to store the final outcome on a hive table 🙂      thanks buddy @Dinesh Chitlangia   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













