Member since 
    
	
		
		
		06-20-2016
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                488
            
            
                Posts
            
        
                433
            
            
                Kudos Received
            
        
                118
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3601 | 08-25-2017 03:09 PM | |
| 2501 | 08-22-2017 06:52 PM | |
| 4192 | 08-09-2017 01:10 PM | |
| 8969 | 08-04-2017 02:34 PM | |
| 8946 | 08-01-2017 11:35 AM | 
			
    
	
		
		
		09-09-2016
	
		
		09:47 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @lgeorge Excellent.  Thank you.  I noticed also that 2.4 had %psql also but 2.5 did not.  Thoughts or comments? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-09-2016
	
		
		09:35 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 In the 2.4 sandbox, Zeppelin interpreters include %hive and %phoenix (but not %jdbc) but in the 2.5 TP sandbox there is a %jdbc interpreter (but not %hive and %phoenix).    2.4 sandbox uses Zeppelin 0.6.0.2.5.0.0-817 whereas 2.5 sandbox uses Zeppelin 0.6.0.2.4.0.0-169.  Is this just the way Zeppelin 0.6.0 minor versions were packaged for the sandboxes? Or does the Zeppelin GA (which is 0.6.0) only use %jdbc.  NOTE that in 2.5 sandbox I was able to create hive tables with %jdbc identically as creating them with %hive. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hive
- 
						
							
		
			Apache Phoenix
- 
						
							
		
			Apache Zeppelin
			
    
	
		
		
		09-09-2016
	
		
		09:24 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 🙂 understood.  One of those ease of development ( a few quick pig lines) vs highly optimized (custom m-r program) questions.  Should still be relatively performant in pig.  Above code I think is the only way to do it in pig. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-09-2016
	
		
		08:54 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Good question:  you can use multiple conditions in parens.  eg  SPLIT A INTO X IF f1 < 7, Y IF f2 == 5, Z IF (f3 < 6 OR f5 ==0); 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-09-2016
	
		
		08:45 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							 This should work  -- split into 2 datasets
SPLIT Input_data INTO A IF Field > 0, B if Field <= 0;
-- count > 0 records
A_grp = GROUP A ALL;
A_count = FOREACH A_grp GENERATE COUNT(A);
-- count <= 0 records
B_grp = GROUP B ALL;
B_count = FOREACH B_grp GENERATE COUNT(B);  See  
 https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT  http://pig.apache.org/docs/r0.9.2/func.html#count  (note the use of ALL here instead of a particular field)
  http://www.tutorialspoint.com/apache_pig/apache_pig_count.htm  
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-09-2016
	
		
		01:14 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		1 Kudo
		
	
				
		
	
		
					
							 For profiling data off Hadoop, see https://community.hortonworks.com/questions/35396/data-quality-analysis.html  For profiling data on Hadoop, the best solution for you should be:   zeppelin as your client/UI  spark in zeppelin as your toolset to profile   Both zeppelin and spark are extremely powerful tools for interacting with data and are packaged in HDP.  Zeppelin is a browser-based notebook UI (like iPython/Jupyter) that excels at interacting with and exploring data.  Spark of course is in-memory data analysis and is lightening fast.  Both are key pieces in the future of Big Data analysis.  BTW, you can use python in spark or you can use scala, including integration of external libraries.  See the following links to get started:  http://hortonworks.com/apache/zeppelin/  http://www.social-3.com/solutions/personal_data_profiling.php 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-08-2016
	
		
		06:10 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Glad I could help 🙂 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-08-2016
	
		
		12:51 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		2 Kudos
		
	
				
		
	
		
					
							   @Mohan V  I would:  
 Land the data in a landing zone in hdfs.  Decide to keep this going forward or not (you may want to reuse the raw data).  Then use pig scripts to transform the data into your hbase tables as tab-delimited output (see next step).     Importantly, this involves inserting a key as the first column of your resulting tsv file. HBase of course is all about well-designed keys.    You will use pig's CONCAT() function to create a key from existing fields.  It is often useful to concatenate fields into a key with a "-" separating each field in the resulting composite key.   A single tsv output will be used to bulk load a single hbase table (next step).  These should be outputted to a tmp dir in hdfs to be used as input in the next step.    Note: you could take your pig scripting to the next level and create a single flexible pig script for creating tsv output for all hbase tables.  See https://community.hortonworks.com/content/kbentry/51884/pig-doing-yoga-how-to-build-superflexible-pig-scri.html .  Not necessary though.   3. Then do a bulk import into your hbase table for each tsv.  See the following links on bulk imports.  (Inserting record by record will be much too slow for large tables.   http://hbase.apache.org/0.94/book/arch.bulk.load.html  http://hbase.apache.org/book.html#importtsv   I have used this workflow frequently, including loading 2.53 billion relational records into a HBase table.  The more you do it, the more automated you find yourself making it. 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		09-08-2016
	
		
		03:05 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		3 Kudos
		
	
				
		
	
		
					
							 S3 is slower than HDFS for Map-Reduce jobs.  Besides that, are there any special considerations or optimizations for ORC files on S3, compared to HDFS?  Do they achieve all the benefits on S3 that they do on HDFS?  If not, why? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
			
	
					
			
		
	
	
	
	
				
		
	
	
- Labels:
- 
						
							
		
			Apache Hive
			
    
	
		
		
		09-08-2016
	
		
		12:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 @D Srini    You have provided the error but not the code itself.  Based on the error, it looks like your code did not use SUM in caps.  SUM is a built in function of pig.  It is a java static method called SUM in the package org.apache.pig.builtin  which is packaged in the pig jars.  Bottom line is that this function is case sensitive.  I think instead of writing  e =foreach d generate groupas driverid, SUM(c.occurance)as t_occ;  you wrote  e =foreach d generate groupas driverid, sum(c.occurance)as t_occ;  See https://pig.apache.org/docs/r0.16.0/func.html#built-in-functions to learn more about pig functions, both builtin functions (where you do not need to register them in your code because pig dynamically finds the function in the pig jars) and user-defined functions (where you create the jar and register it in the pig code).  If this was indeed the problem, let me know by accepting the answer.   
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
         
					
				













