About gkeys

gkeys · ‎09-04-2016

@João Souza This requirement is based around FILTER, which retrieves records that satisfy one or more conditions. There are two ways to do this. This first is using FILTER as below: X = FILTER Count by Field >10; Y = FILTER Count by Field <=10; The second way achieves the same result but using different grammar. SPLIT Count into X if Field >10, Y if Field <=10; Please note that the use of SUM requires a GROUP operation beforehand. In your case, you would have needed to GROUP data before you summed it as shown in your first line of code. It would have to look something like the following. data = LOAD ... as (amt:int, name:chararray); grouped_data = GROUP data by name; summed_data = FOREACH grouped_data GENERATE SUM(data.amt) amtSum, name; X = FILTER summed_data by amtSum >10; Y = FILTER summed_data by amtSum <=10; See: https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SUM http://www.thomashenson.com/sum-field-apache-pig/ (Let me know if this is what you are looking for by accepting the answer).

gkeys · ‎09-03-2016

Sqoop is faster than NiFi at pulling data from relational databases because it parallelizes throughput whereas NiFi does not (https://community.hortonworks.com/questions/25228/can-i-use-nifi-to-replace-sqoop.html). NiFi is easy to develop and has data lineage and other governance and monitoring capabilities out of the box, which makes a case for using it to ingest relational data to HDFS at least for one time offloads or smallish table sizes (e.g. for data science work). Are there any benchmark results out there that describe just how long NiFi takes to offload relational tables of given sizes? Benchmarks of course are specific to implementations (e.g CPU cores) but some numbers would be informative.

gkeys · ‎09-02-2016

As far as using pig to insert the data to a hbase table, these links should be helpful: https://community.hortonworks.com/questions/31164/hbase-insert-from-pig.html http://princetonits.com/blog/technology/loading-customer-data-into-hbase-using-a-pig-script/

gkeys · ‎09-02-2016

@Ashwini Maurya A good starting point is here: http://hortonworks.com/developers/

gkeys · ‎09-01-2016

The Apache Hive Wiki https://cwiki.apache.org/confluence/display/Hive/GettingStarted has details on logging, including local mode and version differences. I have pasted the key info below but see link above for more. Hive Logging Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN for Hive releases prior to 0.13.0. Starting with Hive 0.13.0, the default logging level is INFO . The logs are stored in the directory /tmp/<user.name> : /tmp/<user.name>/hive.log Note: In local mode, prior to Hive 0.13.0 the log file name was " .log " instead of " hive.log ". This bug was fixed in release 0.13.0 (see HIVE-5528 and HIVE-5676). To configure a different log location, set hive.log.dir in $HIVE_HOME/conf/hive-log4j.properties. Make sure the directory has the sticky bit set ( chmod 1777 <dir> ). hive.log.dir=<other_location> If the user wishes, the logs can be emitted to the console by adding the arguments shown below: bin/hive --hiveconf hive.root.logger=INFO,console //for HiveCLI (deprecated) bin/hiveserver2 --hiveconf hive.root.logger=INFO,console Alternatively, the user can change the logging level only by using: bin/hive --hiveconf hive.root.logger=INFO,DRFA //for HiveCLI (deprecated) bin/hiveserver2 --hiveconf hive.root.logger=INFO,DRFA Another option for logging is TimeBasedRollingPolicy (applicable for Hive 0.15.0 and above, HIVE-9001) by providing DAILY option as shown below: bin/hive --hiveconf hive.root.logger=INFO,DAILY //for HiveCLI (deprecated) bin/hiveserver2 --hiveconf hive.root.logger=INFO,DAILY Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time. Hive also stores query logs on a per Hive session basis in /tmp/<user.name>/ , but can be configured in hive-site.xml with the hive.querylog.location property. Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI. When using local mode (using mapreduce.framework.name=local ), Hadoop/Hive execution logs are produced on the client machine itself. Starting with release 0.6 – Hive uses the hive-exec-log4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name> . The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors.

gkeys · ‎08-21-2016

One minor thing to remember about the answer is that maximum number of entries must be left blank

gkeys · ‎08-21-2016

(@hduraiswamy please ignore second question ... just a rephrasing of the first)

gkeys · ‎08-20-2016

First, to get insights into what is happening you can tail -f /var/log/nifi/nifi-app.log. You can grep InvokeHTTP and PutFile to get the details for each of these processors. Second you can right click each of these processors and view stats. That will show you bytes read and written. This should give you insights into where data is flowing and where it is not. Your InvokeHTTP processor I believe is repeatedly reading from the url and sending flowfiles (as determined by Scheduling > Run Schedule property). This means it will be repeatedly writing to the local storage. Set the PutFile property Conflict Resolution to ignore and see what happens. (And again, use the log file for details on what is happening for each processor)

gkeys · ‎08-20-2016

I am tailing a log file into MergeContent. I want MergeContent to merge log entries into a large flow file to put to HDFS. I have been fiddling with the properties: Merge Strategy Minimum Number of Entries Maximum Number of Entries Maximum number of Bins It has been pretty much trial-and-error. How do the above properties determine MergeContent output FlowFile size, and what is the most direct way to, say, double the output file size compared to existing settings? What is the most direct way to increase size until a desired size is reached?

gkeys · ‎08-16-2016

Introduction We know that parameter passing is valuable for pig script reuse. One lesser known understanding is that parameters do not simply pass variables to pig scripts but rather (and more fundamentally) they pass text that replaces placeholders in the script. This is a subtle but powerful difference: it means that we can dynamically pass code alternatives to the script at run-time. This allows us to build a script with the same larger purpose but whose logic, settings, schema, storage types, UDFs, etc. can be swapped at runtime. This results in significantly fewer yet more flexible scripts that need to be built and maintained across your projects, group or organization. In this article I will show techniques to leverage this approach through a few script examples. Keep in mind the goal is not the final examples themselves, but rather the possibilities for your own scripting. Key ideas and techniques are: parameters as text substitution -param whose parameter value holds a code snippet -param_file whose value loads a parameter file holding one or more parameters whose values hold code snippets passing multiple -param and -param_file to a pig script -dryrun parameter that shows the inline result of the parameter substitution (for understanding purposes) Example 1 In this example we want to load a dataset and insert a first column that is a key of one or more of the original columns. If the first column is a composite key, we will concatenate the values of multiple columns and separate each value with a dash. Note: in all cases I am calling the script from a command line client. This command could be generated manually or via a program. Technique 1: Simple parameter passing (values as variables) Let's say we want to concatenate column 1 before column 0. The script would look as follows: A = LOAD '$SRC' USING PigStorage(','); X = FOREACH A generate CONCAT(CONCAT($1,'-'),$0), $0..; STORE X into '$DEST'; and we trigger it with the following command pig -param SRC=../datasets/myData.csv -param DEST=../output/myDataWithKey.csv -f keyGenerator.pig But what if we wanted to concatenate columns 2 and 3, or columns 1, 2, 5 or 5,1,4, 7? By passing only parameters as variables, we would have to write a different script each time with different CONCAT logic, and then annoyingly give this script a similar but still different name. Technique 2: Paramater passing code Alternatively, we can maintain one script template and pass the CONCAT logic via the parameter. The script would look like this: A = LOAD '$SRC' USING PigStorage(','); X = FOREACH A generate $CON; STORE X into '$DEST'; and we would call the script in using any number of CONCAT logic possibilities, such as either of the following (I am showing only the new parameter here): -param CON="CONCAT(CONCAT($1,'-'),$0),$0.." -param CON="CONCAT(CONCAT($1,'-'),$2),$0,$1,$2,$4" Note that I defining CONCAT logic in the -param value, and also which of the original fields to return. Note also that I am wrapping the -param value in quotes to escape them from the shell script. As owner of the CONCAT logic, you would of course also need to understand the dataset you are loading. For example, you would not want to CONCAT using a column index that did not exist for the dataset (e.g column 7 for a data set that has only 5 columns Technique 3: Include -dryrun as a parameter to see the inline rendering If you pass -dryrun in addition to the -param parameters, you will see the running pig script output a line like this: 2016-08-11 15:15:30,530 [main] INFO org.apache.pig.Main - Dry run completed. Substituted pig script is at keyGenerator.pig.substituted When the script finishes, you will notice a file called keyGenerator.pig.substituted next to the actual script that run (keyGenerator.pig). The .substituted file shows the original script with all of the parameters values inlined, as if you hard-coded the full script. This shows the text replacement that occurs when the script is run. This is good development technique to see how your parameter values are represented in the running script. Note that -dryrun produces the file as described above but does not execute the script. You could alternatively use -debug instead which will both produce the above file and execute the script. In an operational environment this may not be valuable because time the same script is run it will overwrite the contents of the .substituted file it produces. Example 2 In this example we develop a reusable script to clean and normalize datasets to desired standards using a library of UDFs we built. Using the techniques from above, our script would like like this: REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE; Note that in addition to source and destination paths, we are able to define LOAD details (storage type, schema) and STORE details (storage type); I could for example run the following on the command line: pig \ -param SRC=data.txt \ -param DEST=../output/myFile.txt \ -param LOADER="'TextLoader() AS (line:chararray)'" \ -param GEN="clean.CLEAN_CSV(line)" \ -param STORAGE=PigStorage('|')" \ -f clean.pig or I could run this: pig \ -param "SRC=data.txt" \ -param "DEST=../output/myOtherFile.txt" \ -param LOADER="USING PigStorage(',') AS (lastname:chararray, firstname:chararray, ssn:chararray, position:chararray, startdate:chararray, tenure:int, salary:double)" \ -param GEN="clean.REDACT(lastname), clean.CLEAN_TOLOWER(firstname), clean.REDACT(ssn), clean.CLEAN_TOLOWER(position), normalize.NORMALIZE_DATE(startdate), tenure, salary" -param STORAGE="PigStorage('|')" \ -f clean.pig Using the same script template in the first instance I am using a UDF to apply a generic clean operation to each field in the entire line (the script knows the delimiter). In the second instance I use the same script template to use different UDFs on each field, including both normalizing and cleaning. This requires knowledge of the schema, which is passed in with the LOADER parameter. Note again the quotes to escape special characters in parameter values. Here we have a special additional need for quotes. Pig specs require that when your parameter value has spaces, you need to wrap that in single quotes. Thus, notice -param LOADER="'TextLoader() AS (line:chararray)'" The double quotes are for shell escaping and spaces, and the single quotes are required by pig for spaces. Technique 4: Store parameters in parameter files and select parameter file at run-time The above is clearly clumsy on the command line. We could put some or all of the parameters in a parameter file and identify the file using -param_file. For the second example above, the file contents would look like: LOADER='USING PigStorage(',') AS (lastname:chararray, firstname:chararray, ssn:chararray, position:chararray, startdate:chararray, tenure:int, salary:double)' GEN='clean.REDACT(lastname), clean.CLEAN_TOLOWER(firstname), clean.REDACT(ssn), clean.CLEAN_TOLOWER(position), normalize.NORMALIZE_DATE(startdate), tenure, salary' STORAGE=PigStorage() Note we only need the single quote wrappers to satisfy pig specs on spaces in parameter values. We would now call the script as follows: pig -param SRC=data.txt -param DEST=../output/xform.txt -param_file thisJobParams.txt -f clean.pig Technique 5: Store optimization settings in a separate parameter file and select which settings at run-time We can store optimization settings in a set of parameter files and select which we want to implement at run-time. For example, imagine the following new parameter $OPTin the script $OPT REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE; Now imagine, say, 10 files each with different optimization settings. One of the files could look like this: OPT='\ SET opt.multiquery false \ SET io.sort.mb 2048 \ SET mapreduce.task.timeout 1800' And our running of the script would identical to the command line at the end of Technique 4, but would have this additional parameter -param_file chosenOptParams.txt We can thus have different parameter files that server different purposes. In the case here, the param file is used to inline optimization settings in the rendered script. Technique 6: Implement multi-tenancy Since your scripts will receive reuse and concurrency, make them multi-tenant by passing in a job and log file name. If our script looked like this: SET job.name $JOBNAME SET pig.logfile ./pigLogPath/$JOBNAME $OPT REGISTER ./lib/pigUDFs.jar; A = LOAD '$SRC' USING $LOADER; B = FOREACH A generate $GEN; STORE B into '$DEST' USING $STORAGE; we could pass in a unique name that includes line of business or group and job name. Thus, the full command line would look something like this: pig -param SRC=myData.txt -param DEST=myOutput.txt -param_file myLob_MyJob_Params.txt -param_file chosenOptParams.txt -param myLob_myJobName -f clean.pig Given the above pig script template, one can only imagine the diverse number of parameters that could be passed into the same pig script to load, clean/normalize (or other) and store files in an optimized and multi-tenant way. By defining the logfile path and reusing the same job name for the same set of parameters, we get the benefit of appending job failures to one job (as opposed to a new log file for each failure) and also of writing it in a location other than where the script is located. Conclusion Passing parameters as code and not simply as variables opens up a new world of flexibility and reuse for your pig scripting. Along with your imagination and experimentation, the above techniques can lead to significantly less time building and maintaining pig scripts and more time leveraging a set of templates and parameter files that you maintain. Give it a try and see where it takes you. Process more of your Big Data into business value data with more powerful scripting.

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Apache PIG - If Statement based on a count val...

Are there benchmark results available for NiFi ing...

Re: Can we create HBase table using PIG if Yes the...

Re: How to start project in Hadoop, I have complet...

Re: How to locate logs for local jobs

Re: How do I best set MergeContent properties to c...

Re: How do I best set MergeContent properties to c...

Re: How to download JSON files from live feed?

How do I best set MergeContent properties to contr...

Pig Doing Yoga: How to Build Superflexible Pig Scr...