Member since
10-20-2017
63
Posts
0
Kudos Received
0
Solutions
09-24-2018
11:21 PM
@Raj ji Looks
like this thread is Older than the Other Duplicate Thread link mentioned above
hence posting my update from the other HCC thread so that other thread
can be deleted. If you want to Rotate as well as compress your Logs (like Audit log)
then you can make use of "RollingFileAppender" (instead of using
DailyRollingFileAppender. Because with "RollingFileAppender" you get
more options to rotate the logs based on various policies like
"TimeBasedRollingPolicy" and you can also compress the files "log.gz" Please
refer to the following example for more details:
https://community.hortonworks.com/articles/50058/using-log4j-extras-how-to-rotate-as-well-as-zip-th.html TimeBasedTriggeringPolicy hdfs.audit.logger=WARN,console
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.DRFAAUDIT=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.DRFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.DRFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.DRFAAUDIT.DatePattern=.yyyy-MM-dd
log4j.appender.DRFAAUDIT.rollingPolicy=org.apache.log4j.rolling.TimeBasedRollingPolicy
log4j.appender.DRFAAUDIT.rollingPolicy.ActiveFileName=${hadoop.log.dir}/${hadoop.log.file}
log4j.appender.DRFAAUDIT.rollingPolicy.FileNamePattern=${hadoop.log.dir}/${hadoop.log.file}-.%d{yyyyMMdd}.log.gz
. Please make sure to copy the "apache-log4j-extras-1.2.17.jar" files inside the /usr/hdp/x.x.x.x.x/hadoop/lib/ directory as mentioned in the above article Followed by restart of all required services. . . Similarly "SizeBasedTriggeringPolicy" can be used as following: hdfs.audit.logger=WARN,console
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.DRFAAUDIT=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.DRFAAUDIT.rollingPolicy=org.apache.log4j.rolling.FixedWindowRollingPolicy
log4j.appender.DRFAAUDIT.rollingPolicy.maxIndex=10
log4j.appender.DRFAAUDIT.rollingPolicy.ActiveFileName=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.DRFAAUDIT.rollingPolicy.FileNamePattern=${hadoop.log.dir}/hdfs-audit.log-%i.gz
log4j.appender.DRFAAUDIT.triggeringPolicy=org.apache.log4j.rolling.SizeBasedTriggeringPolicy
log4j.appender.DRFAAUDIT.triggeringPolicy.MaxFileSize=10485760
log4j.appender.DRFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n<br> Please
change the value of "log4j.appender.file.triggeringPolicy.MaxFileSize"
according to your requirement here the value "10485760" is around 10MB. Reference: https://community.hortonworks.com/questions/212567/log4g-logs-not-rotated-and-zipped.html
... View more
09-20-2018
01:56 AM
@Raj ji Consider using PutHBaseRecord processor instead of PutHBaseJson and use UpdateRecord processor to add the CompositePrimaryKey to the json doc. Then use the CompositePrimaryKey field as row identifier for the PutHBaseRecord processor and adjust the BatchSize property value to get max number of put's into HBase.
... View more
09-18-2018
09:09 PM
@Raj ji Yes you can use it.. PutHBaseJson processor:- 1. expects json individual messages(not an array) 2. You need to extract the values for ServerName,ServerNo from the content using EvaluateJsonPath processor then use Row Identifier
${ServerName},${ServerNo} (or) PutHBaseRecord processor: Using Record processor we don't need to split the array of json messages if you are using this processor but we need to prepare the row_id by using Update Record processor by using concat(/ServerName,',',/ServerNo) function. Refer to this link for more details regards to UpdateRecord processor concat function usage.
... View more
09-15-2018
06:25 PM
Hi All, Good day, I'm just checking routeontext processor to route values if matched then route for different action and if not matched route for different action . My Sample flow would be like this : GenerateFlowFile | RouteOnText --- Flow 1 ---> Putfile --- Flow 2 --> Putfile My Text content is simple My car color is Blue My car color is Yellow If Color is Blue -> goes to Flow1 If Color is Yellow -> Goes to Flow 2 Now I have just started the processor I can see queue in Generate flow File is queued upto 10000 and Flow 1 is having 6000 and Flow2 is having somewhat around 1000 with N number of duplicates . Now How can I restrict it to only given content without any repetition .SO expected output is only two text files on respective flows . not n number of values . How can I configure that way . Can you please high level information of why is it happening . Cannot terminate relationship in GenerateFlowFile as the connection is already given to next processor . One important thing is Flow 1 should be executed first and flow 2 should be executed after the flow 1 is completed .
... View more
Labels:
08-30-2018
07:09 PM
Hi All, I'm trying to update the JSON data into Hive using this approach https://goo.gl/J7chi3 . While doing this approach I have two issues. 1.My Input record is 20k and output record count is 23.5k approx . some json's are breaking creating duplicates 2.My Input record size is 10k and output record count is 20k . As per the link it should be updating the records if it is already present in the table . Can anyone guide me to do upserts in hive . Apart from the above mentioned methods , Few other failed methods for upserts I have tried to use Merge options in hive refer link : https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html --> This merge is not suitable to merge more than 5 GB or more . Taking more hours to complete or not to complete or getting Heap memory error for even 6 GB data. Someone Suggested Merge with Source and Destination as partition. We will be getting error if the destination is partitioned since Merge cannot update the partition key value . Cluster ram size is 250GB Can anyone help me in this Please with definitive steps. . But it should work for Upserts(when record matched then update , if not then Insert) for larger datasets more than 5TB . None of the solutions are working out there in the Internet so far more than a month.Could anyone let me know the valid steps for larger datasets with JSON.
... View more
Labels:
08-30-2018
11:33 AM
@Raj ji Check out this solution here: https://community.hortonworks.com/questions/147226/replacetextprocessor-remove-blank-lines.html If this answer is helpful please choose ACCEPT to mark the question resolved.
... View more
03-06-2019
07:07 PM
@Eugene Koifman We are facing an issue which seems to be a limitation of Hive 1.2 ACID tables. We are using MERGE for loading mutable data on Hive ACID tables but loading/Reading these ACID tables using Pig or using Spark seems to be an issue . Does Hive ACID table for Hive version 1.2 posses the capability of being read into Apache Pig using HCatLoader (or other means) or in Spark using SQLContext(or other means). For Spark, it seems it is only possible to read ACID tables if the table is fully compacted i.e no delta folders exist in any partition. Details in the following JIRA https://issues.apache.org/jira/browse/SPARK-15348, https://issues.apache.org/jira/browse/SPARK-15348 However I wanted to know if it is supported at all in Apache Pig to read ACID tables in Hive. When I tried reading both an un-partitoned/partitioned ACID table in Pig version 0.16 I get 0 records read. Successfully read 0 records from: "dwh.acid_table" HDP version 2.6.5 Spark version 2.3 Pig version 0.16 Hive version 1.2
... View more
08-17-2018
09:04 AM
capture1.png@Jay Kumar SenSharma . Everything is configured properly in the Ambari . However it is not working as expected . Please refer attached screenshot . you will understand and not sure why it is not removing the older logs which is more than 5 days as configured capture.png
... View more
07-13-2018
02:16 PM
from Generatetablefetch I'm able to get the flow file . Upon running the flow, the select query failed to execute on DB2. On investigating we found that the query generated by GenerateTableFetch looked like this select userid,timestamp from user11 where timestamp<='01-01-2018 12:00:00' order by timestamp limit 10000 And I have used the Nifi Expression language as per the https://community.hortonworks.com/articles/167733/using-generatetablefetch-against-db2.html
and created a query like this select ${generatetablefetch.columnnames} from ${generatetablefetch.tablename} where ${generatefetchtable.whereClause} order by ${generatetablefetch.maxColumnNames} fetch first ${generatetablefetch.limit} rows only
select userid,timestamp from user11 where timestamp >= '01-01-2018 12:00:00' order by timestamp limit 1000
but i'm getting
select userid,timestamp from user11 where order by timestamp limit 1000 In the above example where condition is not taking the value. please refer the screenshots for my configuration80483-nififlow.png I think I have made halfway through this and stuck here .What is missing in this .
... View more
07-04-2018
08:25 PM
@Raj ji Once make sure the avro schema field name is matching(case sensitive) with the Partition Columns property value specified. If your avro data file having chrono is Captial letters then you need to change the property value according to the avro schema field name. Please refer to this and this links explains about streaming api of Hive.
... View more
07-02-2018
03:58 PM
For the services you have mentioned, simply go into ambari, click on the service, click on configs tab, click on advanced, in search box type "log" and you will see the settings where you can customized as you see fit
... View more
06-27-2018
03:56 AM
@Raj ji You can use ExecuteProcess (doesn't allow any incoming connections) (or) ExecuteStreamCommand processors to trigger the shell script. ExecuteProcess configs: As your executable script is on Machine 4 and NiFi installed on Machine1 so create a shell script on Machine 1 which ssh into Machine 4 and trigger your Python Script. Refer to this and this links describes how to use username/password while doing ssh to remote machine. As you are going to store the logs into a file, so you can use Tail file processor to tail the log file and check is there any ERROR/WARN, by using RouteText Processor then trigger mail. (or) Fetch the application id (or) application name of the process and then use yarn rest api to get the status of the job Please refer to how to monitor yarn applications using NiFi and Starting Spark jobs directly via YARN REST API and this link describes yarn rest api capabilities.
... View more
07-02-2018
08:18 AM
It is nothing to do with the spark. may be package-mismatch on libcurl
do a complete uninstall of curl and try a fresh install.
by
wget http://curl.haxx.se/download/curl-7.48.0.tar.gz
tar -xvzf curl-7.48.0.tar.gz
cd curl-7.48.0
./configure
sudo make
sudo make install If you are particular about the version, please refer https://curl.haxx.se/download.html . or sudo apt-get update
sudo apt-get install curl
sudo apt-get install libcurl3 please change apt-get according to your Linux flavor
... View more
06-09-2018
06:24 PM
Looks like duplicate of https://community.hortonworks.com/questions/196668/hive-server-2-hs-2-two-instance-through-ambari.html
... View more
06-14-2018
07:23 PM
@Raj ji , If you are using Ambari to manage both the hiveserver2's, you need not separately change the configs in the node, Ambari will take care of all the configs. Apart from this, you should see all the tables form both the hiveserver2's.
... View more
06-09-2018
03:24 AM
Fantastic and detailed reply. I would try this out and reply if that works .Thanks a lot @Shu
... View more
05-31-2018
07:00 PM
Could you clarify a bit? Are you trying to configure authentication via LDAPS or LDAP authentication when Hive is accessed via HTTPS (or something else)? What problems are you running into?
... View more
05-15-2018
08:09 PM
Error shows there are missing blocks Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-267577882-40.133.26.59-1515787116650:blk_1076168453_2430591 file=/user/backupdev/machineID=XEUS/delta_21551841_21551940/bucket_00003 at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:995) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:638) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:888) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:945) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.tools.util.ThrottledInputStream.read(ThrottledInputStream.java:77) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.readBytes(RetriableFileCopyCommand.java:285) ... 16 more Check Namenode UI to see whether you have missing blocks.
... View more
05-07-2018
06:54 AM
@ Raj ji Yes if it's an HDP cluster Ranger is the only tool to make your administration for Authorization easy else it_dev,it_admin users can individually create their databases which won't be shareable. Unfortunately, even on CDH you have sentry which remind me of Oracle Admin where the permission/authorization is a 3 step and CLI based. Sentry
Create role; grant role permissions, Grant role to users Advantage of Ranger
Centralized security administration to manage all security-related tasks in a central UI or using REST APIs. Fine-grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role-based access control, Geolocalized, Time-based, UDF,attribute-based access control etc. Centralize auditing of user access and administrative actions (security related) within all the components of Hadoop. So your only option is to have each user create own database and tables , there is no concept of grant user X select, create etc in hive. Hope that helps
... View more
05-07-2018
08:29 AM
@Raj ji Yes, Symlinks are preferred to tweaking the py codes !!! How I quickly analyzed the first pointer was " parent directory /usr/hdp/2.6.3.0/hive/conf doesn't exist" this by experience is a symlink which points to the configuration files. So without that symlink access to the conf files you can't start the WebHcat server
... View more
05-03-2018
07:10 PM
@Raj ji I am afraid there are only 2 options - tweak the python code NOT recommended !!!!!
- Ask the SysOPS team to rename the mount and notify them of the caveat for using /home as Hadoop mount point. Hope that helps
... View more
04-25-2018
01:18 AM
1 Kudo
@Raj ji
You can use Execute Process (or) Execute Stream Command processors to pass arguments to the shell script. Execute Process Processor:- This processor won't need any upstream connections to trigger the script i.e this processor can run its own based on the schedular. Example:- I'm having sample script which gets 2 command line arguments and echo output. bash$ cat sample_script.sh
#!/bin/bash
echo "First arg: $1"
echo "Second arg: $2" Execution in terminal:- bash$ ./sample_script.sh hello world
First arg: hello
Second arg: world 1.Execution in NiFi using ExecuteProcess Processor:- Command bash Command Arguments /tmp/sample_script.sh hello world //here we are triggering the shell script and passing arguments with space Batch Duration
No value set
Redirect Error Stream false Argument Delimiter space //by default if Argument Delimiter is ; then command arguments would be /tmp/sample_script.sh;hello;world Configs:- Success relation from ExecuteProcess will output the below as content of flowfile First arg: hello
Second arg: world 2.Execution in NiFi using ExecuteStreamCommand processor:- This processor needs some upstream connection to trigger the script. Flow:- We have used generateflowfile processor as a trigger to ExecuteStreamCommand script Generateflowfile Configs:- Added two attributes arg1,arg2 to the flowfile ExecuteStreamCommand processor:- Command Arguments
${arg1};${arg2} Command Path
/tmp/sample_script.sh
Argument Delimiter
;
Now we are using the attributes that added in generateflowfile processor and passing them to the script. Use the OutputStream relation from ExecuteStreamCommand processor and the output flowfile content would be same First arg: hello
Second arg: world By using these processors you can trigger the shell script and pass the arguments also. - If the Answer helped to resolve your issue, Click on Accept button below to accept the answer, That would be great help to Community users to find solution quickly for these kind of issues.
... View more
04-21-2018
11:42 AM
tar.pngHi All, I have only these options , I do not have good internet connection . However, I have created a repo . but I can see only tar files in the hdf repo tar ball. http://public-repo-1.hortonworks.com/HDF/centos7/3.x/updates/3.0.2.0 https://s3.amazonaws.com/public-repo-1.hortonworks.com/HDF/centos7/3.x/updates/3.0.2.0/HDF-3.0.2.0-centos7-tars-tarball.tar.gz If you download and add to the Ambari Manage -> versions-> add the local repo URL (http://192.168.1.8/repo/HDF/) and get this error Some of the repositories failed validation. Make changes to the base url or skip validation if you are sure that urls are correct HOwever When I look at the HDF repo directory it has only tar files of services . I do not think it is a valid repo .
... View more
02-22-2018
04:07 PM
Hi @Pratik Kumar, Are you following the entire process in the article you are following? Are you still trying to build an HDP 2.3 cluster? What version of Ambari have you installed?
... View more
03-30-2018
08:06 PM
@Jay Kumar SenSharma Do we need to clear these processes and do a host reboot without stopping the services on that host? or, Do we need to manually stop the services on that host before rebooting the host? Please advise.
... View more
09-20-2018
09:29 AM
I am also facing the same issue on Production. Please provide the solution..
... View more
11-02-2017
02:42 AM
Unless I am mistaken, Amabri only checks the Datanode / NodeManager process is running, not that the network connection between the DataNode to the ResourceManager is possible. SSH to the datanodes, try to telnet to the resource manager port, and report back
... View more
10-30-2017
07:45 PM
I'm running a hive query and it will create a MR . The table is partitioned table and ORC formatted table.I'm not trying to insert values into the tables . I need to filter not null values from the table . When I tried to do that I'm getting the above error. Still couldn't figure out why .
... View more