Member since
10-06-2015
273
Posts
202
Kudos Received
81
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4044 | 10-11-2017 09:33 PM | |
3566 | 10-11-2017 07:46 PM | |
2572 | 08-04-2017 01:37 PM | |
2215 | 08-03-2017 03:36 PM | |
2242 | 08-03-2017 12:52 PM |
05-17-2017
08:02 PM
Is there a way for Nifi to keep a DB connection open as it processes different incoming flowfiles? Or will the processor ExecuteSQL processor open/close the connection with every flowfile it processes?
... View more
Labels:
- Labels:
-
Apache NiFi
05-17-2017
07:07 PM
2 Kudos
@J Nunes Unfortunately Hue does not support interacting with the cluster through Knox. A Jira was opened for this ( https://issues.apache.org/jira/browse/KNOX-44 ) but the community has decided not to resolve/address/fix it. If you would like to use Hue then it will have to be from behind your firewall and Knox.
... View more
05-16-2017
07:05 PM
I see. I misread your question. This request was addressed in the below jira: https://issues.apache.org/jira/browse/SQOOP-912 The flag to specify the database is "--hive-database". Your command would look like sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-database MYDATABASE
OR
sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-table MYDATABASE.MYTABLE
... View more
05-16-2017
03:25 PM
9 Kudos
Use recommended mount options for all HDFS data disks There are specific filesystem mount options that have proven to be more efficient for Hadoop clusters. Using these mount options will provide performance benefits. Since mount options are applied when mounting the filesystem, on system boot or remounting for example, changes to /etc/fstab alone are not enough for these settings to take effect. The recommended approach is to make the mount option changes and either manually remount the individual file systems or reboot the host for the settings to take effect. Use the following mount options for the respective file systems:
ext4 —> "inode_readahead_blks=128","data=writeback","noatime","nodev"
xfs —> "noatime"
Configure HDFS block size for optimal performance Having optimal HDFS block size boosts NameNode performance as well as job execution performance. Make sure that the blocksize ('dfs.blocksize' in 'hdfs-site.xml') is within the recommended range of 134217728 to 1073741824 (exclusive).
Enable HDFS short circuit reads In HDFS, reads normally go through the DataNode. Thus, when the client asks the DataNode to read a file, the DataNode reads that file off of the disk and sends the data to the client over a TCP socket. So-called short-circuit reads bypass the DataNode, allowing the client to read the file directly. Obviously, this is only possible in cases where the client is co-located with the data. Short-circuit reads provide a substantial performance boost to many applications. Enable short circuit read for better performance. To configure short-circuit local reads, you will also need to enable libhadoop.so (dfs.domain.socket.path).
In hdfs-site.xml set the below:
dfs.client.read.shortcircuit=true
dfs.domain.socket.path=/var/lib/hadoop-hdfs/dn_socket
Avoid file sizes that are smaller than a block size Average block size should be greater than the recommended block size of 67108864 MB. An average size below the recommended size adds more burden to the NameNode, cause heap/GC issues in addition to cause storage and processing to be inefficient. Set the block size greater than 67108864 MB
Also, use one or more of the following techniques to consolidate smaller files :
- run Hive/HBase compactions
- merge small files
- use HAR to compact small files
Tune DataNode JVM for optimal performance DataNode is sensitive to the JVM performance and behavior. Make sure that the DataNode JVM is configured for optimal performance Sample JVM Configs:
-Djava.net.preferIPv4Stack=true,
-XX:ParallelGCThreads=8,
-XX:+UseConcMarkSweepGC,
-Xloggc:*,
-verbose:gc,
-XX:+PrintGCDetails,
-XX:+PrintGCTimeStamps,
-XX:+PrintGCDateStamps
Also:
-Xms should be same as -Xmx
New generation size should be ⅛ of the total JVM size.
Avoid reads or write from stale DataNodes DataNodes that have not sent a heartbeat to NameNode for a defined interval, may be under load or may have died. Avoid sending any read/write requests to such 'stale' DataNodes. In hdfs-site.xml set the below:
dfs.namenode.avoid.read.stale.datanode=true
dfs.namenode.avoid.write.stale.datanode=true
Use JNI-based group lookup over other implementations Hadoop uses a pluggable interface with multiple possible implementations for looking up the group memberships of a user. The JNI-based implementation has better performance characteristics than other implementations. In core-site.xml set:
hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback
*** This article focuses on settings that would improve HDFS performance. However, may impact other areas such as stability and uptime. Please understand the settings before applying them ***
*** You might also be interested in the following articles: *** OS Configurations for Better Hadoop Performance
... View more
Labels:
05-16-2017
02:15 AM
In the table at the bottom of the page for the first link: --hive-table <table_name> Specifies the table name to use when importing data into Hive. So your command would look like sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-table MYTABLE
... View more
05-16-2017
02:06 AM
3 Kudos
@Andres Urrego Take a look at the below link for the different ways to Sqoop data into Hive. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_dataintegration/content/using_sqoop_to_move_data_into_hive.html And below for HBase. http://www.dummies.com/programming/big-data/hadoop/importing-data-into-hbase-with-sqoop/
... View more
05-11-2017
03:55 PM
Glad I could help. No, unfortunately you won't be able to manage both clusters with the same Ambari.
... View more
05-11-2017
03:03 PM
You do not need to change the path for your existing cluster. Leave it as is. Chroot the second/new cluster. As for versions, you should be ok. To make your config life easier, I suggest you not install both Kafka versions on the same nodes though.
... View more
05-10-2017
04:18 PM
I've updated my response to include the steps.
... View more
05-10-2017
03:34 PM
@Silvio del Val Yes, you can install and manage the second set of Kafka brokers manually (not through Ambari). As for Zookeeper, it can manage multiple Kafka clusters. Just "chroot" them differently {zookeeperHost}:{portnumber}/{kafkacluster1} & {zookeeperHost}:{portnumber}/{kafkacluster2} Take a look at the link below on how zookeeper manages sessions and on chroot. https://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#ch_zkSessions To do this: 1) Create the chroot in Zookeeper with the below commands zkCli.sh -server yourZookeeperHost:2181
# note the empty brackets below are _required_
create /kafkaCluster1 [] This will create a path called "kafkaCluster1". You can verify this by running the "ls/" command. (* replace "yourZookeeperHost" and the port number with the appropriate values from you environment) 2) For the Kafka brokers configs, under "zookeeper.connect", add a "chroot" path which will make all kafka data for this cluster appear under that particular path. To do this give a connection string in the form kafka_v08_1:port1,kafka_v08_2:port2,kafka_v08_3:port3 yourZookeeperHost:2181/kafkaCluster1 which would put all this cluster's data under the path yourZookeeperHost:2181/kafkaCluster1 . 3) For the Kafka consumer configs, under "zookeeper.connect", input the same as you did in step 2 for the brokers 4) Repeat steps 1-3 for your second Kafka cluster using a different "chroot" path For more explanation of Kafka configs take a look at the below link. https://kafka.apache.org/08/documentation/#configuration
... View more