About egarelnabi

egarelnabi · ‎05-17-2017

Is there a way for Nifi to keep a DB connection open as it processes different incoming flowfiles? Or will the processor ExecuteSQL processor open/close the connection with every flowfile it processes?

egarelnabi · ‎05-17-2017

@J Nunes Unfortunately Hue does not support interacting with the cluster through Knox. A Jira was opened for this ( https://issues.apache.org/jira/browse/KNOX-44 ) but the community has decided not to resolve/address/fix it. If you would like to use Hue then it will have to be from behind your firewall and Knox.

egarelnabi · ‎05-16-2017

I see. I misread your question. This request was addressed in the below jira: https://issues.apache.org/jira/browse/SQOOP-912 The flag to specify the database is "--hive-database". Your command would look like sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-database MYDATABASE OR sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-table MYDATABASE.MYTABLE

egarelnabi · ‎05-16-2017

Use recommended mount options for all HDFS data disks There are specific filesystem mount options that have proven to be more efficient for Hadoop clusters. Using these mount options will provide performance benefits. Since mount options are applied when mounting the filesystem, on system boot or remounting for example, changes to /etc/fstab alone are not enough for these settings to take effect. The recommended approach is to make the mount option changes and either manually remount the individual file systems or reboot the host for the settings to take effect. Use the following mount options for the respective file systems: ext4 —> "inode_readahead_blks=128","data=writeback","noatime","nodev" xfs —> "noatime" Configure HDFS block size for optimal performance Having optimal HDFS block size boosts NameNode performance as well as job execution performance. Make sure that the blocksize ('dfs.blocksize' in 'hdfs-site.xml') is within the recommended range of 134217728 to 1073741824 (exclusive). Enable HDFS short circuit reads In HDFS, reads normally go through the DataNode. Thus, when the client asks the DataNode to read a file, the DataNode reads that file off of the disk and sends the data to the client over a TCP socket. So-called short-circuit reads bypass the DataNode, allowing the client to read the file directly. Obviously, this is only possible in cases where the client is co-located with the data. Short-circuit reads provide a substantial performance boost to many applications. Enable short circuit read for better performance. To configure short-circuit local reads, you will also need to enable libhadoop.so (dfs.domain.socket.path). In hdfs-site.xml set the below: dfs.client.read.shortcircuit=true dfs.domain.socket.path=/var/lib/hadoop-hdfs/dn_socket Avoid file sizes that are smaller than a block size Average block size should be greater than the recommended block size of 67108864 MB. An average size below the recommended size adds more burden to the NameNode, cause heap/GC issues in addition to cause storage and processing to be inefficient. Set the block size greater than 67108864 MB Also, use one or more of the following techniques to consolidate smaller files : - run Hive/HBase compactions - merge small files - use HAR to compact small files Tune DataNode JVM for optimal performance DataNode is sensitive to the JVM performance and behavior. Make sure that the DataNode JVM is configured for optimal performance Sample JVM Configs: -Djava.net.preferIPv4Stack=true, -XX:ParallelGCThreads=8, -XX:+UseConcMarkSweepGC, -Xloggc:*, -verbose:gc, -XX:+PrintGCDetails, -XX:+PrintGCTimeStamps, -XX:+PrintGCDateStamps Also: -Xms should be same as -Xmx New generation size should be ⅛ of the total JVM size. Avoid reads or write from stale DataNodes DataNodes that have not sent a heartbeat to NameNode for a defined interval, may be under load or may have died. Avoid sending any read/write requests to such 'stale' DataNodes. In hdfs-site.xml set the below: dfs.namenode.avoid.read.stale.datanode=true dfs.namenode.avoid.write.stale.datanode=true Use JNI-based group lookup over other implementations Hadoop uses a pluggable interface with multiple possible implementations for looking up the group memberships of a user. The JNI-based implementation has better performance characteristics than other implementations. In core-site.xml set: hadoop.security.group.mapping=org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback *** This article focuses on settings that would improve HDFS performance. However, may impact other areas such as stability and uptime. Please understand the settings before applying them *** *** You might also be interested in the following articles: *** OS Configurations for Better Hadoop Performance

egarelnabi · ‎05-16-2017

In the table at the bottom of the page for the first link: --hive-table <table_name> Specifies the table name to use when importing data into Hive. So your command would look like sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES --hive-import --hive-table MYTABLE

egarelnabi · ‎05-16-2017

@Andres Urrego Take a look at the below link for the different ways to Sqoop data into Hive. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_dataintegration/content/using_sqoop_to_move_data_into_hive.html And below for HBase. http://www.dummies.com/programming/big-data/hadoop/importing-data-into-hbase-with-sqoop/

egarelnabi · ‎05-11-2017

Glad I could help. No, unfortunately you won't be able to manage both clusters with the same Ambari.

egarelnabi · ‎05-11-2017

You do not need to change the path for your existing cluster. Leave it as is. Chroot the second/new cluster. As for versions, you should be ok. To make your config life easier, I suggest you not install both Kafka versions on the same nodes though.

egarelnabi · ‎05-10-2017

I've updated my response to include the steps.

egarelnabi · ‎05-10-2017

@Silvio del Val Yes, you can install and manage the second set of Kafka brokers manually (not through Ambari). As for Zookeeper, it can manage multiple Kafka clusters. Just "chroot" them differently {zookeeperHost}:{portnumber}/{kafkacluster1} & {zookeeperHost}:{portnumber}/{kafkacluster2} Take a look at the link below on how zookeeper manages sessions and on chroot. https://zookeeper.apache.org/doc/r3.2.2/zookeeperProgrammers.html#ch_zkSessions To do this: 1) Create the chroot in Zookeeper with the below commands zkCli.sh -server yourZookeeperHost:2181 # note the empty brackets below are _required_ create /kafkaCluster1 [] This will create a path called "kafkaCluster1". You can verify this by running the "ls/" command. (* replace "yourZookeeperHost" and the port number with the appropriate values from you environment) 2) For the Kafka brokers configs, under "zookeeper.connect", add a "chroot" path which will make all kafka data for this cluster appear under that particular path. To do this give a connection string in the form kafka_v08_1:port1,kafka_v08_2:port2,kafka_v08_3:port3 yourZookeeperHost:2181/kafkaCluster1 which would put all this cluster's data under the path yourZookeeperHost:2181/kafkaCluster1 . 3) For the Kafka consumer configs, under "zookeeper.connect", input the same as you did in step 2 for the brokers 4) Repeat steps 1-3 for your second Kafka cluster using a different "chroot" path For more explanation of Kafka configs take a look at the below link. https://kafka.apache.org/08/documentation/#configuration

Online	Offline
Last Visited	‎08-14-2019 09:54 AM

Member Since	‎10-06-2015 09:21 PM
Last Visited	‎08-14-2019 09:54 AM
Posts	273
Kudos received	202

Cloudera Community

Re: Is it possible to import a complete new taxono...

Re: Is it possible in Apache Atlas to add key-valu...

Re: Do we have tag carry forward in atlas hdp2.6.1...

Re: With ATLAS, which format attribute Date is acc...

Re: Spark streaming support for stream analytics m...

Can we reuse a DB connection in Nifi?

Re: How can i configure hue to use knox to access ...

Re: sqoop import from RDBMS to Hive, HBase

HDFS Settings for Better Hadoop Performance

Re: sqoop import from RDBMS to Hive, HBase

Re: sqoop import from RDBMS to Hive, HBase

Re: Run multiple broker versions in the same clust...

Re: Run multiple broker versions in the same clust...

Re: Run multiple broker versions in the same clust...

Re: Run multiple broker versions in the same clust...