Member since
04-12-2017
17
Posts
6
Kudos Received
1
Solution
My Accepted Solutions
Title | Views | Posted |
---|---|---|
641 | 06-18-2017 04:16 AM |
10-06-2017
08:51 AM
@RANKESH NATH Architecture wise it is not a good option as spark is distributed processing framework (running on multiple nodes) and spark writing to MYSQL on a single node will create performance bottleneck as MYSQL is not distributed.If you want to make full use of Sparks distributed processing capability then it is advisable to use either distributed file system or distributed database.
... View more
10-06-2017
08:42 AM
@MyeongHwan Oh Do you mean transferring data to Hadoop data node. You can use your own test data in Oracle and import it using sqoop as there are no specific data datasets or benchmarking results for sqoop. Also, There are few common performance improvement techniques for Sqoop e.g. split-by and boundary-query, direct, fetch-size, num-mapper Please find below links as they are good starting points https://community.hortonworks.com/articles/70258/sqoop-performance-tuning.html https://kb.informatica.com/h2l/HowTo%20Library/1/0930-SqoopPerformanceTuningGuidelines-H2L.pdf http://www.xmsxmx.com/performance-tuning-data-load-into-hadoop-with-sqoop/ https://dzone.com/articles/apache-sqoop-performance-tuning If this answer your question, please vote/accept best answer.
... View more
10-06-2017
08:17 AM
Can you send a sample file. I will try it on my machine.
... View more
10-05-2017
12:56 PM
1 Kudo
Can you please verify the back-pressure threshold of previous processor, which means once the threshold is reached the components right before the queue will no longer be triggered to run anymore until the queue goes below the threshold. You can see it is red in color. You can increase the threshold limit.
... View more
07-07-2017
10:29 PM
Admin Server is restarted automatically after 'ambari-admin-password-reset' command is fired and password is provided to make password change effective so no need of 'ambari-server restart'. I feel step 3 is not mandatory but can be followed.
... View more
06-22-2017
03:01 AM
@Wynner But what if there are multiple sources for get so doesn't primary node become a bottleneck? Do we have solution or pattern for the same?
... View more
06-20-2017
04:59 AM
I am new to NiFi, as per my understanding primary node is used to run isolated processes. eg. if we have a process to get data from ftp dir so it is better to keep it as primary node so that there is no load on ftp server. For a scenario where in I have multiple get processors for e.g. ftp and other to get data from a db so won’t the primary node’s performance be a bottleneck since we can not have more than 1 primary node. Below are my queries : 1. Do we have any NiFi flow design patterns for the above scenario. 2. If we do not make get processors as primary node there is a possibility of fetching duplicate data please clarify.
... View more
Labels:
- Labels:
-
Apache NiFi
06-19-2017
08:58 AM
@Artem Ervits This is a great article Do we have performance comparison of HIVE on HDFS vs HIVE on HBase. Is it advisable to come up with Hive on Hbase in production for large datasets?
... View more
06-18-2017
04:16 AM
2 Kudos
@Ankit Jain ExtractCCDAAttributes has been introduced in NiFi 1.3.0. Extracts information from an Consolidated CDA formatted FlowFile and provides individual attributes as FlowFile attributes.
Please refer https://nifi.apache.org/docs.html
... View more
06-18-2017
03:55 AM
1 Kudo
@Balakumar Balasundaram Does the explanation and link I provided address your question? If so, please "accept" the answer to close the posting.
... View more
06-18-2017
03:53 AM
@priyal patel Does the explanation provided address your question? If so, please "accept" the answer to close the posting.
... View more
06-02-2017
04:01 AM
@priyal patel You can look at the application logs from the failed application. To fetch them run the following command as the user who ran the Sqoop command. yarn logs -applicationId application_1496289796598_0013 > appln_logs.txt appln_logs.txt would contain more details on the errors. Please post them here If you can't figure out.
... View more
05-30-2017
11:17 AM
@priyal patel Sqoop - Sqoop is used to move data from an existing RDBMS to Hadoop (or vice versa). Once the data is imported from Sqoop initially (i.e initial load is performed), the incremental data (i.e the data which is updated in RDBMS) is not updated automatically. It needs to be imported incrementally using incremental imports . https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports Flume - Flume was the main tool previously to ingest log files, events, flat files, csv, etc. Flume has recently fallen out of favour and is often being replaced with HDF(Hortonworks DataFlow) /NiFi. HDF(Hortonworks DataFlow)/NiFi - NiFi provides a visual user interface with more than 180+ processors for collecting data from various sources (Sensors,geo-location devices, machine, logs, files, feeds etc) & perform simple event processing (e.g. parsing, filtering etc) & delivering to storage platforms such as HDP in a secure environment. Kafka is a distributed fault tolerant messaging system which lets you to publish and subscribe to streams of records. Generally it is used while doing real time stream processing where real time messages are buffered in Kafka and are consumed by storm or spark streaming. The right tool for the job depends on your use case. Here is another good write up on the same subject: https://community.hortonworks.com/questions/23337/best-tools-to-ingest-data-to-hadoop.html As always, if you find this post useful, please accept the answer.
... View more
05-29-2017
09:18 AM
@Balakumar Balasundaram The main disadvantage of RDDs is
that they don’t perform particularly well whenever Spark needs to distribute
the data within the cluster, or write the data to disk, it does so using Java
serialization by default (although it is possible to use Kryo as a faster
alternative in most cases). The overhead of serializing individual Java and
Scala objects is expensive and requires sending both data and structure between
nodes (each serialized object contains the class structure as well as the
values). There is also the overhead of garbage collection that results from
creating and destroying individual objects. So in Spark 1.3 DataFrame API was introduced which seeks to improve the performance and
scalability of Spark. The DataFrame API introduces the concept of a schema to
describe the data, allowing Spark to manage the schema and only pass data
between nodes, in a much more efficient way than using Java serialization.
There are also advantages when performing computations in a single process as
Spark can serialize the data into off-heap storage in a binary format and then
perform many transformations directly on this off-heap memory, avoiding the
garbage-collection costs associated with constructing individual objects for
each row in the data set. Because Spark understands the schema, there is no
need to use Java serialization to encode the data. Query plans are created for execution using
Spark catalyst optimiser. After an optimised execution plan is prepared going
through some steps, the final execution happens internally on RDDs only but
thats completely hidden from the users. Please find below the
list of useful blogs : https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html http://www.adsquare.com/comparing-performance-of-spark-dataframes-api-to-spark-rdd/ https://www.youtube.com/watch?v=1a4pgYzeFwE http://why-not-learn-something.blogspot.in/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html https://www.youtube.com/watch?v=pZQsDloGB4w
... View more
05-29-2017
03:20 AM
@Sonu Sonu You can use FCC Conversion API to get fips code from lat & long For example: http://data.fcc.gov/api/block/2010/find?latitude=40.0&longitude=-85 Also, Please check out http://www.datasciencetoolkit.org/ a ready to use virtual machine(VM) for geocoding and reverse geocoding, it provides useful information including FIPS codes. i hope it will help.
... View more
05-24-2017
05:23 AM
1 Kudo
@Bala Vignesh N V Capacity planning depends on multiple factors - - Amount of data that needs needs to be stored, the incremental data growth rate for next 2-3 years & data retention period - Kind of processing Real time or Batch - Use cases - Based on the use case, Workload patterns can be derived - Balance workload(Work load equally distributed - CPU bound, Disk I/O Bound, Network I/O bound) , Compute intensive (CPU bound - involving complex algorithms, NLP, HPCC etc ) or I/O intensive - Jobs requiring very less compute power and more of I/O (Archival use case have lots of cold data). If you don’t know the workload pattern then it is recommended to start with balanced workload - The SLA’s for the system. You need to consider all the above factors. It really depends on what you want to do with the data & need to look at each and every single piece. For e.g. Ingesting data from different sources (Sqoop,Flume, NiFi etc.), transformations if any ( pig/hive ) ,consumption ( hive ) or real time processing - Distributed message queue (Kafka ), Storage (HBase), . The best strategy in my opinion is to setup a development cluster and test it out, then scale it up. Hadoop is designed in a way that most tasks scale linearly with resources allocated to them. Below 2 links are good starting point : https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_cluster-planning/content/ch_hardware-recommendations_chapter.html http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_cluster-planning/content/index.html
... View more