About anarasimham

anarasimham · ‎01-22-2019

The Data Load Process We looked at the performance that these engines have in the last article, now it’s time to look at how the data got loaded in. There are trade-offs here to be aware of when loading data into each of these engines, as they use different mechanisms to accomplish the task. Hive Load There is no immediate need for a Schema-on-Write data load when you are using Hive with your native file format. Your only “load” operation is copying the files from your local file system to HDFS. With schema-on-read functionality, Hive can instantly access data as soon as its’ underlying file is loaded into HDFS. In our case the real data load step was in converting this Schema-on-Read external Hive table data into optimized ORC format, therefore loading it from an external table to a Hive-managed table. This was a relatively short process, coming in much under an hour. HBase Load Contrast that with HBase, where a bulk data load for our sample data set of 200M rows (around 30GB of disk size in CSV format) took 4+ hours using a single-threaded Java application running in the cluster. In this case, HBase went through a process of taking several columns of the CSV data and concatenating them together to come up with a composite key. This, along with the fact that the inserts were causing hot-spotting within the Region Servers, slowed things down. One way to improve this performance would be to pre-split the regions so your inserts aren’t all going to one region to start with. We could have parallelized the data load as well to improve the performance, writing a MapReduce job to distribute the work. Druid Load Let’s also contrast that with the Druid load, which took about 2 hours. Druid bulk loads data using a MapReduce job; this is a fairly efficient way of doing things since it distributes the work across the cluster and is why we’re seeing a lower time relative to HBase. Druid still has to do the work of adding its own indexes on top of the data and optionally pre-aggregate the data to a certain user-defined level, so it doesn’t have a trivial path to getting the data in either. Although we didn’t choose to pre-aggregate this data, this is what allows Druid to save a lot of space; instead of storing the raw data, Druid can roll the data up to a minute-level granularity if you think your users will not query deeper than that. But remember - Once you aggregate the data, you no longer have the raw data. Space Considerations Another interesting way to slice this data is by how much space it takes up in each of the 3 columnar formats. Engine Size on Disk with Replication Hive - ORC w/ Zlib 28.4GB HBase - Snappy compression 89.5GB Druid 31.5GB Hive and Druid have compressed the data very efficiently considering the initial data size was 90GB with replication, but HBase is sitting right around the raw data size. At this point, we've covered both relative loading times for the three engines as well as data storage space requirements across the three. These may change as you use different compression formats or load different kinds of data into the engines, but this is intended as a general reference to understand relative strengths between the three.

anarasimham · ‎01-03-2019

This article series is an expansion into the technical details behind the Big Data Processing Engines blog post: https://hortonworks.com/blog/big-data-processing-engines-which-one-do-i-use-part-1/ Intro to performance analysis Here, we will be deep diving into the performance side of the three Big Data Processing Engines discussed in the above blog post: Druid, HBase, and Hive. I ran a number of query types to represent the various workloads generally executed on these processing engines, and measured their performance in a consistent manner. I ran a few tests on each of the three engines to showcase their different strengths, and show where some are less effective (but could still fill the gap in a pinch). We will be executing the following queries: Simple count of all records in a table, highlighting aggregation capabilities A select with a where clause, highlighting drill-down and "needle in the haystack" OLTP queries A join, showcasing ad-hoc analysis across the dataset Updates, showcasing scenarios in which data is constantly changing and our dataset needs to stay up to date Performance Analysis An aggregation much like an analyst would do, such as summing data on a column Performance Analysis A few notes about setup: Data size: 200 million rows, 30GB on disk (90GB after replication) Cluster size: 8 nodes, broken down into 2 masters and 6 workers Node size: 8 core, 16 GB RAM machines in a virtualized environment Method of querying: Hive on Tez+LLAP was used to query Hive-managed and Druid-managed data. Phoenix was used to query HBase-managed data. A reasonable configuration was used for each of the engines Cache was taken out of the picture in order to get accurate estimates for initial query execution. Query execution and re-execution times will be much faster with cache in place for each of these engines Note that this is not an ideal setup for Hive+HBase+Druid. Dedicated nodes for each of these services would yield better numbers but we decided to keep it approachable so you can reproduce these results on your own small cluster. As I laid out, the three processing engines performed about how you would expect given their relative strengths and weaknesses. Take a look at the table below. Query HBase/Phoenix (seconds) Hive (seconds) Druid (seconds) Count(*) 281.44 4.72 0.71 Select with filter 1.35 8.71 0.34 Select with join and filter 365.41 9.16 N/A Update with filter 1.52 9.75 N/A Aggregation with filter 353.07 8.66 1.72 Here is that same information in a graph format, with HBase capped at 15s to keep the scale readable. As expected, HBase outshined the other two when it came to ACID operations, with an average of 1.5 seconds on the updates. Druid is not capable of them and Hive took a bit longer. HBase however is not great at aggregation queries, as seen in the ~6 minute query times. Druid is extremely efficient at everything it does, giving no results above 2 seconds and mostly under 1 second. Lastly, Hive with its latest updates has become a real-time database and serviced all queries thrown at it in under 10 seconds. Queries Here are all of the queries that were run, multiple times each, to arrive at the results above. --queries select count(*) from transactions; select count(*) from transactions_hbase; select count(*) from transactions_druid; select trxn_amt,rep_id from transactions_partitioned where trxn_date="2018-10-09" and trxn_hour=0 and trxn_time="2018-10-09 00:33:59.07"; select * from transactions_hbase_simple where row_key>="2018-09-11 12:03:05.860" and row_key<"2018-09-11 12:03:05.861"; select * from transactions_druid where `__time`='2018-09-11 12:03:05.85 UTC'; select distinct(b.name) from transactions_partitioned a join rep b on a.rep_id=b.id where rep_id in (1,2,3) and trxn_amt > 180; select distinct b."name" from "transactions_hbase" a join "rep_hbase" b on a."rep_id"=b."ROW_KEY" where b."ROW_KEY" in (1,2,3) and a."trxn_amt">180.0; update transactions_partitioned set qty=10 where trxn_date="2018-10-09" and trxn_hour=0 and trxn_time>="2018-10-09 00:33:59.07" and trxn_time<"2018-10-09 00:33:59.073"; insert into table transactions_hbase_simple values ('2018-09-11 12:03:05.860~xxx-xxx~xxx xxx~1~2017-02-09', null,null,null,10,null,null,null); select sum(trxn_amt),rep_id from transactions_partitioned group by rep_id; select sum("trxn_amt"),"rep_id" from "transactions_hbase" group by "rep_id"; select sum(trxn_amt),rep_id from transactions_druid group by rep_id;

anarasimham · ‎11-02-2018

With the arrival of the Hive3Streaming processors, performance has never been better between NiFi and Hive3. Below, we'll be taking a look at the PutHive3Streaming (separate from the PutHiveStreaming processor) processor and how it can fit into a basic Change Data Capture workflow. We will be performing only inserts on the source Hive table and then carrying over those inserts into a destination Hive table. Updates are also supported through this process with the addition of a 'last_updated_datetime' timestamp, but that is out of scope for this article. This is meant to simulate copying data from one HDP cluster to another - we're using the same cluster as the source and destination here however. Here is the completed flow. Take note that most of this flow is just keeping track of the latest ID that we've seen, so that we can pull that back out of HDFS periodically and query Hive for records that were added beyond that latest ID. These last two processors, SelectHive3QL and PutHive3Streaming, are the ones doing the heavy lifting. They are the ones first getting the data from the source table (based on the pre-determined latest ID) and then inserting that retrieved data into the destination Hive table. Here's the configuration for the SelectHive3QL processor. Note the ${curr_id} variable used in the HiveQL Select Query field. That ensures our query will be dynamic. This is the configuration for the PutHive3Streaming processor. Nothing special here - we've configured the Hive Configuration Resources with the hive-site.xml file and used the Avro format (above) to retrieve data and (below) to write it back out. Here's the state of the table as we first check the destination table: And then insert a record into the source table: Here's the new state of the table: Here is the full NiFi flow: puthive3streaming-cdc-flow.xml In conclusion, we've shown one of the ways you can utilize the new PutHive3Streaming processors to perform quick inserts into Hive and at a broader level perform Change Data Capture.

anarasimham · ‎11-02-2018

Unfortunately each table in HBase requires at least one region backing it. To resolve your issue you will have to merge the tables before loading into HBase. Hive is another component that would probably fit your use case bettter. There are no regions in Hive, rather the data is store in the file system, so lots of small tables are possible.

anarasimham · ‎11-02-2018

NiFi doesn't do joins in-stream. To accomplish this, I'd recommend processing the first CSV file and storing it in HBase with key=id. You can then process the second csv file, creating a flow to use the id as a query to HBase and get back the reference data from the first CSV. If you'd like to do windowed joins, take a look at Streaming Analytics Manager inside of the HDF platform.

anarasimham · ‎10-05-2018

If you are running MapReduce on a single node, it will take more time than a sequential application due to the job creation overhead that MapReduce must undertake. There is extra time taken in the case of MapReduce to submit the job, copy the code and dependencies into a YARN container, and start the job. As you scale out to several nodes and more, you will see the performance benefits of MapReduce. In general however, MapReduce is used less often now on the platform - Hive runs on Tez now rather than MapReduce and I've only seen MapReduce of late being used for things like bulk loading data into HBase/Druid. In-memory processing, the likes of which both Hive/LLAP and Spark provide, can net you a significant performance boost depending on what you're trying to accomplish and the tool best suited for the job.

anarasimham · ‎08-04-2018

@Banshidhar Sahoo I'm unclear on what you mean by Reports Manager Working Directory. Could you explain what that is? FSImage is a file that HDFS uses to store metadata about the blocks on the cluster. In general you should not change or move this file as HDFS relies on it to perform its functions. See the HDFS Architecture documentation: https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

anarasimham · ‎07-23-2018

@laiju cbabu You may be using a region that the practice exam image is not uploaded on. Please try the US East - Northern Virginia region, I'm able to find it there.

anarasimham · ‎07-23-2018

@kanna k You will have to use the Ambari REST API in conjunction with a custom application that you create to measure the total uptime of the system through restarts. You can also use Ambari Alerts + notifications to get push notifications when services are stopped or started. That way, you can have an email with a timestamp of when a service was stopped/started and calculate the total uptime based on those numbers.

anarasimham · ‎06-26-2018

Please take a look at the following HCC thread: https://community.hortonworks.com/questions/35203/hi-where-can-i-find-the-rules-of-nifis-regex-langu.html That thread also links to this website, which you can use to create your expressions and directly copy them to NiFi. You may not be using the correct syntax if it isn't working, so please verify with this website: https://regexr.com/

Online	Offline
Last Visited	‎01-16-2019 11:09 AM

Member Since	‎01-14-2019 08:10 AM
Last Visited	‎01-16-2019 11:09 AM
Posts	144
Kudos received	48

Cloudera Community

Re: MAPREDUE code VS sequential code?

Re: How can i get “Hortonworks HDPCA_x.x PracticeE...

Re: Services report

Re: Best way to restart the NiFi cluster

Re: Divide file in nifi

Big Data Processing Engines, The Technical Series ...

Big Data Processing Engines, The Technical Series ...

PutHive3Streaming to do Basic CDC with NiFi

Re: Large Number of regions in the region server

Re: Apache nifi merge two different csv file and c...

Re: MAPREDUE code VS sequential code?

Re: Reports Manager Working Directory change

Re: How can i get “Hortonworks HDPCA_x.x PracticeE...

Re: Services report

Re: ReplaceText Processor edit only the header of ...