About gkeys

gkeys · ‎05-12-2017

User 2 cannot see the details or contents of process group of user 1, but can still see the presence of the process group, eg as per screenshot: See: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html "User2 is unable to add components to the dataflow or move, edit, or connect components. The details and properties of the root process group and processors are hidden from User2."

gkeys · ‎05-12-2017

Have each user put all of his/her flows in a processor group owned by that user. This high-level processor group can hold multiple flows by the user (including separate flows themselves organized by processor groups). The high-level processor group can be owned by the users so only user has permission to view and edit the contents. This post is a good reference: https://community.hortonworks.com/questions/33007/what-is-the-best-way-to-organize-many-nifi-process.html

gkeys · ‎05-12-2017

Intro Scalable cheap storage and parallel processing are at the foundation of Hadoop. These capabilities allow Hadoop to play a key role in complementing and optimizing traditional Enterprise Data Warehouse (EDW) workloads. In addition, recent technologies like Hive LLAP (in-memory, long-running execution threads) with ACID merge and AtScale proprietary software (virtual OLAP cubes with aggregation cacheing in HDFS) now for the first time allow fast BI directly against TB- and PB-sized data on Hadoop using well-known BI tools (e.g Tableau). This article describes reference architectures for three use cases of EDW optimization: Active archiving: archiving aged EDW data on Hadoop for cheap storage and for exploratory and BI analysis. EDW offload: offloading staging data and ETL transformations to Hadoop and exporting the final data structures back to existing EDW system (for cheap storage and faster ETL on Hadoop, as well as faster BI queries on existing EDW). BI on Hadoop: Business Intelligence tooling against TB and PB data on Hadoop, with the possibility of retiring existing EDW. The Problem EDW systems like Teradata and Neteeza cost > 50x – 100x more per GB to store data compared to HDFS Most of the data in EDW systems (typically up to 70%) is staged for transformation to final tables used for BI queries. BI users do not directly query this staged data. This staged data is extremely costly to store. Aged data either sits expensively on the EDW or is archived to cheaper systems like tape where it is inaccessible for BI users and analysts. Transformation of staging data is typically long-duration, often more than a day for one transformation job. Most of the CPU in EDW systems (typically > 50%) is used to process these transformations. This long-running background CPU usage lowers the performance of BI queries run at the same time. These BI queries are the reason for the EDW, but typically do not perform optimally. EDW schema-on-write requirement stresses the ability to load modern data sources like semi-structured social data Reference Architectures Note that any of the below architectures can be implemented alone or a combination can be implemented together, depending on your needs and strategic roadmap. Also for each diagram below, red represents EDW optimization data architecture and black represents existing data architecture. Use Case 1: Active Archiving In this use case aged data is offloaded to Hadoop instead of being stored on the EDW or on archival storage like tape. Offload can be via tools such as Sqoop (native to Hadoop) or an ETL tool like Syncsort DMX-h (proprietary, integrated with Hadoop framework including YARN and map-reduce). Benefits: Aged data from EDW is now stored more cheaply Aged data from archival systems like tape are now accessible to querying EDW data is now combined with new data sources (like geospatial, social or clickstream) in the lake. The combination of these sources allow greater analytical capabilities like enriched data or customer 360 analysis for both BI users and data scientists. Use Case 2: EDW Offloading In this use case both staging data and ETL are offloaded from the EDW to hadoop. Raw data is stored on the lake and processed into cleaned and standardized data used for Hive LLAP tables. Cleaned and standardized data is transformed into structures that are exported to the existing EDW. BI users continue to use the existing EDW unchanged, unaware that the plumbing beneath has changed. Benefits: Raw data is centralized in the lake, available to data scientists and for repurposing in other use cases. Raw data is retained because storage is cheap. New (EDW) data sources are ingested to the lake, leading to greater analytical capabilities as described above. ETL is significantly faster on Hadoop because of parallel batch processing. ETL times are reduced from parts of days to minutes or hours. ETL is removed from existing EDW. This frees significant CPU resulting in noticeably faster BI queries making BI users happier. Use Case 3: BI on Hadoop This use case is identical to EDW offload as per above but the EDW is either fully replaced by or augmented by OLAP on Hadoop. For greenfield environments, replacement (i.e. prevention) of OLAP is particularly attractive. Benefits: Same as previous use case OLAP queries are run directly against data in the lake. OLAP in the lake can be against larger volumes of data than traditional OLAP and can include enriched data and new data sources (e.g. geolocation, social, clickstream). OLAP in the lake can replace or prevent implementation of expensive and constrained traditional OLAP systems. Notes on OLAP on Hadoop: Jethro and Druid are both viable technologies for Fast Bi / OLAP on Hadoop. Druid is open source and best implemented against OLAP cubes with relatively few drill-downs/roll-ups. This is because each aggregation of the cube is hard-coded in Druid statements. This makes complex or rapidly evolving OLAP cubes difficult to develop, test and maintain. Jethro is proprietary software and a certified partner of Hortonworks. It is integrated with Hive LLAP. It is best for fully implemented BI initiatives, because of its "set-and-forget" implementation. This greatly reduces the role of IT in the BI program and thus allows user needs to be met more rapidly. Conclusion Traditional Enterprise Data Warehouses are feeling the strain of the modern Big Data era: these warehouses are unsustainably expensive; the majority of their data storage and processing is typically dedicated to prep work for BI queries, and not the queries themselves (the purpose of the warehouse); it is difficult to store varied data like semi-structured social and clickstream; they are constrained by how much data volume can be stored, for both cost and scaling reasons. Hadoop's scalable inexpensive storage and parallel processing can be used to optimize existing EDWs by offloading data and ETL to this platform. Moreover, recent technologies like Hive LLAP and Druid or Jethro allow you to transfer your warehouse to Hadoop and run your BI tooling (Tableau, MicroStrategy, etc) directly against TBs and PBs of data on Hadoop. The reference architectures in this article show how to architect your data on and off of Hadoop to achieve significant gains in cost, performance and Big Data strategy. Why wait? References Hive LLAP and ACID merge https://cwiki.apache.org/confluence/display/Hive/LLAP https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ https://www.youtube.com/watch?v=EjkOIhdOqek Syncsort DMX-h http://www.syncsort.com/en/Solutions/Hadoop-Solutions/Hadoop-ETL Jethro https://jethro.io/architecture Druid https://hortonworks.com/open-source/druid/ http://druid.io/

gkeys · ‎05-05-2017

This is where Phoenix will be quite useful: it is a SQL interface to HBase and does joins with other tables (but they should be simple joins ... usually just one table). So you can have one table with both patient_id and time forming a composite key with columns holding data on that patient that changes with time. Then you can have a lookup table with patient_id as key and join the two whenever you need to. The link in the original answer with the slide share shows you how to build these tables. This may help also: https://phoenix.apache.org/joins.html I am not sure of your specific query patterns but based on the info given, this should solve your problem.

gkeys · ‎05-05-2017

Below are the two most important aspects of HBase table design. Key design The most important design aspect is row key design. Proper row key design allows subsecond query results against billions of rows of data because row keys are hashed and sorted which results in fast random find and contiguous scanning of rows. Your query pattern determines row key design. Thus if you want to query by patient, patient id is the proper row key. Keep in mind that row keys can be composite (multiple keys concatenated). For example, if you want to query by hospital but only over a range of dates, you would want a compound key of hospital_id-date. In this case, a query of a date range for a hospital will find and scan the contiguous rows of hospital-date1 to hospital-date2 and avoid scanning all other rows (even if there were a billion rows). Column Family design There is really no such thing as a nested table in hbase ... sometimes it is called nested entity. The main idea is really a column family. A single column family contains one or more columns. Column families must be defined at table creation time but columns can be added dynamically after table creation (if an insert statement states a column that does not exist for a column family it will create it). Column families thus can be seen as holding an array of information that may have different lengths among rows (keys). You do not have to use it that way: you can always use identical columns for each column family. Another feature of column families is that they are written to their own files. Thus queries read only the column families holding the columns in the queries. This allows you to design very wide tables (hundreds of columns) and read only a subset of columns for each query (resulting in faster performance). Also, column families can have different properties, e.g one can be compressed and others not. Thus the general rule is to group columns that will be queried together into the same column family and allow the number of columns in a column family to be dynamic among records if you wish. Altogether a row key defines a single row or record. All rows of a table have the same number of one or more column families. Each column family can have the same or different numbers of columns among rows because new columns can be added at insert-time for a particular record, and not necessarily at table create-time (for all records). The best reference for HBase table design is probably: http://shop.oreilly.com/product/0636920014348.do This is a good overview of key design and scan behavior. It is for using HBase with the Phoenix SQL interface (which creates and queries native HBase tables underneath): https://www.slideshare.net/Hadoop_Summit/w-145p230-ataylorv2

gkeys · ‎05-03-2017

The good news is that it is quite easy to add new data nodes to your cluster. This is especially true if you use Ambari and its workflow steps. The rest of the cluster becomes aware of the new data nodes as members of the existing cluster ... no work needs to be done for clients, jobs, etc. When a new data node is added, hdfs starts balancing existing file blocks to the new nodes (everything is automatic). So, adding sets of say 4 or 8 nodes to the cluster periodically is operationally easy and has no impact on existing jobs. Purchasing and racking and stacking the servers that will be added as data nodes is the hardest part (unless you are in the cloud) and enough lead time should be given to this in your capacity planning.

gkeys · ‎04-25-2017

Roughly speaking your size will be triple in HDFS because data blocks are replicated to produce three copies for high availability reasons (e.g. if a disk fails). In addition to that you will need local disc about ~20% of hdfs capacity for intermediate storage of job results. That is very high level starting point. You can compress data in hdfs and you will likely process data in hdfs and keep the original, but those involve strategic decisions around how you will be using the data (a separate discussion/post). Since you do not want to cut it close with capacity planning, give yourself a margin beyond 3x + 20%. Keep in mind that hdfs storage is very cheap compared to MS SQL, so the redundant data blocks will still not have a large relative cost.

gkeys · ‎04-21-2017

These may be helpful: https://community.hortonworks.com/questions/84075/how-to-integrate-nifi-with-atlas-for-metadata-line.html https://community.hortonworks.com/repos/66014/nifi-atlas-bridge.html

gkeys · ‎04-17-2017

Hi @Jatin Kheradiya, I suggest placing this as a standalone question to get full visibility and value for the committee. 🙂

gkeys · ‎04-12-2017

During installation of HDP you can uncheck Atlas service on the step when you are setting up services for the cluster: http://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-guide/content/adding_a_service_to_your_hadoop_cluster.html If you have Atlas implemented, you can click on the Atlas service in Ambari (left panel) and then go to upper right Service Actions and delete the service.

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Multiple Nifi Flows in one Nifi

Re: Multiple Nifi Flows in one Nifi

A Reference Architecture for Enterprise Data Wareh...

Re: HBase schema design for complex data

Re: HBase schema design for complex data

Re: HDFS Capacity Planning question

Re: HDFS Capacity Planning question

Re: Nifi integration in Apache Atlas

Re: Hortonworks HDF - Apache Nifi - how to deploy ...

Re: hive atlas integration