About gkeys

federico_facca · ‎02-18-2018

In my understanding, everything is based anyhow on "processGroup". and there is a "root" process group that is shared among all the users. It would be nice to allow the creation of multiple "root" process group and define for a user or a group of user what is the root "processGroup". Of course, super users or users belonging to different groups, should have the possibilty to see the list of root "processGroup" they can access and enter it. what do you think? for real multi-tenancy this would be quite interesting.

gkeys · ‎05-12-2017

Intro Scalable cheap storage and parallel processing are at the foundation of Hadoop. These capabilities allow Hadoop to play a key role in complementing and optimizing traditional Enterprise Data Warehouse (EDW) workloads. In addition, recent technologies like Hive LLAP (in-memory, long-running execution threads) with ACID merge and AtScale proprietary software (virtual OLAP cubes with aggregation cacheing in HDFS) now for the first time allow fast BI directly against TB- and PB-sized data on Hadoop using well-known BI tools (e.g Tableau). This article describes reference architectures for three use cases of EDW optimization: Active archiving: archiving aged EDW data on Hadoop for cheap storage and for exploratory and BI analysis. EDW offload: offloading staging data and ETL transformations to Hadoop and exporting the final data structures back to existing EDW system (for cheap storage and faster ETL on Hadoop, as well as faster BI queries on existing EDW). BI on Hadoop: Business Intelligence tooling against TB and PB data on Hadoop, with the possibility of retiring existing EDW. The Problem EDW systems like Teradata and Neteeza cost > 50x – 100x more per GB to store data compared to HDFS Most of the data in EDW systems (typically up to 70%) is staged for transformation to final tables used for BI queries. BI users do not directly query this staged data. This staged data is extremely costly to store. Aged data either sits expensively on the EDW or is archived to cheaper systems like tape where it is inaccessible for BI users and analysts. Transformation of staging data is typically long-duration, often more than a day for one transformation job. Most of the CPU in EDW systems (typically > 50%) is used to process these transformations. This long-running background CPU usage lowers the performance of BI queries run at the same time. These BI queries are the reason for the EDW, but typically do not perform optimally. EDW schema-on-write requirement stresses the ability to load modern data sources like semi-structured social data Reference Architectures Note that any of the below architectures can be implemented alone or a combination can be implemented together, depending on your needs and strategic roadmap. Also for each diagram below, red represents EDW optimization data architecture and black represents existing data architecture. Use Case 1: Active Archiving In this use case aged data is offloaded to Hadoop instead of being stored on the EDW or on archival storage like tape. Offload can be via tools such as Sqoop (native to Hadoop) or an ETL tool like Syncsort DMX-h (proprietary, integrated with Hadoop framework including YARN and map-reduce). Benefits: Aged data from EDW is now stored more cheaply Aged data from archival systems like tape are now accessible to querying EDW data is now combined with new data sources (like geospatial, social or clickstream) in the lake. The combination of these sources allow greater analytical capabilities like enriched data or customer 360 analysis for both BI users and data scientists. Use Case 2: EDW Offloading In this use case both staging data and ETL are offloaded from the EDW to hadoop. Raw data is stored on the lake and processed into cleaned and standardized data used for Hive LLAP tables. Cleaned and standardized data is transformed into structures that are exported to the existing EDW. BI users continue to use the existing EDW unchanged, unaware that the plumbing beneath has changed. Benefits: Raw data is centralized in the lake, available to data scientists and for repurposing in other use cases. Raw data is retained because storage is cheap. New (EDW) data sources are ingested to the lake, leading to greater analytical capabilities as described above. ETL is significantly faster on Hadoop because of parallel batch processing. ETL times are reduced from parts of days to minutes or hours. ETL is removed from existing EDW. This frees significant CPU resulting in noticeably faster BI queries making BI users happier. Use Case 3: BI on Hadoop This use case is identical to EDW offload as per above but the EDW is either fully replaced by or augmented by OLAP on Hadoop. For greenfield environments, replacement (i.e. prevention) of OLAP is particularly attractive. Benefits: Same as previous use case OLAP queries are run directly against data in the lake. OLAP in the lake can be against larger volumes of data than traditional OLAP and can include enriched data and new data sources (e.g. geolocation, social, clickstream). OLAP in the lake can replace or prevent implementation of expensive and constrained traditional OLAP systems. Notes on OLAP on Hadoop: Jethro and Druid are both viable technologies for Fast Bi / OLAP on Hadoop. Druid is open source and best implemented against OLAP cubes with relatively few drill-downs/roll-ups. This is because each aggregation of the cube is hard-coded in Druid statements. This makes complex or rapidly evolving OLAP cubes difficult to develop, test and maintain. Jethro is proprietary software and a certified partner of Hortonworks. It is integrated with Hive LLAP. It is best for fully implemented BI initiatives, because of its "set-and-forget" implementation. This greatly reduces the role of IT in the BI program and thus allows user needs to be met more rapidly. Conclusion Traditional Enterprise Data Warehouses are feeling the strain of the modern Big Data era: these warehouses are unsustainably expensive; the majority of their data storage and processing is typically dedicated to prep work for BI queries, and not the queries themselves (the purpose of the warehouse); it is difficult to store varied data like semi-structured social and clickstream; they are constrained by how much data volume can be stored, for both cost and scaling reasons. Hadoop's scalable inexpensive storage and parallel processing can be used to optimize existing EDWs by offloading data and ETL to this platform. Moreover, recent technologies like Hive LLAP and Druid or Jethro allow you to transfer your warehouse to Hadoop and run your BI tooling (Tableau, MicroStrategy, etc) directly against TBs and PBs of data on Hadoop. The reference architectures in this article show how to architect your data on and off of Hadoop to achieve significant gains in cost, performance and Big Data strategy. Why wait? References Hive LLAP and ACID merge https://cwiki.apache.org/confluence/display/Hive/LLAP https://hortonworks.com/blog/announcing-apache-hive-2-1-25x-faster-queries-much/ https://www.youtube.com/watch?v=EjkOIhdOqek Syncsort DMX-h http://www.syncsort.com/en/Solutions/Hadoop-Solutions/Hadoop-ETL Jethro https://jethro.io/architecture Druid https://hortonworks.com/open-source/druid/ http://druid.io/

gkeys · ‎12-23-2016

The most direct way is to transform the date to correct format in NiFi. Alternatively, you could land it in a hive table and CTAS to a new table while transforming to correct format. See this for Hive timestamp format to be used in either case: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp NiFi: Before putting to hdfs or hive, use a ReplaceText processor. You will use regex to find the timestamp pattern from original twitter json and replace it with the timestamp pattern needed in Hive/Kibana. This article should help you out: https://community.hortonworks.com/articles/57803/using-nifi-gettwitter-updateattributes-and-replace.html Hive alternative: Here you either use a SerDe to transform the timestamp or you use regex. In both cases, you land the data in a Hive table, then CTAS (Create Table as Select) to a final table. This should help you out for this approach: https://community.hortonworks.com/questions/19192/how-to-transform-hive-table-using-serde.html To me, the NiFi approach is superior (unless you must store the original with untransformed date into Hadoop).

gkeys · ‎12-19-2016

Sequence files are binary files containing key-value pairs. They can be compressed at the record (key-value pair) or block levels. A Java API is typically used to write and read sequence files but Sqoop can convert to sequence files. Because they are binary, they have faster read/write than text formatted files. The small file problem arises when many small files cause memory overhead for the namenode referencing large amounts of small files. Large is a relative term, but if for example you have daily ingests of many small files ... over time you will start paying the price in memory as just stated. Also, map-reduce operates on blocks of data and when files contain less than a block of data the job spins up more mappers (with overhead cost) compared to those for files with over a block of data. Sequence files can solve the small file problem if they are used in the following way. Sequence file is written to hold multiple key-value pairs and the key is a unique file metadata, like ingest filename or filename+timestamp and value is the content of the ingested file. Now you have a single file holding many ingested files as splittable key-value pairs. So if you loaded it into pig for example and grouped by key, each file content would be its own record. Sequence files often are used in custom-written map-reduce programs. Like any decision for file formats, you need to understand what problem you are solving by deciding on a particular file format for a particular use case. If you are writing your own map-reducing programs and especially if you are also ingesting many small files repeatedly (and perhaps also want to do processing only on the ingested file metadata as well as its contents), then sequence files are a good fit. If on the other hand you want to load the data into hive tables (and especially where most queries are on subsets of columns), you would be better off landing the small files into hdfs, merging them and converting to ORC, and then deleting the landed small files.

gkeys · ‎12-19-2016

@Kumar, as tests by me and @Devin Pinkston show, it is the actual file content that you need to look at (not the UI). Thus ... no fears of the processor adding a new line.

gkeys · ‎12-16-2016

If you are using Cloudbreak to deploy HDP to Azure, anything you deploy on premise can be identically deployed to Azure (also AWS, Google, OpenStack). This is the IaaS model where only the infrastructure is virtualized in the cloud, not what is deployed there. (The PaaS model additionally abstracts deployed components as virtualized components and thus is not identical to on premise deployments. HDInsight on Azure is PaaS). http://hortonworks.com/apache/cloudbreak/ http://hortonworks.com/products/cloud/azure-hdinsight/

cecil_new · ‎12-19-2016

Back at it this morning and, while I don't quite get what's happening, I consider this resolved. This morning I did the following: removed a prior install of Apache Zeppelin after I realized that after a reboot, it still responded to localhost:8080 confirmed it was indeed gone started up Virtual box and started the sandbox then zeppelin still responded to localhost:8080, which really confused me then tried localhost:9995, to which a different zeppelin page responded - so that was a good thing then, remembering something from a previous experience, I tried 127.0.0.1:8080 and then Ambari responded with its login page This is now the second time I have seen localhost and 127.0.0.1 be treated differently; one of these days I'll have to figure out why. But for now, I'm back in business and continuing the tutorial. Thanks everyone for their help! Cecil

gkeys · ‎12-16-2016

Your first screen shot shows the local file system, not hdfs. To see the files in hdfs from the command line you have to run the command hdfs dfs -ls path If you do not start path with / the path will be relative to the user directory of the user you are logged in as. In your case it will be /user/root in hdfs. Similarly from the NiFi side if you do not start the path in PutHDFS with / it will put to hdfs under the nifi user (I think it will be /user/nifi/Nifi). It is a best practice to specify hdfs paths explicitly (ie starting with /) You can use the explorer ui to navigate the hdfs file system and you can also use the Ambari files view. (Just log into Ambari and go to views in upper right, then Files View) See the following links for more: http://www.dummies.com/programming/big-data/hadoop/hadoop-distributed-file-system-shell-commands/ https://hadoop.apache.org/docs/r2.6.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

gkeys · ‎12-12-2016

Hope you give it a go with the free 5 day trial -- looking forward to seeing how it goes.

gkeys · ‎12-10-2016

You can be 100% that one forked flow file will not effect another. When a flow file is passed from one processor to another, the upstream processor passes a reference (to flowfile in content repository) to the second processor. When one processor forks the same flow file to two different processors, the flow file in content repository is CLONED ... reference of one clone is passed to one processor and reference to the other clone is passed to the second processor. Note that viewing the provenance of your flow live flow shows these reference-clone details. This explains flowfile life cycle, including explanation here: https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#pass-by-reference

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Multiple Nifi Flows in one Nifi

A Reference Architecture for Enterprise Data Wareh...

Re: what kind of data type i am going to use as a ...

Re: What is the exact difference between Sequence ...

Re: NiFi: FetchFile processor appends new line at ...

Re: IBM DataStage on Hadoop in Azure

Re: Ambiguities in the Tutorial and things that do...

Re: HDP file location for Nifi

Re: Hortonworks Data Cloud and R

Re: NiFi: Pass by Reference vs Copy on Write