About gkeys

gkeys · ‎03-30-2017

Unfortunately the dataset is not in a simple field-delimited format, ie. where each line is a record consisting of fields separated by a delimiter like comma, pipe, or tab. If it were, you could define the delimiter on LOAD with USING PigStorage('delim') where delim would be an actual delimiter like , or | or \t. The million song data is structured in a HDF5 format, which is a complex hierarchical structure with both metadata and field data. See https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/FileSchema.pdf You need to use a wrapper API to work with it: https://labrosa.ee.columbia.edu/millionsong/pages/hdf-what https://support.hdfgroup.org/downloads/ In your case, you would need to use the wrapper API to iterate the data and output it into a delimited format. Then you could load it to pig as described above. In addition to the above links, this link is generally useful for your data set: https://labrosa.ee.columbia.edu/millionsong/faq

gkeys · ‎02-20-2017

NiFi is a perfect technology for this: UI/configuration-based, easy connection to twitter stream, automated processing, enterprise security, SDLC and deploy automation ... NiFi usage is growing rapidly and starting to replace flume use cases. Here are some links on NiFi in general: http://hortonworks.com/apache/nifi/ https://nifi.apache.org/docs.html http://www.slideshare.net/hortonworks/design-a-dataflow-in-7-minutes-58718224 https://community.hortonworks.com/questions/65360/can-someone-help-me-with-apache-nifis-basic-workin.html Here are some How-to articles on NiFi Twitter and NiFi Hive: https://community.hortonworks.com/articles/57803/using-nifi-gettwitter-updateattributes-and-replace.html https://community.hortonworks.com/articles/52856/stream-data-into-hive-like-a-king-using-nifi.html Here is an article on NiFi SDLC and reusable components https://community.hortonworks.com/articles/60868/enterprise-nifi-implementing-reusable-components-a.html

gkeys · ‎01-30-2017

Could you add the exact errors (copy-paste) as a comment to your question?

gkeys · ‎12-28-2016

ambari-infra is a core service shared across HDP stack components currently (HDP 2.5) , ambari-infra service itself has only one component: a fully managed Apache Solr installation this service is used for indexing by Atlas, Ranger, and Log Search So, in a nutshell ... ambari-infra is a shared solr service and in the future ambari-infra will add other core services to be shared among stack components. https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.0.1/bk_ambari-user-guide/content/ch_ambari_infra.html Note that Ranger and Atlas use ambari-infra by default for indexing, but each can be configured to use an externally managed SolrCloud instance (not ambari-infra).

gkeys · ‎12-23-2016

The most direct way is to transform the date to correct format in NiFi. Alternatively, you could land it in a hive table and CTAS to a new table while transforming to correct format. See this for Hive timestamp format to be used in either case: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp NiFi: Before putting to hdfs or hive, use a ReplaceText processor. You will use regex to find the timestamp pattern from original twitter json and replace it with the timestamp pattern needed in Hive/Kibana. This article should help you out: https://community.hortonworks.com/articles/57803/using-nifi-gettwitter-updateattributes-and-replace.html Hive alternative: Here you either use a SerDe to transform the timestamp or you use regex. In both cases, you land the data in a Hive table, then CTAS (Create Table as Select) to a final table. This should help you out for this approach: https://community.hortonworks.com/questions/19192/how-to-transform-hive-table-using-serde.html To me, the NiFi approach is superior (unless you must store the original with untransformed date into Hadoop).

gkeys · ‎12-21-2016

1) No, the host instances that Cloudbreak creates in the cloud infrastructure uses the following default base images: Amazon Web Services: RHEL 7.1 Microsoft Azure: CentOS 7.1 OpenLogic Google Cloud Platform: CentOS 7: centos-7-v20150603 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_cldbrk_install/bk_CLBK_IAG/content/ch02s02.html 2) see * 3) No, you get Amazon Linux 2016.09 https://aws.amazon.com/amazon-linux-ami/ 4) see * 5-6) ** 7-8) We strongly recommend NiFi which has native connectors and is very easy/fast to develop, deploy, operate: http://hortonworks.com/apache/nifi/ 9) See ** * I recommend posting these as a separate question around Storage Options for HDP in the cloud ** I recommend posting these as separate question around VPC Deployment of HDP in the cloud The reason I am suggesting this is because I am not the best to answer this and the questions will get better exposure and thus will be a better benefit to you and the community.

gkeys · ‎12-21-2016

@learninghuman Thank you for your clarification on #2 -- I have updated my answer to reflect accordingly. 1) Yes, that is HDP IaaS on Azure. There are two ways to deploy this -- via Hortonworks Cloudbreak tool and Azure Marketplace (your screenshots in comment) https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_AzureSetup/content/ch_HDP_AzureSetup.html 2) That is referred to as ADLS in my answer and is available to HDP IaaS Azure and HDInsights. Data Lake Store is designed to be Azures form of a data lake compatible with HDFS and processed via Hadoop technology. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-overview 3) Compute and storage are not colocated so yes there is a cost to traveling across the wire. As the link above describes, ADLS is distributed so it improves on read/write performance through parallelization. Note that decoupling compute and storage in the cloud is typically seen as an advantage since you can scale each separately (compute=expensive, storage=cheap, compute/storage=expensive) 4) HDCloud should be thought of much differently than HDP via Cloudbreak to AWS. HDCloud is deployed and billed via AWS marketplace (thus pay-as-you-go i.e by the minute) and is meant for ephemeral workloads -- ones you want to turn on/off to save cost whereas HDP should be seen as a full on prem deployment but in the cloud. HDCloud is very focused on core use cases of data prep, ETL, data warehouse, data science, analytics. It is a subset of full HDP services. HDCloud clusters are preconfigured and can be spun up in mere minutes (the IaaS and HDP layers are deployed in one shot via preconfigured clusters). HDP via Cloudbreak is full cluster deployment (whatever config you want) with the IaaS deployment assisted and managed via Cloudbreak and the HDP deployment via Ambari. In a nutshell -- HDP via Cloudbreak is full control for long-running full featured clusters and HDCloud are pre-packaged easy deploy clusters to be used with the intent of spinning them up/down frequently to save money on a pay by the minute basis. 5) HDC is not a managed service like HDInsights. Once you quickly get HDC up, you are completely inside the HDP world and not the service world. HDC is IaaS, but just quick prepackages clusters thar are easy to deploy.

gkeys · ‎12-21-2016

Note that there are 3 cloud deploy options for HDP -- you are missing HDC Hortonworks Data Cloud. Also, your #2 is probably best called HDP IaaS so that it is not confused with managed services. HDP via Cloudbreak (alternatively via Azure Marketplace for Azure) This is a full deployment of whatever cluster configurations you want to deploy to either AWS, Azure, Google or OpenStack. This is pure IaaS. It is like having your full on-prem possibilities in the cloud. Hortonworks tool Cloudbreak helps provisioning of the IaaS instances and their management including autoscaling. Blueprints help select preconfigured cluster types but you can use whatever server instances and cluster configurations (number of masternodes, datanodes, how HDP service are distributed across them) you wish. It deploys all HDP services via Ambari as you would on prem. You can use HDFS or S3, Blob etc as storage, or a combination. It is meant for long-running ("permanent") clusters. For Azure, you can alternatively deploy your full cluster via the Azure Marketplace instead of using Cloudbreak. http://hortonworks.com/apache/cloudbreak/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_AzureSetup/content/ch_HDP_AzureSetup.html HDInsight This is PaaS or managed services on Azure. It includes most, but not all HDP services (e.g. has HBase, Storm, Spark but currently does not have Atlas) and also has R Server on Spark, and these are managed as Azure services engineered by Microsoft and Hortonworks to conform to the Azure platform. It has a set of preconfigured cluster types to make spinning up a cluster easy and fast. It can use HDFS, ADLS or WASB storage. It is meant for both long-running and ephemeral clusters (spin down when work is completed). For ephemeral storage data is persisted in ADLS or WASB and Hive metadata is persisted so that you can spin up a new cluster and pick up data and hive state from any previous work. http://hortonworks.com/products/cloud/azure-hdinsight/ Hortonworks Data Cloud (HDC) This was released by Hortonworks in Oct 2016. It is AWS only and provisioned via AWS marketplace (takes just minutes) and offers a set of preconfigured clusters focused on the core use cases of data prep, ETL, data warehousing, data science and analytics. As such, it includes only HDP services around these use cases (e.g. no Storm). It is meant for ephemeral workloads where you frequently spin down and then spin up and thus pay-as-you-go. Data is persisted in S3 and Hive metadata persists behind the scenes so you can pick up state after spinning down/up. It is a very cost-effective and rapid way to do Big Data. Example use cases are: Imagine daily feeds from your on-prem cluster that get sent to HDC to process for one hour to tune a model deployed in production app. After processing it is turned off and this happens daily. Or imagine a data science project where the cluster is spun down each day when no one is working on it. HDC is optimized for S3 so processing is much faster than it is for AWS EMR. HDC is self-managed (IaaS as preconfigured HDP clusters). http://hortonworks.com/products/cloud/aws/ Regarding your questions: 1) Rationale for multiple cluster types is to preconfigure clusters so you simply select which one you want and quickly get up and running (vs designing/configuring them yourself). 2) The two Azure options are HDP Iaas on Azure via Cloudbreak (self-managed) and HDInsights (managed). Differences are explained in summary above but main contrast is HDInsight is more focused on Azure integration, managed services, hourly pricing option and ease of deployment as well as both long-running and ephemeral workloads. As your link states, the HDInsight managed service model means that support is first to Microsoft (which escalates to Hortonworks if it is an HDP issue and not an Azure service layer issue). 3) Difficult to benchmark directly -- you scale in the cloud to meet your performance needs. 4) HDP Iaas Azure and HDInsight support both WASB and ADLS blob storage. Regarding performance, you scale to meet your needs. 5) HDP IaaS is deployed to virtual cloud via Cloudbreak tool, which as mentioned is very useful in provisioning the cluster (workflow, blueprints, autoscaling, etc).

gkeys · ‎12-20-2016

You can do the following: 1. Schedule FetchSFTP processor for once a day arrival of file 2a. fork 'success' relationship to flow to process file when it arrives 2b fork 'not.found' relationship to PutEmail, where you configure recipients, email message (body), etc You will have to use one FetchSFTP processor for each file that you expect to arrive in sftp location and you will have to know the name of the file. You can reuse the PutEmail (by pointing multiple FetchSFTP not.found to same PutEmail) and have same recipients and email body but with filename dynamically written in body using expression language. Alternatively, you can have each not.found connection connect to its own PutEmail.

gkeys · ‎12-20-2016

Deterministic template export is a way to make the xml content of exported templates constructed in a standard way so that the same template can be diffed and change-managed in traditional cms like git. Before deterministic template export, the xml elements and ids of elements were not built in a standardized way, so changes in the flow generated unpredictable changes to the exported template xml (for example, the xml child element of a processor could change position among all child elements in a way that did not map to changes in the flow). This made change management difficult since deltas in flows represented by a template were not mapped directly to deltas in the template xmls. This feature was resolved in NiFi 1.0.0: https://issues.apache.org/jira/browse/NIFI-826

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Load MillionSongsSubset data in Pig

Re: Making Data Available

Re: Error Trying to get Basic Pig Syntax Running

Re: what is difference between ambari-infra and so...

Re: what kind of data type i am going to use as a ...

Re: HDInsight Vs HDP Service on Azure Vs HDP on Az...

Re: HDInsight Vs HDP Service on Azure Vs HDP on Az...

Re: HDInsight Vs HDP Service on Azure Vs HDP on Az...

Re: Raise alert from NiFi if file not available fr...

Re: What is meant by Deterministic Template export...