About gkeys

gkeys · ‎11-23-2016

Unfortunately when importing, Sqoop cannot access stored procedures on the source system -- you will have to implement the processing logic on the hadoop side. To do this, you have three main choices: ingest the raw data in a landing zone and use pig to transform (implementing the stored proc logic) into your target hive table. Note that landing the data raw is a best practice in hadoop ... there is a good chance you may want this raw data for activities elsewhere (like reporting or data science) and storage is cheap. same as above but implement Hive HPL/SQL which is a procedural sql language for hive https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156 same as above but use a 3rd party tool like Syncsort's DMX-h https://community.hortonworks.com/answers/list.html?sort=newest&customPageSize=true Notes there are advantages to offloading the stored proc processing to hadoop: it typically takes much less time on hadoop (parallel processing) it frees resources on your source system and thus improves performance on that side when exporting from hadoop to RDBMS you CAN trigger a stored procedure on the RDBMS side If this is what you are looking for, let me know by accepting the answer; else, please follow up with any remaining gaps.

gkeys · ‎11-23-2016

True. Here is the complete list of validation limitations. https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#validation Validation currently only validates data copied from a single table into HDFS. The following are the limitations in the current implementation: all-tables option free-form query option Data imported into Hive or HBase table import with --where argument incremental imports For Hive You can sqoop to HDFS with an external Hive table pointed to it. You can use the --validate feature here ... and if it passess validation, you can load to your Hive managed table from the landing zone external table using INSERT Select *. Note that this validation only validates the number of rows. If you want to validate every single value transferred from source to Hive, you would have to sqoop back to your source and compare checksums of the original and sqooped data in that source system HBase You can put a Phoenix front-end in front of your hbase table and do the same as with hive above.

gkeys · ‎11-23-2016

Great article! I needed to do the following to get this to work fully: InferAvroSchema processor: City,Edition,Sport,sub_sport,Athlete,country,Gender,Event,Event_gender,Medal - produced nulls in the Hive table for columns in caps I made them all lower case and got the values in the Hive table

gkeys · ‎11-21-2016

Glad it worked out Vamsi 🙂

gkeys · ‎11-19-2016

I was able to find out that FlowFiles on disk are stored in a (human unreadable) binary format and thus there is little need to encrypt. They do appear as readable to someone viewing provenance -- but this can either be switched off or locked down by user role.

gkeys · ‎11-18-2016

HDF is best thought of as working with data in motion and HDP as Hadoop, the popular Big Data Platform which in contrast can be seen as data at rest. Both are independent platforms but can are often integrated. When integrated, they are deployed as separate clusters or platforms. Both are open source and Hortonworks provides paid support for each separately. HDF HDF has NiFi, Storm and Kafka (as well as Ambari admin console). These components are used to get data from diverse sources (ranging from social media sites, log files, IoT devices, databases, etc) and send the data to an equally diverse range of target systems. In between, they can transform moving content, make decisions based on moving content, and run analytics on moving content. The actual movement of data is difficult to engineer and these components move data and handle the many challenges in doing so all under the covers with no low-level development needed. See: https://hortonworks.com/products/data-center/hdf/ HDP HDP is more commonly known as the Hadoop or Big Data Platform. It has HDFS, YARN, Map-reduce and Tez processing engines, Hive database, HBase No Sql database, and many other tools to work with Big Data (data in large volumes, wide variety of formats, and fast real-time velocity of arriving on the platform ... the 3 Vs). It stores this data cheaply and flexibly, and uses horizontal scaling of servers to parallel process these 3 Vs of data in a short amount of time (compared to traditional databases which face limits in working with the 3 Vs). What type of processing depends on the out-of-the-box or 3rd party tools used and the use case / business case involved. See: https://hortonworks.com/products/data-center/hdp/ HDF + HDP HDF and HDP are often integrated because HDF is an effective way to get diverse sources of data into HDP to be stored and processed all in one place, to be used by data scientists for example. If this is what you were looking for, let me know by accepting the answer; else, please respond to this answer with further questions and I will follow-up.

gkeys · ‎11-18-2016

Thank you @Karthik Narayanan I am wondering what the best practices are, especially since NiFi was built by the NSA. How would they recommend? (Not sure if you can answer this, but I would have thought on disk encryption would be a more straightforward implementation).

gkeys · ‎11-18-2016

I want to encrypt all flow files on disk in the the NiFi cluster. Let's say I GetSFTP, do a flow with many processors, then PutSFTP. Do I simply place an EncryptContent (to encrypt) after the GetSFTP and then another (to decrypt) before the PutSFTP? If so, won't the data be unencrypted on disc between GetSFTP - EncryptContent and between EncryptContent-PutSFTP?

gkeys · ‎11-18-2016

I was hoping to be granular with encryption of sensitive data vs non-sensitive data flowing into HDFS for performance reasons. If performance differences are not that large ... it is no big deal, then.

gkeys · ‎11-18-2016

Yes, I saw that. In my environment I got it only during filter and not the load. what pig version are you using? what happens when you do: USING PigStorage() as (str:chararray); In any case, it is just a warning to let you know nothing invisible is happening under the scenes.

Online	Offline
Last Visited	‎06-11-2019 01:24 AM

Member Since	‎06-20-2016 01:29 PM
Last Visited	‎06-11-2019 01:24 AM
Posts	488
Kudos received	430

Cloudera Community

Re: DR for hadoop

Re: API + how to know by API command all machines ...

Re: Does data get copied in edge node from externa...

Re: is it possible to set the hadoop.tmp.dir value...

Re: How to handle nulls when exporting from Hive?

Re: Data ingestion from MSSQL server to HDFS?

Re: Data validation after Sqoop command execution

Re: Stream data into HIVE like a Boss using NiFi H...

Re: Pig data load problem

Re: NiFi: Encrypt all flowfiles on disk during ful...

Re: DataFlow vs Data Platform

Re: NiFi: Encrypt all flowfiles on disk during ful...

NiFi: Encrypt all flowfiles on disk during full fl...

Re: NiFi: How to put to HDFS with wire encryption?

Re: Pig data load problem