About ask_bill_brooks

ask_bill_brooks · ‎07-05-2021

Hi @tja Other community members may weigh in with their opinions, but I believe the answer to the first and last questions is, of course, "it depends on the job". The most suitable use cases for Sqoop center on bulk structured data transfer between RDBMSs and HDFS. Sqoop takes the commands you provide at the CLI and internally generates MapReduce tasks to execute your desired data movement, with HDFS as either the source or destination. While you can do some of the same types of (simple) things with either Spark or Sqoop, Spark and Sqoop are not interchangeable tools. You can do a lot more with Spark than you can with Sqoop because Spark gives you a full-blown programming language (Scala) along with a set of libraries that support a fairly complete distributed processing framework. The "T" part of ETL is going to be a lot easier to tackle in Spark than by using Sqoop and you will probably encounter tasks that are nearly impossible to complete with Sqoop that are fairly straightforward to address in Spark code, assuming that you have the requisite software development background. While I have not done any performance comparisons between a batch job in Sqoop to an equivalent job written in Spark (and I haven't read anybody else's work on that topic), just engaging in a bit of logical deduction from first principles would lead me to expect a considerable performance advantage in using Spark over Sqoop import (given a sufficiently large data set) because in Spark you can leverage in-memory processing capabilities which should, in theory, out perform MapReduce. Yes, unfortunately the Sqoop PMC voted in June to retire Sqoop and move the responsibility for its oversight to the Attic. That does not mean that Apache Sqoop as a tool has lost all value. Cloudera still ships it as part of Cloudera Runtime, still fully supports Sqoop and responds to new feature requests coming from customers, and there's no plan to change this. The change in status at Apache could mean that the software has reached maturity "as is" and still has its uses. But end-user development of a complete new ETL pipeline is probably not one of them.

ask_bill_brooks · ‎07-04-2021

@jiyo The explanation provided earlier by @Shelton isn't wrong, but I thought I would follow-up and provide some context I think would be helpful. In your original question, you wrote that you "got an email saying that the username is my Google email address." This email likely was referring to the username you would use to log into the Cloudera Community, not the username you would use to access Cloudera's private repositories where the binaries for Cloudera's distributions of Hadoop and/or Ambari are now located. As he pointed out and hopefully you are now aware, Cloudera modified its download policies and the binaries you are seeking to download are now only available in a private repository. If not, please see the announcement here: Transition to private repositories for CDH, HDP and HDF. The credentials to access this private repository are not generally the same ones to access Cloudera's website or the Cloudera community. The same announcement describes new patch releases of Ambari which are required to access Cloudera’s private repositories, which now contain these new and existing releases. The reason you're getting the error message from the invocation of the wget command is that the credentials to use to access the aforementioned private repositories does not depend upon your Google email address, and the command is never actually getting to the point of accessing the host archive.cloudera.com. The HTTP 301 redirects you're seeing are responses from one of google's web servers, and not Cloudera's.

ask_bill_brooks · ‎07-04-2021

Hi @roshanbi I think there are really two questions here: For each row of my data set, can I mask the last 5 digits of each data element present in the pri_identity column using Ranger? Is this possible to achieve while using Kudu? I'll restrict myself to addressing the first question. Your second question is a good one, though, because most of the documentation I've read about this simply doesn't mention Kudu, so I'll leave that part of your question to another community member who has more experience with Apache Kudu as a storage option. You didn't provide the version of either Impala, Ranger or Kudu you're using or on what distribution, but I will attempt to point you in the right direction nonetheless. You can see a quick demonstration of why and how to use a mask in Ranger on CDP in the first two minutes of this video: How to use Column Masking and Row Filtering in CDP You can see a slightly longer length demonstration of how to do something similar on HDP 3.1.x in this video: How to mask Hive columns using Atlas tags and Ranger Neither quite shows how to establish the custom masking expression, though, which is what I think you'll need to satisfy your requirements. To suppress the display of the last 5 digits in the pri_identity column, you are likely to need a custom masking expression for use in Ranger. Ranger includes several "out of the box" masking types, but a cursory look at the documentation indicates that the masking policy you've described and desire is not one of them. If that's true, you can always write a custom masking expression using the UDF syntax, which you can read about at the Apache.org site here: Hive Operators and User-Defined Functions (UDFs) Hope this helps

ask_bill_brooks · ‎06-25-2021

What the Trial Version of CDP Private Cloud Base Edition includes is an installer package. You can view the documentation on how to complete the installation here: INSTALLING CDP PRIVATE CLOUD BASE Other members of this community have previously reported success using this approach. Alternatively, if you're already familiar with Virtualbox and Vagrant, you might consider closely reading @carrossoni 's community article outlining how to create a Centos7 CDP-DC Trial VM for sandbox/learning purposes (CDP Private Cloud was formerly known as CDP Data Center). Other than that, I have not personally seen any publicly available VM images that can be deployed using Virtualization tools, but that certainly doesn't mean they don't exist. Cloudera's distributions have a vast ecosystem built up around them and it's close to impossible to keep up with everything anyone is producing. But participating in and contributing to this community helps a lot. 😀

ask_bill_brooks · ‎06-24-2021

@jiyo Maybe. You wrote that "it doesn't work properly because of "@" in username". Why do you say that "it doesn't work properly"? What is the exact error message you're receiving?

ask_bill_brooks · ‎06-24-2021

Hi @Cisco94 , Yes, we are no longer making the Cloudera Quickstart VM available for download (and haven't since March of 2020) because it was old and outdated as it was based on CDH 5.13, which went out of support in the Fall of last year. If you're a registered Cloudera partner, you can probably open a request with your partner contact at Cloudera and still get a copy of it. I'm curious as to why anyone would ask you to do a demo of CDH at this point in time, or why you are interested in showing your client a distribution which does not include the up-to-date releases of the various Hadoop ecosystem components. Cloudera's current distribution, since the Fall of 2020, is Cloudera Data Platform (or CDP); a Trial Version of CDP Private Cloud Base Edition of Cloudera Data Platform can easily be downloaded and installed from the "downloads" section of Cloudera's website.

ask_bill_brooks · ‎06-21-2021

Hi @M_Shash You didn't indicate how/where you downloaded and installed the software you used to originally establish your "Hortonworks Data Flow cluster managed by Ambari" from, but your installation of Ambari probably isn't set up to supply the authentication credentials. As you've indicated you are aware, earlier this year Cloudera announced new versions of Ambari which have the ability to access Cloudera’s private repositories for HDP/HDF installation and upgrade. Please see the announcement here: Transition to private repositories for CDH, HDP and HDF. The same announcement has extensive links to documentation on installing/upgrading HDF using Cloudera’s private repository. If this is for a new cluster, you should seriously consider upgrading to 3.5.2. HDF 3.5.2 is the last version of HDF that Cloudera will provide, and Cloudera requires customers to be on HDF 3.5.1 or HDF 3.5.2 before migrating to CFM on CDP.

ask_bill_brooks · ‎06-18-2021

Hi @roshanbi Just perusing the Hue user documentation, I read this: Scheduler: The application lets you build workflows and then schedule them to run regularly automatically. A monitoring interface shows the progress, logs and allow actions like pausing or stopping jobs.

ask_bill_brooks · ‎06-17-2021

Hi @Exor , I am guessing that the key part of the log for this issue is this: 2021-06-15 08:39:14,498 INFO NodeConfiguratorThread-6-0:com.cloudera.server.cmf.node.NodeConfigurator: Using key bundle from URL: https://archive.cloudera.com/cm6/6.0.1/allkeys.asc 2021-06-15 08:39:14,960 INFO NodeConfiguratorThread-6-0:com.cloudera.server.cmf.node.NodeConfiguratorProgress: node5: Setting COPY_FILES as failed and done state The log doesn't indicate it, but the server archive.cloudera.com is probably returning an HTTP 401 error because authentication is required. You didn't provide the version of Cloudera Manager you're using, but I would guess that the last time you attempted this operation, you were not challenged for authentication by this particular host at Cloudera and so you're wondering what changed recently. The answer is probably that your installation of Cloudera Manager isn't set up to supply the authentication credentials. Earlier this year, Cloudera announced new versions of Cloudera Manager 6.x that are required to access Cloudera’s repositories. Please see the announcement here: Transition to private repositories for CDH, HDP and HDF. The same announcement describes the new patch releases of Cloudera Manager, which are required to access Cloudera’s private repositories, which now contain the new and legacy releases and other assets such as those necessary to add a new host to an existing CDH cluster.

ask_bill_brooks · ‎06-13-2021

It would be helpful to community members inclined to answer your question if you included which version of Hue you're using, and from which distribution (i.e., HDP, CDH or CDP) you installed it from.

Member Since	‎07-29-2019 03:29 PM
Last Visited
Posts	640
Kudos received	109

Cloudera Community

Re: Vulnerability (Text4Shell) (CVE-2022-42889)

Re: ERROR orm.CompilationManager: Sqoop requires a...

Re: How to enable TEZ UI on CDP 7.1.7

Re: CDH HIVE download

Re: Nifi registry architecture.

Re: Using Sqoop to import data from SQL server

Re: username gives error in wget

Re: mask fields

Re: Easiest way to do a Cloudera Demo

Re: username gives error in wget

Re: Easiest way to do a Cloudera Demo

Re: HDF startup issue

Re: schedule sql

Re: Error while adding new host to cluster

Re: schedule sql