About emaxwell

emaxwell · ‎04-27-2016

@Roberto Sancho Pig is a good tool to use for ETL and data warehouse type of processing on your data. It provides an abstraction layer for the underlying processing engine (MR or Tez). You can use Tez as the execution engine to speed up processing. This Pig Tutorial has additional information.

emaxwell · ‎04-26-2016

@David Lays The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc. Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides. If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

emaxwell · ‎04-22-2016

@Hefei Li The data is stored encrypted with a copy of the encrypted decryption key (EDEK) attached to the file. No user will be able to access the contents of the O/S level files unless they get the KMS to provide an unencrypted version of the decryption key (DEK). The EDEK is stored with the file so the KMS can determine which version of the key was used to encrypt the file to provide the appropriate DEK once policy checks for access to the file have passed. At the HDFS layer, the user has to have policy access to the KMS key to unencrypt the file. The user will not be able to decrypt the file unless this policy check passes. If you uninstall Ranger and the KMS, you will start seeing errors in the HDFS logs when you try to access files in an encryption zone because the namenode will no longer be able to communicate with the KMS for keys or Ranger for key access policies to the files.

emaxwell · ‎04-21-2016

@Artem Ervits This can definitely be done, but you'll need a different "database" (MySQL parlance) or "schema" (Oracle, DB2 parlance) for each Ambari cluster. For example, you might create an "ambari-Prod1" database or schema for the Prod1 HDP cluster and an "ambari-Test2" database/schema for the Test2 HDP cluster.

emaxwell · ‎04-20-2016

@rbiswas Using the security features of NiFi (like HTTPS transport) is a great way to secure the data in motion. You will want to make sure that the connection from NiFi to the HDP cluster is secured as well (depending on how you do this, possibly with WebHDFS HTTPS transport or Knox). Once the data has landed, you may consider at-rest encryption utilizing the Ranger KMS to provide additional security for the data as well.

emaxwell · ‎04-19-2016

@Gowrisankar Periyasamy HDFS allocates space in blocks at a time and a block belongs to a file. If you have a file that takes up a partial block at the end, then that block (and its replicas) remain unfilled until an append is done to the file. If you append to the file, then the last block of the file (and its replicas) is used to hold the appended data until the block is full. For very large files (which is mostly why people use Hadoop), having a max of <blocksize>MB (plus replicas) of space unused is not too large of a concern. For example, if you have a 99.9GB file, you would allocate 799 full blocks (at 128MB/block) and have one block that was only 20% full. That equates to about 0.1% unused space for that file.

emaxwell · ‎04-15-2016

@Alexander Check out this question thread. See if that helps?

emaxwell · ‎04-13-2016

You can delete a service with the REST API through ambari. You will need to stop the service first, then use the following command: curl -u admin:admin -i -H 'X-Requested-By: ambari' -X DELETE http://sandbox.hortonworks.com:8080/api/v1/clusters/Sandbox/services/FALCON

emaxwell · ‎04-12-2016

@shannon luo What are the configuration attributes of your EvaluateJSONPath processor? Is you destination set to "flowfile-content" or "flowfile-attribute"? I have a processor set up to evaluate Twitter JSON, and the destination is set to "flowfile-attribute" with a number of attributes identified. Can you take a look at the attached image and see if your attributes are configured similarly?

emaxwell · ‎04-12-2016

@shannon luo You can use an EvaluateJSONPath processor to pull out the fields that you want in the flow. You will create a parameter for each field in the JSON you wish to put on the output flow file. The name will be what you want the field to be called on the output, and the value will be an expression equating to the field in the input JSON (e.g. Name = twitter.name, Value = $.user.screen_name takes the input user:screen_name value from the JSON and creates a variable called twitter.name on the output flow file). Thanks Erik

Online	Offline
Last Visited	‎08-09-2023 04:54 PM

Member Since	‎07-30-2019 08:48 AM
Last Visited	‎08-09-2023 04:54 PM
Posts	181
Kudos received	197

Cloudera Community

Re: HDP 2.6.2 Upgrade Stuck

Re: What are the best practices around HDFS Transp...

Re: Any documentation on Kafka Governance ?

Re: Can we set exceptions to a SuperUser's access ...

Re: oozie sqoop job error

Re: best way to do mapreduce

Re: HDFS replication for DR

Re: HDFS Data Protection？

Re: Can one use a single database server for hosti...

Re: Best practices for securing data ingestion thr...

Re: I have question related to writing data in dat...

Re: Assume you reboot a datanode in a running clus...

Re: How to remove Falcon service from Ambari?

Re: how i extract attribute from json file using n...

Re: how i extract attribute from json file using n...