About rbiswas1

rbiswas1 · ‎07-13-2017

@Sami Ahmad You might want to delete this duplicate post. I already answered in your other post: https://community.hortonworks.com/questions/112965/cant-find-parameter-dfsdatadirs.html#answer-114021

rbiswas1 · ‎07-13-2017

@Sami Ahmad Option 1: Search for dfs.datanode.data.dir in HDFS service in Ambari. The values are separated by , Enter the number of directories specified over there. Option 2: You can also go to your terminal and issue: sudo fdisk -l and see the discs. Option 1 is easier though. Thanks

rbiswas1 · ‎07-13-2017

@Mahendra Malpute You can use chmod -R 755 /user/maria_dev And if you do not want to open up permission on the maria_dev folder, You can either place the file on /tmp with 755 permission or create another directory that is owned by hive:hdfs and place your file there. Another option is placing the file under /user/hive which anyways has the permission.

rbiswas1 · ‎07-13-2017

Design approach The designs depend on the work done in the below Jira, where data node is conceptualized as a collection of heterogeneous storage with different durability and performance requirements. https://issues.apache.org/jira/browse/HDFS-2832 Design 1 1) Hot data with partitions that are wholly hosted by HDFS. 2) Cold data with partitions that are wholly hosted by S3. 3) A view that unions these two tables which is the live table that we expose to end users. Design 2 1) Hot data with partitions that are wholly hosted by HDFS. 2) Cold data with partitions that are wholly hosted by S3. 3) Both hot and cold data are in the same table Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application. Architecture Diagram High Level Design Automation Flow Diagram Code Automation tool codebase https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/hive_hybrid_storage.sh Example configuration file https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/test_table.conf Setup & Run Setup cd /root/scripts/dataCopy vi hive_hybrid_storage.sh ##Put the script here chmod 755 hive_hybrid_storage.sh cd /root/scripts/dataCopy/conf vi test_table.conf ##This is where the cold partition names are placed Run Option1 Retain the hdfs partition and delete it manually after data verification. ./hive_hybrid_storage.sh schema_name.test_table test_table.conf retain Option2 Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs. ./hive_hybrid_storage.sh schema_name.test_table test_table.conf delete For part 1 of the article refer to the following link: https://community.hortonworks.com/content/kbentry/113932/hive-hybrid-storage-mechanism-to-reduce-storage-co.html

rbiswas1 · ‎07-13-2017

Introduction Traditional data warehouse archive strategy involves moving the old data into offsite tapes. This does not quite fit the size for modern analytics applications since the data is unavailable for business analytics in real time need. Mature Hadoop clusters need a modern data archival strategy to keep the storage expense at check when data volume increases exponentially. The term hybrid here designates an archival solution which is always available as well as completely transparent to the application layer This document will cover Use case Requirement Storage cost analysis Design Approach Architecture diagram Code How to Setup and Run the code Use case Entire business data is in HDFS (HDP clusters) backed by Amazon EBS. Disaster recovery solution is in place. Amazon claims S3 storage delivers 99.999999999% durability. In the case of data loss from S3 we have to recover the data from disaster recovery site. Requirement Decrease storage costs. Archived data should be available to perform analytics 24X7. Access hot and cold (archived) data simultaneously from the application. The solution should be transparent to the application layer. In other words, absolutely no change should be required from the application layer after the hybrid archival strategy is implemented. Performance should be acceptable. Storage cost analysis Storage vs Cost Graph Basis for Calculation For S3 $0.023 per GB-month of usage Source: https://aws.amazon.com/s3/pricing/ For EBS SSD (gp2) $0.10 per GB-month of provisioned storage Including replication factor of 3, this becomes net $0.30 per GB Source: https://aws.amazon.com/ebs/pricing/ Important Note EBS is provisioned storage, whereas S3 is paid as you use. In other words for future data growth, say you provision EBS storage of 1 TB. You have to pay 100% for it regardless you are using 0% or 90% of it. Whereas S3 is just the storage you are using. So for 2GB pay for 2 GB and for 500 GB pay for 500GB. Hence S3 price calculation is divided by 2 roughly calculating the way it will grow in correlation to the HDFS EBS storage. Please refer to part 2 for the architecture of the proposed solution and codebase: https://community.hortonworks.com/articles/113934/hive-hybrid-storage-mechanism-to-reduce-storage-co-1.html

rbiswas1 · ‎07-12-2017

@JT Ng Just omit the quotes and fire the insert. INSERT INTO TABLE tmp Partition (datehour=${hiveVar:var}) SELECT * FROM tmp2; You can also test before firing the insert like this: select ${hiveVar:var}; You should be good. Thanks

rbiswas1 · ‎07-12-2017

@Mahendra Malpute Check the permissions on the folder. The user that is running the query should have access to the folder/file. Permission can be hive:hdfs if you are running as hive. Easy way to test is: sudo su hive head user/maria_dev/timesheet.csv Also you can test it out by placing the file in the /tmp folder and giving it read permission. Thanks

rbiswas1 · ‎07-03-2017

Synopsis: In addition to authentication and access control, data encryption adds a robust layer of security, by making data unreadable in transit over the network or at rest on a disk. Encryption helps protect sensitive data, in the case of an external breach or unauthorized access by privileged users. The automation of this task is expected to save close to 4-6 hours of manual intervention per occurrence. It can be used as a disaster recovery custom solution. Github link for the code: https://github.com/RajdeepBiswas/EncryptedDataTransfer Script (common code) location: cluster1: Under root@cluster1 /root/scripts/dataCopy/hdfs_data_move.sh cluster2: Under root@cluster2 /root/scripts/dataCopy/hdfs_data_move.sh Usage: Scenario1: For copying encrypted hdfs folder from cluster2 to cluster1 Example folder name: /tmp/zone_encr_test encrypted with key “testKey123” In cluster2: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh export keys After above execution finishes: In cluster1: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh import keys After above execution finishes: ./hdfs_data_move.sh create /tmp/zone_encr_test testKey123 After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh export /tmp/zone_encr_test Glossary: Quick set up of HDFS encryption zone How to set up an encryption zone: sudo su hdfs hdfs dfs -mkdir /tmp/zone_encr_test hdfs crypto -createZone -keyName testKey123 -path /tmp/zone_encr_test hdfs crypto -listZones hdfs dfs -chown -R hive:hdfs /tmp/zone_encr_test exit sudo su hive hdfs dfs -chmod -R 750 /tmp/zone_encr_test hdfs dfs -copyFromLocal /home/hive/encr_file.txt /tmp/zone_encr_test hdfs dfs -cat /tmp/zone_encr_test/encr_file.txt exit sudo su hdfs hdfs dfs -cat /tmp/zone_encr_test/encr_file.txt NOTE: The above command will fail although it ran as hdfs superuser

rbiswas1 · ‎07-03-2017

DataTransfer: Generic HDFS data and Hive Database transfer automation between any environment(Production/QA/Development) utilizing Amazon S3 storage Github link for the code: https://github.com/RajdeepBiswas/DataTransfer Synopsis: Exporting and importing data between different layers of environment like production, QA and development is a recurring task. Due to security considerations, this environments cannot talk to each other. Hence we are using Amazon S3 storage as an intermediate storage point for transferring data seamlessly across environments. The automation of this task is expected to save close to 4 hours of manual intervention per occurrence. The code can be re-used for disaster recovery automation. Code location: Place your scripts here: Script: /root/scripts/dataCopy/datamove.sh Configuration File: /root/scripts/dataCopy/conf/conf_datamove_devs3.conf Note: The name of the configuration files can be different for different S3 locations. This can be passed to the script. But it needs to be in conf folder under the /root/scripts/dataCopy directory. Usage: Scenario1: For exporting database from cluster1 to cluster2 Example database name: testdb In cluster1: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh export testdb db conf_datamove_devs3.conf After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh import testraj db conf_datamove_devs3.conf Scenario 2: For exporting HDFS data (directory) from cluster1 to cluster2 Example directory name: /tmp/tomcatLog In cluster1: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh export /tmp/tomcatLog dir conf_datamove_devs3.conf After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh import /tmp/tomcatLog dir conf_datamove_devs3.conf Note: The script can be run in background (nohup &) and the logs are stored inside the folder structure with database or directory name with timestamp. Logs: [root@cluster1 tomcatLog]# pwd /root/scripts/dataCopy/tomcatLog [root@cluster1 tomcatLog]# ls -lrt total 3 -rw-r--r--. 1 root root 4323 Jun 27 20:53 datamove_2017_06_27_20_52_42.log -rw-r--r--. 1 root root 4358 Jun 27 20:54 datamove_2017_06_27_20_54_15.log -rw-r--r--. 1 root root 4380 Jun 27 20:57 datamove_2017_06_27_20_57_31.log [root@cluster1 tomcatLog]# head datamove_2017_06_27_21_29_24.log [2017/06/27:21:29:24]: dir tomcatLog copy initiation... [2017/06/27:21:29:24]: dir tomcatLog import initiation... 17/06/27 21:29:25 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[REPLICATION, BLOCKSIZE, USER, GROUP, PERMISSION, CHECKSUMTYPE, TIMES], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://s3.path/tmp/tomcatLog], targetPath=hdfs:/tmp/tomcatLog, targetPathExists=true, filtersFile='null'} 17/06/27 21:29:26 INFO impl.TimelineClientImpl: Timeline service address: http://cluster1:8188/ws/v1/timeline/ 17/06/27 21:29:26 INFO client.RMProxy: Connecting to ResourceManager at test:8050 17/06/27 21:29:26 INFO client.AHSProxy: Connecting to Application History server at test:10200 17/06/27 21:29:28 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 9; dirCnt = 0 17/06/27 21:29:28 INFO tools.SimpleCopyListing: Build file listing completed. 17/06/27 21:29:29 INFO tools.DistCp: Number of paths in the copy list: 9 17/06/27 21:29:29 INFO tools.DistCp: Number of paths in the copy list: 9 [root@cluster1 tomcatLog]#

rbiswas1 · ‎05-19-2017

Hi, I managed to fix the issue. The fix was to put the Hbase dispatch in the service.xml <dispatch classname="org.apache.hadoop.gateway.hbase.HBaseDispatch"/> Thanks

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

Re: can't find parameter dfs.data.dirs

Re: can't find parameter dfs.data.dirs

Re: SemanticException Line 1:23 Invalid path ''/us...

Hive hybrid storage mechanism to reduce storage co...

Hive hybrid storage mechanism to reduce storage co...

Re: How to set variable in hive partition ?

Re: SemanticException Line 1:23 Invalid path ''/us...

HDFS Encrypted zone intra-cluster transfer automat...

Generic HDFS data and Hive Database transfer autom...

Re: Issue with knox gateway when passing url with ...