About rbiswas1

rbiswas1 · ‎07-03-2017

Synopsis: In addition to authentication and access control, data encryption adds a robust layer of security, by making data unreadable in transit over the network or at rest on a disk. Encryption helps protect sensitive data, in the case of an external breach or unauthorized access by privileged users. The automation of this task is expected to save close to 4-6 hours of manual intervention per occurrence. It can be used as a disaster recovery custom solution. Github link for the code: https://github.com/RajdeepBiswas/EncryptedDataTransfer Script (common code) location: cluster1: Under root@cluster1 /root/scripts/dataCopy/hdfs_data_move.sh cluster2: Under root@cluster2 /root/scripts/dataCopy/hdfs_data_move.sh Usage: Scenario1: For copying encrypted hdfs folder from cluster2 to cluster1 Example folder name: /tmp/zone_encr_test encrypted with key “testKey123” In cluster2: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh export keys After above execution finishes: In cluster1: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh import keys After above execution finishes: ./hdfs_data_move.sh create /tmp/zone_encr_test testKey123 After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./hdfs_data_move.sh export /tmp/zone_encr_test Glossary: Quick set up of HDFS encryption zone How to set up an encryption zone: sudo su hdfs hdfs dfs -mkdir /tmp/zone_encr_test hdfs crypto -createZone -keyName testKey123 -path /tmp/zone_encr_test hdfs crypto -listZones hdfs dfs -chown -R hive:hdfs /tmp/zone_encr_test exit sudo su hive hdfs dfs -chmod -R 750 /tmp/zone_encr_test hdfs dfs -copyFromLocal /home/hive/encr_file.txt /tmp/zone_encr_test hdfs dfs -cat /tmp/zone_encr_test/encr_file.txt exit sudo su hdfs hdfs dfs -cat /tmp/zone_encr_test/encr_file.txt NOTE: The above command will fail although it ran as hdfs superuser

rbiswas1 · ‎07-03-2017

DataTransfer: Generic HDFS data and Hive Database transfer automation between any environment(Production/QA/Development) utilizing Amazon S3 storage Github link for the code: https://github.com/RajdeepBiswas/DataTransfer Synopsis: Exporting and importing data between different layers of environment like production, QA and development is a recurring task. Due to security considerations, this environments cannot talk to each other. Hence we are using Amazon S3 storage as an intermediate storage point for transferring data seamlessly across environments. The automation of this task is expected to save close to 4 hours of manual intervention per occurrence. The code can be re-used for disaster recovery automation. Code location: Place your scripts here: Script: /root/scripts/dataCopy/datamove.sh Configuration File: /root/scripts/dataCopy/conf/conf_datamove_devs3.conf Note: The name of the configuration files can be different for different S3 locations. This can be passed to the script. But it needs to be in conf folder under the /root/scripts/dataCopy directory. Usage: Scenario1: For exporting database from cluster1 to cluster2 Example database name: testdb In cluster1: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh export testdb db conf_datamove_devs3.conf After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh import testraj db conf_datamove_devs3.conf Scenario 2: For exporting HDFS data (directory) from cluster1 to cluster2 Example directory name: /tmp/tomcatLog In cluster1: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh export /tmp/tomcatLog dir conf_datamove_devs3.conf After above execution finishes: In cluster2: sudo su root cd /root/scripts/dataCopy/ ./datamove.sh import /tmp/tomcatLog dir conf_datamove_devs3.conf Note: The script can be run in background (nohup &) and the logs are stored inside the folder structure with database or directory name with timestamp. Logs: [root@cluster1 tomcatLog]# pwd /root/scripts/dataCopy/tomcatLog [root@cluster1 tomcatLog]# ls -lrt total 3 -rw-r--r--. 1 root root 4323 Jun 27 20:53 datamove_2017_06_27_20_52_42.log -rw-r--r--. 1 root root 4358 Jun 27 20:54 datamove_2017_06_27_20_54_15.log -rw-r--r--. 1 root root 4380 Jun 27 20:57 datamove_2017_06_27_20_57_31.log [root@cluster1 tomcatLog]# head datamove_2017_06_27_21_29_24.log [2017/06/27:21:29:24]: dir tomcatLog copy initiation... [2017/06/27:21:29:24]: dir tomcatLog import initiation... 17/06/27 21:29:25 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, overwrite=false, skipCRC=false, blocking=true, numListstatusThreads=0, maxMaps=20, mapBandwidth=100, sslConfigurationFile='null', copyStrategy='uniformsize', preserveStatus=[REPLICATION, BLOCKSIZE, USER, GROUP, PERMISSION, CHECKSUMTYPE, TIMES], preserveRawXattrs=false, atomicWorkPath=null, logPath=null, sourceFileListing=null, sourcePaths=[s3a://s3.path/tmp/tomcatLog], targetPath=hdfs:/tmp/tomcatLog, targetPathExists=true, filtersFile='null'} 17/06/27 21:29:26 INFO impl.TimelineClientImpl: Timeline service address: http://cluster1:8188/ws/v1/timeline/ 17/06/27 21:29:26 INFO client.RMProxy: Connecting to ResourceManager at test:8050 17/06/27 21:29:26 INFO client.AHSProxy: Connecting to Application History server at test:10200 17/06/27 21:29:28 INFO tools.SimpleCopyListing: Paths (files+dirs) cnt = 9; dirCnt = 0 17/06/27 21:29:28 INFO tools.SimpleCopyListing: Build file listing completed. 17/06/27 21:29:29 INFO tools.DistCp: Number of paths in the copy list: 9 17/06/27 21:29:29 INFO tools.DistCp: Number of paths in the copy list: 9 [root@cluster1 tomcatLog]#

rbiswas1 · ‎05-19-2017

Hi, I managed to fix the issue. The fix was to put the Hbase dispatch in the service.xml <dispatch classname="org.apache.hadoop.gateway.hbase.HBaseDispatch"/> Thanks

rbiswas1 · ‎05-08-2017

Scenario: Trying to add new columns to an already partitioned Hive table. Problem: The newly added columns will show up as null values on the data present in existing partitions. Solution: One of the workaround can be copying/moving the data in a temporary location,dropping the partition, adding back the data and then adding back the partition. It works and the new column picks up the values. But for big tables this is not a viable solution. Best approach: We need to construct the alter statement to add columns with CASCADE option as follows: ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE; From the Hive documentation: “ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column change only to table metadata.” I found out that this option is not at all in wide use and can help who face this situation. Thanks

rbiswas1 · ‎03-15-2017

PROBLEM: Hive queries were getting stuck/hang on both mr and tez engines while selecting from a table containing few csv files. While the query was working fine for few csv files, for others it just hangs. Nothing in the logs also. I was using Hive 1.2.1 on HDP 2.5.3.0 After some investigation, I found out that those files have some empty values '' in the fields where an rpad function was getting used. You can easily reproduce the issue by firing: select rpad('',1,''); You will see that the query just hangs. The reason is it goes to an infinite loop. More details here: HIVE-15792 RESOLUTION: nvl will not work in this case. That is select nvl('','D'); --will return '' I resolved using a query like this: SELECT rpad(CASE WHEN LENGTH(nvl(COLUMN_NAME,null)) > 0 THEN COLUMN_NAME ELSE null END, 1, ''); In this case the query will return null for both null and empty string values occurring in COLUMN_NAME. Hope this helps. Thanks, Rajdeep

rbiswas1 · ‎03-05-2017

Found in version Ambari 2.4.2.0 Generation of SmartSense bundle fails with following error: ERROR 2017-02-24 10:52:53,284 shell.py:95 - Execution of command returned 1. Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference at java.util.regex.Matcher.appendReplacement(Matcher.java:857) at java.util.regex.Matcher.replaceAll(Matcher.java:955) at java.lang.String.replaceAll(String.java:2223) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.handleApplyGroupPatternKeyValue(BundleAnonymizer.java:690) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyGroupPattern(BundleAnonymizer.java:673) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyPropertyRule(BundleAnonymizer.java:612) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyRules(BundleAnonymizer.java:393) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:291) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:259) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:224) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymize(BundleAnonymizer.java:160) at com.hortonworks.smartsense.anonymization.Main.run(Main.java:82) at com.hortonworks.smartsense.anonymization.Main.start(Main.java:210) at com.hortonworks.smartsense.anonymization.Main.main(Main.java:294) ERROR 2017-02-24 10:52:53,284 anonymize.py:67 - Execution of script /usr/java/default/bin/java -Xmx2048m -Xms1024m -Dlog.file.name=anonymization.log -Djava.io.tmpdir=/hadoop/smartsense/hst-agent/data/tmp -cp :/etc/hst/conf/:/usr/hdp/share/hst/hst-common/lib/* com.hortonworks.smartsense.anonymization.Main -m /hadoop/smartsense/hst-agent/data/tmp/master001.dev.company.com-a-00027129-c-00065260_comhdpdev_0_2017-02-24_10-52-04 -c /etc/hst/conf/hst-agent.ini failed ERROR 2017-02-24 10:52:53,284 anonymize.py:68 - Execution of command returned 1. Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference at java.util.regex.Matcher.appendReplacement(Matcher.java:857) at java.util.regex.Matcher.replaceAll(Matcher.java:955) at java.lang.String.replaceAll(String.java:2223) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.handleApplyGroupPatternKeyValue(BundleAnonymizer.java:690) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyGroupPattern(BundleAnonymizer.java:673) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyPropertyRule(BundleAnonymizer.java:612) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.applyRules(BundleAnonymizer.java:393) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:291) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:259) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymizeFolder(BundleAnonymizer.java:224) at com.hortonworks.smartsense.anonymization.BundleAnonymizer.anonymize(BundleAnonymizer.java:160) at com.hortonworks.smartsense.anonymization.Main.run(Main.java:82) at com.hortonworks.smartsense.anonymization.Main.start(Main.java:210) at com.hortonworks.smartsense.anonymization.Main.main(Main.java:294) ERROR 2017-02-24 10:52:53,285 AnonymizeBundleCommand.py:62 - Anonymization failed. Please check logs. Traceback (most recent call last): File "/usr/hdp/share/hst/hst-agent/lib/hst_agent/command/AnonymizeBundleCommand.py", line 58, in execute context['bundle_dir'] = anonymizer.anonymize(bundle_dir) File "/usr/hdp/share/hst/hst-agent/lib/hst_agent/anonymize.py", line 69, in anonymize raise Exception("Anonymization failed.") Exception: Anonymization failed. Cause: A group Regex exception is happening when the constant REPLACE_PROPERTY_VALUE_PATTERN regex pattern is not able to properly group search the parameter patternStr; Resolution for the above error: Option1 (preferred): Preferred way to upgrade is upgrade Ambari and follow post upgrade procedures as per hortonworks docs NOTE: Make sure that current Ambari version is lower than 2.4.2.8-2 Option2: 1. Stop Smartsense from Ambari 2. Uninstall smartsense-hst rpm on all nodes rpm -e smartsense-hst 3. Install smartsense-hst rpm on all nodes For Centos/Redhat 7: rpm -ivh http://private-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.4.2.8-2/smartsense/smartsense-hst-1.3.1.0-2.x86_64.rpm For Centos/Redhat 6: rpm -ivh http://private-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.4.2.8-2/smartsense/smartsense-hst-1.3.1.0-2.x86_64.rpm 4. On Ambari Server host run hst add-to-ambari 5. Restart Ambari Server 6. Delete SmartSense service from Ambari if already there 7. Add SmartSense service through Ambari's add service wizard A great many thanks to @sheetal for providing the solution. Thanks.

rbiswas1 · ‎12-14-2016

@vperiasamy Thank you. I am going to accept this answer, since really there is no inbuilt way to bulk load it. I had to do some custom scripting to call the rest api to get the policies, change repository and post it back. Not complicated but not the cleanest way either since you have to hop from one cluster to the other:)

rbiswas1 · ‎10-17-2016

Although it is a simple solution, there is not much reference to this exact problem in the www. We have seen people doing multiple projection, flattening and group by to get the data into shape for storing into the final table using pig after the aggregation and group operations. Source Table: CREATE EXTERNAL TABLE IF NOT EXISTS source( pk1 string, pk2 string, agg1 INT, agg2 INT ) STORED AS ORC tblproperties("orc.compress"="ZLIB"); Data: Target Table: CREATE EXTERNAL TABLE IF NOT EXISTS target_aggregated( pk1 string, pk2 string, sum_agg1 BIGINT, sum_agg2 BIGINT ) STORED AS ORC tblproperties("orc.compress"="ZLIB"); Pig Script: --Load the data in pig relations staging = LOAD 'DEFAULT.SOURCE' USING org.apache.hive.hcatalog.pig.HCatLoader(); --Group the data group_staging = group staging BY (pk1,pk2); --Flatten the grouped data and generate aggregates with same attribute names as the target table calculate_group_staging = FOREACH group_staging GENERATE FLATTEN(group) AS(pk1,pk2),SUM(staging.agg1) as sum_agg1, SUM(staging.agg2) as sum_agg2; --Order the data if required calculate_group_staging_ordered = ORDER calculate_group_staging BY pk1,pk2; --Store the data using HCatStorer --Data will be automatically dereferenced by using the HCatalog metastore STORE calculate_group_staging_ordered INTO 'DEFAULT.TARGET_AGGREGATED' USING org.apache.hive.hcatalog.pig.HCatStorer(); Final Results: Hope this helps. Thanks.

zack_riesland · ‎01-04-2018

Hey everyone, I have a somewhat similar question, which I posted here: https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html I would really appreciate any ideas. cc @Lester Martin @Jagatheesh Ramakrishnan @rbiswas

TimothySpann · ‎10-24-2016

https://community.hortonworks.com/questions/3037/what-size-of-tables-make-the-best-out-of-orc-forma.html Large ORC files with large stripes should be best performance. Look at this Yahoo article on Hive and ORC at scale http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

HDFS Encrypted zone intra-cluster transfer automat...

Generic HDFS data and Hive Database transfer autom...

Re: Issue with knox gateway when passing url with ...

Adding new columns to an already partitioned Hive ...

Hive query (tez/mr) hangs due to empty string oper...

Resolution: SmarSense Error: Exception in thread "...

Re: Is there a way to bulk dump ranger policy usin...

Aggregation in PIG and storage in HIVE

Re: The best approach to the thousands of small pa...

Re: Number of ORC files effect on namnode?