About rbiswas1

Seaport · ‎11-18-2021

rbiswas1, I tried your code but pssh returned a timeout error. It was waiting for the password but I never got the prompt to enter the password. Could you elaborate more about your method? Thanks.

rbiswas1 · ‎08-01-2017

@Sonu Sahi and @Greg Keys There is a bug in Hortonworks > ODBC 2.1.5 This is fixed and will be part of HDP 3.0 As an interim, we can use ODBC driver v2.1.2 from the archive section under HDP 2.4. Thanks

vincentvanouden · ‎02-25-2018

Were you ever able to fix this? Currently trying to do the same thing using Nifi v1.5.

rbiswas1 · ‎07-14-2017

Steps to replicate: hdfs dfs -ls /apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz Found 1 items -rw-rw-rw- 1 hive hive 38258 2017-06-27 21:04 /apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz/000000_0 USING hive -f script cat /tmp/test.txt ALTER TABLE testraj.testtable PARTITION (filename="test.csv.gz") SET LOCATION "hdfs://ip-1-1-1-1.us-west-2.compute.internal:8020/apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz"; Error from hive -f scriptname: [hive@ip-1-1-1-1 rbiswas]$ hive -f /tmp/test.txt Logging initialized using configuration in file:/etc/hive/2.5.3.0-37/0/hive-log4j.properties FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter partition. alter is not possible [hive@ip-1-1-1-1 rbiswas]$ Error from Beeline: 0: jdbc:hive2://ip-1-1-1-1.us-west-2.com> ALTER TABLE testraj.testtable PARTITION (filename="test.csv.gz") SET LOCATION 'hdfs://ip-1-1-1-1.us-west-2.compute.internal:8020/apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz'; Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter partition. alter is not possible (state=08S01,code=1) 0: jdbc:hive2://ip-1-1-1-1.us-west-2.com> It does works if directly logged into HIVECLI: hive> ALTER TABLE testraj.testtable PARTITION (filename="test.csv.gz") SET LOCATION "hdfs://ip-1-1-1-1.us-west-2.compute.internal:8020/apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz"; OK Time taken: 0.605 seconds Solution: In the script, rather than using schema_name.tablename, use 2 separate lines use dbname; alter table tablename; --Note no schema name prefix The Same solution is applicable for beeline. So the script becomes: cat /tmp/test.txt use testraj; ALTER TABLE testtable PARTITION (filename="test.csv.gz") SET LOCATION "hdfs://ip-1-1-1-1.us-west-2.compute.internal:8020/apps/hive/warehouse/testraj.db/testtable/filename=test.csv.gz";

aliyesami · ‎07-14-2017

oh I am sorry the question got posted twice . the website was slow earlier and I must have clicked twice. thanks for your answer

rbiswas1 · ‎07-13-2017

@Sami Ahmad Option 1: Search for dfs.datanode.data.dir in HDFS service in Ambari. The values are separated by , Enter the number of directories specified over there. Option 2: You can also go to your terminal and issue: sudo fdisk -l and see the discs. Option 1 is easier though. Thanks

laule75 · ‎08-02-2017

Hello @rbiswas, Sorry for the delay in getting back to you (I was on holidays). Thanks for your answers. Yes we can close the thread. regards Laurent

rbiswas1 · ‎07-13-2017

Design approach The designs depend on the work done in the below Jira, where data node is conceptualized as a collection of heterogeneous storage with different durability and performance requirements. https://issues.apache.org/jira/browse/HDFS-2832 Design 1 1) Hot data with partitions that are wholly hosted by HDFS. 2) Cold data with partitions that are wholly hosted by S3. 3) A view that unions these two tables which is the live table that we expose to end users. Design 2 1) Hot data with partitions that are wholly hosted by HDFS. 2) Cold data with partitions that are wholly hosted by S3. 3) Both hot and cold data are in the same table Design 2 is chosen over Design 1 because Design 1 is not transparent to the application layer. The change from old table to the view would inherently transfer some level of porting/integration extra work to the application. Architecture Diagram High Level Design Automation Flow Diagram Code Automation tool codebase https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/hive_hybrid_storage.sh Example configuration file https://github.com/RajdeepBiswas/HybridArchiveStorage/blob/master/test_table.conf Setup & Run Setup cd /root/scripts/dataCopy vi hive_hybrid_storage.sh ##Put the script here chmod 755 hive_hybrid_storage.sh cd /root/scripts/dataCopy/conf vi test_table.conf ##This is where the cold partition names are placed Run Option1 Retain the hdfs partition and delete it manually after data verification. ./hive_hybrid_storage.sh schema_name.test_table test_table.conf retain Option2 Delete the hdfs partition as part of the script. It will delete after data is copied to s3. So there is an option to copy it back to hdfs if you want to revert the location of the partition to hdfs. ./hive_hybrid_storage.sh schema_name.test_table test_table.conf delete For part 1 of the article refer to the following link: https://community.hortonworks.com/content/kbentry/113932/hive-hybrid-storage-mechanism-to-reduce-storage-co.html

rbiswas1 · ‎07-13-2017

Introduction Traditional data warehouse archive strategy involves moving the old data into offsite tapes. This does not quite fit the size for modern analytics applications since the data is unavailable for business analytics in real time need. Mature Hadoop clusters need a modern data archival strategy to keep the storage expense at check when data volume increases exponentially. The term hybrid here designates an archival solution which is always available as well as completely transparent to the application layer This document will cover Use case Requirement Storage cost analysis Design Approach Architecture diagram Code How to Setup and Run the code Use case Entire business data is in HDFS (HDP clusters) backed by Amazon EBS. Disaster recovery solution is in place. Amazon claims S3 storage delivers 99.999999999% durability. In the case of data loss from S3 we have to recover the data from disaster recovery site. Requirement Decrease storage costs. Archived data should be available to perform analytics 24X7. Access hot and cold (archived) data simultaneously from the application. The solution should be transparent to the application layer. In other words, absolutely no change should be required from the application layer after the hybrid archival strategy is implemented. Performance should be acceptable. Storage cost analysis Storage vs Cost Graph Basis for Calculation For S3 $0.023 per GB-month of usage Source: https://aws.amazon.com/s3/pricing/ For EBS SSD (gp2) $0.10 per GB-month of provisioned storage Including replication factor of 3, this becomes net $0.30 per GB Source: https://aws.amazon.com/ebs/pricing/ Important Note EBS is provisioned storage, whereas S3 is paid as you use. In other words for future data growth, say you provision EBS storage of 1 TB. You have to pay 100% for it regardless you are using 0% or 90% of it. Whereas S3 is just the storage you are using. So for 2GB pay for 2 GB and for 500 GB pay for 500GB. Hence S3 price calculation is divided by 2 roughly calculating the way it will grow in correlation to the HDFS EBS storage. Please refer to part 2 for the architecture of the proposed solution and codebase: https://community.hortonworks.com/articles/113934/hive-hybrid-storage-mechanism-to-reduce-storage-co-1.html

rbiswas1 · ‎07-14-2017

@ravi nandyala you cannot do that for orc tables per my understanding. You have to insert the empty strings as null during insert. There are a lot of ways to do it example case, nullif etc. Check this thread also https://stackoverflow.com/questions/38872500/serialization-null-format-for-hive-orc-table

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

Re: Trying to use sudo to run commands(as root) in...

Re: How to connect hive LLAP via ODBC using http a...

Re: How to access NiFi UI installed on an AWS ec2 ...

Solution: ALTER TABLE PARTITION SET LOCATION does ...

Re: can't find parameter dfs.data.dirs

Re: can't find parameter dfs.data.dirs

Re: HDFS resiliency - DR - rack aware

Hive hybrid storage mechanism to reduce storage co...

Hive hybrid storage mechanism to reduce storage co...

Re: How to retrieve not null fields data from hive...