About SQLShaw

SQLShaw · ‎09-07-2016

Hi @Sridhar Subramaniam. The best place to start is with the large number of tutorials you can find here. https://github.com/hortonworks/tutorials/tree/hdp-2.5/tutorials/hortonworks Once you feel comfortable then you can begin experimenting with data sets at work or online like www.data.gov. Good luck and have fun!

SQLShaw · ‎09-05-2016

Hi @Sarah Maadawy. Did you run ./tpcds-setup.sh 100? That's 100 GB of data. Are you sure you wanted that much data? You might be running out of space.

SQLShaw · ‎08-26-2016

Hi @Vasilis Vagias. Looks like you downloaded the vmdk for vmware. Make you download the ova for VirtualBox instead.

SQLShaw · ‎08-25-2016

Hi @Adi Jabkowsky. Take a look at the QueryDatabaseTable processor. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html. You can also use the ExecuteSQL processor https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteSQL/index.html

SQLShaw · ‎08-23-2016

Changing "choose authorization" to NONE fixed the problem. 2016-08-23-09-34-47.png

SQLShaw · ‎08-22-2016

@ripunjay godhani 1. YARN provides resource isolation for most data access. The exception is streaming processes in which you will want to size and dedicate your hardware appropriately. You can use Capacity Scheduler for fine-grained resource allocations. There are also node labeling if you want to run certain jobs on certain nodes based on their hardware architecture. You will still want to do your due diligence around service co-location and proactively monitor and maintain your environment for proper performance. SmartSense is vital to helping you in this regard. 2. You cannot have multiple HDP versions under a single Ambari server. HDP 2.5 will allow for multiple Spark versions but the core HDP will need to be the same. You use development and/or test environments to test upgrades or varying tech preview components. 3. Again, YARN Capacity Scheduler can fence off application resources so no one application will consume all your cluster resources. Security is always a concern but following best practices around encryption, authentication, authorization, auditing, RBAC policies, etc. will help alleviate most scenarios. If you think about it, we've been using shared storage (SAN) for over a decade. HDFS is similar but much more versatile than SAN. Still, the same basic centralized storage concept applies.

SQLShaw · ‎08-22-2016

Hi @ripunjay godhani, there may be some confusion. Data Lake is a concept, not a technology - unless you are referring to a Azure's Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/. Anytime you bring in silo'd data and store them on a single HDFS cluster, you are creating a data lake. The benefits of this include: 1. Centralized security across all your data 2. Centrilized data governance and data lineage. 3. Centralized cluster monitoring and tuning. 4. Multiple data access on single data sets (batch, realtime, in-memory, interactive, adhoc...) 5. Central data repository for 3rd party BI tools and other visualization applications. 6. Centralized CoE for development, management, operations, and control. 7. Centralized budgeting and charge-back models. This list is by no means conclusive. In summary, a data lake is break away from application driven silos and into a data driven, data centric architecture. Hope this helps!

SQLShaw · ‎08-16-2016

@Vincent Romeo. Your use case makes a lot of sense. I don't know for sure but you might be able to override the setting. Adding Wei to the conversation. + @Wei Zheng

SQLShaw · ‎08-16-2016

Hi @Vincent Romeo I think I know what's going on here. The issue isn't that there is something technically different between external and internal tables but, instead there is a design expectation between the two functionalities. A user will use external tables because they expect the data to not change. In this way, you could have multiple schemas applied to the same data set without fear of any one user deleting or changing the data and, if they decide to drop a table, the data isn't removed. The sole purpose of ACID is to insert, update, and delete data so this goes against the basic premise of why you would use external tables. To adhere to this expectation, the developers essentially disable the ability to run ACID on external tables, i.e. disable compaction which is the change mechanism for Hive ACID. Hope this helps!

SQLShaw · ‎08-09-2016

Hi @devers, ACID does require some performance considerations especially if you have a high number of inserts and deletes. The performance will slowly degrade over time until a compaction is executed. After compaction your performance will normalize. Based on your use case you'll want to adjust the compaction frequency to find your unique performance sweet spot. https://community.hortonworks.com/questions/15095/hive-compaction-for-acid-transactions.html

Online	Offline
Last Visited	‎06-25-2024 10:10 AM

Member Since	‎07-31-2019 06:56 AM
Last Visited	‎06-25-2024 10:10 AM
Posts	346
Kudos received	257

Cloudera Community

Re: Regarding to activate HIVE ACID transactions o...

Re: Hive 1.2.1++

Re: What is the fastest way to load data into Apac...

Re: Do i have to commit my insert statment in hive...

Re: Deploying hortonworks sandbox VM to cluster

Re: Next step towards starting to learn HDP after ...

Re: Trying to generate data from hive_testbench th...

Re: HDP 2.5 TP Sandbox Virtual Box image not worki...

Re: Best practice for exporting oracle rdbms to hi...

Re: HDP 2.5 TP issue running Hive query - xasecure...

Re: When do we need to consider moving to hadoop ...

Re: When do we need to consider moving to hadoop ...

Re: Updated information on differences between Ext...

Re: Updated information on differences between Ext...

Re: HIVE and ACID table performance for updates