Member since
07-31-2019
346
Posts
259
Kudos Received
62
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2893 | 08-22-2018 06:02 PM | |
1671 | 03-26-2018 11:48 AM | |
4145 | 03-15-2018 01:25 PM | |
5064 | 03-01-2018 08:13 PM | |
1418 | 02-20-2018 01:05 PM |
09-07-2016
12:34 PM
2 Kudos
Hi @Sridhar Subramaniam. The best place to start is with the large number of tutorials you can find here. https://github.com/hortonworks/tutorials/tree/hdp-2.5/tutorials/hortonworks Once you feel comfortable then you can begin experimenting with data sets at work or online like www.data.gov. Good luck and have fun!
... View more
09-05-2016
08:08 PM
Hi @Sarah Maadawy. Did you run ./tpcds-setup.sh 100? That's 100 GB of data. Are you sure you wanted that much data? You might be running out of space.
... View more
08-26-2016
02:59 PM
1 Kudo
Hi @Vasilis Vagias. Looks like you downloaded the vmdk for vmware. Make you download the ova for VirtualBox instead.
... View more
08-25-2016
06:04 PM
3 Kudos
Hi @Adi Jabkowsky. Take a look at the QueryDatabaseTable processor. https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.QueryDatabaseTable/index.html. You can also use the ExecuteSQL processor https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExecuteSQL/index.html
... View more
08-23-2016
02:33 PM
Changing "choose authorization" to NONE fixed the problem. 2016-08-23-09-34-47.png
... View more
08-22-2016
04:40 PM
1 Kudo
@ripunjay godhani 1. YARN provides resource isolation for most data access. The exception is streaming processes in which you will want to size and dedicate your hardware appropriately. You can use Capacity Scheduler for fine-grained resource allocations. There are also node labeling if you want to run certain jobs on certain nodes based on their hardware architecture. You will still want to do your due diligence around service co-location and proactively monitor and maintain your environment for proper performance. SmartSense is vital to helping you in this regard. 2. You cannot have multiple HDP versions under a single Ambari server. HDP 2.5 will allow for multiple Spark versions but the core HDP will need to be the same. You use development and/or test environments to test upgrades or varying tech preview components. 3. Again, YARN Capacity Scheduler can fence off application resources so no one application will consume all your cluster resources. Security is always a concern but following best practices around encryption, authentication, authorization, auditing, RBAC policies, etc. will help alleviate most scenarios. If you think about it, we've been using shared storage (SAN) for over a decade. HDFS is similar but much more versatile than SAN. Still, the same basic centralized storage concept applies.
... View more
08-22-2016
04:05 PM
2 Kudos
Hi @ripunjay godhani, there may be some confusion. Data Lake is a concept, not a technology - unless you are referring to a Azure's Data Lake. https://azure.microsoft.com/en-us/solutions/data-lake/. Anytime you bring in silo'd data and store them on a single HDFS cluster, you are creating a data lake. The benefits of this include: 1. Centralized security across all your data 2. Centrilized data governance and data lineage. 3. Centralized cluster monitoring and tuning. 4. Multiple data access on single data sets (batch, realtime, in-memory, interactive, adhoc...) 5. Central data repository for 3rd party BI tools and other visualization applications. 6. Centralized CoE for development, management, operations, and control. 7. Centralized budgeting and charge-back models. This list is by no means conclusive. In summary, a data lake is break away from application driven silos and into a data driven, data centric architecture. Hope this helps!
... View more
08-16-2016
01:55 PM
@Vincent Romeo. Your use case makes a lot of sense. I don't know for sure but you might be able to override the setting. Adding Wei to the conversation. + @Wei Zheng
... View more
08-16-2016
01:03 PM
1 Kudo
Hi @Vincent Romeo I think I know what's going on here. The issue isn't that there is something technically different between external and internal tables but, instead there is a design expectation between the two functionalities. A user will use external tables because they expect the data to not change. In this way, you could have multiple schemas applied to the same data set without fear of any one user deleting or changing the data and, if they decide to drop a table, the data isn't removed. The sole purpose of ACID is to insert, update, and delete data so this goes against the basic premise of why you would use external tables. To adhere to this expectation, the developers essentially disable the ability to run ACID on external tables, i.e. disable compaction which is the change mechanism for Hive ACID. Hope this helps!
... View more
08-09-2016
12:55 PM
1 Kudo
Hi @devers, ACID does require some performance considerations especially if you have a high number of inserts and deletes. The performance will slowly degrade over time until a compaction is executed. After compaction your performance will normalize. Based on your use case you'll want to adjust the compaction frequency to find your unique performance sweet spot. https://community.hortonworks.com/questions/15095/hive-compaction-for-acid-transactions.html
... View more