Member since
07-31-2019
346
Posts
259
Kudos Received
62
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1625 | 08-22-2018 06:02 PM | |
876 | 03-26-2018 11:48 AM | |
2131 | 03-15-2018 01:25 PM | |
3548 | 03-01-2018 08:13 PM | |
752 | 02-20-2018 01:05 PM |
07-08-2019
07:28 PM
@Baris Akgun this is a feature of hive 3. The integration between spark and hive in hive 3 is solely through the connector. There are a number feature-rich reasons why this is the case.
... View more
01-30-2019
05:41 PM
@Nethaji R You can use a single sqoop command. https://data-flair.training/blogs/sqoop-import-all-tables/
... View more
01-29-2019
03:56 PM
Hi @Misha Beek. DAS is definitely the best tool for what you are looking for. Also, HDP 3.1 has a sys database and an information schema you can query to get some of the results you need. Finally, if your queries are not showing up, it may be a security issue.
... View more
01-24-2019
06:43 PM
@Abhay Kasturia There is no workaround. Merge requires ACID and ACID is only supported for Hive managed tables.
... View more
01-17-2019
03:15 PM
@Abhishek Gupta I'm seeing a permission denied error in the logs. /hadoop/yarn/local/usercache/hive/appcache/application_1547703090681_0001/container_e27_1547703090681_0001_01_000004/app/install//bin/runLlapDaemon.sh: Permission denied
... View more
01-08-2019
09:10 PM
@Shak all LLAP queries will go through the default LLAP queue. You will see this in the YARN view or TEZ view. Make sure you are connecting to the correct Hiverserver2 or, if using the Ambari Hive View you have configured the view for interactive queries.
... View more
12-07-2018
02:39 PM
Hi @Mauro Ruffino. DAS is part of the recent Sandbox 3.0.1 release. The Hive tutorial has been updated with the new steps. https://hortonworks.com/tutorial/how-to-process-data-with-apache-hive/
... View more
11-19-2018
05:45 PM
1 Kudo
@Anurag Mishra Hive is used as a traditional data warehouse. It provides scalability, in-memory processing, ACID support for CDC, and full TPCS DW ANSI compliance. Hive is a great tool for BI analytics and visualization with reporting tools such Tableau. Hive is not a tool for machine learning, realtime streaming, complex event processing, or simple event processing (though it can be a key component in all those architectures). There are other tools in the stack which would provide much better functionality for those specific use cases.
... View more
08-22-2018
06:02 PM
1 Kudo
Hi @Leonardo Araujo. Enabling ACID will not automatically convert existing tables to transactional tables. You will need to manually (or script) new CREATE TABLE statements with the TRANSACTION property and then insert data into those tables. Be aware of existing requirement for transactional tables such as ORC format and bucketing. In HDP 3.0, all Hive managed tables are ACID by default.
... View more
07-30-2018
12:28 PM
@Sanaz Janbakhsh Hive CLI is deprecated. We suggest using beeline. There is work being done on a more robust SQL IDE that provides significantly more functionality than the Hive view.
... View more
07-30-2018
12:08 PM
3 Kudos
Hi @Anurag Mishra, HDI is used for ephemeral clusters based on a finite set of services. It's primary purpose is to quickly set up the services, run a workload, and then bring the cluster down. HDI was not designed to handle long-running workloads, or production data lake architecture. Finally, HDI is not configurable. You only have features provided in the images. With HDI security is an additional cost. You will need to leverage both Ranger as well as Azure Active Directory. If you would like more control and more of a production-ready environment, I'd suggest running HDP as IaaS (Infrastructure as a Service). This can be quickly and easily provisioned using Cloudbreak https://hortonworks.com/open-source/cloudbreak/ Hope this helps.
... View more
07-10-2018
01:45 PM
@tauqeer khan I'd recommend upgrading to 2.6 prior to using LLAP in production.
... View more
07-05-2018
04:20 PM
Hi @Vinay Khandelwal were you able to find a solution to this problem? This case is a dated but I've recently seen others with similar issues and we are still trying to track down a resolution.
... View more
06-28-2018
08:35 PM
Check the firewall settings between subnets. The firewall may not be allowing the connection or timing out the connection.
... View more
05-19-2018
12:27 PM
Hi @Uday Allu. Verify that you are trying to ssh to the public IP. You also may need to verify port 22 is open. You can do this by checking the networking settings for your Azure account.
... View more
05-15-2018
12:31 PM
Hi @Nick Xu. _col0 is the first column in the table definition. You can find out what it is by running "DESCRIBE <tablename>"
... View more
05-14-2018
11:42 PM
Hi @Eric Lucas. Use DAS. Hadoop is a DoS against a NAS and SAN. You can reference vmware documentation for more info. https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/vsphere/vmware-hadoop-deployment-guide.pdf. Always follow best recommendations. Virtualizing masters and edge (and dev) is ok but be cautious with data nodes.
... View more
05-09-2018
03:04 PM
HI @Shesh Kumar. It's interpreting the table as a value. You can execute "select count(*) foobar" and it will come back with 1 as a result.
... View more
05-04-2018
02:10 PM
Hi @Venkat Gali, based on the process you described I'd recommend Sqoop for general data loading from RDBMS. You'll need to use a 3rd party solution like Attunity (or Oracle's GoldenGate) for more real-time data loads. Once the data is loaded your fastest process for transformation workloads can be handled by Spark, Pig, Hive LLAP, or a combination of all of them. You may also want to look at HPL\SQL but its new on the scene and not fully baked into the platform. I hope this helps get you started.
... View more
03-28-2018
02:25 PM
1 Kudo
Hi @Mushtaq Rizvi, that sounds like a creative and good idea. I'm glad you are working something out that others can learn from. Thanks for posting!
... View more
03-28-2018
01:59 PM
1 Kudo
Hi @Mushtaq Rizvi, in thinking outloud, if you are looking at the Hive metastore and its running Oracle, MySQL, or MariaDB, I supposed you could create standard triggers to notify you when something changes. I know this can be done in SQL Server but I haven't explored the other RBMS options. Be careful how how this would affect performance depending on the rate of change. I'm not aware of a solution native to Hive. Hive does not support triggers though there may be some better options once HPL/SQL is introduced into the Hive. Please update this post if you find another solution.
... View more
03-28-2018
01:49 PM
Hi @Sebastien F, are you referring to sampling data https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling? I think I need a little more clarification in order to better help.
... View more
03-27-2018
05:38 PM
Hi @Vinod Kumar. I'll see if I can find someone who will contact you.
... View more
03-27-2018
03:56 PM
8 Kudos
AD (After Druid) In my opinion, Druid creates a new analytic service that I term the real-time EDW. In traditional EDWs the process of getting from the EDW to an optimized OLAP could take hours depending on the size of the EDW. Having a cube take six or more hours was not unusual for a lot of companies. The process also tended to be brittle and needed to be closely monitored. In addition to real-time, Druid also facilitates long-term analytics which effectively provides a Lambda architecture for your data warehouse. Data streams into Druid and is held in-memory for a configurable amount of time. While in-memory the data can be queried and visualized. After a period of time the data is then passed to long-term (historical) storage as segments on HDFS. These segments can also be part of the same visualization as the real-time data. As mentioned previously, all data in Druid contains a timestamp. The other data elements consist of the same properties as traditional EDWs: dimensions and measures. The timestamp simplifies the aggregation and Druid is completely denormalized into a single table. Remember dimensions are descriptions or attributes and measures are always additive numbers. Since this is always true, it is easy for Druid to infer in the data what are dimensional attributes and what are measures. For each timestamp duration Druid can, in real-time, aggregate facts along all dimensional attributes. This makes Druid ideal for topN, timeseries, and group-by with group-by being the least performant. The challenges around Druid and other No-SQL type technologies like MongoDB is in the visualization layer as well as the architectural and storage complexities. Durid stores json data and the json data can be difficult to manage and visualize in your standard tools such as Tableau or PowerBI. This is where the integration between Druid and Hive becomes most useful. There is a three-part series describing the integration: Druid and HIve Part 1 https://hortonworks.com/blog/apache-hive-druid-part-1-3/ Druid and Hive Part 2 https://hortonworks.com/blog/sub-second-analytics-hive-druid/ Druid and Hive Part 3 https://hortonworks.com/blog/connect-tableau-druid-hive/ The integration provides a single pane of glass against real-time pre-aggregated cubes, standard Hive tables, and historical OLAP data. More importantly, the data can be accessed through standard ODBC and JDBC visualization tools as well as managed and secure3d through Ambari, Ranger, and Atlas. Druid provided out-of-the-box lambda architecture for time-series data and, coupled with Hive, we now provide for the flexibility and ease-of-access associated with standard RDBMS’s.
... View more
Labels:
03-27-2018
01:39 PM
8 Kudos
Druid is an OLAP solution for streaming event data as well as OLAP for long-term storage. All Druid data requires a timestamp. Druid’s storage architecture is based off the timestamp similar to how HBase stores by key. Following are some key benefits of Druid: Real-time EDW on event data (time-series) Long-term storage leveraging HDFS High availability Extremely performant querying over large data sets Aggregation and Indexing High-level of data compression Hive integration Druid provides a specific solution for specific problems that could not be handled by any other technology. With that being said, there are instances where Druid may not be a good fit: Data without a timestamp No need for real-time streaming Normalized (transactional) data (no joins in Druid) Small data sets No need for aggregating measures Non-BI queries like Spark or streaming lookups Why Druid? BD (Before Druid) In traditional EDWs data is broken into dimensional tables and fact tables. Dimensions describe an object. For example, a product dimension will have colors, sizes, names, and other descriptors of product. Dimensions are always descriptors of something whether it is a product, store, or something that is part of every EDW, date. In addition to dimensions, EDWs have facts, or measures. Measures are always numbers that can be added. For example, the number 10 can be measure but averages cannot. The reason is that you can add 10 to another number, but adding 2 averages does not make numerical sense. The reason for dimensions and facts is two-fold; firstly, it was a means to denormalize the data and reduce joins. Most EDW’s are architected so that you will not need more than 2 joins to get any answer; secondly, dimensions and facts easily map to business questions (see Agile Data Warehouse Design in the reference section). For example, take the following question: “How many product x were purchased last month in store y”? We can dissect this sentence in the following way. product, month, and store are all dimensions while the question “how many” is the fact or measure. For that single question you can begin building your star schema: Figure 1: Star Schema The fact table will have a single row for each unique product sold in a particular store for particular time frame. The difference between an EDW and OLAP is that an OLAP system will pre-aggregate this answer. Prior to the query you will run a process that anticipates this question and will add up all the sales totals for all the products for all time ranges. This is fundamentally why in traditional EDW development all possible questions needed to be flushed out prior to building the schemas. The questions being asked define how the model is designed. This makes traditional EDW development extremely difficult, prone to errors, and expensive. Interviewing LOBs to find what questions they may ask the system or, more likely, looking at existing reports and trying reproduce the data in an EDW design was only the first step. Once the EDW was built you still had to work on what is called the “semantic layer”. This is the point where you instruct the OLAP tool how to aggregate the data. Tools like SQL Server Analysis Server (SSAS) are complicated tools and require a deep understanding of OLAP concepts. They are based off the Kimball methodology and therefore to some extent require the schema to look as much like a star schema as possible. Figure 2: SSAS In these tools the first thing you needed to do was define hierarchies. The easiest hierarchy to define is date. Date always follows the pattern: year,month,day,hour,seconds. Other hierarchies include geography:country, state, county, city, zip code. Hierarchies are important in OLAP because they describe how the user will drill through the data and how the data will be aggregated at each level of the hierarchy. The semantic layer is also where you define what the analyst will actually see in their visualization tools. For example, exposing and EDW surrogate key would only confuse the analyst. In the hadoop space the semantic layer is handled by vendors and software like Jethrodata, AtScale, Kyvos, and Kylin (open source).
... View more
- Find more articles tagged with:
- Design & Architecture
- druid
- FAQ
- rdbms
- use-cases
03-26-2018
11:48 AM
1 Kudo
Hi @Daniela Mohan. 2.6 ships with both Hive 1.2.1 and 2.1.0. Due to packaging and testing cycles HDP will be slightly behind the official Apache release. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.4/bk_release-notes/content/comp_versions.html. This is in order to provide stability and ioperability with other apache software. In 2.6 you will need to enable Hive interactive (LLAP) to use 2.1.0.
... View more
03-15-2018
02:48 PM
Hi @Nde Gerald Awa, sorry to hear no one has contacted you. You should have received an email notification. I reached out internally to our training team. You should be hearing from someone soon.
... View more
03-15-2018
01:27 PM
1 Kudo
Additionally the sqoop/merge process is easily automated using Workflow Manager.
... View more
03-15-2018
01:25 PM
1 Kudo
Hi @Timothy Spann the recommended approach is Attunity ---> Kafka --->Nifi---->Hive--->Merge. If you want 100% open source than sqoop the data to a staging area and run merge to get the deltas.
... View more