About sluangsay

sluangsay · ‎05-13-2016

Hive is considered as the "Hadoop Data Warehouse", so indeed you can create fact and dimensional tables. Here is a doc giving an introduction on that: http://www.ibm.com/developerworks/library/bd-hivewarehouse/ If you are new to Hive, I recommend you also to start your journey by downloading the Hortonworks Sandbox and looking at the tutorials we have: http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/ http://hortonworks.com/apache/hive/#tutorials

sluangsay · ‎05-12-2016

My 2 cents to complement a bit Marco Gaido's answer. Doing data validation entirely in Hive is not something uncommon. Basically, some companies use Hive as an ETL solution the same way they used to do it before with traditional databases: 1) loading the data in a first table ("staging") without doing any validation. The easiest way for that in your case would be to have a (Text file format) table where all the columns are defined as "string", so that you don't get any NULL values when doing a select 2) create a second table (ideally, ORC format) with the types correctly enforced (int, date, string...) 3) have a HQL select to fill the second table from the first table. Some basic transformations might be done to have the format that adapt to the corresponding types. In case that some rows don't pass the validation, that field would be marked as NULL or another "ERROR" token. In that query, you might have to write some complex structures/conditions using regex, UDFs etc. 4) Have another query to just extract all the valid rows from the second table and insert it into a third table (would be the production table). All the lines with NULL or ERROR would go into a fourth table (reporting table). Note: there might be some ways to merge the steps 3 & 4 in order to reduce the number of queries, doing some "Hive multi inserts".

sluangsay · ‎03-30-2016

Is your cluster kerberized?

sluangsay · ‎03-29-2016

I guess that Tez being faster than MR generally depends on the kind of queries you have. But this is what I could see in different customer's projects. Could you tell us which version of HDP you use? I acknowledge that Hive views are not as intuitive as MR Web-UI at the beginning but it does not seem that buggy to me. And you can still send the logs as a URL to people of your team. As for diagnosing the bottlenecks, I would recommend you to try to use Swimlane with Tez: https://github.com/apache/tez/tree/master/tez-tools/swimlanes This is a graphical tool that will help you to understand which container/vertex is the bottleneck in your query.

sluangsay · ‎03-29-2016

Strange, I have usually seen the other pattern: things were failing with Tez but working with MR. And when going to the last version of HDP, the Tez error was fixed. Could you tell us why you don't want to use Tez? Tez is usually much faster than MR.

sluangsay · ‎03-25-2016

You're doing a window function and a group by on the same column, and that seems to be your error. Try this: SELECT stackdata_clean.owneruserid, SUM(stackdata_clean.score) as sumscore FROM stackdata_clean GROUP BY stackdata_clean.owneruserid ORDER BY sumscore DESC LIMIT 10;

sluangsay · ‎03-18-2016

Introduction This wiki page describes the script sqoopTables.sh (GitHub: https://github.com/sourygnahtw/hadoopUtils/blob/master/scripts/sqoop/sqoopTables.sh). The purpose of the script is to sqoop in parallel many tables and to store them into Hive. Motivations In a previous project, I needed to download around 400 tables (out of 500) of a SQLserver database. Most of the tables were quite small (less than a few MB), which means that the overhead of fetching the metadata in Sqoop (establishing the connexions, getting the DDL of the table...) is very important compared to the time to do the real job (download the data). Getting the 400 tables could take around 6 hours. To speed up the downloading process, Sqoop usually "splits" a table in 4 parts, to parallelise the downloading process in 4 streams. In my project, such approach was not working: Many tables had not clear primary keys or the distribution of those keys was not uniform. Trying to find better split columns for 400 tables was a waste of time... Trying to split a table in several parts for downloading is sometimes not optimal for the database (how are those rows stored on disk?) Most of the tables have few data so trying to speed the download of data won't make lot of improvement. We need to speedup the fetching of the metadata. So the idea of the script is, instead of parallelising the downloading of the data for 1 single table into several stream, to download several tables at the same time. And each table will have only 1 single stream of download. With that approach, I was able to download the 400 tables in less than 1 hour. Examples of execution The easiest way to execute the script is (take care to first configure the SQL driver, user and password. See next chapter): ./sqoopTables.sh fileWithListOfTable With fileWithListOfTable being a file that lists all the tables we want to Sqoop. For instance, if we want to Sqoop the 6 tables table1, table2, table3, table4, table5, table6, then the file must contain only 6 lines, 1 for each table: table1 table2 table3 table4 table5 table6 The script will launch first 4 Sqoop processes, to download the first 4 tables. When one of those first processes finishes, the script launch another Sqoop process for "table5". So that we will always have 4 active Sqoop processes till there is no more table to sqoop. You can also use some options to tune the script behaviour. For instance: ./sqoopTables.sh -d myDatabase2 -H myHiveDatabase3 -p 6 -q etl listOfTables In this case, we change the name of the relational and Hive databases. We also change the parallelism to have 6 Sqoop processes working at the same time. And we choose the "etl" Yarn queue instead of the default one. Configuration The default configuration can be encountered at the beginning of the script. You will have to change the default values or override them on the command line. Here are the variables that can be modified: origServer=myRelationalDatabase.example.com # The FQDN of the relational database you want to fetch (option: -o) origDatabase=myDatabase # The names of the database that contains the tables to fetch (option: -d) hiveDatabase=myHiveDatabase # The name of the Hive database that will get the tables fetched (option: -H) parallelism=4 # The number of tables (sqoop processes) you want to download at the same time (option: -p) queue=default # The queue used in Yarn (option: -q) baseDir=/tmp/sqoopTables # Base directory where will be stored the log files (option: -b) dirJavaGeneratedCode=/tmp/doSqoopTable-`id -u`-tmp # Directory for the java code generated by Sqoop (option: -c) targetTmpHdfsDir=/tmp/doSqoopTable-`id -u`-tmp/$$ # Temporary directory in HDFS to store downloaded data before moving it to Hive Important! This script is focused to SQLserver. Search the "sqoop import" line (in the middle of the script) and change the header of the URL appropriately. Take care also to change in this line the user and password needed to connect to the relation database. Some few more notes Logging The script shows on the standard output the name of each table it has started to download, so that you can easily know how much part of the work defined in the "listOfTables" table has been accomplished. It also stores more information in the logging directory (by default: /tmp/sqoopTables). For each parallelisation stream (4 streams by default), you will have 2 kind of files available: process_N-summary.log: an overview log. After executing the script, you should always have a look at those files. You may see the Hive stats (for instance: "[numFiles=1, totalSize=1263]"). But you must also get sure that there is no java error, due to an error when trying to Sqoop some tables (the standard output won't show up errors, this is why you must have a look at those files). process_N-raw.log: this is the whole standard output of each Sqoop execution for all the tables downloaded by this stream. After having finishing downloading a table, the script will "tail" the last 6 lines of this file and write them in the process_N-summary.log file. That is why the "summary log" is a quick way to detect errors. The "raw log" enables you to get enough details to debug any issue that might happen. Parallelism By default, the script uses 4 streams, meaning that 4 tables will be sqooped at the same time (thus, 4 connections to the relational database will be established). This number was chosen because Sqoop uses 4 as a default. However, this number is quite conservative and you might easily put a higher degree of parallelism (even more if your tables are quite small). In my SQLserver project for instance, I have set that number to 12. For another Teradata project, I used 54 (more due to a limitation of containers on the Hadoop side than a limitation on the Teradata). To avoid wasting containers, the script makes use of ubertask. Mapping of the name of the tables The names of the tables in Hive might be a bit different from the names in the relational database. In the middle of the doSqoop() function, there is an example (commented) showing how to establish some mappings. For instance, you might want to change all the table names that start with "raw_" by "ro_".

sluangsay · ‎03-18-2016

@Robin Dong As mentioned by Ancil, you might want to have a script to do the sqoop download in parallel. And you need to control quite well how big is your parallelism. Above all if you want to avoid the typical "No more spool space in...". Here's a script to do that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html Another problem I saw in Teradata, is that it is some data types are not supported when you try to directly insert the data into Hive from Sqoop. So the solution I took was the traditional one: 1) Sqoop to HDFS. 2) Build external tables on top of them 3) Create ORC file and then insert the data or the external tables

sluangsay · ‎02-08-2016

If you are running independently the 2 subqueries that are in your join (query for price.tkonkurent and price.toprice. I think the one that is failing on your general query is tkonkurent, but let's get sure and execute the 2 queries), is it working?

sluangsay · ‎02-08-2016

What is the total quantity of memory dedicated to YARN in your cluster (check the Yarn WebUI to know it)? Could you try reducing the number of reducers (you have 366 + 174 on the main reduce vertices) for instance by playing with the variable hive.exec.reducers.bytes.per.reducer? If you only have 75 mappers as input, I am not sure you need that many reducers.

Online	Offline
Last Visited	‎05-30-2016 01:32 PM

Member Since	‎09-29-2015 07:44 AM
Last Visited	‎05-30-2016 01:32 PM
Posts	67
Kudos received	45

Cloudera Community

Re: Data Processing Using Pig from local to HDFS

Re: "Number of reduce tasks is set to 0 since ther...

Re: Sqoop import : composite primary key and textu...

Re: can we create a facts and dimensional tables i...

Re: Hive QL - Aggregating within a group

Re: can we create a facts and dimensional tables i...

Re: Data Type Validation in Hive

Re: oozie execute sqoop falls

Re: Solution for "Hive Runtime Error while process...

Re: Solution for "Hive Runtime Error while process...

Re: Hive QL - Aggregating within a group

Using Sqoop to fetch many tables in parallel

Re: we are going to extract data from teradata to ...

Re: Hive mapper not initializing

Re: Hive mapper not initializing