Member since
09-29-2015
67
Posts
45
Kudos Received
10
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
805 | 05-25-2016 10:24 AM | |
7399 | 05-19-2016 11:24 AM | |
5612 | 05-13-2016 10:09 AM | |
2074 | 05-13-2016 06:41 AM | |
5830 | 03-25-2016 09:15 AM |
05-30-2016
07:39 AM
Could you explain us what command you have executed, or program you have written, or copy-paste some part of the logs? It is very difficult to help with so few information
... View more
05-25-2016
10:24 AM
1 Kudo
IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies). I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html). Those tools already have lot of features to ingest files into your cluster, and archive files then after.
... View more
05-19-2016
11:24 AM
2 Kudos
This is not a problem at all. Hive is just telling you that you are doing a "Map only" job. Usually, in MapReduce (now in Hive we prefer using Tez instead of MapReduce but let's talk about MapReduce here because it is easier to understand) your job will have the following steps: Map -> Shuffle -> Reduce. The Map and Reduce steps are where computations (in Hive: projections, aggregations, filtering...) happen.
Shuffle is just data going on the network, to go from the nodes that launched the mappers to the one that launch the reducers. So if there is a possibility to do some "Map only" job and to avoid the "Shuffle" and "Reduce" steps, better: your job will be much faster in general and will involve less cluster resources (network, CPU, disk & memory). The query you are showing on this example is very simple, that is why it can be transformed by Hive into a "Map only" job. To understand better how the Hive queries are transformed into some MapReduce/Tez jobs, you can have a look at the "explain" command: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain
... View more
05-19-2016
10:23 AM
This doc is just showing you an example. Instead of using the principal "spark/blue1@EXAMPLE.COM", we could also consider using the principal "app1/blue1@EXAMPLE.COM" for one application, then using "app2/blue1@EXAMPLE.COM" for a second application etc.
... View more
05-19-2016
10:09 AM
You could use the same script to import the data from your views: Sqoop will fetch the data from your view and store it into Hive/HDFS. If you don't want to import the data but just want to create a view on Hive, then take the definition of your view in SQLserver (DDL) and create the same view in Hive (some few adaptations might be needed, check the documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView ). A recommendation I would also give you, is to do the Sqoop commands in parallel. Otherwise, if you have many tables and you use "-m 1", it will take a lot of time. You can check the script I wrote for that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html
... View more
05-19-2016
07:09 AM
Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment. If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.
... View more
05-13-2016
11:26 AM
Checking at the mismatch, you have: 7909097 rows in your source table 15818167 rows in your Hive table. This number is nearly the double of the one in your source table, which kind of confirm the warning Sqoop made you "If your database sorts in a case-insensitive order,this may result in a partialimportor duplicate records". And as you said, if you have an Int column, you don't have that kind of duplication. (by the way, you don't need to do a "select count()" in Hive to know the numbers of rows. You can just check the counters "Map input records=15818167" and "Map output records=15818167" in your Sqoop job; that will give you as much or more information and that will help you to debug it). As a test, you could try to use "--num-mappers 1", that should remove the duplication. You'll be sacrificing speed for that, but that might not be an issue if you Sqoop a lot of tables in parallel as I mentioned in my other post. You could also do another test choosing a split-column that is a numeric value, so that you won't suffer a duplication due to lower/upper case.
... View more
05-13-2016
10:13 AM
1 Kudo
If you have many tables to Sqoop, choosing some specific columns for each one of them can be cumbersome. In such case, you might consider having just 1 mapper for each table to Sqoop, and launch many Sqoop processes in parallel. You can refer to this article for that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html
... View more
05-13-2016
10:09 AM
1 Kudo
Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised. When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).
... View more
05-13-2016
09:42 AM
Those are a lot of (broad) questions! I would recommend you in the first place to look at the "Hive performance tuning" documentation on our website: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_architectural_overview.html I guess you could also find some answers on this forum https://community.hortonworks.com/topics/Hive.html However, due to the number of questions you have, I would recommend you to contact Hortonworks's professional services to have a consultant help you on your specific implementation (there is no "universal holy grail" tuning, at the end configurations and queries are optimised for specific use cases).
... View more
05-13-2016
06:41 AM
Hive is considered as the "Hadoop Data Warehouse", so indeed you can create fact and dimensional tables. Here is a doc giving an introduction on that: http://www.ibm.com/developerworks/library/bd-hivewarehouse/ If you are new to Hive, I recommend you also to start your journey by downloading the Hortonworks Sandbox and looking at the tutorials we have: http://hortonworks.com/hadoop-tutorial/loading-data-into-the-hortonworks-sandbox/ http://hortonworks.com/apache/hive/#tutorials
... View more
05-12-2016
07:09 AM
My 2 cents to complement a bit Marco Gaido's answer. Doing data validation entirely in Hive is not something uncommon. Basically, some companies use Hive as an ETL solution the same way they used to do it before with traditional databases: 1) loading the data in a first table ("staging") without doing any validation. The easiest way for that in your case would be to have a (Text file format) table where all the columns are defined as "string", so that you don't get any NULL values when doing a select 2) create a second table (ideally, ORC format) with the types correctly enforced (int, date, string...) 3) have a HQL select to fill the second table from the first table. Some basic transformations might be done to have the format that adapt to the corresponding types. In case that some rows don't pass the validation, that field would be marked as NULL or another "ERROR" token. In that query, you might have to write some complex structures/conditions using regex, UDFs etc. 4) Have another query to just extract all the valid rows from the second table and insert it into a third table (would be the production table). All the lines with NULL or ERROR would go into a fourth table (reporting table). Note: there might be some ways to merge the steps 3 & 4 in order to reduce the number of queries, doing some "Hive multi inserts".
... View more
03-29-2016
01:51 PM
1 Kudo
I guess that Tez being faster than MR generally depends on the kind of queries you have. But this is what I could see in different customer's projects. Could you tell us which version of HDP you use? I acknowledge that Hive views are not as intuitive as MR Web-UI at the beginning but it does not seem that buggy to me. And you can still send the logs as a URL to people of your team. As for diagnosing the bottlenecks, I would recommend you to try to use Swimlane with Tez: https://github.com/apache/tez/tree/master/tez-tools/swimlanes This is a graphical tool that will help you to understand which container/vertex is the bottleneck in your query.
... View more
03-29-2016
01:18 PM
1 Kudo
Strange, I have usually seen the other pattern: things were failing with Tez but working with MR. And when going to the last version of HDP, the Tez error was fixed. Could you tell us why you don't want to use Tez? Tez is usually much faster than MR.
... View more
03-25-2016
09:15 AM
1 Kudo
You're doing a window function and a group by on the same column, and that seems to be your error. Try this: SELECT stackdata_clean.owneruserid,
SUM(stackdata_clean.score) as sumscore
FROM stackdata_clean
GROUP BY stackdata_clean.owneruserid
ORDER BY sumscore DESC LIMIT 10;
... View more
03-18-2016
02:40 PM
7 Kudos
Introduction This wiki page describes the script sqoopTables.sh (GitHub: https://github.com/sourygnahtw/hadoopUtils/blob/master/scripts/sqoop/sqoopTables.sh). The purpose of the script is to sqoop in parallel many tables and to store them into Hive. Motivations In a previous project, I needed to download around 400 tables (out of 500) of a SQLserver database. Most of the tables were quite small (less than a few MB), which means that the overhead of fetching the metadata in Sqoop (establishing the connexions, getting the DDL of the table...) is very important compared to the time to do the real job (download the data). Getting the 400 tables could take around 6 hours. To speed up the downloading process, Sqoop usually "splits" a table in 4 parts, to parallelise the downloading process in 4 streams. In my project, such approach was not working:
Many tables had not clear primary keys or the distribution of those keys was not uniform. Trying to find better split columns for 400 tables was a waste of time... Trying to split a table in several parts for downloading is sometimes not optimal for the database (how are those rows stored on disk?) Most of the tables have few data so trying to speed the download of data won't make lot of improvement. We need to speedup the fetching of the metadata. So the idea of the script is, instead of parallelising the downloading of the data for 1 single table into several stream, to download several tables at the same time. And each table will have only 1 single stream of download. With that approach, I was able to download the 400 tables in less than 1 hour. Examples of execution The easiest way to execute the script is (take care to first configure the SQL driver, user and password. See next chapter): ./sqoopTables.sh fileWithListOfTable With fileWithListOfTable being a file that lists all the tables we want to Sqoop. For instance, if we want to Sqoop the 6 tables table1, table2, table3, table4, table5, table6, then the file must contain only 6 lines, 1 for each table: table1
table2
table3
table4
table5
table6 The script will launch first 4 Sqoop processes, to download the first 4 tables. When one of those first processes finishes, the script launch another Sqoop process for "table5". So that we will always have 4 active Sqoop processes till there is no more table to sqoop. You can also use some options to tune the script behaviour. For instance: ./sqoopTables.sh -d myDatabase2 -H myHiveDatabase3 -p 6 -q etl listOfTables In this case, we change the name of the relational and Hive databases. We also change the parallelism to have 6 Sqoop processes working at the same time. And we choose the "etl" Yarn queue instead of the default one. Configuration The default configuration can be encountered at the beginning of the script. You will have to change the default values or override them on the command line. Here are the variables that can be modified: origServer=myRelationalDatabase.example.com # The FQDN of the relational database you want to fetch (option: -o)
origDatabase=myDatabase # The names of the database that contains the tables to fetch (option: -d)
hiveDatabase=myHiveDatabase # The name of the Hive database that will get the tables fetched (option: -H)
parallelism=4 # The number of tables (sqoop processes) you want to download at the same time (option: -p)
queue=default # The queue used in Yarn (option: -q)
baseDir=/tmp/sqoopTables # Base directory where will be stored the log files (option: -b)
dirJavaGeneratedCode=/tmp/doSqoopTable-`id -u`-tmp # Directory for the java code generated by Sqoop (option: -c)
targetTmpHdfsDir=/tmp/doSqoopTable-`id -u`-tmp/$$ # Temporary directory in HDFS to store downloaded data before moving it to Hive Important! This script is focused to SQLserver. Search the "sqoop import" line (in the middle of the script) and change the header of the URL appropriately. Take care also to change in this line the user and password needed to connect to the relation database. Some few more notes Logging The script shows on the standard output the name of each table it has started to download, so that you can easily know how much part of the work defined in the "listOfTables" table has been accomplished. It also stores more information in the logging directory (by default: /tmp/sqoopTables). For each parallelisation stream (4 streams by default), you will have 2 kind of files available:
process_N-summary.log: an overview log. After executing the script, you should always have a look at those files. You may see the Hive stats (for instance: "[numFiles=1, totalSize=1263]"). But you must also get sure that there is no java error, due to an error when trying to Sqoop some tables (the standard output won't show up errors, this is why you must have a look at those files). process_N-raw.log: this is the whole standard output of each Sqoop execution for all the tables downloaded by this stream. After having finishing downloading a table, the script will "tail" the last 6 lines of this file and write them in the process_N-summary.log file. That is why the "summary log" is a quick way to detect errors. The "raw log" enables you to get enough details to debug any issue that might happen. Parallelism By default, the script uses 4 streams, meaning that 4 tables will be sqooped at the same time (thus, 4 connections to the relational database will be established). This number was chosen because Sqoop uses 4 as a default. However, this number is quite conservative and you might easily put a higher degree of parallelism (even more if your tables are quite small).
In my SQLserver project for instance, I have set that number to 12. For another Teradata project, I used 54 (more due to a limitation of containers on the Hadoop side than a limitation on the Teradata). To avoid wasting containers, the script makes use of ubertask. Mapping of the name of the tables The names of the tables in Hive might be a bit different from the names in the relational database. In the middle of the doSqoop() function, there is an example (commented) showing how to establish some mappings. For instance, you might want to change all the table names that start with "raw_" by "ro_".
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- how-to-tutorial
- How-ToTutorial
- Sqoop
Labels:
03-18-2016
11:06 AM
@Robin Dong As mentioned by Ancil, you might want to have a script to do the sqoop download in parallel. And you need to control quite well how big is your parallelism. Above all if you want to avoid the typical "No more spool space in...". Here's a script to do that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html Another problem I saw in Teradata, is that it is some data types are not supported when you try to directly insert the data into Hive from Sqoop. So the solution I took was the traditional one: 1) Sqoop to HDFS. 2) Build external tables on top of them 3) Create ORC file and then insert the data or the external tables
... View more
02-08-2016
12:19 PM
If you are running independently the 2 subqueries that are in your join (query for price.tkonkurent and price.toprice. I think the one that is failing on your general query is tkonkurent, but let's get sure and execute the 2 queries), is it working?
... View more
02-08-2016
08:29 AM
1 Kudo
What is the total quantity of memory dedicated to YARN in your cluster (check the Yarn WebUI to know it)? Could you try reducing the number of reducers (you have 366 + 174 on the main reduce vertices) for instance by playing with the variable hive.exec.reducers.bytes.per.reducer? If you only have 75 mappers as input, I am not sure you need that many reducers.
... View more
01-28-2016
10:48 PM
Actually, I upgraded my Sandbox to the last version of HDP. When I do a "locate" on my Sandbox, I no longer find a reference to any spark-1.4.1 jar, only 1.5.2 jars.
... View more
01-28-2016
10:17 PM
Interesting, my SPARK_HOME is not defined in that file. I have only that comment: # export SPARK_HOME # (required) When it is defined, load it instead of Zeppelin embedded Spark libraries Which explains why we use the Spark libraries in the zeppelin jar instead of the one defined in spark.yarn.jar ?
... View more
01-28-2016
10:13 PM
I have the same behaviour on my Sandbox (HDP2.3.4). This seems strange because the version number in spark.yarn.jar and in spark.home seemed to be totally bypassed. If you look at the jar zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar inside <ZEPPELIN-HOME>/interpreter/spark, and if you extract the file META-INF/maven/org.apache.zeppelin/zeppelin-spark/pom.xml, you'll see this: <spark.version>1.4.1</spark.version>
... View more
12-03-2015
05:50 AM
1 Kudo
To be able to use both S3 and HDFS for your Hive table, you could use an external table with partitions pointing to different locations. Look for the process that starts at "An interesting benefit of this flexibility is that we can archive old data on inexpensive storage" in this link: Hive def guide To automate this process, you could use Cron but I guess Falcon should also be possible.
... View more
12-01-2015
02:34 PM
3 Kudos
Some few months ago I asked a similar question and I got that reply: https://issues.apache.org/jira/browse/HIVE-11937 So, I don't think you can use the stats in Hive 0.14 for the kind of query you want to do. Maybe with the next Hive version. A possible workaround would be to get the names of all your partitions in that table, and to have a script (in python, bash or a java program) that generates a query for each partition. Not sure it works but you might give it a try.
... View more
11-18-2015
02:18 PM
1 Kudo
I confirm that DbVisualizer works fine with HiveServer2. We use it quite a lot in one big ETL-Hive project we have. Take care, the free version has some limitations (for instance, you can't have 2 same tabs opened at the same time) and depending on the usage you have of that tool, you might want the comercial version.
... View more
11-18-2015
12:26 AM
In my previous company we developed a rules engine/CEP based on Hadoop. I don't remember the reasons why but we discarded Drool (the other existing software in the market did not match our need neither). Hive was definitely not an option because it had too much latencies (take care about those last 2 sentences: those design decisions were made 3 years ago, lot of things have changed since and you might reconsider those choices). The 1st implementation of the CEP was done using MapReduce and HBase (to maintain the states). The rules where loaded from a MySQL database and applied by the MapReduce job. Since we still had some latencies (due to MR), we started to move the code to Spark (streaming), still keeping HBase as a backend. Using HBase coprocessors was also an idea. Can't say much because I left the company before seeing that change in production. The front-end was web-graphical-drag&drop, so it allowed the user to quickly implement the business logic without our help. I'm not sure my answer is exactly what you were looking for. If you find some good opensource CEP projects that suit you, please let me know. I still feel curious about it.
... View more
11-16-2015
08:32 PM
1 Kudo
In order to avoid/"reduce the risk" of the above mentioned dangers, maybe some recommendations could be: - results should be in a temporary HDFS directory (the same HDFS volume where the target directory is) - DFS command to move (instead of copy) the files to the target directory. Doing so, the "move" operation should be pretty atomic and the risk of race conditions quite low.
... View more
11-11-2015
03:58 PM
1 Kudo
According to the documentation, the sink multiport_syslogtcp is faster than syslogtcp. Has anybody some benchmarks that describe that? Please share your experience.
... View more
Labels:
- Labels:
-
Apache Flume
11-11-2015
03:46 PM
1 Kudo
Instead of spending time writing a new SerDe, wouldn't it be possible to use the following approach: 1) Use a Regex SerDe (https://hive.apache.org/javadocs/r1.2.1/api/org/apache/hadoop/hive/serde2/RegexSerDe.html ) to get in a first temporary table the 8 "keys" columns and the last (String) dynamic column 2) With a CTAS, insert the data into an ORC table, using the str_to_map() UDF to transform the string dynamic column into a map. This step would also enable you to have your data in a more performant backend.
... View more