About sluangsay

sluangsay · ‎05-30-2016

Could you explain us what command you have executed, or program you have written, or copy-paste some part of the logs? It is very difficult to help with so few information

sluangsay · ‎05-25-2016

IMHO, all you should avoid having complex logic with "home developed shell script". Those kind of shell scripts are good to do some quick tests, but when you want to go into PoC, you need something less error prone and also more optimal (shell scripts will launch some many java processes, leading to quite some overhead and latencies). I recommend you to have a look at ingestion tools such as Flume (http://flume.apache.org/FlumeUserGuide.html#spooling-directory-source) or Nifi (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.FetchFile/index.html). Those tools already have lot of features to ingest files into your cluster, and archive files then after.

sluangsay · ‎05-19-2016

This is not a problem at all. Hive is just telling you that you are doing a "Map only" job. Usually, in MapReduce (now in Hive we prefer using Tez instead of MapReduce but let's talk about MapReduce here because it is easier to understand) your job will have the following steps: Map -> Shuffle -> Reduce. The Map and Reduce steps are where computations (in Hive: projections, aggregations, filtering...) happen. Shuffle is just data going on the network, to go from the nodes that launched the mappers to the one that launch the reducers. So if there is a possibility to do some "Map only" job and to avoid the "Shuffle" and "Reduce" steps, better: your job will be much faster in general and will involve less cluster resources (network, CPU, disk & memory). The query you are showing on this example is very simple, that is why it can be transformed by Hive into a "Map only" job. To understand better how the Hive queries are transformed into some MapReduce/Tez jobs, you can have a look at the "explain" command: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Explain

sluangsay · ‎05-19-2016

This doc is just showing you an example. Instead of using the principal "spark/blue1@EXAMPLE.COM", we could also consider using the principal "app1/blue1@EXAMPLE.COM" for one application, then using "app2/blue1@EXAMPLE.COM" for a second application etc.

sluangsay · ‎05-19-2016

You could use the same script to import the data from your views: Sqoop will fetch the data from your view and store it into Hive/HDFS. If you don't want to import the data but just want to create a view on Hive, then take the definition of your view in SQLserver (DDL) and create the same view in Hive (some few adaptations might be needed, check the documentation (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView ). A recommendation I would also give you, is to do the Sqoop commands in parallel. Otherwise, if you have many tables and you use "-m 1", it will take a lot of time. You can check the script I wrote for that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html

sluangsay · ‎05-19-2016

Other very good ways to load data into HDFS is using Flume or Nifi. "Hadoop fs put" is good but it has some limitation or lack of flexibility that might make it difficult to use it in a production environment. If you look at the documentation of the Flume HDFS sink for instance ( http://flume.apache.org/FlumeUserGuide.html#hdfs-sink ), you'll see that Flume lets you define how to rotate the files, how to write the file names etc. Other options can be defined for the source (your local text files) or for the channel. "Hadoop fs put" is more basic and doesn't offer those possibilities.

sluangsay · ‎05-13-2016

Checking at the mismatch, you have: 7909097 rows in your source table 15818167 rows in your Hive table. This number is nearly the double of the one in your source table, which kind of confirm the warning Sqoop made you "If your database sorts in a case-insensitive order,this may result in a partialimportor duplicate records". And as you said, if you have an Int column, you don't have that kind of duplication. (by the way, you don't need to do a "select count()" in Hive to know the numbers of rows. You can just check the counters "Map input records=15818167" and "Map output records=15818167" in your Sqoop job; that will give you as much or more information and that will help you to debug it). As a test, you could try to use "--num-mappers 1", that should remove the duplication. You'll be sacrificing speed for that, but that might not be an issue if you Sqoop a lot of tables in parallel as I mentioned in my other post. You could also do another test choosing a split-column that is a numeric value, so that you won't suffer a duplication due to lower/upper case.

sluangsay · ‎05-13-2016

If you have many tables to Sqoop, choosing some specific columns for each one of them can be cumbersome. In such case, you might consider having just 1 mapper for each table to Sqoop, and launch many Sqoop processes in parallel. You can refer to this article for that: https://community.hortonworks.com/articles/23602/sqoop-fetching-lot-of-tables-in-parallel.html

sluangsay · ‎05-13-2016

Try having a look at the "--split-by" option (see documentation https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_syntax). This will allow to choose exactly which column to split on, instead of letting Sqoop choosing a default one ("ChassiNo" in your example), which might not be optimised. When choosing the column, Sqoop recommends you not to take a String column (which could be problematic if your SQLserver database sorts this column in a case-insensitive way). You may also choose to une a column which is not a Primary Key. The important thing is to have this column with an even distribution (you want to avoid skewed Sqoop fetch tasks) and also trying to have every splits in the same disk location in your source table (try to get sequential reads).

sluangsay · ‎05-13-2016

Those are a lot of (broad) questions! I would recommend you in the first place to look at the "Hive performance tuning" documentation on our website: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_performance_tuning/content/ch_hive_architectural_overview.html I guess you could also find some answers on this forum https://community.hortonworks.com/topics/Hive.html However, due to the number of questions you have, I would recommend you to contact Hortonworks's professional services to have a consultant help you on your specific implementation (there is no "universal holy grail" tuning, at the end configurations and queries are optimised for specific use cases).

Online	Offline
Last Visited	‎05-30-2016 01:32 PM

Member Since	‎09-29-2015 07:44 AM
Last Visited	‎05-30-2016 01:32 PM
Posts	67
Kudos received	45

Cloudera Community

Re: Data Processing Using Pig from local to HDFS

Re: "Number of reduce tasks is set to 0 since ther...

Re: Sqoop import : composite primary key and textu...

Re: can we create a facts and dimensional tables i...

Re: Hive QL - Aggregating within a group

Re: Kafka Error: Could not find or load main class...

Re: Data Processing Using Pig from local to HDFS

Re: "Number of reduce tasks is set to 0 since ther...

Re: How to restrict valid users from submitting Sp...

Re: import the views of a sql server database

Re: Load data to HDFS & Data Transformation with S...

Re: Sqoop import : composite primary key and textu...

Re: Sqoop import : composite primary key and textu...

Re: Sqoop import : composite primary key and textu...

Re: HIVE Best Practice