Member since
08-08-2016
42
Posts
32
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2747 | 09-22-2017 09:10 PM | |
1403 | 06-16-2017 11:29 AM | |
908 | 06-14-2017 09:27 PM | |
1740 | 02-28-2017 03:51 PM | |
430 | 11-02-2016 02:00 PM |
10-19-2017
05:53 PM
if it is only hive-site.xml, core-site.xml. could you please add hdfs-site.xml ?
... View more
10-19-2017
05:52 PM
@Raj B based on your statement the same XML files (hive-site.xml, core-site.xml). Can you confirm the XML files ?
... View more
10-19-2017
05:25 PM
1 Kudo
@Raj B Can you try to add, core-site.xml, hdfs-site.xml and hive-site.xml in the config resources properties ? Please let us know the outcome ? Thanks
... View more
09-24-2017
01:35 PM
3 Kudos
Spark Load testing
framework built on a number of distributed technologies, including Gatling, Livy,
Akka, and HDP. Using Akka Server powered by LIVY {Spark as a Service} provides
the following benefits.
REST
friendly and Docker Friendly Low latency
execution Sharing
cache across jobs Separation
of concern Multi
tenancy Direct
Spark SQL execution Configuration
at one place Auditing
and Logging Complete statement
history and metrics Livy Server Livy is an open source REST interface for interacting with
Apache Spark from anywhere. It supports executing snippets of code or programs
in a Spark context that runs locally or in Apache Hadoop YARN. Livy offers three modes to run Spark jobs:
Using programmatic
API Running interactive
statements through REST API Submitting batch
applications with REST API Livy provides the following features:
Interactive Scala,
Python, and R shells Batch submissions in
Scala, Java, Python Multiple users can
share the same server (impersonation support) Can be used for
submitting jobs from anywhere with REST Does not require any
code change to your programs Support Spark1/
Spark2, Scala 2.10/2.11 within one build. Livy provides the
following advantages:
Programmatically
upload jar file and run job. Add additional applications that will connect to
same cluster and upload jar with next job. If you use spark-submit, you must
upload manually JAR file to cluster and run command. Everything must be prepared
before run Use Spark in
interactive mode, hard to do with spark-submit or Thrift Server at scale. Security. Reduce
exposure of the cluster to the outside world. Stability. Spark is
a complex framework and there many factors which can affect its long term
performance and stability. Decoupling Spark context and application allows to
handle Spark issues gracefully, without full downtime of the application. Gatling Server Gatling is a highly capable load testing tool. It is designed
for ease of use, maintainability and high performance. Gatling server provides
the following benefits.
Powerful scripting using Scala Akka + Netty Run multiple scenarios in one
simulation Scenarios = code + DSL Graphical reports with clear
& concise graphs Gatling’s architecture is asynchronous as
long as the underlying protocol, such as HTTP, can be implemented in a non
blocking way. This kind of architecture lets us implement virtual users as
messages instead of dedicated threads, making them very resource cheap. Thus,
running thousands of concurrent virtual users is not an issue. val theScenarioBuilder =
scenario("Interactive Spark Command Scenario Using LIVY Rest Services $sessionId").exec(
/* myRequest1 is a name that describes the request. */
http("Interactive Spark Command Simulation")
.get("/insrun?sessionId=${sessionId}&statement=sparkSession.sql(%22%20select%20event.site_id%20from%20siteexposure_event%20as%20event%20where%20st_intersects(st_makeBBOX(${bbox})%2C%20geom)%20limit%205%20%22).show").check()
).pause(4 second)
So, this is great, we can load
test our spark interactive command with one user! Let’s increase the number of
users. To increase the number of
simulated users, all you have to do is to change the configuration of the
simulation as follows: setUp(
theScenarioBuilder.inject(atOnceUsers(10))
).protocols(theHttpProtocolBuilder) If you want to simulate 3000
users, you might not want them to start at the same time. Indeed, real users
are more likely to connect to your web application gradually. Gatling provides rampUsers to implement this behavior. The
value of the ramp indicates the duration over which the users will be linearly
started. In our scenario let’s have 10 regular users ramp them over 10 seconds
so we don’t hammer the Livy server: setUp(
theScenarioBuilder.inject(rampUsers(10) over (10 seconds)),
).protocols(theHttpProtocolBuilder)
... View more
- Find more articles tagged with:
- How-ToTutorial
- Sandbox & Learning
- Spark
Labels:
09-22-2017
09:10 PM
1 Kudo
@sudheer Could you please run the major compact after the ETL ingestion. please find below alter statement for the reference
alter table <<table_name>> compact 'MAJOR';
... View more
06-16-2017
11:29 AM
1 Kudo
@timc c Can you also check the following properties and modify it according to your needs ? webhcat.proxyuser.knox.hosts *
webhcat.proxyuser.knox.groups * As Jay pointed - once you set the appropriate permission restart all the necessary components + all hive services.
... View more
06-14-2017
09:27 PM
1 Kudo
It seems to be an issue with the SasParser file. Could you please check if you have the latest SasFileParser lib. Please find below my SBT confiig libraryDependencies ++= Seq(
"com.databricks" % "spark-csv_2.11" % "1.5.0",
"org.slf4j" % "slf4j-api" % "1.7.5"
)
... View more
06-13-2017
09:47 PM
Can you try with S3a instead of S3n and post the outcome here ?
... View more
06-13-2017
06:06 PM
what is your cloudstorage- wasb should be an issue
... View more
02-28-2017
04:18 PM
@Steve Wong What is your allocated LLAP Queue % ? Queries submitted to LLAP will use only LLAP queue configured percentage. LLAP is measured on number of concurrent queries and number of daemons.
... View more
02-28-2017
03:55 PM
Also please check - the max container-size, and NM capacities. Looks like these parameters are set incorrectly on the cluster
... View more
02-28-2017
03:51 PM
1 Kudo
@Nube Technologies
Please check - the max container-size, and NM capacities. Looks like these parameters are set incorrectly on the cluster Could you please check on the following item in your cluster. 1. Login to zookeeper client using the following command zkCli.sh -server $hostName:2181
Where $hostName = hostname of zookeeper server
2.Verify the ACLs set on the znode: getAcl /llap-sasl/user-hive
it should give the results like 'sasl,'hive
: cdrwa
'world,'anyone
: r
3. If the ACL's are not set propertly then manual Set ACLs on znode with comma separated
setAcl /llap-sasl/user-hive sasl:hive:cdrwa, world:anyone:r
4.Verify the ACLs set on the znode: getAcl /llap-sasl/user-hive
... View more
12-07-2016
09:19 AM
6 Kudos
Sqoop Overview Sqoop
is a tool designed to transfer data between Hadoop and relational databases or
mainframes. You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle or a mainframe into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop MapReduce, and
then export the data back into an RDBMS. Sqoop
automates most of this process, relying on the database to describe the schema
for the data to be imported. Sqoop uses MapReduce to import and export the
data, which provides parallel operation as well as fault tolerance. Sqoop Performance Tuning Best Practices Tune the following Sqoop arguments in JDBC connection or
Sqoop mapping to optimize performance
batch• split-by and boundary-query• direct• fetch-size• num-mapper• 2.Inserting Data in Batches Specifies that you can group the related SQL statements into
a batch when you export data. The JDBC interface exposes an API for doing batches in a
prepared statement with multiple sets of values. With the --batch parameter,
Sqoop can take advantage of this. This API is present in all JDBC drivers
because it is required by the JDBC interface. Enable JDBC batching using the --batch parameter. sqoop export --connect <<JDBC URL>> --username <<SQOOP_USER_NAME>> --password <<SQOOP_PASSWOR>> --table <<TABLE_NAME>> --export-dir <<FOLDER_URI>> --batch The second option is to use the property sqoop.export.records.per.statementto specify
the number of records that will be used in each insert statement: sqoop export
-Dsqoop.export.records.per.statement=10
--connect <<JDBC URL>> --username
<<SQOOP_USER_NAME>> --password
<<SQOOP_PASSWORD>> --table
<<TABLE_NAME>> --export-dir
<<FOLDER_URI>> Finally, you can set how many rows will be inserted per
transaction with the sqoop.export.statements.per.transaction property: sqoop export
-Dsqoop.export.statements.per.transaction=10 --connect
<<JDBC URL>> --username
<<SQOOP_USER_NAME>> --password
<<SQOOP_PASSWORD>> --table
<<TABLE_NAME>> --export-dir
<<FOLDER_URI>> The default values can vary from connector to connector.
Sqoop defaults to disabled batching and to 100 for both sqoop.export.records.per.statementand sqoop.export.statements.per.transactionproperties. 2.Custom Boundary Queries Specifies the range of values that you can import. You can
use boundary-query if you do not get the desired results by using the split-by
argument alone. When you configure the boundary-query argument, you must
specify the min(id) and max(id) along with the table name. If you do not
configure the argument, Sqoop runs the following query. sqoop import --connect
<<JDBC URL>> --username
<<SQOOP_USER_NAME>> --password
<<SQOOP_PASSWORD>> --query <<QUERY>>
--split-by <<ID>>
--target-dir <<TARGET_DIR_URI>>
--boundary-query "select min(<<ID>>), max(<<ID>>)
from <<TABLE>>" 2.Importing Data Directly into Hive Specifies the direct import fast path when you
import data from RDBMS. Rather than using the JDBC interface for
transferring data, the direct mode delegates the job of transferring data to
the native utilities provided by the database vendor. In the case of MySQL, the
mysqldump and mysqlimport will be used for retrieving data from the database
server or moving data back. In the case of PostgreSQL, Sqoop will take
advantage of the pg_dump utility to import data. Using native utilities will
greatly improve performance, as they are optimized to provide the best possible
transfer speed while putting less burden on the database server. There are
several limitations that come with this faster import. For one, not all
databases have available native utilities. This mode is not available for every
supported database. Out of the box, Sqoop has direct support only for MySQL and
PostgreSQL. sqoop import --connect
<<JDBC URL>> --username
<<SQOOP_USER_NAME>> --password
<<SQOOP_PASSWORD>> --table
<<TABLE_NAME>> --direct 2.Importing Data using Fetch-size Specifies the number of entries that Sqoop can
import at a time. Use the following syntax: --fetch-size=<n> Where <n> represents the number of entries
that Sqoop must fetch at a time. Default is 1000. Increase the value of the fetch-size argument based
on the volume of data that need to read. Set the value based on the available
memory and bandwidth. 2.Controlling Parallelism Specifies number of map tasks that can run in parallel.
Default is 4. To optimize performance, set the number of map tasks to a value
lower than the maximum number of connections that the database supports. Use the parameter --num-mappers if you want Sqoop to use a
different number of mappers. For example, to suggest 10 concurrent tasks, use
the following Sqoop command: sqoop import --connect
jdbc:mysql://mysql.example.com/sqoop --username
sqoop --password
sqoop --table
cities --num-mappers 10 Controlling the amount of parallelism that Sqoop will use to
transfer data is the main way to control the load on your database. Using more
mappers will lead to a higher number of concurrent data transfer tasks, which
can result in faster job completion. However, it will also increase the load on
the database as Sqoop will execute more concurrent queries. 2.Split-By Specifies the column name based on which Sqoop must split
the work units. Use the following syntax: --split-by <column name>
sqoop import --connect
<<JDBC URL>> --username
<<SQOOP_USER_NAME>> --password
<<SQOOP_PASSWORD>> --query <<QUERY>>
--split-by <<ID>>
Note: If you do not specify a column name, Sqoop splits the
work units based on the primary key.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- data-ingestion
- development
- documentation
- How-ToTutorial
- Sqoop
Labels:
12-07-2016
04:37 AM
@Rene Sluiter - ls /usr/share/java/mysql-connector-java.jar can you check the jar in share folder ?
... View more
11-02-2016
02:00 PM
@Peter Coates - look for the parameters fs.s3a.multipart.threshold and fs.s3a.multipart.size
... View more
09-20-2016
02:50 PM
It might be due to network connectivity issue. Please check on the network config to see any packet loss.
... View more
09-20-2016
10:58 AM
1 Kudo
You must supply the generic arguments -conf , -D , and so on after the tool name but before any tool-specific arguments (such as --connect ). Note that generic Hadoop arguments are preceeded by a single dash character ( - ), whereas tool-specific arguments start with two dashes ( -- ), unless they are single character arguments such as -P . https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_using_generic_and_specific_arguments
... View more
09-20-2016
10:54 AM
6 Kudos
@Gayathri Reddy G - pass generic arguments like -D after SQOOP JOB -Dhadoop.security.credential.provider.path=jceks .... General syntax is sqoop-job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
... View more
09-12-2016
04:22 PM
2 Kudos
you have to specify MM for the month. 'mm' is for the minutes.
... View more
09-11-2016
09:38 PM
5 Kudos
Using SQOOP with MySQL as metastore To set up MySQL for use with SQOOP: On the SQOOP
Server host, install the connector. Install the
connector
RHEL/CentOS/Oracle Linux yum install mysql-connector-java SLES zypper install mysql-connector-java Confirm that .jar
is in the Java share directory. ls /usr/share/java/mysql-connector-java.jar Make sure the
.jar file has the appropriate permissions - 644. Create a user for
SQOOP and grant it permissions. For example,
using the MySQL database admin utility: # mysql -u root -p
CREATE USER '<SQOOPUSER>'@'%' IDENTIFIED BY '<SQOOPPASSWORD>';
GRANT ALL PRIVILEGES ON *.* TO '<SQOOPUSER>'@'%';
CREATE USER '<SQOOPUSER>'@'localhost' IDENTIFIED BY '<SQOOPPASSWORD>';
GRANT ALL PRIVILEGES ON *.* TO '<SQOOPUSER>'@'localhost';
CREATE USER '<SQOOPUSER>'@'<SQOOPSERVERFQDN>' IDENTIFIED BY '<SQOOPPASSWORD>';
GRANT ALL PRIVILEGES ON *.* TO '<SQOOPUSER>'@'<SQOOPSERVERFQDN>';
FLUSH PRIVILEGES; Where
<SQOOPUSER> is the SQOOP user name, <SQOOPPASSWORD> is the SQOOP
user password and <SQOOPSERVERFQDN> is the Fully Qualified Domain Name of
the SQOOP Server host. Configure the
sqoop-site.xml to create the sqoop database and load the SQOOP Server database
schema. <configuration> <property> <name>sqoop.metastore.client.enable.autoconnect</name> <value>true</value> </property> <property> <name>sqoop.metastore.client.autoconnect.url</name> <value>jdbc:mysql://<<MYSQLHOSTNAME>>/sqoop?createDatabaseIfNotExist=true</value> </property> <property> <name>sqoop.metastore.client.autoconnect.username</name> <value>$$SQOOPUSER$$</value> </property> <property> <name>sqoop.metastore.client.autoconnect.password</name> <value>$$$SQOOPPASSWORD$$$</value> </property> <property> <name>sqoop.metastore.client.record.password</name> <value>true</value> </property> <property> <name>sqoop.metastore.server.location</name> <value>/usr/lib/sqoop/metastore/</value> </property> <property> <name>sqoop.metastore.server.port</name> <value>16000</value> </property> </configuration> execute the
following command to create the initial database and tables. sqoop job --list If you get any error or exception then you must
pre-load the SQOOP tables with the mandatory values. mysql -u <SQOOPUSER> -p USE <SQOOPDATABASE>; -- Inserted the following row
INSERT INTO SQOOP_ROOT VALUES( NULL, 'sqoop.hsqldb.job.storage.version', '0' ); Where
<SQOOPUSER> is the SQOOP user name and <SQOOPDATABASE> is the SQOOP
database name. execute the
following command one more time, to create the all required SQOOP internal meta
tables. sqoop job --list Once all the
necessary sqoop tables are created, then sqoop job will use the meta store for
the SQOOP job execution.
... View more
- Find more articles tagged with:
- Data Ingestion & Streaming
- data-ingestion
- FAQ
- how-to-tutorial
- How-ToTutorial
- metadata
- Metastore
- MySQL
- Sqoop
Labels:
09-01-2016
03:05 PM
@Jon Roberts Could you please elaborate on external table supports secured clusters? Not sure how HAWQ handles HDFS write to different secured hadoop cluster using external writable table. Thanks in advance.
... View more
09-01-2016
02:49 PM
@Jon Roberts sqoop can be run in parallel based on split by coloumn id or externally providing the number of mapper. Majority of the places, HAWQ will be managed by a different team, creating the external table involves lot of process changes. Not sure how HAWQ will handle HDFS write, in case of secured cluster.
... View more
09-01-2016
01:13 PM
1 Kudo
@Yogeshprabhu Multiple nested sub partition from HAWQ to HIVE using sqoop will be challenging, if you need to implement then we need to use SQOOP2 API's. I would recommend to import the table as it is with one parent partition in to HDFS. Create a external table and migrate it to internal table with necessary required partition. Please remember having large partition with small amount of data @ hive might hinder the performance.
... View more
09-01-2016
01:07 PM
1 Kudo
@Yogeshprabhu Move the HAWQ primary key check constraint to the data ingestion script. suppose in case of sqoop, use custom Query handler to get only the check constraint data and create a child table in Hive. In this way you can acheive same schema structure between HAWQ and HIVE. CONSTRAINT rank_1_prt_2_check CHECK (year >= 2001 AND year < 2002)
)
INHERITS ("Test".rank)
Move this constraint to the Sqoop Script condition create a separate hive tables for each HAWQ child tables.
... View more
08-31-2016
05:44 PM
1 Kudo
Any thoughts on how to integrate Oracle IDM with Ranger for applying custom policies.
... View more
Labels:
- Labels:
-
Apache Ranger
08-29-2016
07:35 PM
ok Sami. it is better to provide schema so that we dont need to prefix the hive tables with 'default.'
... View more
08-29-2016
07:08 PM
1 Kudo
use -- --schema option to specify the additional information about the source schema
... View more