Member since
05-05-2016
147
Posts
223
Kudos Received
18
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3724 | 12-28-2018 08:05 AM | |
3712 | 07-29-2016 08:01 AM | |
3053 | 07-29-2016 07:45 AM | |
7113 | 07-26-2016 11:25 AM | |
1382 | 07-18-2016 06:29 AM |
10-07-2016
03:42 PM
2 Kudos
PostgreSQL extension PG-Strom, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient. PG-Storm based on two basic ideas:
On-the-fly native GPU code generation. Asynchronous pipeline execution mode. Below figure shows how query is submitted to execution engine and during query optimization phase, PG-Storm detects whether a given query is fully or partially executable on the GPU, and then determines whether the query can be transferred. If the query can be transferred, PG-Storm creates the source code for the GPU native binaries on the fly, starting the real-time compilation process before the execution phase. Next, PG-Storm loads the extracted rowset into the DMA cache (the size of a buffer is defaulted to 15MB) and asynchronously starts DMA transfers and GPU core execution. The CUDA platform allows these tasks to be executed in the background, so PostgreSQL can run the current process ahead of time. Through GPU acceleration, these asynchronous correlation slices also hide the general delay. After loading PG-Strom, running SQL on the GPU does not require special instructions. It allows the user to customize the way PostgreSQL is scanned, and provides additional workarounds for scan/join logic that can be run on the GPU. If the expected cost is reasonable, Task Manager places the custom scan node instead of the built-in query execution logic. The graph below shows the benchmark results for PG-Strom and PostgreSQL. The abscissa is the number of tables, and the ordinate is the query execution time. In this test, all relevant internal relations can be loaded into the GPU RAM on a one-time basis, pre-aggregation greatly reduces the number of rows the CPU needs to process. For more details, test code can be viewed https://wiki.postgresql.org/wiki/PGStrom As can be seen from this figure, PG-Strom is much faster than PostgreSQL alone. Here are a few ways you can improve the performance of PostgreSQL: 1. Similar vertical expansion 2. Heterogeneous vertical expansion 3. Horizontal expansion PG-Strom uses a heterogeneous longitudinal extension approach that maximizes hardware benefits for workload characteristics. In other words, the PG-Strom allocates simple, large numbers of numerical calculations on GPU devices before running on the CPU core. https://www.linkedin.com/pulse/pg-storm-let-postgresql-run-faster-gpu-mukesh-kumar?trk=prof-post Evolution, Right...
... View more
Labels:
08-26-2016
06:43 AM
Thanks @lgeorge for your response. i have tried with and without new consumer and same error message.
... View more
08-25-2016
11:52 AM
Mirror command below:- ./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config /usr/hdp/current/kafka-broker/config/consumer_mirr.properties --producer.config /usr/hdp/current/kafka-broker/config/producer_mirr.properties --whitelist MukeshTest --new.consumer
... View more
08-25-2016
11:50 AM
Hi, I am following steps on Kafka-Mirror given at "http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_kafka-user-guide/bk_kafka-user-guide-20160628.pdf" and having two separate cluster but when i run kafka.tools.MirrorMaker i am getting below Error:- [2016-08-25 17:20:00,081] WARN The configuration serializer.class = kafka.serializer.DefaultEncoder was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig)
[2016-08-25 17:20:00,136] ERROR Exception when starting mirror maker. (kafka.tools.MirrorMaker$)
org.apache.kafka.common.config.ConfigException: Missing required configuration "bootstrap.servers" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:148)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:49)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:56)
at org.apache.kafka.clients.consumer.ConsumerConfig.<init>(ConsumerConfig.java:336)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:541)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:523)
at kafka.tools.MirrorMaker$$anonfun$4.apply(MirrorMaker.scala:330)
at kafka.tools.MirrorMaker$$anonfun$4.apply(MirrorMaker.scala:328)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at kafka.tools.MirrorMaker$.createNewConsumers(MirrorMaker.scala:328)
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:246)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
Exception in thread "main" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:276)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
consumer_mirr.properties zookeeper.connect=sourceHOST:2181
zookeeper.connection.timeout.ms=6000
group.id=test-consumer-group-mirror
consumer.timeout.ms=5000
shallow.iterator.enable=true
mirror.topics.whitelist=app_log
producer_mirr.properties metadata.broker.list=targetHOST:6667
request.required.acks=0
producer.type=async
compression.codec=none
serializer.class=kafka.serializer.DefaultEncoder
queue.enqueue.timeout.ms=-1
max.message.size=1000000
queue.time=1000
Your help is really appriciated here. Thanks in Advance!!!
... View more
Labels:
- Labels:
-
Apache Kafka
07-29-2016
11:12 AM
Hi @Himanshu Rawat There are two approach for this problem. 1. Create partition wise separate files using unix or any other tool and load them on individually in static partitions like below:- ALTER TABLE Unm_Parti ADD PARTITION (Department='A')
location '/user/mukesh/HIVE/HiveTrailFolder/A'; ALTER TABLE Unm_Parti ADD PARTITION (Department='B')
location '/user/mukesh/HIVE/HiveTrailFolder/B'; ALTER TABLE Unm_Parti ADD PARTITION (Department='C')
location '/user/mukesh/HIVE/HiveTrailFolder/C'; 2. Create external table and put file into external table HDFS location, we can call it as staging table. Now create final partition table and load it using dynamic partition enable:- 1. set hive.exec.dynamic.partition=true
This enable dynamic partitions, by default it is false.
2. set hive.exec.dynamic.partition.mode=nonstrict
We are using the dynamic partition without a static
partition (A table can be partitioned based
on multiple columns in hive) in such case we have to
enable the non strict mode. In strict mode we can use
dynamic partition only with a Static Partition. Now use below statement to load data:- INSERT OVERWRITE TABLE Final_Table PARTITION(c2) SELECT c1, c4,c3,c2 FROM stage_table;
... View more
07-29-2016
09:20 AM
Yes, Because it is required to map schema, the last column in file is partitioned column. But if you are loading from another table then in select statement keep your partitioned column last.
... View more
07-29-2016
08:01 AM
3 Kudos
We can create partition on both External as well as Managed tables. Yes we need to define partition before creating the tables. More on performance related go to below link https://community.hortonworks.com/questions/15161/can-we-apply-the-partitioning-on-the-already-exist.html See below example of a partitionon External table. CREATE EXTERNAL TABLE `myTable`(
`ossc_rc` string,
`subnetwork1` string)
PARTITIONED BY (
`part` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://location/ready'
TBLPROPERTIES (
'transient_lastDdlTime'='1433520068')
;
... View more
07-29-2016
07:45 AM
1 Kudo
Hope this help, type below on google and in result page your have release notes of all versions:- cloudbreak release notes site:sequenceiq.com
... View more
07-27-2016
10:48 AM
1 Kudo
You can create external table and map schema and move file to HDFS, CREATE EXTERNAL TABLE IF NOT EXISTS Cars(
Name STRING,
Miles_per_Gallon INT,
Cylinders INT,
Displacement INT,
Horsepower INT,
Weight_in_lbs INT,
Acceleration DECIMAL,
Year DATE,
Origin CHAR(1))
COMMENT 'Data about cars from a public database'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/user/<username>/visdata'; hdfs dfs -copyFromLocal cars.csv /user/<username>/visdata Now create ORC table :- CREATE TABLE IF NOT EXISTS mycars(
Name STRING,
Miles_per_Gallon INT,
Cylinders INT,
Displacement INT,
Horsepower INT,
Weight_in_lbs INT,
Acceleration DECIMAL,
Year DATE,
Origin CHAR(1))
COMMENT 'Data about cars from a public database'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC; Insert the data from the external table to the Hive ORC table. INSERT OVERWRITE TABLE mycars SELECT * FROM cars;
... View more
07-27-2016
10:39 AM
Hi @Arun A K, it has been observed that most of the time is consumed when we write data for downstream in Cassandra as single node is serving to Cassandra cluster. Now we are planning to increase create multiple Cassandra nodes inside the Hadoop cluster for fast writing. I'll keep you update on progress.
... View more