About mkumar13

mkumar13 · ‎10-07-2016

PostgreSQL extension PG-Strom, allows users to customize the data scan and run queries faster. CPU-intensive work load is identified and transferred to the GPU to take advantage of the powerful GPU parallel execution ability to complete the data task. The combination of few number of core processors, RAM bandwidth, and the GPU has a unique advantage. GPUs typically have hundreds of processor cores and RAM bandwidths that are several times larger than CPUs. They can handle large numbers of computations in parallel, so their operations are very efficient. PG-Storm based on two basic ideas: On-the-fly native GPU code generation. Asynchronous pipeline execution mode. Below figure shows how query is submitted to execution engine and during query optimization phase, PG-Storm detects whether a given query is fully or partially executable on the GPU, and then determines whether the query can be transferred. If the query can be transferred, PG-Storm creates the source code for the GPU native binaries on the fly, starting the real-time compilation process before the execution phase. Next, PG-Storm loads the extracted rowset into the DMA cache (the size of a buffer is defaulted to 15MB) and asynchronously starts DMA transfers and GPU core execution. The CUDA platform allows these tasks to be executed in the background, so PostgreSQL can run the current process ahead of time. Through GPU acceleration, these asynchronous correlation slices also hide the general delay. After loading PG-Strom, running SQL on the GPU does not require special instructions. It allows the user to customize the way PostgreSQL is scanned, and provides additional workarounds for scan/join logic that can be run on the GPU. If the expected cost is reasonable, Task Manager places the custom scan node instead of the built-in query execution logic. The graph below shows the benchmark results for PG-Strom and PostgreSQL. The abscissa is the number of tables, and the ordinate is the query execution time. In this test, all relevant internal relations can be loaded into the GPU RAM on a one-time basis, pre-aggregation greatly reduces the number of rows the CPU needs to process. For more details, test code can be viewed https://wiki.postgresql.org/wiki/PGStrom As can be seen from this figure, PG-Strom is much faster than PostgreSQL alone. Here are a few ways you can improve the performance of PostgreSQL: 1. Similar vertical expansion 2. Heterogeneous vertical expansion 3. Horizontal expansion PG-Strom uses a heterogeneous longitudinal extension approach that maximizes hardware benefits for workload characteristics. In other words, the PG-Strom allocates simple, large numbers of numerical calculations on GPU devices before running on the CPU core. https://www.linkedin.com/pulse/pg-storm-let-postgresql-run-faster-gpu-mukesh-kumar?trk=prof-post Evolution, Right...

mkumar13 · ‎08-26-2016

Thanks @lgeorge for your response. i have tried with and without new consumer and same error message.

mkumar13 · ‎08-25-2016

Mirror command below:- ./kafka-run-class.sh kafka.tools.MirrorMaker --consumer.config /usr/hdp/current/kafka-broker/config/consumer_mirr.properties --producer.config /usr/hdp/current/kafka-broker/config/producer_mirr.properties --whitelist MukeshTest --new.consumer

mkumar13 · ‎08-25-2016

Hi, I am following steps on Kafka-Mirror given at "http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_kafka-user-guide/bk_kafka-user-guide-20160628.pdf" and having two separate cluster but when i run kafka.tools.MirrorMaker i am getting below Error:- [2016-08-25 17:20:00,081] WARN The configuration serializer.class = kafka.serializer.DefaultEncoder was supplied but isn't a known config. (org.apache.kafka.clients.producer.ProducerConfig) [2016-08-25 17:20:00,136] ERROR Exception when starting mirror maker. (kafka.tools.MirrorMaker$) org.apache.kafka.common.config.ConfigException: Missing required configuration "bootstrap.servers" which has no default value. at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:148) at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:49) at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:56) at org.apache.kafka.clients.consumer.ConsumerConfig.<init>(ConsumerConfig.java:336) at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:541) at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:523) at kafka.tools.MirrorMaker$$anonfun$4.apply(MirrorMaker.scala:330) at kafka.tools.MirrorMaker$$anonfun$4.apply(MirrorMaker.scala:328) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at kafka.tools.MirrorMaker$.createNewConsumers(MirrorMaker.scala:328) at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:246) at kafka.tools.MirrorMaker.main(MirrorMaker.scala) Exception in thread "main" java.lang.NullPointerException at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:276) at kafka.tools.MirrorMaker.main(MirrorMaker.scala) consumer_mirr.properties zookeeper.connect=sourceHOST:2181 zookeeper.connection.timeout.ms=6000 group.id=test-consumer-group-mirror consumer.timeout.ms=5000 shallow.iterator.enable=true mirror.topics.whitelist=app_log producer_mirr.properties metadata.broker.list=targetHOST:6667 request.required.acks=0 producer.type=async compression.codec=none serializer.class=kafka.serializer.DefaultEncoder queue.enqueue.timeout.ms=-1 max.message.size=1000000 queue.time=1000 Your help is really appriciated here. Thanks in Advance!!!

mkumar13 · ‎07-29-2016

Hi @Himanshu Rawat There are two approach for this problem. 1. Create partition wise separate files using unix or any other tool and load them on individually in static partitions like below:- ALTER TABLE Unm_Parti ADD PARTITION (Department='A') location '/user/mukesh/HIVE/HiveTrailFolder/A'; ALTER TABLE Unm_Parti ADD PARTITION (Department='B') location '/user/mukesh/HIVE/HiveTrailFolder/B'; ALTER TABLE Unm_Parti ADD PARTITION (Department='C') location '/user/mukesh/HIVE/HiveTrailFolder/C'; 2. Create external table and put file into external table HDFS location, we can call it as staging table. Now create final partition table and load it using dynamic partition enable:- 1. set hive.exec.dynamic.partition=true This enable dynamic partitions, by default it is false. 2. set hive.exec.dynamic.partition.mode=nonstrict We are using the dynamic partition without a static partition (A table can be partitioned based on multiple columns in hive) in such case we have to enable the non strict mode. In strict mode we can use dynamic partition only with a Static Partition. Now use below statement to load data:- INSERT OVERWRITE TABLE Final_Table PARTITION(c2) SELECT c1, c4,c3,c2 FROM stage_table;

mkumar13 · ‎07-29-2016

Yes, Because it is required to map schema, the last column in file is partitioned column. But if you are loading from another table then in select statement keep your partitioned column last.

mkumar13 · ‎07-29-2016

We can create partition on both External as well as Managed tables. Yes we need to define partition before creating the tables. More on performance related go to below link https://community.hortonworks.com/questions/15161/can-we-apply-the-partitioning-on-the-already-exist.html See below example of a partitionon External table. CREATE EXTERNAL TABLE `myTable`( `ossc_rc` string, `subnetwork1` string) PARTITIONED BY ( `part` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://location/ready' TBLPROPERTIES ( 'transient_lastDdlTime'='1433520068') ;

mkumar13 · ‎07-29-2016

Hope this help, type below on google and in result page your have release notes of all versions:- cloudbreak release notes site:sequenceiq.com

mkumar13 · ‎07-27-2016

You can create external table and map schema and move file to HDFS, CREATE EXTERNAL TABLE IF NOT EXISTS Cars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data about cars from a public database' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/user/<username>/visdata'; hdfs dfs -copyFromLocal cars.csv /user/<username>/visdata Now create ORC table :- CREATE TABLE IF NOT EXISTS mycars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data about cars from a public database' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC; Insert the data from the external table to the Hive ORC table. INSERT OVERWRITE TABLE mycars SELECT * FROM cars;

mkumar13 · ‎07-27-2016

Hi @Arun A K, it has been observed that most of the time is consumed when we write data for downstream in Cassandra as single node is serving to Cassandra cluster. Now we are planning to increase create multiple Cassandra nodes inside the Hadoop cluster for fast writing. I'll keep you update on progress.

Online	Offline
Last Visited	‎08-15-2019 08:33 PM

Member Since	‎05-05-2016 12:35 PM
Last Visited	‎08-15-2019 08:33 PM
Posts	147
Kudos received	222

Cloudera Community

Re: HDP3.0.1 Ambari unable to stop all services...

Re: Do we need to create a normal managed table be...

Re: Where can I find list of enhancements (Release...

Re: Spark performance parameter num-executors has ...

Re: How we can connect an external Hive table to a...

PG-Strom: Let PostgreSQL run faster on the GPU

Re: Kafka Mirror Error:- Missing required configur...

Re: Kafka Mirror Error:- Missing required configur...

Kafka Mirror Error:- Missing required configuratio...

Re: Do we need to create a normal managed table be...

Re: Do we need to create a normal managed table be...

Re: Do we need to create a normal managed table be...

Re: Where can I find list of enhancements (Release...

Re: How can I upload ORC files to Hive?

Re: Spark performance parameter num-executors has ...