About nsabharwal

nsabharwal · ‎04-15-2016

@Marco Gaido Marco, Take a look on this thread https://community.hortonworks.com/questions/4067/snappy-vs-zlib-pros-and-cons-for-each-compression.html I would test the same task with zlib to check the different behavior.

nsabharwal · ‎04-15-2016

Original Post Calcite is a highly customizable engine for parsing and planning queries on data in a wide variety of formats. It allows database-like access, and in particular a SQL interface and advanced query optimization, for datanot residing in a traditional database. Apache Calcite is a dynamic data management framework. It contains many of the pieces that comprise a typical database management system, but omits some key functions: storage of data, algorithms to process data, and a repository for storing metadata. Calcite intentionally stays out of the business of storing and processing data. As we shall see, this makes it an excellent choice for mediating between applications and one or more data storage locations and data processing engines. It is also a perfect foundation for building a database: just add data. Source Tutorial https://calcite.apache.org/docs/tutorial.html Demo: Read DEPT and EMPS table Create a test table based on existing csv example. Read the tutorial link to understand the model.json and schema. In the demo, you can see that I am running explain plan on the queries and then I used smart.json to change the plan. Watch the demo and then read the following links model.json https://calcite.apache.org/docs/tutorial.html#schema-discovery Query tuning https://calcite.apache.org/docs/tutorial.html#optimizing-queries-using-planner-rules Calcite https://calcite.apache.org/ This page describes the SQL dialect recognized by Calcite’s default SQL parser. Adapters JDBC driver Calcite is embedded in Drill, Hive and Kylin.

nsabharwal · ‎04-01-2016

@Usman Shahid See this https://www.digitalocean.com/community/tutorials/how-to-install-solr-on-ubuntu-14-04

nsabharwal · ‎03-30-2016

@Rainer Geissendoerfer See if it helps https://community.hortonworks.com/content/kbentry/1620/how-to-increase-sandbox-disk-space-virtualbox.html

nsabharwal · ‎03-29-2016

@Vadim See if this is helpful.

nsabharwal · ‎03-28-2016

@marksf Do this: 1) run top as root 2) locate top pid and then see what process those pid belongs to. Once you are 100% sure that mysql causing the issue and those sql are not related to hadoop or ambari then follow this

nsabharwal · ‎03-22-2016

@mkataria You can use Apache Falcon http://hortonworks.com/hadoop/falcon/ or see this https://community.hortonworks.com/articles/9933/apache-nifi-aka-hdf-data-flow-across-data-center.html

nsabharwal · ‎03-20-2016

@Mohamed Ashiq Please see this http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_HDP_RelNotes/content/ch_relnotes_v240.html HDP 2.4 comes with Apache Kafka 0.9.0 -- This is the highest version of Kafka as of now in HDP stack

nsabharwal · ‎03-19-2016

@gopal

nsabharwal · ‎03-19-2016

@Mahesh Deshmukh See this https://cwiki.apache.org/confluence/display/Hive/Tutorial It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column: The following example illustrates the case of the page_view table that is bucketed on the userid column: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

Online	Offline
Last Visited	‎07-18-2019 05:10 PM

Member Since	‎09-18-2015 05:49 PM
Last Visited	‎07-18-2019 05:10 PM
Posts	3,274
Kudos received	1129

Cloudera Community

Re: Is Ranger KMS Encryption FIPS 140-2 compliant ...

Re: How to add another HiveServer for current meta...

Re: FQDNs - are they necessary?

Re: java.io.FileNotFoundException: (Is a director...

Re: Need Design/Architecture Suggestion on HDP & H...

Re: SparkSQL very slow writing ro ORC table (and i...

Apache Calcite - Introduction and Demo

Re: Ubuntu Support for SOLR

Re: Adding Space to the Hortonworks Sandbox

Re: Has anyone tried to use Apache Ignite on Yarn ...

Re: Any thoughts on why MySQL is utilizing over 10...

Re: Best practice to copy data from clusters on di...

Re: What version of HDP to use to get kafka 0.9.0....

Re: Help understanding corrupt ORC file in Hive

Re: what is basic difference between Partitioning ...