About jpp

jpp · ‎04-22-2016

You will need to specify the column you are clustering on, and then achieve it in multiple statements: CREATE TABLE emp1 LIKE emp; ALTER TABLE emp1 SET FILEFORMAT ORC; ALTER TABLE emp1 CLUSTERED BY (empId) INTO 4 BUCKETS;

jpp · ‎04-22-2016

If you create a Hive table over an existing data set in HDFS, you need to tell Hive about the format of the files as they are on the filesystem ("schema on read"). For text-based files, use the keywords STORED as TEXTFILE. Once you have declared your external table, you can convert the data into a columnar format like parquet or orc using CREATE TABLE. CREATE EXTERNAL TABLE sourcetable (col bigint) row format delimited fields terminated by "," STORED as TEXTFILE LOCATION 'hdfs:///data/sourcetable'; Once the data is mapped, you can convert it to other formats like parquet: set parquet.compression=SNAPPY; --this is the default actually CREATE TABLE testsnappy_pq STORED AS PARQUET AS SELECT * FROM sourcetable; For the hive optimized ORC format, the syntax is slightly different: CREATE TABLE testsnappy_orc STORED AS ORC TBLPROPERTIES("orc.compress"="snappy") AS SELECT * FROM sourcetable;

jpp · ‎12-16-2015

The COUNT(DISTINCT) could be the bottleneck if it is not being parallelized. Can you share the explain plan ?

jpp · ‎12-10-2015

The work to generically create a table by reading a schema from orc, parquet and avro is tracked in HIVE-10593.

jpp · ‎11-20-2015

A mistyped hadoop fs -rmr -skipTrash can have catastrophic consequences, which can be protected against with snapshots. What are the performance concerns ?

jpp · ‎11-17-2015

What size Nifi system would we need to read 400 MB/s from a Kafka topic and store the output in HDFS ? The input is log lines, 100 B to 1KB in length each.

jpp · ‎10-30-2015

Pig does not support appending to an existing partition through HCatalog. What workarounds are there to perform the append and get a behavior similar to Hive's INSERT INTO TABLE with Pig ?

jpp · ‎10-27-2015

You would assign one folder to each of the datanode disks, closely mapping dfs.datanode.data.dir. On a 12 disk system you would have 12 yarn local-dir locations.

jpp · ‎10-22-2015

In an HA environment, you should always refer to the nameservice, not any one of the namenodes. The syntax for the URL is hdfs://<nameservice>/ Notice that no port number is specified. The HA configuration should be defined in /etc/hadoop/conf/core-site.xml and accessible by the process. WebHDFS does not natively support Namenode HA but you can use Knox to provide the functionality.

jpp · ‎10-10-2015

That was it. Thanks ! Used http://tweeterid.com/ to convert from username to user id.

Online	Offline
Last Visited	‎09-19-2024 01:15 PM

Member Since	‎09-21-2015 05:16 PM
Last Visited	‎09-19-2024 01:15 PM
Posts	28
Kudos received	29

Cloudera Community

Re: Hive's taking to much time. It is normal?

Re: Change hadoop service group post-install

Re: Error while creating a table with 'LIKE' claus...

Re: Hive table format and compression

Re: What is the best way to implement row-based se...

Re: Error while creating a table with 'LIKE' claus...

Re: Hive table format and compression

Re: Optimize a long running hive query - has a joi...

Re: Create Hive table to read parquet files from p...

Re: Snapshotting Apps/Hive/Warehouse

Nifi sizing

Appending to existing partition with Pig

Re: Recommended size for yarn.nodemanager.resource...

Re: Accessing HDFS in Namenode HA environment

Re: What is the format of the (IDs to Follow) fiel...