About mqureshi

mqureshi · ‎06-26-2016

This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link. https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html sc.parallelize(rdd.collect.toSeq.transpose) See the link here for more details.

mqureshi · ‎06-26-2016

@Akash Mehta So, even following wont work for you? If not, I think currently there is no other way given we have looked at all other possible options. //a DataFrame can be created for a JSON dataset represented by // an RDD[String] storing one JSON object per string. val anotherPeopleRDD = sc.parallelize( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil) val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

mqureshi · ‎06-25-2016

@Sri Bandaru since you are not running in a sandbox, what does --master yarn resolves to?

mqureshi · ‎06-23-2016

load will infer schema and convert to a row. Question is whether it will take an http url. Can you try?

mqureshi · ‎06-23-2016

@Akash Mehta Can you do something like this? dataframe = sqlContext.read.format(“json”).load(your json here)

mqureshi · ‎06-23-2016

ALTER TABLE MAGNETO.SALES_FLAT_ORDER SET SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' assuming you have hive 0.14 or later.

mqureshi · ‎06-23-2016

@Simran Kaur check your data by doing a "cat" to see what the data looks like. Are fields separated by a space or whatever it is. You can also instead create a table and in create table statement specify what you want your fields to be terminated by and then do an import using Sqoop.

mqureshi · ‎06-23-2016

It is likely an issue of field delimiter. Default in Hive is ^A. you should specify what your fields are delimited by. --fields-terminated-by Might want to do --lines-terminated-by also.

mqureshi · ‎06-21-2016

@Kaliyug Antagonist Hi I would disagree with your assumption that it doesn't make sense to backup peta bytes of data. Think what would you do if there is a fire in a data center and your data is physically destroyed. So even at Petabyte scale, it is very important to have a backup and DR strategy. Now, snapshots only create backups of data for point in time. You can mark a directory "snapshottable" and then create snapshots of data in that directory. This will give you the ability to go back in time and restore the data to that particular point in time. Please see the following link for more details: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html Snapshots still don't solve the problem you are trying to solve. You need to backup data using either distcp or use tools like Falcon to help. Please see the following link. https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html http://hortonworks.com/apache/falcon/ As for your question number 3, when your data nodes go down or name node goes down, I don't think your backups help. When a data node goes down, Hadoop will take care of creating the lost copy by replicating the data. Also, someone in operations will likely be working to bring the datanode up. Similarly if your name node goes down, your cluster should failover to standby namenode and your operations team should be working to restore the lost namenode. Backing up metadata doesn't help in this particular case because between namenode and standby name node you have a quorum journal manager you already have multiple copies of data (this does not discount the significance of a backup and DR strategy which includes metadata backup). Please check the following link. It will help you understand better on how this is working. http://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_hadoop-ha/content/ch_HA-NameNode.html (If you are interested in learning more details) Thanks Imad

mqureshi · ‎06-15-2016

@chandramouli muthukumaran No, as for HDFS files, their storage will depend only on replication factor. Think about it this way. You start with a fresh linux install. You have different mount points in your system with different capacities. Which mount points would you like to use to store your HDFS data (datanode) as well as your metadata (namenode).

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: Whats the best way to read multiline cvs and ...

Re: Converting JSON to Rdd

Re: Spark Job Failing "Could not find or load main...

Re: Converting JSON to Rdd

Re: Converting JSON to Rdd

Re: importing data in hive from a location

Re: importing data in hive from a location

Re: importing data in hive from a location

Re: Cluster 'back-up' - does it make sense ?

Re: Name Node and Data Node Directories