Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3275 | 10-18-2017 10:19 PM | |
3635 | 10-18-2017 09:51 PM | |
13298 | 09-21-2017 01:35 PM | |
1343 | 08-04-2017 02:00 PM | |
1718 | 07-31-2017 03:02 PM |
06-26-2016
05:18 AM
This is quite a custom requirement that you are converting some rows to column and other rows to both rows and column. You'll have to write a lot of your code but take advantage of pivot functionality in Spark. Check following link. https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html sc.parallelize(rdd.collect.toSeq.transpose) See the link here for more details.
... View more
06-26-2016
01:54 AM
@Akash Mehta So, even following wont work for you? If not, I think currently there is no other way given we have looked at all other possible options. //a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
"""{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
... View more
06-25-2016
08:44 PM
@Sri Bandaru since you are not running in a sandbox, what does --master yarn resolves to?
... View more
06-23-2016
11:51 PM
load will infer schema and convert to a row. Question is whether it will take an http url. Can you try?
... View more
06-23-2016
10:58 PM
@Akash Mehta Can you do something like this?
dataframe = sqlContext.read.format(“json”).load(your json here)
... View more
06-23-2016
06:17 PM
1 Kudo
ALTER TABLE MAGNETO.SALES_FLAT_ORDER SET SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' assuming you have hive 0.14 or later.
... View more
06-23-2016
05:47 PM
@Simran Kaur check your data by doing a "cat" to see what the data looks like. Are fields separated by a space or whatever it is. You can also instead create a table and in create table statement specify what you want your fields to be terminated by and then do an import using Sqoop.
... View more
06-23-2016
05:13 PM
It is likely an issue of field delimiter. Default in Hive is ^A. you should specify what your fields are delimited by. --fields-terminated-by Might want to do --lines-terminated-by also.
... View more
06-21-2016
04:42 PM
4 Kudos
@Kaliyug Antagonist
Hi I would disagree with your assumption that it doesn't make sense to backup peta bytes of data. Think what would you do if there is a fire in a data center and your data is physically destroyed. So even at Petabyte scale, it is very important to have a backup and DR strategy. Now, snapshots only create backups of data for point in time. You can mark a directory "snapshottable" and then create snapshots of data in that directory. This will give you the ability to go back in time and restore the data to that particular point in time. Please see the following link for more details: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html Snapshots still don't solve the problem you are trying to solve. You need to backup data using either distcp or use tools like Falcon to help. Please see the following link. https://community.hortonworks.com/questions/394/what-are-best-practices-for-setting-up-backup-and.html http://hortonworks.com/apache/falcon/ As for your question number 3, when your data nodes go down or name node goes down, I don't think your backups help. When a data node goes down, Hadoop will take care of creating the lost copy by replicating the data. Also, someone in operations will likely be working to bring the datanode up. Similarly if your name node goes down, your cluster should failover to standby namenode and your operations team should be working to restore the lost namenode. Backing up metadata doesn't help in this particular case because between namenode and standby name node you have a quorum journal manager you already have multiple copies of data (this does not discount the significance of a backup and DR strategy which includes metadata backup). Please check the following link. It will help you understand better on how this is working. http://hortonworks.com/blog/namenode-high-availability-in-hdp-2-0/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_hadoop-ha/content/ch_HA-NameNode.html (If you are interested in learning more details) Thanks Imad
... View more
06-15-2016
06:07 PM
@chandramouli muthukumaran No, as for HDFS files, their storage will depend only on replication factor. Think about it this way. You start with a fresh linux install. You have different mount points in your system with different capacities. Which mount points would you like to use to store your HDFS data (datanode) as well as your metadata (namenode).
... View more
- « Previous
- Next »