Member since
05-02-2019
319
Posts
145
Kudos Received
59
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7086 | 06-03-2019 09:31 PM | |
1715 | 05-22-2019 02:38 AM | |
2165 | 05-22-2019 02:21 AM | |
1347 | 05-04-2019 08:17 PM | |
1660 | 04-14-2019 12:06 AM |
05-25-2016
02:28 PM
3 Kudos
Wow, a TON of questions around Snapshots; I'll try to hit on most of them. Sounds like you might have already found these older posts on this topic, http://hortonworks.com/blog/snapshots-for-hdfs/ & http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/. For DR (data onto another cluster) you'll need to export these snapshots with a tool like distcp. As you go up into the Hive and HBase stacks, you have some other tools and options in addition to this. My recommendation is to open a dedicated HCC question for each after you do a little research and we can all jump in to help anything you don't understand. As with all things, the best way to find out is to give it a try. As the next bit shows, you cannot delete a snapshot like "normal"; you have to use the special delete snapshot command. [root@sandbox ~]# hdfs dfs -mkdir testsnaps
[root@sandbox ~]# hdfs dfs -put /etc/group testsnaps/
[root@sandbox ~]# hdfs dfs -ls testsnaps
Found 1 items
-rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/group
[root@sandbox ~]# su - hdfs
[hdfs@sandbox ~]$ hdfs dfsadmin -allowSnapshot /user/root/test
snapsAllowing snaphot on /user/root/testsnaps succeeded
[hdfs@sandbox ~]$ exit
logout
[root@sandbox ~]# hdfs dfs -createSnapshot /user/root/testsnaps snap1
Created snapshot /user/root/testsnaps/.snapshot/snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot/snap1
Found 1 items
-rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/.snapshot/snap1/group
[root@sandbox ~]# hdfs dfs -rmr -skipTrash /user/root/testsnaps/.snapshot/snap1
rmr: DEPRECATED: Please use 'rm -r' instead.
rmr: Modification on a read-only snapshot is disallowed
[root@sandbox ~]# hdfs dfs -deleteSnapshot /user/root/testsnaps snap1
[root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot
[root@sandbox ~]# There is no auto-delete of snapshots. The rule of thumb is that if you create them (likely with an automated process) then you need to have a complimentary process to delete them as you can clog up HDFS space if the data directory you are snapshotting actually does change. Snapshots should not adversely affect your quotas, with the exception I just called out about them hanging onto HDFS space for items you have deleted from the actual directory that you do have 1+ snapshot pointing to. Have fun playing around with snapshots & good luck!
... View more
05-23-2016
12:12 PM
Good thing it is an ASF project!! 😉 See if http://zeppelin.apache.org/, http://zeppelin.apache.org/download.html, http://zeppelin.apache.org/docs/0.5.6-incubating/index.html, http://zeppelin.apache.org/docs/0.5.6-incubating/install/install.html and/or http://zeppelin.apache.org/docs/0.5.6-incubating/install/yarn_install.html can get you going. Good luck!
... View more
05-18-2016
10:06 PM
Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.html that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉
... View more
05-18-2016
02:03 PM
1 Kudo
Sounds like https://community.hortonworks.com/questions/33961/how-to-import-data-return-by-google-analytic-s-api.html was a repost of this earlier question. I provided my (more generic) answer over there, but maybe someone has a more specific response tied directly to Google Analytics and Hadoop. Good luck!
... View more
05-18-2016
02:01 PM
1 Kudo
Sure!! Just navigate over to http://hortonworks.com/products/sandbox/ to download the free Hortonworks Sandbox and to check out all the available tutorials listed there. I do assume you already know about this, so feel free to refine your question. 😉
... View more
05-18-2016
01:59 PM
1 Kudo
My thoughts on #1 & #2: Some googling shows there are folks out there with somewhat direct ways for you to open an HDFS file and write to it as you are getting the data from the external system (google api in this case). That said, I'd consider applying the KISS principle and have your python program write the results into a file so that when you are done (and you are sure you are done -- i.e. this helps prevent a half-baked file in HDFS) simply use the hadoop fs -put command to drop the complete file exactly where you want it in HDFS. As for #3: You have to create table (external or not -- as even "managed" tables can reside outside of /apps/hive/warehouse) as this is a layered on ecosystem tool above base HDFS and this CREATE TABLE DDL command will store metadata about the logical table you want to be mapped to your data. The good news is that you can create that table before or after you load the data. Additionally, if you are going to continue to add net-new data to the table, you don't have to create it again. Good luck!
... View more
05-17-2016
08:51 PM
good points; an example of some of the "corner cases" on CSV files (especially those generated by tools like Excel) are discussed in https://martin.atlassian.net/wiki/x/WYBmAQ.
... View more
05-17-2016
12:58 AM
2 Kudos
Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. I usually get something like the following when trying to use a multi-line file. scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json")
productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string] That said, all seems to be working for me with a file like the following. [root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json
{"id" : "1201", "name" : "satish", "age" : "25"}
{"id" : "1202", "name" : "krishna", "age" : "28"}
{"id" : "1203", "name" : "amith", "age" : "39"}
{"id" : "1204", "name" : "javed", "age" : "23"}
{"id" : "1205", "name" : "prudvi", "age" : "23"} As you can see by the two ways I read the JSON file below. SQL context available as sqlContext.
scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json")
df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df1.printSchema()
root
|-- age: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
scala> df1.show()
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+
scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json")
df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string]
scala> df2.show()
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 23|1205| prudvi|
+---+----+-------+ Again, if this doesn't help feel free to share some more details. Good luck!
... View more
05-16-2016
11:17 PM
1 Kudo
As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS. As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.
... View more
03-21-2016
01:12 AM
2 Kudos
I'm surely not going to give you the best answer on this one, but "Hadoop Streaming", as described at http://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html, is a way to run a MapReduce job that executes your Python code. In this case, you'll need to have Python installed on all the cluster nodes, but since you're starting out on the Sandbox that makes it easy (just one place!). Yes, you could also run a Python app that queries Hive, but only that query itself will be running in the cluster. In this case, you'll obviously just need Python wherever you are running it from.
... View more