About LesterMartin

LesterMartin · ‎05-25-2016

Wow, a TON of questions around Snapshots; I'll try to hit on most of them. Sounds like you might have already found these older posts on this topic, http://hortonworks.com/blog/snapshots-for-hdfs/ & http://hortonworks.com/blog/protecting-your-enterprise-data-with-hdfs-snapshots/. For DR (data onto another cluster) you'll need to export these snapshots with a tool like distcp. As you go up into the Hive and HBase stacks, you have some other tools and options in addition to this. My recommendation is to open a dedicated HCC question for each after you do a little research and we can all jump in to help anything you don't understand. As with all things, the best way to find out is to give it a try. As the next bit shows, you cannot delete a snapshot like "normal"; you have to use the special delete snapshot command. [root@sandbox ~]# hdfs dfs -mkdir testsnaps [root@sandbox ~]# hdfs dfs -put /etc/group testsnaps/ [root@sandbox ~]# hdfs dfs -ls testsnaps Found 1 items -rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/group [root@sandbox ~]# su - hdfs [hdfs@sandbox ~]$ hdfs dfsadmin -allowSnapshot /user/root/test snapsAllowing snaphot on /user/root/testsnaps succeeded [hdfs@sandbox ~]$ exit logout [root@sandbox ~]# hdfs dfs -createSnapshot /user/root/testsnaps snap1 Created snapshot /user/root/testsnaps/.snapshot/snap1 [root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot/snap1 Found 1 items -rw-r--r-- 3 root hdfs 1196 2016-05-25 14:18 testsnaps/.snapshot/snap1/group [root@sandbox ~]# hdfs dfs -rmr -skipTrash /user/root/testsnaps/.snapshot/snap1 rmr: DEPRECATED: Please use 'rm -r' instead. rmr: Modification on a read-only snapshot is disallowed [root@sandbox ~]# hdfs dfs -deleteSnapshot /user/root/testsnaps snap1 [root@sandbox ~]# hdfs dfs -ls testsnaps/.snapshot [root@sandbox ~]# There is no auto-delete of snapshots. The rule of thumb is that if you create them (likely with an automated process) then you need to have a complimentary process to delete them as you can clog up HDFS space if the data directory you are snapshotting actually does change. Snapshots should not adversely affect your quotas, with the exception I just called out about them hanging onto HDFS space for items you have deleted from the actual directory that you do have 1+ snapshot pointing to. Have fun playing around with snapshots & good luck!

LesterMartin · ‎05-23-2016

Good thing it is an ASF project!! 😉 See if http://zeppelin.apache.org/, http://zeppelin.apache.org/download.html, http://zeppelin.apache.org/docs/0.5.6-incubating/index.html, http://zeppelin.apache.org/docs/0.5.6-incubating/install/install.html and/or http://zeppelin.apache.org/docs/0.5.6-incubating/install/yarn_install.html can get you going. Good luck!

LesterMartin · ‎05-18-2016

Looks like same question over at https://community.hortonworks.com/questions/33621/input-path-on-sandbox-for-loading-data-into-spark.html that @Joe Widen answered. Note, my comment (and example) below that Joe also pointed out about the JSON object needing to be on a single line. Glad to see Joe got a "best answer" and I'd sure be appreciative for the same on this one. 😉

LesterMartin · ‎05-18-2016

Sounds like https://community.hortonworks.com/questions/33961/how-to-import-data-return-by-google-analytic-s-api.html was a repost of this earlier question. I provided my (more generic) answer over there, but maybe someone has a more specific response tied directly to Google Analytics and Hadoop. Good luck!

LesterMartin · ‎05-18-2016

Sure!! Just navigate over to http://hortonworks.com/products/sandbox/ to download the free Hortonworks Sandbox and to check out all the available tutorials listed there. I do assume you already know about this, so feel free to refine your question. 😉

LesterMartin · ‎05-18-2016

My thoughts on #1 & #2: Some googling shows there are folks out there with somewhat direct ways for you to open an HDFS file and write to it as you are getting the data from the external system (google api in this case). That said, I'd consider applying the KISS principle and have your python program write the results into a file so that when you are done (and you are sure you are done -- i.e. this helps prevent a half-baked file in HDFS) simply use the hadoop fs -put command to drop the complete file exactly where you want it in HDFS. As for #3: You have to create table (external or not -- as even "managed" tables can reside outside of /apps/hive/warehouse) as this is a layered on ecosystem tool above base HDFS and this CREATE TABLE DDL command will store metadata about the logical table you want to be mapped to your data. The good news is that you can create that table before or after you load the data. Additionally, if you are going to continue to add net-new data to the table, you don't have to create it again. Good luck!

LesterMartin · ‎05-17-2016

good points; an example of some of the "corner cases" on CSV files (especially those generated by tools like Excel) are discussed in https://martin.atlassian.net/wiki/x/WYBmAQ.

LesterMartin · ‎05-17-2016

Not sure of what error you are getting (feel free to share some of the dataset and the error messages you received), but I'm wondering if you are accounting for the following warning called out in http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail. I usually get something like the following when trying to use a multi-line file. scala> val productsML = sqlContext.read.json("/tmp/hcc/products.json") productsML: org.apache.spark.sql.DataFrame = [_corrupt_record: string] That said, all seems to be working for me with a file like the following. [root@sandbox ~]# hdfs dfs -cat /tmp/hcc/employees.json {"id" : "1201", "name" : "satish", "age" : "25"} {"id" : "1202", "name" : "krishna", "age" : "28"} {"id" : "1203", "name" : "amith", "age" : "39"} {"id" : "1204", "name" : "javed", "age" : "23"} {"id" : "1205", "name" : "prudvi", "age" : "23"} As you can see by the two ways I read the JSON file below. SQL context available as sqlContext. scala> val df1 = sqlContext.read.json("/tmp/hcc/employees.json") df1: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df1.printSchema() root |-- age: string (nullable = true) |-- id: string (nullable = true) |-- name: string (nullable = true) scala> df1.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+ scala> val df2 = sqlContext.read.format("json").option("samplingRatio", "1.0").load("/tmp/hcc/employees.json") df2: org.apache.spark.sql.DataFrame = [age: string, id: string, name: string] scala> df2.show() +---+----+-------+ |age| id| name| +---+----+-------+ | 25|1201| satish| | 28|1202|krishna| | 39|1203| amith| | 23|1204| javed| | 23|1205| prudvi| +---+----+-------+ Again, if this doesn't help feel free to share some more details. Good luck!

LesterMartin · ‎05-16-2016

As for ingestion, Pig is not really used for simple ingestion and Sqoop is a great tool for importing data from a RDBMS, so the "directly in(to) HDFS" seems like the logical answer and if your data is on an edge/ingestion node where you could easily script just using the hadoop fs "put" command, https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#put, can be a simple, novel & effective way to get your data loaded into HDFS. As for if Spark is a good option for data transformation (I'm going to side-step the "segmentation" term as it means a lot of different things to a lot of different people 😉 I'd say this is really a matter of style, experience and results of POC testing based on your data & processing profile. So, yes, Spark could be an effective transformation engine.

LesterMartin · ‎03-21-2016

I'm surely not going to give you the best answer on this one, but "Hadoop Streaming", as described at http://hadoop.apache.org/docs/current/hadoop-streaming/HadoopStreaming.html, is a way to run a MapReduce job that executes your Python code. In this case, you'll need to have Python installed on all the cluster nodes, but since you're starting out on the Sandbox that makes it easy (just one place!). Yes, you could also run a Python app that queries Hive, but only that query itself will be running in the cluster. In this case, you'll obviously just need Python wherever you are running it from.

Online	Offline
Last Visited	‎03-04-2021 02:39 PM

Member Since	‎05-02-2019 12:59 PM
Last Visited	‎03-04-2021 02:39 PM
Posts	319
Kudos received	145

Cloudera Community

Re: How to create partitions on existing Hive tabl...

Re: Copying data from One HBase to another Hbase c...

Re: Number of Concurrent Users on HDP Sandbox in a...

Re: Reason for Hive dependency on PIg during insta...

Re: One datanode nearly full but not the others

Re: Snapshots, Backup and DR

Re: Apache Zeppelin with Hive

Re: json file input path for loading into spark

Re: How to use Google Analytics API to import data...

Re: I have to practise hadoop(hive,pig,sqoop,oozie...

Re: How to import data return by Google analytic s...

Re: Structured Unstructured Data for Pig and Hive

Re: json file input path for loading into spark

Re: Load data to HDFS & Data Transformation with S...

Re: Running Python Scripts on data in HDFS