About jdyer

jdyer · ‎04-19-2016

ignore the comment about Phoenix. I see you are using MySQL by the "ENGINE=InnoDB" now.

jdyer · ‎04-19-2016

Chris can you validate that your DBCPConnectionPool controller is pointing the the appropriate database instance? The JSONToSQL processor will attempt a "describe" using the Connection Service and often this error is the result of that Connection Service not being pointed to the desired database. OR if you using using a Phoenix table be careful as the Phoenix JDBC driver is case sensitive and can make things a little more tricky.

jdyer · ‎04-06-2016

Since the number of salt buckets can only be set a table creation time this can be a little tricky. It takes a small amount of foresight in understanding your needs from the table AKA will the table be more read heavy or write heavy. A neutral stance would be to set the number of salt buckets to the number of Hbase RegionServers in your cluster. If you anticipate heavy write loads increasing that to something around {Hbase RegionServer Count * 1.20} which would increase the number of buckets by 20% and allow for a more distributed load. Increasing the salt buckets too high however may reduce your flexibility when you perform range based queries.

jdyer · ‎02-01-2016

Wes - I know you are asking for REST API in your question but it seems to me that it would be better suited to pull this information using from Flume's JMX MBeans. It sounds to me like you are looking for lower level metrics like memory used, cpu, etc.

jdyer · ‎01-16-2016

@Narendra Bidari clientId and groupId are not the same. ClientId is a user specified string value that is sent along with every message to help with tracing and debugging. On the other hand groupId is a unique identifier for a group of consumer processes. Since the Kafka read offset is stored in zookeeper for your groupId you don't start reading files from the beginning for that topic. This is why you are able to read the entire topic when you change the topic name because no previous offset has been stored hope this helps

jdyer · ‎01-15-2016

@Kausha Simpson ReplaceTextWithMapping works exactly like ReplaceText with the exception that the ReplaceText property "Replacement Value" is defined in an external file. Unfortunately the format of that file is not very well documented. I have attached an example of that file and a sample workflow but for a high level overview the mapping file format is newline delimited per mapping defined with a \t character separator the mapping key to the desired replacement value. Good luck and hope this information helps.

jdyer · ‎01-14-2016

Ashwin - You can start the spark shell by running /usr/hdp/current/spark-client/bin/spark-shell in the sandbox.

jdyer · ‎01-14-2016

I agree with @jpercivall and @mpayne ReplaceText is the best way to go. I created a quick workflow that you can reference. This was assuming the input of AABBBBCC as you suggested. You can change the GetFile path and PutFile path, and the regex in ReplaceText to test with your real data.fixedwidthexample.xml

jdyer · ‎12-14-2015

Hey Divya, There are a couple of ways to do this. The main flow is this however. Load/parse the data into dataframes. It seems like you have already done this but since you didn't pass along that snippet I'm just going to make something up. You did mention you were using the spark-csv package so the example is doing the same. val sqlContext = new SQLContext(sc) val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") // Use first line of all files as header .option("inferSchema", "true") // Automatically infer data types .load("cars.csv"); Write the dataframe data to the HDFS location where you plan to create the Hive external table or the directory for an existing Hive table. df.select("year", "model").write() .format("com.databricks.spark.csv") .option("header", "true") .save("hdfs://hdfs_location/newcars.csv"); Create the external Hive table by creating a HiveSQLContext val hiveSQLContext = new org.apache.spark.sql.hive.HiveContext(sc) //Several other options can be passed in here for other formats, partitions, etc hiveSQLContext.sql("CREATE EXTERNAL TABLE cars(year INT, model STRING) STORED AS TEXTFILE LOCATION 'hdfs_location'"); Query the Hive table with whatever query you wish // Queries are expressed in HiveQL hiveSQLContext.sql("SELECT * FROM cars").collect().foreach(println)

jdyer · ‎12-02-2015

Hortonworks has a tutorial that shows how to configure Solr to store index files in HDFS. Since HDFS is already a fault tolerant file system, does it mean that with this approach we can keep the replication factor of 1 for any collections (shards) that we create? It sounds like a lot of redundancy if we keep the default HDFS replication factor of 3 plus Solr replication on top of that.

Online	Offline
Last Visited	‎03-29-2018 04:40 PM

Member Since	‎09-23-2015 08:12 PM
Last Visited	‎03-29-2018 04:40 PM
Posts	42
Kudos received	80

Cloudera Community

Re: Best API to pull Flume Metrics from Ambari

Re: Trident (Opaque Transactional Spout) : Differe...

Re: How are mapping files formatted for the NiFi R...

Re: How to start spark console

Re: save spark-csv output to hive in HDP 2.3.2

Re: Trouble with JsontoSQL processor in Nifi

Re: Trouble with JsontoSQL processor in Nifi

Re: How many salt buckets should I use for my Phoe...

Re: Best API to pull Flume Metrics from Ambari

Re: Trident (Opaque Transactional Spout) : Differe...

Re: How are mapping files formatted for the NiFi R...

Re: How to start spark console

Re: How to parse w/ fixed width instead of char de...

Re: save spark-csv output to hive in HDP 2.3.2

SolrCloud Replication factor with index files in H...