Member since
09-23-2015
42
Posts
91
Kudos Received
8
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1285 | 02-01-2016 08:56 PM | |
2981 | 01-16-2016 12:40 PM | |
6902 | 01-15-2016 01:14 PM | |
5642 | 01-14-2016 09:37 PM | |
7102 | 12-14-2015 01:02 PM |
04-19-2016
07:13 PM
ignore the comment about Phoenix. I see you are using MySQL by the "ENGINE=InnoDB" now.
... View more
04-19-2016
07:11 PM
1 Kudo
Chris can you validate that your DBCPConnectionPool controller is pointing the the appropriate database instance? The JSONToSQL processor will attempt a "describe" using the Connection Service and often this error is the result of that Connection Service not being pointed to the desired database. OR if you using using a Phoenix table be careful as the Phoenix JDBC driver is case sensitive and can make things a little more tricky.
... View more
04-06-2016
08:29 PM
3 Kudos
Since the number of salt buckets can only be set a table creation time this can be a little tricky. It takes a small amount of foresight in understanding your needs from the table AKA will the table be more read heavy or write heavy. A neutral stance would be to set the number of salt buckets to the number of Hbase RegionServers in your cluster. If you anticipate heavy write loads increasing that to something around {Hbase RegionServer Count * 1.20} which would increase the number of buckets by 20% and allow for a more distributed load. Increasing the salt buckets too high however may reduce your flexibility when you perform range based queries.
... View more
02-01-2016
08:56 PM
2 Kudos
Wes - I know you are asking for REST API in your question but it seems to me that it would be better suited to pull this information using from Flume's JMX MBeans. It sounds to me like you are looking for lower level metrics like memory used, cpu, etc.
... View more
01-16-2016
12:40 PM
2 Kudos
@Narendra Bidari clientId and groupId are not the same. ClientId is a user specified string value that is sent along with every message to help with tracing and debugging. On the other hand groupId is a unique identifier for a group of consumer processes. Since the Kafka read offset is stored in zookeeper for your groupId you don't start reading files from the beginning for that topic. This is why you are able to read the entire topic when you change the topic name because no previous offset has been stored hope this helps
... View more
01-15-2016
01:14 PM
2 Kudos
@Kausha Simpson ReplaceTextWithMapping works exactly like ReplaceText with the exception that the ReplaceText property "Replacement Value" is defined in an external file. Unfortunately the format of that file is not very well documented. I have attached an example of that file and a sample workflow but for a high level overview the mapping file format is newline delimited per mapping defined with a \t character separator the mapping key to the desired replacement value. Good luck and hope this information helps.
... View more
01-14-2016
09:37 PM
3 Kudos
Ashwin - You can start the spark shell by running /usr/hdp/current/spark-client/bin/spark-shell in the sandbox.
... View more
01-14-2016
05:14 PM
1 Kudo
I agree with @jpercivall and @mpayne ReplaceText is the best way to go. I created a quick workflow that you can reference. This was assuming the input of AABBBBCC as you suggested. You can change the GetFile path and PutFile path, and the regex in ReplaceText to test with your real data.fixedwidthexample.xml
... View more
12-14-2015
01:02 PM
Hey Divya, There are a couple of ways to do this. The main flow is this however.
Load/parse the data into dataframes. It seems like you have already done this but since you didn't pass along that snippet I'm just going to make something up. You did mention you were using the spark-csv package so the example is doing the same.
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv");
Write the dataframe data to the HDFS location where you plan to create the Hive external table or the directory for an existing Hive table. df.select("year", "model").write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("hdfs://hdfs_location/newcars.csv");
Create the external Hive table by creating a HiveSQLContext val hiveSQLContext = new org.apache.spark.sql.hive.HiveContext(sc)
//Several other options can be passed in here for other formats, partitions, etc
hiveSQLContext.sql("CREATE EXTERNAL TABLE cars(year INT, model STRING) STORED AS TEXTFILE LOCATION 'hdfs_location'");
Query the Hive table with whatever query you wish // Queries are expressed in HiveQL
hiveSQLContext.sql("SELECT * FROM cars").collect().foreach(println)
... View more
12-02-2015
03:28 PM
Hortonworks has a tutorial that shows how to configure Solr to store index files in HDFS. Since HDFS is already a fault tolerant file system, does it mean that with this approach we can keep the replication factor of 1 for any collections (shards) that we create? It sounds like a lot of redundancy if we keep the default HDFS replication factor of 3 plus Solr replication on top of that.
... View more
Labels:
- Labels:
-
Apache Solr
- « Previous
-
- 1
- 2
- Next »