Created on 08-01-2023 03:06 AM - edited 08-02-2023 04:12 AM
Table of Contents:
Apache Solr (stands for Searching On Lucene w/ Replication) is the popular, blazing-fast, open-source enterprise search platform built on Apache Lucene. It is designed to provide powerful full-text search, faceted search, and indexing capabilities to enable fast and accurate search functionality for various types of data.
Solr is the second-most popular enterprise search engine after Elasticsearch.
Written in Java, Solr has RESTful XML/HTTP and JSON APIs and client libraries for many programming languages such as Java, Phyton, Ruby, C#, PHP, and many more being used to build search-based and big data analytics applications for websites, databases, files, etc.
Solr is often used as a search engine for applications, websites, and enterprise systems that require robust search capabilities. It can handle a wide range of data formats, including text, XML, JSON, and more. Solr offers advanced features like distributed searching, fault tolerance, near-real-time indexing, and high availability.
Key features of Apache Solr include:
To search a document, Apache Solr performs the following operations in sequence:
The key terms associated with Solr are as follows:
Core = an instance of Lucene Index + Solr configuration
Refer to the following article for Solr server tuning and additional tuning resources.
The Spark Solr Connector is a library that allows seamless integration between Apache Spark and Apache Solr, enabling you to read data from Solr into Spark and write data from Spark into Solr. It provides a convenient way to leverage the power of Spark's distributed processing capabilities with Solr's indexing and querying capabilities.
The Spark Solr Connector provides more advanced functionalities for working with Solr and Spark, such as query pushdown, schema inference, and custom Solr field mapping.
Service | Protocol | Port | Access | Purpose |
Solr search/update | HTTP | 8983 | External | All solr specific actions such as query and update |
Solr Admin | HTTP | 8984 | Internal | Administrative use |
CDP Version | Spark2 Supported | Spark3 Supported |
CDP 7.1.6 | No | No |
CDP 7.1.7 | Yes | No |
CDP 7.1.8 | Yes | No |
CDP 7.1.9 | Yes | Yes |
The solrctl utility is a wrapper shell script included with Cloudera Search for managing collections, instance directories, configs, Apache Sentry permissions, and more.
Config templates are immutable configuration templates that you can use as a starting point when creating configs for Solr collections. Cloudera Search contains templates by default and you can define new ones based on existing configs.
Configs can be declared as immutable, which means they cannot be deleted or have their Schema updated by the Schema API. Immutable configs are uneditable config templates that are the basis for additional configs. After a config is made immutable, you cannot change it back without accessing ZooKeeper directly as the solr (or solr@EXAMPLE.COM principal, if you are using Kerberos) superuser.
Solr provides a set of immutable config templates. These templates are only available after Solr initialization, so templates are not available in upgrades until after Solr is initialized or re-initialized.
Templates include:
Template Name | Supports Schema API | Uses Schemaless Solr |
managedTemplate | Yes | No |
schemalessTemplate | Yes | Yes |
Config templates are managed using the solrctl config command.
For example:
To create a new config based on the managedTemplate template:
solrctl config --create [***NEW CONFIG***] managedTemplate -p immutable=false
Replace [NEW CONFIG] with the name of the config you want to create.
To create a new template (immutable config) from an existing config:
solrctl config --create [***NEW TEMPLATE***] [***EXISTING CONFIG***] -p immutable=true
Replace [NEW TEMPLATE] with a name for the new template you want to create and [EXISTING CONFIG] with the name of the existing config that you want to base [NEW TEMPLATE] on.
solrctl config --create [***NEW CONFIG***] [***TEMPLATE***] [-p [***NAME***]=[***VALUE***]]
where
For example:
solrctl config --create testConfig managedTemplate -p immutable=false
To list all available config templates:
solrctl instancedir --list
solrctl collection --create [***COLLECTION NAME***] -s [***NUMBER OF SHARDS***] -c [***COLLECTION CONFIGURATION***]
where
For example:
Create a collection with 2 number of shards.
solrctl collection --create testcollection -s 2 -c testConfig
To list all available collections:
solrctl collection --list
If you are using Kerberos,
kinit solradmin@EXAMPLE.COM
Replace EXAMPLE.COM with your Kerberos realm name.
solrctl config --create sample-config managedTemplate -p immutable=false
solrctl collection --create sample-collection -s 1 -c sample-config
Log in to the host where the Solr instance is running:
cat /etc/solr/conf/solr-env.sh
export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra
export SOLR_HADOOP_DEPENDENCY_FS_TYPE=shared
Note: Make sure that the SOLR_ZK_ENSEMBLE environment variable is set in the above configuration file.
To integrate Spark with Solr, you need to use the spark-solr library. You can specify this library using --jars or --packages options when launching Spark.
Example(s):
Using --jars option:
spark-shell \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar
Using --packages option:
spark-shell \
--packages com.lucidworks.spark:spark-solr:3.9.0.7.1.8.15-5 \
--repositories https://repository.cloudera.com/artifactory/cloudera-repos/
In the following example(s), I have used the --jars option.
ls /opt/cloudera/parcels/CDH/jars/*spark-solr*
For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.
spark-shell \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar
Replace /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar with the actual path to the spark-solr JAR file obtained in Step 1.
Step1: Create a jass file
cat /tmp/solr-client-jaas.conf
Client {
com.sun.security.auth.module.Krb5LoginModule required
doNotPrompt=true
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="sampleuser.keytab"
principal="sampleuser@EXAMPLE.COM";
};
Replace the values of keyTab and principal with your specific configuration.
Step2: Find the spark-solr jar
Use the following command to locate the spark-solr JAR file:
ls /opt/cloudera/parcels/CDH/jars/*spark-solr*
For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.
Step3: Launch the spark-shell
Before running the following spark-shell command, you need to replace the keyTab, principal, and jars files (collected from Step 2):
spark-shell \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
--principal sampleuser@EXAMPLE.COM \
--keytab sampleuser.keytab \
--files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
--driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false"
Step1: Create a jass file
cat /tmp/solr-client-jaas.conf
Client {
com.sun.security.auth.module.Krb5LoginModule required
doNotPrompt=true
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="sampleuser.keytab"
principal="sampleuser@EXAMPLE.COM";
};
Replace the values of keyTab and principal with your specific configuration.
Step2: Find the spark-solr jar
Use the following command to locate the spark-solr JAR file:
ls /opt/cloudera/parcels/CDH/jars/*spark-solr*
For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.
Step3: Launch the spark-shell
Before running the following spark-shell command, you need to replace keyTab, principal, jars file (collected from Step2), javax.net.ssl.trustStore file, and javax.net.ssl.trustStorePassword password in both driver and executor java options.
spark-shell \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
--principal sampleuser@EXAMPLE.COM \
--keytab sampleuser.keytab \
--files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
--driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'"
Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.
case class Employee(id:Long, name: String, age: Short, salary: Float)
val employeeDF = Seq(
Employee(1L, "Ranga", 34, 15000.5f),
Employee(2L, "Nishanth", 5, 35000.5f),
Employee(3L, "Meena", 30, 25000.5f)
).toDF()
val collectionName = "sample-collection"
val zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
val solrOptions = Map("zkhost" -> zkHost, "collection" -> collectionName, "commitWithin" -> "1000")
// Write data to Solr
employeeDF.write.format("solr").options(solrOptions).mode("overwrite").save()
Write Optimization Parameters:
Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.
val collectionName = "sample-collection"
val zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
val solrOptions = Map("zkhost" -> zkHost, "collection" -> collectionName)
// Read data from Solr
val df = spark.read.format("solr").options(solrOptions).load()
df.show()
Read Optimization Parameters:
vi /tmp/spark_solr_connector_app.py
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, ShortType, FloatType
def main():
spark = SparkSession.builder.appName("Spark Solr Connector App").getOrCreate()
data = [(1, "Ranga", 34, 15000.5), (2, "Nishanth", 5, 35000.5),(3, "Meena", 30, 25000.5)]
schema = StructType([ \
StructField("id",LongType(),True), \
StructField("name",StringType(),True), \
StructField("age",ShortType(),True), \
StructField("salary", FloatType(), True)
])
employeeDF = spark.createDataFrame(data=data,schema=schema)
zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
collectionName = "sample-collection"
solrOptions = { "zkhost" : zkHost, "collection" : collectionName }
# Write data to Solr
employeeDF.write.format("solr").options(**solrOptions).mode("overwrite").save()
# Read data from Solr
df = spark.read.format("solr").options(**solrOptions).load()
df.show()
# Filter the data
df.filter("age > 25").show()
spark.stop()
if __name__ == "__main__":
main()
Note: Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.
Client Mode:
spark-submit \
--master yarn \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
/tmp/spark_solr_connector_app.py
Cluster Mode:
spark-submit \
--master yarn \
--deploy-mode cluster \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
/tmp/spark_solr_connector_app.py
Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 'self'; media-src 'self'; style-src 'self' 'unsafe-inline'; script-src 'self'; worker-src 'se { "responseHeader":{ "status":400, "QTime":238}, "Operation create caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "exception":{ "msg":"Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "rspCode":400}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException"], "msg":"Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "code":400}}
Problem:
The error message org.apache.solr.common.SolrException: Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. indicates that the maximum number of shards per node allowed by your Solr configuration is set to 1, and there is already a collection with the same name or a collection with the same name is still present in the Solr cluster.
Solution:
To resolve this issue, we have a few options:
Remove Existing Collection: If you no longer need the existing collection with the same name, you can remove it from the Solr cluster. You can use the Solr Admin UI or the Solr API to delete the existing collection. After removing the existing collection, try creating the new collection again.
Choose a Different Collection Name: If you want to keep the existing collection and create a new collection with a similar name, choose a different name for the new collection. Make sure the new collection name is unique and doesn't conflict with any existing collections in the Solr cluster.
Increase maxShardsPerNode: If you want to create multiple collections with the same name but different shards on a single node, you need to increase the value of maxShardsPerNode in your Solr configuration. Modify the Solr configuration file (solr.xml) and set the appropriate value for maxShardsPerNode. Remember to restart the Solr server after making the changes.
Add Additional Nodes: If you intend to create multiple collections with the same name and each collection having a maximum of one shard per node, you need to add more Solr nodes to your cluster. By adding more nodes, you will have the necessary capacity to create multiple collections with the specified shard configuration.
com.google.common.util.concurrent.UncheckedExecutionException: org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
at com.lucidworks.spark.util.SolrQuerySupport$.getUniqueKey(SolrQuerySupport.scala:107)
at com.lucidworks.spark.rdd.SolrRDD.<init>(SolrRDD.scala:39)
at com.lucidworks.spark.rdd.SelectSolrRDD.<init>(SelectSolrRDD.scala:29)
... 49 elided
Caused by: org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready
at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:508)
at org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getZkStateReader(ZkClientClusterStateProvider.java:176)
at org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:160)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.connect(BaseCloudSolrClient.java:335)
at com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:223)
at com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:242)
at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)
at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
... 56 more
Problem:
The error message org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready indicates that the Solr connector is unable to connect to the Solr cluster specified by the provided hostname and port.
val options = Map("zkhost" -> "localhost:2181", "collection" -> "testcollection")
Solution:
Provide the ZooKeeper host collected from /etc/solr/conf/solr-env.sh file.
export SOLR_ZK_ENSEMBLE=localhost:2181/solr-infra
export SOLR_HADOOP_DEPENDENCY_FS_TYPE=shared
For example,
val options = Map("zkhost" -> "localhost:2181/solr-infra", "collection" -> "testcollection")
py4j.protocol.Py4JJavaError: An error occurred while calling o85.save.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.lucidworks.spark.util.SolrSupport$CloudClientParams.<init>(SolrSupport.scala:184)
at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
at com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:254)
at com.lucidworks.spark.SolrRelation.insert(SolrRelation.scala:655)
at solr.DefaultSource.create relation(DefaultSource.scala:29)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:111)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
Problem:
The error java.lang.NoClassDefFoundError: scala/Product$class typically occurs when there is a mismatch or compatibility issue between the versions of Scala and Spark being used in your project. Spark3 uses the scala version 2.12 and Spark2 uses the scala version 2.11.
The above error occurred due to running the spark3 application using the spark2 solr jar.
Solution:
Use the Spark3-supported spark-solr connector jar.
java.lang.RuntimeException: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://localhost:8995/solr
at solr.DefaultSource.createRelation(DefaultSource.scala:33)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:142)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:170)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:167)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:142)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:91)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:704)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:704)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:704)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
... 49 elided
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://localhost:8995/solr
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:682)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1003)
at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1018)
at com.lucidworks.spark.util.SolrSupport$.getSolrVersion(SolrSupport.scala:88)
at com.lucidworks.spark.SolrRelation.solrVersion$lzycompute(SolrRelation.scala:67)
at com.lucidworks.spark.SolrRelation.solrVersion(SolrRelation.scala:67)
at com.lucidworks.spark.SolrRelation.insert(SolrRelation.scala:659)
at solr.DefaultSource.createRelation(DefaultSource.scala:29)
... 69 more
Problem:
The above exception will occur in a kerberized environment if you are not specified parameters correctly to the spark submit.
Solution:
cat /tmp/solr-client-jaas.conf
Client {
com.sun.security.auth.module.Krb5LoginModule required
doNotPrompt=true
useKeyTab=true
storeKey=true
useTicketCache=false
keyTab="sampleuser.keytab"
principal="sampleuser@EXAMPLE.COM";
};
Client mode:
spark-submit \
--deploy-mode client \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
--principal sampleuser@EXAMPLE.COM \
--keytab sampleuser.keytab \
--files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
--driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
/tmp/spark_solr_connector_app.py
Cluster mode:
spark-submit \
--deploy-mode cluster \
--jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
--principal sampleuser@EXAMPLE.COM \
--keytab sampleuser1.keytab \
--files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
/tmp/spark_solr_connector_app.py
In cluster mode, the same keytab name is not allowed to pass to the submit the spark application. So you need to create a different name for example sampleuser1.keytab and pass it to the spark-submit.
Problem:
After saving the data to Solr, if you try to read the data immediately we can see the above exception because collection is immediately not committed.
Solution:
1. To commit the changes immediately after saving the data to Solr.
// Write data to Solr
employeeDF.write.format("solr").options(solrOptions).mode("overwrite").save()
// Commit the changes in Solr
val solrClient = new HttpSolrClient.Builder(s"http://$zkHost/$collection").build()
solrClient.commit()
2. Add the "commitWithin" with less value, so the documents are committed to the Solr collection after being indexed. It controls the interval at which the changes made to the collection are made searchable.
"commitWithin" -> "500"
3. After verifying that the collection has data and the schema is correctly configured, you can retry the query with the Spark-Solr Connector code.
Created on 09-25-2024 11:00 PM
Tried all the above steps but the program gets stuck while reading the data from solr
Created on 10-25-2024 06:57 AM
We’re attempting to run a basic Spark job to read/write data from Solr, using the following versions:
CDP version: 7.1.9
Spark: Spark3
Solr: 8.11
Spark-Solr Connector: opt/cloudera/parcels/SPARK3/lib/spark3/spark-solr/spark-solr-3.9.3000.3.3.7191000.0-78-shaded.jar
When we attempt to interact with Solr through Spark, the execution stalls indefinitely without any errors or results(similar to the issue which @hadoopranger mentioned). Other components, such as Hive and HBase, integrate smoothly with Spark, and we are using a valid Kerberos ticket that successfully connects with other Hadoop components. Additionally, testing REST API calls via both curl and Python’s requests library confirms we can access Solr and retrieve data using the Kerberos ticket.
The issue seems isolated to Solr’s connection with Spark, as we have had no problems with other systems. Has anyone encountered a similar issue or have suggestions for potential solutions? @RangaReddy @hadoopranger