Community Articles

RangaReddy · ‎08-01-2023

Table of Contents:

Apache Solr
Spark Solr Connector
Spark Solr Collection Introduction
Spark Solr Integration
Troubleshooting
Apache Solr

1.1 Solr Introduction

Apache Solr (stands for Searching On Lucene w/ Replication) is the popular, blazing-fast, open-source enterprise search platform built on Apache Lucene. It is designed to provide powerful full-text search, faceted search, and indexing capabilities to enable fast and accurate search functionality for various types of data.

Solr is the second-most popular enterprise search engine after Elasticsearch.

Written in Java, Solr has RESTful XML/HTTP and JSON APIs and client libraries for many programming languages such as Java, Phyton, Ruby, C#, PHP, and many more being used to build search-based and big data analytics applications for websites, databases, files, etc.

Solr is often used as a search engine for applications, websites, and enterprise systems that require robust search capabilities. It can handle a wide range of data formats, including text, XML, JSON, and more. Solr offers advanced features like distributed searching, fault tolerance, near-real-time indexing, and high availability.

1.2 Solr Features

Key features of Apache Solr include:

Full-Text Search Capabilities: Solr provides advanced full-text search capabilities, allowing users to search for relevant documents based on keywords, phrases, or complex queries. It supports stemming, fuzzy search, wildcard search, phrase matching, and more.
Faceted search: Solr offers faceted search or guided navigation, allowing users to refine search results based on specific criteria or filters. It enables users to drill down into search results using facets, which are pre-computed categories or attributes associated with the indexed data.
Indexing and document processing: Solr supports efficient indexing of large volumes of data. It provides flexible document processing and ingestion capabilities, allowing you to ingest data from various sources, transform it, and index it for fast and accurate retrieval.
High Scalability and distributed search: Solr is designed to scale horizontally and can distribute data across multiple nodes for increased performance and fault tolerance. It supports distributed searching, where queries are parallelized and executed across multiple shards or replicas.
Near-real-time indexing: Solr supports near-real-time indexing, which means that indexed data becomes searchable almost immediately after it is ingested. This enables applications to provide up-to-date search results without significant delay.
Advanced text analysis and language support: Solr provides extensive text analysis capabilities, including tokenization, stemming, stop-word filtering, synonym expansion, and more. It supports multiple languages and offers language-specific analyzers and tokenizers.
Integration and extensibility: Solr offers a rich set of APIs and integration options, allowing seamless integration with various systems and frameworks. It provides RESTful APIs, XML/JSON APIs, and client libraries for popular programming languages. Solr can also be extended with custom plugins and components to add additional functionality.
Built-in Security: Secure Solr with SSL, Authentication, and Role based Authorization.

1.3 Solr Operations

To search a document, Apache Solr performs the following operations in sequence:

Indexing: converts the documents into a machine-readable format.
Querying: understanding the terms of a query asked by the user. These terms can be images or keywords, for example.
Mapping: Solr maps the user query to the documents stored in the database to find the appropriate result.
Ranking: as soon as the engine searches the indexed documents, it ranks the outputs by their relevance.

1.4 Solr Terms

The key terms associated with Solr are as follows:

SolrCloud: An umbrella term for a suite of functionality in Solr that allows managing a Cluster of Solr Nodes for scalability, fault tolerance, and high availability.
Cluster: In Solr, a cluster is a set of Solr nodes operating in coordination with each other via ZooKeeper, and managed as a unit. A cluster may contain many collections.
Instance: an instance of Solr running in the Java Virtual Machine (JVM). In stand-alone mode, it only offers one instance, whereas, in cloud mode, you can have one or more instances.
Node: A JVM instance running Solr. Also known as a Solr server. A single-node system cannot provide high availability or fault-tolerant behavior. Production systems should have at least two nodes.
Core: An individual Solr instance (which represents a logical index). Multiple cores can run on a single node. A core has a name like my_first_index_shard1_replica_n2.

Core = an instance of Lucene Index + Solr configuration

Collection: In Solr, one or more documents are grouped in a single logical index using a single configuration and Schema. A collection may be divided up into multiple logical shards, which may in turn be distributed across many nodes, or in a Single node Solr installation, a collection may be a single Core. Collections have names like my-first-index.
Document: A group of fields and their values. Documents are the basic unit of data in a collection. Documents are assigned to shards using standard hashing, or by specifically assigning a shard within the document ID. Documents are versioned after each write operation.
Commit: To make document changes permanent in the index. In the case of added documents, they would be searchable after a commit.
Field: The content to be indexed/searched along with metadata defining how the content should be processed by Solr.
Metadata: Literally, data about data. Metadata is information about a document, such as its title, author, or location.
Facet: The arrangement of search results into categories based on indexed terms.
Shard: A logical partition of a single collection. Every shard consists of at least one physical Replica, but there may be multiple Replicas distributed across multiple Nodes for fault tolerance.
Replica: A Core that acts as a physical copy of a Shard in a SolrCloud Collection.
Replication: A method of copying a leader index from one server to one or more "follower" or "child" servers.
Leader: A single Replica for each Shard that takes charge of coordinating index updates (document additions or deletions) to other replicas in the same shard. This is a transient responsibility assigned to a node via an election, if the current Shard Leader goes down, a new node will automatically be elected to take its place.
Transaction log: An append-only log of write operations maintained by each Replica. This log is required with SolrCloud implementations and is created and managed automatically by Solr.
ZooKeeper: The system used by SolrCloud to keep track of configuration files and node names for a cluster. A ZooKeeper cluster is used as the central configuration store for the cluster, a coordinator for operations requiring distributed synchronization, and the system of record for cluster topology.

1.5 Solr Server Tuning

Refer to the following article for Solr server tuning and additional tuning resources.

2. Spark Solr Connector

2.1 Spark Solr Connector Introduction

The Spark Solr Connector is a library that allows seamless integration between Apache Spark and Apache Solr, enabling you to read data from Solr into Spark and write data from Spark into Solr. It provides a convenient way to leverage the power of Spark's distributed processing capabilities with Solr's indexing and querying capabilities.

The Spark Solr Connector provides more advanced functionalities for working with Solr and Spark, such as query pushdown, schema inference, and custom Solr field mapping.

2.2 Spark Solr Connector Features

Send objects from a Spark (Streaming or DataFrames) into Solr.
Read the results from a Solr query as a Spark RDD or DataFrame.
Shard partitioning, intra-shard splitting, streaming results
Stream documents from Solr using /export handler (only works for exporting fields that have docValues enabled).
Read large result sets from Solr using cursors or with the export handler.
Data locality. If Spark workers and Solr processes are co-located on the same nodes, the partitions are placed on the nodes where the replicas are located.

2.3 Advantages of Spark Solr Connector

Seamless integration: The Spark Solr Connector provides seamless integration between Apache Spark and Apache Solr, allowing you to leverage the strengths of both platforms for data processing and search.
High-performance: The connector leverages Spark's distributed processing capabilities and Solr's indexing and querying capabilities, enabling high-performance data processing and search operations.
Query pushdown: The connector supports query pushdown, which allows some parts of the query to be executed directly in Solr, reducing data transfer between Spark and Solr and improving overall performance. Schema inference: The connector can automatically infer the schema of the Solr collection and apply it to the Spark DataFrame, eliminating the need for manual schema definition.
Flexible data processing: With the Spark Solr Connector, you can easily read data from Solr into Spark for further processing, analysis, and machine learning tasks, and write data from Spark into Solr for indexing and search.
Solr field mapping: The connector provides a flexible mapping between Solr fields and Spark DataFrame columns, allowing you to handle schema evolution and mapping discrepancies between the two platforms.
Support for streaming expressions: The connector allows you to execute Solr streaming expressions directly from Spark, enabling advanced analytics and aggregations on data stored in Solr collections.

2.4 Disadvantages of Spark Solr Connector

Complex setup: Setting up and configuring the Spark Solr Connector may require some initial effort, including dependencies management and ensuring compatibility between different versions of Spark and Solr.
Limited functionality: While the Spark Solr Connector provides essential functionality for data integration between Spark and Solr, it may not cover all the advanced features and options available in Solr. Customizations and advanced configurations may require additional development.

2.5 Ports Used

Service	Protocol	Port	Access	Purpose
Solr search/update	HTTP	8983	External	All solr specific actions such as query and update
Solr Admin	HTTP	8984	Internal	Administrative use

2.6 CDP Spark Solr Connector Supportability

CDP Version	Spark2 Supported	Spark3 Supported
CDP 7.1.6	No	No
CDP 7.1.7	Yes	No
CDP 7.1.8	Yes	No
CDP 7.1.9	Yes	Yes

3. Spark Solr Collection Introduction

The solrctl utility is a wrapper shell script included with Cloudera Search for managing collections, instance directories, configs, Apache Sentry permissions, and more.

3.1 Cloudera Search config templates

Config templates are immutable configuration templates that you can use as a starting point when creating configs for Solr collections. Cloudera Search contains templates by default and you can define new ones based on existing configs.

Configs can be declared as immutable, which means they cannot be deleted or have their Schema updated by the Schema API. Immutable configs are uneditable config templates that are the basis for additional configs. After a config is made immutable, you cannot change it back without accessing ZooKeeper directly as the solr (or solr@EXAMPLE.COM principal, if you are using Kerberos) superuser.

Solr provides a set of immutable config templates. These templates are only available after Solr initialization, so templates are not available in upgrades until after Solr is initialized or re-initialized.

Templates include:

Template Name	Supports Schema API	Uses Schemaless Solr
managedTemplate	Yes	No
schemalessTemplate	Yes	Yes

Config templates are managed using the solrctl config command.

For example:

To create a new config based on the managedTemplate template:

solrctl config --create [***NEW CONFIG***] managedTemplate -p immutable=false
Replace [NEW CONFIG] with the name of the config you want to create.

To create a new template (immutable config) from an existing config:

solrctl config --create [***NEW TEMPLATE***] [***EXISTING CONFIG***] -p immutable=true

Replace [NEW TEMPLATE] with a name for the new template you want to create and [EXISTING CONFIG] with the name of the existing config that you want to base [NEW TEMPLATE] on.

3.2 Generating collection configuration using the solrctl config command

solrctl config --create [***NEW CONFIG***] [***TEMPLATE***] [-p [***NAME***]=[***VALUE***]]

where

[NEW CONFIG] is the user-specified name of the config
[TEMPLATE] is the name of an existing config template
-p [NAME]=[VALUE] Overrides a [TEMPLATE] setting. The only config property that you can override is immutable, so the possible options are -p immutable=true and -p immutable=false. If you are copying an immutable config, such as a template, use -p immutable=false to make sure that you can edit the new config.

For example:

solrctl config --create testConfig managedTemplate -p immutable=false

To list all available config templates:

solrctl instancedir --list

3.3 Create a collection using the solrctl collection command

solrctl collection --create [***COLLECTION NAME***] -s [***NUMBER OF SHARDS***] -c [***COLLECTION CONFIGURATION***]

where

[COLLECTION NAME] User-defined name of the collection.
[NUMBER OF SHARDS] The number of shards you want to split your collection into.
[COLLECTION CONFIGURATION] The name of an existing collection configuration.

For example:

Create a collection with 2 number of shards.

solrctl collection --create testcollection -s 2 -c testConfig

To list all available collections:

solrctl collection --list

3.4 Solr Collection Reference(s)

4. Spark Solr Integration

4.1 Solr Collection Creation for Integration

If you are using Kerberos,

kinit as a user with permission to create the collection & its configuration:

kinit solradmin@EXAMPLE.COM

Replace EXAMPLE.COM with your Kerberos realm name.

Generate configuration files for a collection:

solrctl config --create sample-config managedTemplate -p immutable=false

Create a new solr collection with 1 shards:

solrctl collection --create sample-collection -s 1 -c sample-config

4.2 Collecting the Solr Zookeeper details

Log in to the host where the Solr instance is running:

cat /etc/solr/conf/solr-env.sh

export SOLR_ZK_ENSEMBLE=zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra

export SOLR_HADOOP_DEPENDENCY_FS_TYPE=shared

Note: Make sure that the SOLR_ZK_ENSEMBLE environment variable is set in the above configuration file.

4.3 Launch the Spark shell

To integrate Spark with Solr, you need to use the spark-solr library. You can specify this library using --jars or --packages options when launching Spark.

Example(s):

Using --jars option:

spark-shell \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar

Using --packages option:

spark-shell \
  --packages com.lucidworks.spark:spark-solr:3.9.0.7.1.8.15-5 \
  --repositories https://repository.cloudera.com/artifactory/cloudera-repos/

In the following example(s), I have used the --jars option.

4.3.1 Cluster is non-kerberized

Step1: Find the spark-solr jar. Use the following command to locate the spark-solr JAR file:

ls /opt/cloudera/parcels/CDH/jars/*spark-solr*

For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.

Step2: Launch the Spark shell by running the following command:

spark-shell \
  --deploy-mode client \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar

Replace /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar with the actual path to the spark-solr JAR file obtained in Step 1.

4.3.2 Cluster is Kerberized and SSL is not enabled

Step1: Create a jass file

cat /tmp/solr-client-jaas.conf

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  doNotPrompt=true
  useKeyTab=true
  storeKey=true
  useTicketCache=false
  keyTab="sampleuser.keytab"
  principal="sampleuser@EXAMPLE.COM";
};

Replace the values of keyTab and principal with your specific configuration.

Step2: Find the spark-solr jar

Use the following command to locate the spark-solr JAR file:

ls /opt/cloudera/parcels/CDH/jars/*spark-solr*

For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.

Step3: Launch the spark-shell

Before running the following spark-shell command, you need to replace the keyTab, principal, and jars files (collected from Step 2):

spark-shell \
  --deploy-mode client \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  --principal sampleuser@EXAMPLE.COM \
  --keytab sampleuser.keytab \
  --files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
  --driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false" \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false"

4.3.3 Cluster is kerberized and SSL enabled

Step1: Create a jass file

cat /tmp/solr-client-jaas.conf

Client {
  com.sun.security.auth.module.Krb5LoginModule required
  doNotPrompt=true
  useKeyTab=true
  storeKey=true
  useTicketCache=false
  keyTab="sampleuser.keytab"
  principal="sampleuser@EXAMPLE.COM";
};

Replace the values of keyTab and principal with your specific configuration.

Step2: Find the spark-solr jar

Use the following command to locate the spark-solr JAR file:

ls /opt/cloudera/parcels/CDH/jars/*spark-solr*

For example, if the JAR file is located at /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar, note down the path.

Step3: Launch the spark-shell

Before running the following spark-shell command, you need to replace keyTab, principal, jars file (collected from Step2), javax.net.ssl.trustStore file, and javax.net.ssl.trustStorePassword password in both driver and executor java options.

spark-shell \
  --deploy-mode client \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  --principal sampleuser@EXAMPLE.COM \
  --keytab sampleuser.keytab \
  --files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
  --driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'"

4.4 Writing data to Solr

Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.

case class Employee(id:Long, name: String, age: Short, salary: Float)

val employeeDF = Seq(
  Employee(1L, "Ranga", 34, 15000.5f),
  Employee(2L, "Nishanth", 5, 35000.5f),
  Employee(3L, "Meena", 30, 25000.5f)
).toDF()

val collectionName = "sample-collection"
val zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
val solrOptions = Map("zkhost" -> zkHost, "collection" -> collectionName, "commitWithin" -> "1000")
// Write data to Solr
employeeDF.write.format("solr").options(solrOptions).mode("overwrite").save()

Write Optimization Parameters:

batchSize: Specifies the number of documents to be sent to Solr in each batch during the write operation. Increasing the batch size can improve indexing performance by reducing the number of round trips between Spark and Solr. Higher batch sizes can improve indexing throughput but may require more memory. The default value is 500.
commitWithin: Sets the time interval (in milliseconds) within which the documents should be committed to Solr. It controls the soft commit behavior, where the documents are made searchable but not persisted to disk immediately. Setting a lower value can improve indexing speed but may increase the overhead of frequent commits. The default value is 1000.
queueSize: Specifies the maximum number of documents that can be buffered in memory before being sent to Solr. It determines the size of the write buffer. Increasing the queue size can improve indexing performance by allowing more documents to be buffered before sending them to Solr. However, setting it too high may consume excessive memory. The default value is 10000.
softCommit: Determines whether a soft commit is performed after each indexing operation. Soft commit makes the indexed documents searchable immediately but may impact the overall indexing performance. You can set this parameter based on your requirements for near real-time searchability. The default value is false.
commitRefresh: Controls whether the Solr index is refreshed after a commit. Setting this parameter to true ensures that the indexed data is immediately available for search. The default value is true.

4.5 Reading data from Solr

Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.

val collectionName = "sample-collection"
val zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
val solrOptions = Map("zkhost" -> zkHost, "collection" -> collectionName)

// Read data from Solr
val df = spark.read.format("solr").options(solrOptions).load()
df.show()

Read Optimization Parameters:

splitField: Specifies the field to be used for splitting the Solr data into Spark partitions during the read operation. Splitting the data helps in parallelizing the reading process across multiple partitions, enhancing read performance. Choosing an appropriate split field based on the data distribution can significantly enhance the read performance. This parameter is applicable when using SolrCloud collections.
filters: Allows specifying filters to limit the data fetched from Solr during a read operation. Applying filters can reduce the amount of data transferred between Solr and Spark, improving query performance.
rows: Specifies the number of rows to fetch per Solr query during a read operation. Adjusting this parameter can control the amount of data loaded into Spark and impact memory usage.
partitionBy: Allows partitioning the Solr data by one or more fields during the read operation. Partitioning the data can improve read performance by parallelizing the data retrieval process. It is useful when the data is evenly distributed across the specified partitioning fields.
partitionCount: Sets the number of partitions to be created during the read operation. It determines the level of parallelism for reading the Solr data. Adjusting the partition count based on the available resources and the size of the Solr data can optimize the read performance.
query: Allows specifying a custom query string to filter the data during the read operation. Limiting the amount of data fetched from Solr by providing a specific query can enhance read performance, especially when dealing with large datasets.
fields: Specifies the fields to be selected while querying data from Solr. By selecting only the required fields, unnecessary data transfer and processing overhead can be reduced.

4.6 Pyspark Example

vi /tmp/spark_solr_connector_app.py

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, ShortType, FloatType

def main():
    spark = SparkSession.builder.appName("Spark Solr Connector App").getOrCreate()
    data = [(1, "Ranga", 34, 15000.5), (2, "Nishanth", 5, 35000.5),(3, "Meena", 30, 25000.5)]

    schema = StructType([ \
        StructField("id",LongType(),True), \
        StructField("name",StringType(),True), \
        StructField("age",ShortType(),True), \
        StructField("salary", FloatType(), True)
      ])
    employeeDF = spark.createDataFrame(data=data,schema=schema)
    zkHost = "zk01.example.com:2181,zk02.example.com:2181,zk03.example.com:2181/solr-infra"
    collectionName = "sample-collection"
    solrOptions = { "zkhost" : zkHost, "collection" : collectionName }
    # Write data to Solr
    employeeDF.write.format("solr").options(**solrOptions).mode("overwrite").save()

    # Read data from Solr
    df = spark.read.format("solr").options(**solrOptions).load()
    df.show()
    # Filter the data
    df.filter("age > 25").show()
    spark.stop()
if __name__ == "__main__":

    main()

Note: Replace the collectionName and zkHost (collected from the Collecting the Solr Zookeeper details step) details.

Client Mode:

spark-submit \
  --master yarn \
  --deploy-mode client \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  /tmp/spark_solr_connector_app.py

Cluster Mode:

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  /tmp/spark_solr_connector_app.py

Troubleshooting

Issue1 - org.apache.solr.common.SolrException: Cannot create collection testcollection. The value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created.

Content-Security-Policy: default-src 'none'; base-uri 'none'; connect-src 'self'; form-action 'self'; font-src 'self'; frame-ancestors 'none'; img-src 'self'; media-src 'self'; style-src 'self' 'unsafe-inline'; script-src 'self'; worker-src 'se { "responseHeader":{ "status":400, "QTime":238}, "Operation create caused exception:":"org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "exception":{ "msg":"Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "rspCode":400}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException"], "msg":"Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 0. This requires 2 shards to be created (higher than the allowed number)", "code":400}}

Problem:

The error message org.apache.solr.common.SolrException: Cannot create collection testcollection. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 1. This allows a maximum of 1 to be created. indicates that the maximum number of shards per node allowed by your Solr configuration is set to 1, and there is already a collection with the same name or a collection with the same name is still present in the Solr cluster.

Solution:

To resolve this issue, we have a few options:

Remove Existing Collection: If you no longer need the existing collection with the same name, you can remove it from the Solr cluster. You can use the Solr Admin UI or the Solr API to delete the existing collection. After removing the existing collection, try creating the new collection again.

Choose a Different Collection Name: If you want to keep the existing collection and create a new collection with a similar name, choose a different name for the new collection. Make sure the new collection name is unique and doesn't conflict with any existing collections in the Solr cluster.

Increase maxShardsPerNode: If you want to create multiple collections with the same name but different shards on a single node, you need to increase the value of maxShardsPerNode in your Solr configuration. Modify the Solr configuration file (solr.xml) and set the appropriate value for maxShardsPerNode. Remember to restart the Solr server after making the changes.

Add Additional Nodes: If you intend to create multiple collections with the same name and each collection having a maximum of one shard per node, you need to add more Solr nodes to your cluster. By adding more nodes, you will have the necessary capacity to create multiple collections with the specified shard configuration.

Issue2 - Cannot connect to cluster at hostname:2181: cluster not found/not ready

com.google.common.util.concurrent.UncheckedExecutionException: org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2051)
  at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
  at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
  at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
  at com.lucidworks.spark.util.SolrQuerySupport$.getUniqueKey(SolrQuerySupport.scala:107)
  at com.lucidworks.spark.rdd.SolrRDD.<init>(SolrRDD.scala:39)
  at com.lucidworks.spark.rdd.SelectSolrRDD.<init>(SelectSolrRDD.scala:29)
  ... 49 elided
Caused by: org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready
  at org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:508)
  at org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getZkStateReader(ZkClientClusterStateProvider.java:176)
  at org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:160)
  at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.connect(BaseCloudSolrClient.java:335)
  at com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:223)
  at com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:242)
  at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)
  at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)
  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
  ... 56 more

Problem:

The error message org.apache.solr.common.SolrException: Cannot connect to cluster at hostname:2181: cluster not found/not ready indicates that the Solr connector is unable to connect to the Solr cluster specified by the provided hostname and port.

val options = Map("zkhost" -> "localhost:2181", "collection" -> "testcollection")

Solution:

Provide the ZooKeeper host collected from /etc/solr/conf/solr-env.sh file.

export SOLR_ZK_ENSEMBLE=localhost:2181/solr-infra

export SOLR_HADOOP_DEPENDENCY_FS_TYPE=shared

For example,

val options = Map("zkhost" -> "localhost:2181/solr-infra", "collection" -> "testcollection")

Issue3 - java.lang.NoClassDefFoundError: scala/Product$class

py4j.protocol.Py4JJavaError: An error occurred while calling o85.save.
: java.lang.NoClassDefFoundError: scala/Product$class
  at com.lucidworks.spark.util.SolrSupport$CloudClientParams.<init>(SolrSupport.scala:184)
  at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
  at com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:254)
  at com.lucidworks.spark.SolrRelation.insert(SolrRelation.scala:655)
  at solr.DefaultSource.create relation(DefaultSource.scala:29)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
  at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:111)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)

Problem:

The error java.lang.NoClassDefFoundError: scala/Product$class typically occurs when there is a mismatch or compatibility issue between the versions of Scala and Spark being used in your project. Spark3 uses the scala version 2.12 and Spark2 uses the scala version 2.11.

The above error occurred due to running the spark3 application using the spark2 solr jar.

Solution:

Use the Spark3-supported spark-solr connector jar.

Issue4 - java.lang.RuntimeException: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://localhost:8995/solr

java.lang.RuntimeException: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://localhost:8995/solr
  at solr.DefaultSource.createRelation(DefaultSource.scala:33)
  at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:142)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:170)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:167)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:142)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:93)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:91)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:704)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:704)
  at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:704)
  at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
  ... 49 elided
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: https://localhost:8995/solr
  at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:682)
  at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
  at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
  at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
  at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1003)
  at org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1018)
  at com.lucidworks.spark.util.SolrSupport$.getSolrVersion(SolrSupport.scala:88)
  at com.lucidworks.spark.SolrRelation.solrVersion$lzycompute(SolrRelation.scala:67)
  at com.lucidworks.spark.SolrRelation.solrVersion(SolrRelation.scala:67)
  at com.lucidworks.spark.SolrRelation.insert(SolrRelation.scala:659)
  at solr.DefaultSource.createRelation(DefaultSource.scala:29)
  ... 69 more

Problem:

The above exception will occur in a kerberized environment if you are not specified parameters correctly to the spark submit.

Solution:

cat /tmp/solr-client-jaas.conf
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  doNotPrompt=true
  useKeyTab=true
  storeKey=true
  useTicketCache=false
  keyTab="sampleuser.keytab"
  principal="sampleuser@EXAMPLE.COM";

};

Client mode:

spark-submit \
  --deploy-mode client \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  --principal sampleuser@EXAMPLE.COM \
  --keytab sampleuser.keytab \
  --files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
  --driver-java-options "-Djava.security.auth.login.config=/tmp/solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
  /tmp/spark_solr_connector_app.py

Cluster mode:

spark-submit \
  --deploy-mode cluster \
  --jars /opt/cloudera/parcels/CDH/jars/spark-solr-3.9.0.7.1.8.3-363-shaded.jar \
  --principal sampleuser@EXAMPLE.COM \
  --keytab sampleuser1.keytab \
  --files /tmp/solr-client-jaas.conf#solr-client-jaas.conf,sampleuser.keytab \
  --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
  --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=solr-client-jaas.conf -Djavax.security.auth.useSubjectCredsOnly=false -Djavax.net.ssl.trustStore=/var/lib/cloudera-scm-agent/agent-cert/cm-auto-global_truststore.jks -Djavax.net.ssl.trustStorePassword='changeit'" \
  /tmp/spark_solr_connector_app.py

In cluster mode, the same keytab name is not allowed to pass to the submit the spark application. So you need to create a different name for example sampleuser1.keytab and pass it to the spark-submit.

Issue5 - com.lucidworks.spark.CollectionEmptyException: No fields defined in query schema for query: q=:&rows=5000&qt=/select&collection=sample-collection. This is likely an issue with the Solr collection sample collection, does it have data?

Problem:

After saving the data to Solr, if you try to read the data immediately we can see the above exception because collection is immediately not committed.

Solution:

1. To commit the changes immediately after saving the data to Solr.

// Write data to Solr
employeeDF.write.format("solr").options(solrOptions).mode("overwrite").save()
// Commit the changes in Solr
val solrClient = new HttpSolrClient.Builder(s"http://$zkHost/$collection").build()
solrClient.commit()

2. Add the "commitWithin" with less value, so the documents are committed to the Solr collection after being indexed. It controls the interval at which the changes made to the collection are made searchable.

"commitWithin" -> "500"

3. After verifying that the collection has data and the schema is correctly configured, you can retry the query with the Spark-Solr Connector code.

hadoopranger · ‎09-25-2024

Tried all the above steps but the program gets stuck while reading the data from solr

sde_20241 · ‎10-25-2024

We’re attempting to run a basic Spark job to read/write data from Solr, using the following versions:

CDP version: 7.1.9
Spark: Spark3
Solr: 8.11
Spark-Solr Connector: opt/cloudera/parcels/SPARK3/lib/spark3/spark-solr/spark-solr-3.9.3000.3.3.7191000.0-78-shaded.jar

When we attempt to interact with Solr through Spark, the execution stalls indefinitely without any errors or results(similar to the issue which @hadoopranger mentioned). Other components, such as Hive and HBase, integrate smoothly with Spark, and we are using a valid Kerberos ticket that successfully connects with other Hadoop components. Additionally, testing REST API calls via both curl and Python’s requests library confirms we can access Solr and retrieve data using the Kerberos ticket.

The issue seems isolated to Solr’s connection with Spark, as we have had no problems with other systems. Has anyone encountered a similar issue or have suggestions for potential solutions? @RangaReddy @hadoopranger