Member since
04-05-2016
39
Posts
8
Kudos Received
9
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4424 | 07-30-2019 11:52 PM | |
| 6133 | 06-07-2019 01:01 AM | |
| 9935 | 04-14-2017 08:31 PM | |
| 6770 | 08-03-2016 12:52 AM | |
| 3694 | 06-22-2016 02:10 AM |
10-15-2025
10:25 PM
This story outlines the procedures undertaken to configure a multi-GPU host within the datacenter for the purpose of hosting and serving open-source models. The primary objective is to leverage two L40s GPUs to provide robust LLM inference capabilities locally.
The deployment of large language models (LLMs) for local inference represents a crucial step in enhancing our on-premises AI capabilities. Given the increasing scale and complexity of contemporary LLMs, a single GPU frequently proves insufficient to accommodate the entire model within its Video RAM (VRAM). This necessitates the utilization of multiple GPUs to distribute the model's parameters and facilitate efficient inference. Our configuration, employing two L40s GPUs, is specifically designed to address this challenge, ensuring our ability to host robust LLMs that demand substantial memory and processing power for optimal performance.
Prerequisites:
We conducted this exercise on a host system with the following hardware and operating system specifications:
Number of GPUs: 2
GPU Model: NVIDIA L40s
Total VRAM: 96 GB
CPU: Intel(R) Xeon(R) Gold 6438Y+
RAM: 512 GB
Operating System: Rocky Linux 8.10
The host requires preparation through the installation of necessary tools prior to running the LLMs.
Installing Podman-Based Docker on the Host
This section outlines the installation process for Podman-based Docker on the host machine. This will provide the necessary containerization environment for deploying and managing our models, ensuring isolation and consistent execution across different environments.
To install Podman-based Docker, execute the following commands:
sudo dnf update -y sudo dnf install -y podman podman-docker
Verify the installation:
podman --version docker --version
Run the test container:
podman run hello-world
Installing Torch and Other Required Libraries
Once the containerization environment is set up, the next step involves installing the core libraries essential for model deployment and inference. These libraries are crucial for efficient LLM operations and GPU utilization.
The following commands will install Torch and other necessary libraries:
pip install torch torchvision torchaudio --pyindex-url https://download.pytorch.org/whl/cu121
Note that the Torch version should be compatible with the CUDA version. In this case it was 13.0
Installing NVIDIA’s Toolkit for GPU Acceleration
To fully utilize the capabilities of the L40s GPUs, NVIDIA’s toolkit must be installed. This toolkit provides the drivers, libraries, and utilities required for GPU acceleration, allowing the system to harness the full potential of the installed hardware.
To install NVIDIA’s toolkit for GPU acceleration, use the following commands.
Set up the NVIDIA Container Toolkit Repository:
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \\ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
Install the Toolkit
sudo dnf install -y nvidia-container-toolkit sudo dnf clean all sudo dnf module install -y nvidia-driver:latest-dkms sudo dnf install -y cuda-toolkit
Configure Docker and Restart:
sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
Verify the Installation:
docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi
Running vLLM Image to Host Appropriate Models
With the current configuration providing approximately 92GB of combined VRAM, the system is well-equipped to support the deployment of demanding Large Language Models (LLMs).
This substantial memory capacity allows for two primary deployment strategies: either hosting robust, quantized LLMs, such as the hugging-quants/Meta-Llama-3.1–70B-Instruct-AWQ-INT4 model, which leverage quantization techniques to reduce their memory footprint while maintaining performance; or deploying high-quality models with a slightly lower parameter count, such as the gpt-oss-20B model, which can benefit from the generous VRAM for optimal inference speed and efficiency.
This flexibility enables the system to serve a range of LLM requirements effectively. Lets try to deploy each of these models one at a time.
hugging-quants/Meta-Llama-3.1–70B-Instruct-AWQ-INT4 Model
The final step is to run the vLLM image to host the 4 bit quantized Llama 3.1 70B model. This will enable local inference using our dual L40s GPU setup, leveraging the combined VRAM and processing power to handle this robust LLM.
The command to run the vLLM image and host the model is as follows:
docker run -d --gpus all -p 8000:8000 --ipc=host \\ vllm/vllm-openai:latest \\ --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 \\ --tensor-parallel-size 2 \\ --max-model-len 8192 \\ --max-num-batched-tokens 16384 \\ --gpu-memory-utilization 0.92 \\ --enforce-eager \\ --swap-space 16
nvidia-smi command will show the below like output with actual memory usage after loading the model into both of the available GPUs:
nvidia-smi output after deploying Meta-Llama-3.1–70B-Instruct-AWQ-INT4 model
openai/gpt-oss-20b model
With the sufficient VRAM (~92GB) in place now, we can serve recently open sourced GPT’s 20B model, gpt-oss-20B from this set-up. We use below command to deploy this model on this double GPU set-up :
docker run --gpus all -p 8000:8000 --ipc=host \\ vllm/vllm-openai:latest \\ --model openai/gpt-oss-20b \\ --tensor-parallel-size 2 \\ --gpu-memory-utilization 0.95 \\ --max-model-len 16384 \\ --max-num-seqs 256 \\ --host 0.0.0.0
Below is the nvidia-smi output:
nvidia-smi output after deploying gpt-oss-20B model
nvidia-smi shows a slightly higher memory footprint for the gpt-oss-20b model due to the increased max-model-len parameter, which is set to 16384 compared to 8192 for the Meta-Llama-3.1–70B-Instruct-AWQ-INT4 model. This larger token limit necessitates more VRAM to accommodate longer sequences during inference.
Testing the Deployment
Below curl command can be used to execute a test completion request on the newly deployed server which is serving the hosted LLM:
curl http://localhost:8000/v1/chat/completions \\ -H "Content-Type: application/json" \\ -d '{ "model": "openai/gpt-oss-20b", "messages": [ {"role": "system", "content": "You are a helpful and concise assistant."}, {"role": "user", "content": "What is the difference between a GPU and a CPU?"} ], "max_tokens": 500, "temperature": 0.7 }'
API Documentation
The available APIs can be accessed at http://localhost:8000/docs
Serving the Open Source LLMs On Multi-GPU Host Setup was originally published in Engineering@Cloudera on Medium, where people are continuing the conversation by highlighting and responding to this story.
... View more
Labels:
02-29-2024
10:24 PM
How about submitting a CDE job from CML to a private cloud base cluster?
... View more
07-31-2020
05:54 AM
In a CDH 6.3.2 cluster have an Anaconda parcel distributed and activated, which of course has the numpy module installed. However the Spark nodes seem to ignore the CDH configuration and keep using the system wide Python from /usr/bin/python. Nevertheless I have installed numpy in system wide Python across all cluster nodes. However I still experience the "ImportError: No module named numpy". Would appreciate any further advice how to solve the problem. Not sure how to implement the solution referred in https://stackoverflow.com/questions/46857090/adding-pyspark-python-path-in-oozie.
... View more
02-05-2020
10:12 PM
while i running this programming i got these error can anyone help me with this [cloudera@quickstart ~]$ spark-shell --master yarn-client Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.13.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.0 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67) Type in expressions to have them evaluated. Type :help for more information. 20/02/05 22:03:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/02/05 22:03:42 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Spark context available as sc (master = yarn-client, app id = application_1580968178673_0001). SQL context available as sqlContext.
... View more
07-30-2019
11:52 PM
1 Kudo
Since the "list" commands gets the apps from the ResourceManager and doesn't set any explicit filters and limits (except those provided with it) on the request, technically it returns all the applications which are present with RM at the moment. That number is controlled by "yarn.resourcemanager.max-completed-applications" config. Hope that clarifies.
... View more
06-07-2019
01:01 AM
1 Kudo
As your intent seems to capture the driver logs in a separate file while executing the app in the cluster mode, make sure that '/some/path/to/edgeNode/' dir is present on all of the NodeManager essentially as in cluster mode the driver will be running in the Yarn app's application master. If you can't make sure that follow a general practice to provide log file path to some pre-existing paths e.g. "/var/log/SparkDriver.log".
... View more
09-24-2018
11:19 PM
Hello Experts,
We are upgrading Our Cloudera Hive from 1.3 to 2.0, Could you please let us know, if there is known issues related to this, i did a search in Tableau and Cloudera Community, but i didn't found any issues.
Thanks in Advance!!!
Regards,
Muthu Venkatesh
... View more
07-28-2017
12:20 PM
How do I query hive tables from spark 2.0 . Could you share the steps.
... View more
05-01-2017
11:11 PM
Hi, i am facing the below mentioned issue. Please help me to solve it 17/05/02 11:07:13 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: C:\Users\arpitbh\AppData\Local\Temp\spark-07d9637a-2eb8-4a32-8490-01e106a80d6b java.io.IOException: Failed to delete: C:\Users\arpitbh\AppData\Local\Temp\spark-07d9637a-2eb8-4a32-8490-01e106a80d6b at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:65) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:62) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:62) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
... View more
04-14-2017
08:31 PM
It is the below line which is setting the data types for both the fields as StringType: val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) You can define your custom schema as follows : val customSchema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true))) You can add additional fields as well in the above schema definition. And then you can use this customSchema while creating the dataframe as follows: val peopleDataFrame = sqlContext.createDataFrame(rowRDD, customSchema) Also for details, please see this page.
... View more