Deploying DeepSeek-R1 on Google Cloud Platform (GCP)

3 min readJan 31, 2025

Deploying DeepSeek-R1 on GCP involves setting up a GPU-accelerated environment, configuring the model for inference, and deploying it as a scalable service. This guide walks through the entire process using Google Cloud Run.

Deploying DeepSeek-R1 on GCP offers enhanced reliability and control, especially in light of recent large-scale cyberattacks that have disrupted DeepSeek’s services. By self-hosting, you can mitigate such vulnerabilities and ensure consistent access to the AI model.

GitHub - deepseek-ai/DeepSeek-R1

Contribute to deepseek-ai/DeepSeek-R1 development by creating an account on GitHub.

github.com

Prerequisites

Google Cloud account with billing enabled
Google Cloud SDK installed (gcloud init)
Basic knowledge of Docker and Python

1. Set Up Google Cloud Environment

Create a New Project

Go to Google Cloud Console.
Click the project dropdown, select New Project, name it, and click Create.

Enable Necessary APIs

Run the following to enable required APIs:

gcloud services enable compute.googleapis.com \
    run.googleapis.com \
    artifactregistry.googleapis.com

2. Install Google Cloud SDK

macOS

brew install --cask google-cloud-sdk
gcloud init

Ubuntu/Debian

sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init

3. Prepare the Application

Set Up Project Directory

mkdir deepseek-app && cd deepseek-app
python3 -m venv venv
source venv/bin/activate

Install Dependencies

Create requirements.txt:

fastapi
uvicorn
transformers
torch

Install packages:

pip install -r requirements.txt

Create FastAPI Application

Create main.py:

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

app = FastAPI()model_name = "deepseek-ai/deepseek-r1-distill-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)class InferenceRequest(BaseModel):
    prompt: str
    max_tokens: int = 512@app.post("/v1/inference")
async def inference(request: InferenceRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(inputs.input_ids, max_new_tokens=request.max_tokens)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return {"response": response}

4. Containerize the Application

Create Dockerfile:

FROM nvidia/cuda:12.0.0-base-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pipWORKDIR /appCOPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txtCOPY . .EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

5. Deploy to Google Cloud Run with GPU Support

gcloud run deploy deepseek-service \
    --source . \
    --region us-central1 \
    --platform managed \
    --gpu \
    --memory 16Gi \
    --cpu 4 \
    --allow-unauthenticated

6. Test the Deployment

Get Service URL

gcloud run services describe deepseek-service --region us-central1 --format 'value(status.url)'

Send a Test Request

curl -X POST \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is the capital of France?", "max_tokens": 50}' \
    YOUR_SERVICE_URL/v1/inference

7. Monitor and Scale

Check Logs

gcloud logs read deepseek-service

Set Scaling Limits

gcloud run services update deepseek-service --region us-central1 --min-instances 1 --max-instances 5

8. Secure the Service

To restrict access:

gcloud run services update deepseek-service --region us-central1 --no-allow-unauthenticated

9. Cleanup Resources

gcloud run services delete deepseek-service --region us-central1