Deploying DeepSeek-R1 on Google Cloud Platform (GCP)

Stan Chen
3 min readJan 31, 2025

--

Deploying DeepSeek-R1 on GCP involves setting up a GPU-accelerated environment, configuring the model for inference, and deploying it as a scalable service. This guide walks through the entire process using Google Cloud Run.

Deploying DeepSeek-R1 on GCP offers enhanced reliability and control, especially in light of recent large-scale cyberattacks that have disrupted DeepSeek’s services. By self-hosting, you can mitigate such vulnerabilities and ensure consistent access to the AI model.

Prerequisites

  • Google Cloud account with billing enabled
  • Google Cloud SDK installed (gcloud init)
  • Basic knowledge of Docker and Python

1. Set Up Google Cloud Environment

Create a New Project

  1. Go to Google Cloud Console.
  2. Click the project dropdown, select New Project, name it, and click Create.

Enable Necessary APIs

Run the following to enable required APIs:

gcloud services enable compute.googleapis.com \
run.googleapis.com \
artifactregistry.googleapis.com

2. Install Google Cloud SDK

macOS

brew install --cask google-cloud-sdk
gcloud init

Ubuntu/Debian

sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init

3. Prepare the Application

Set Up Project Directory

mkdir deepseek-app && cd deepseek-app
python3 -m venv venv
source venv/bin/activate

Install Dependencies

Create requirements.txt:

fastapi
uvicorn
transformers
torch

Install packages:

pip install -r requirements.txt

Create FastAPI Application

Create main.py:

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()model_name = "deepseek-ai/deepseek-r1-distill-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
class InferenceRequest(BaseModel):
prompt: str
max_tokens: int = 512
@app.post("/v1/inference")
async def inference(request: InferenceRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs.input_ids, max_new_tokens=request.max_tokens)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response}

4. Containerize the Application

Create Dockerfile:

FROM nvidia/cuda:12.0.0-base-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pipWORKDIR /appCOPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY . .EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

5. Deploy to Google Cloud Run with GPU Support

gcloud run deploy deepseek-service \
--source . \
--region us-central1 \
--platform managed \
--gpu \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated

6. Test the Deployment

Get Service URL

gcloud run services describe deepseek-service --region us-central1 --format 'value(status.url)'

Send a Test Request

curl -X POST \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of France?", "max_tokens": 50}' \
YOUR_SERVICE_URL/v1/inference

7. Monitor and Scale

Check Logs

gcloud logs read deepseek-service

Set Scaling Limits

gcloud run services update deepseek-service --region us-central1 --min-instances 1 --max-instances 5

8. Secure the Service

To restrict access:

gcloud run services update deepseek-service --region us-central1 --no-allow-unauthenticated

9. Cleanup Resources

gcloud run services delete deepseek-service --region us-central1

--

--

Stan Chen
Stan Chen

Written by Stan Chen

Data engineering mentor & coach

Responses (1)