Deploying DeepSeek-R1 on GCP involves setting up a GPU-accelerated environment, configuring the model for inference, and deploying it as a scalable service. This guide walks through the entire process using Google Cloud Run.
Deploying DeepSeek-R1 on GCP offers enhanced reliability and control, especially in light of recent large-scale cyberattacks that have disrupted DeepSeek’s services. By self-hosting, you can mitigate such vulnerabilities and ensure consistent access to the AI model.
Prerequisites
- Google Cloud account with billing enabled
- Google Cloud SDK installed (
gcloud init
) - Basic knowledge of Docker and Python
1. Set Up Google Cloud Environment
Create a New Project
- Go to Google Cloud Console.
- Click the project dropdown, select New Project, name it, and click Create.
Enable Necessary APIs
Run the following to enable required APIs:
gcloud services enable compute.googleapis.com \
run.googleapis.com \
artifactregistry.googleapis.com
2. Install Google Cloud SDK
macOS
brew install --cask google-cloud-sdk
gcloud init
Ubuntu/Debian
sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init
3. Prepare the Application
Set Up Project Directory
mkdir deepseek-app && cd deepseek-app
python3 -m venv venv
source venv/bin/activate
Install Dependencies
Create requirements.txt
:
fastapi
uvicorn
transformers
torch
Install packages:
pip install -r requirements.txt
Create FastAPI Application
Create main.py
:
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
app = FastAPI()model_name = "deepseek-ai/deepseek-r1-distill-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)class InferenceRequest(BaseModel):
prompt: str
max_tokens: int = 512@app.post("/v1/inference")
async def inference(request: InferenceRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs.input_ids, max_new_tokens=request.max_tokens)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return {"response": response}
4. Containerize the Application
Create Dockerfile
:
FROM nvidia/cuda:12.0.0-base-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pipWORKDIR /appCOPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txtCOPY . .EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
5. Deploy to Google Cloud Run with GPU Support
gcloud run deploy deepseek-service \
--source . \
--region us-central1 \
--platform managed \
--gpu \
--memory 16Gi \
--cpu 4 \
--allow-unauthenticated
6. Test the Deployment
Get Service URL
gcloud run services describe deepseek-service --region us-central1 --format 'value(status.url)'
Send a Test Request
curl -X POST \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of France?", "max_tokens": 50}' \
YOUR_SERVICE_URL/v1/inference
7. Monitor and Scale
Check Logs
gcloud logs read deepseek-service
Set Scaling Limits
gcloud run services update deepseek-service --region us-central1 --min-instances 1 --max-instances 5
8. Secure the Service
To restrict access:
gcloud run services update deepseek-service --region us-central1 --no-allow-unauthenticated
9. Cleanup Resources
gcloud run services delete deepseek-service --region us-central1