San Francisco, CA

Arnav Gupta

Building ML & AI Systems for Biology

I build ML systems that turn genomic data into drug targets. My work spans the full stack of ML, bioinformatics pipelines, foundation models, and the agentic AI tools specialized in single-cell data.

Arnav Gupta
01

About

I'm a Data Scientist specialized in Computational Biology at Gordian Biotechnology, where I build across the full stack of single-cell processing from analysis pipelines to fine-tuning single-cell foundation models.

I work the way good early-stage teams do with a bias toward shipping, honest benchmarking, and a focus on impact. Lately I've gone deeper on agentic AI and LLMs, building multi-agent and RAG systems that let scientists better explore complex biological data.

I hold an M.S. in Computational Biology from Carnegie Mellon and dual degrees - B.E. in Electrical & Electronics Engineering and M.Sc. in Biological Sciences from BITS Pilani, Goa, India.

02

Experience

Data Scientist I, Computational Biology

Mar 2025 – Present

Gordian Biotechnology · South San Francisco, CA

Details

Designed a perturbation-assignment algorithm that lifted screen signal-to-noise by 10% across tissues (bioRxiv preprint), and deployed benchmarked ambient-RNA-removal pipelines across screens of >1M scRNA/snRNA-seq cells.

Senior Data Analyst, Human Genomics

Jun 2023 – Mar 2025

Gordian Biotechnology · South San Francisco, CA

Details

Fine-tuned Geneformer, a single-cell foundation model, for disease classification and in-silico perturbation prediction to nominate screening targets. Architected end-to-end scRNA-seq pipelines for >1M cells/run on GCP, cutting processing from days to hours and feeding targets directly to wet-lab validation.

Computational Biology Intern

May 2022 – Aug 2022

Helomics · Pittsburgh, PA

Details

Trained TensorFlow autoencoders for multi-omic representation learning on cancer patient data, beating the production model's accuracy by 10%.

Data Science Intern

Jan 2020 – Jun 2020

Bounce · Bengaluru, India

Details

Worked on geospatial data to reduce asset loss and damage.

MITACS Globalink Summer Research Intern

May 2019 – Aug 2019

Queen's University · Kingston, ON

03

Projects

Machine Learning & AI

SpatialChat — Multi-Agent RAG for Spatial Transcriptomics

  • LangGraph
  • LangSmith
  • ChromaDB
  • Streamlit

An agentic system exploring spatial transcriptomics data. A supervisor agent decomposes questions and routes them to Pydantic-validated, tool-calling agents for dataset discovery, spatial statistics, and gene-expression analysis. Backed by a ChromaDB RAG layer over dataset metadata, with LangSmith tracing and a Streamlit chat interface.

Multi-Omics Cancer Classification with Deep Representation Learning

  • PyTorch
  • Scikit-Learn

Trained autoencoders for unsupervised multi-omic representation learning (gene expression + CNV), reaching 95% accuracy on lung cancer subtype classification. Systematically benchmarked mono-omic vs. multi-omic embeddings across SVM and Random Forest classifiers, showing multi-omic representations outperform single-source features.

Bioinformatics

eCLIP RNA-Binding Protein Pipeline

  • eCLIP
  • bioinformatics pipeline
  • peak calling
  • motif discovery

An end-to-end bioinformatics pipeline for mapping RNA-binding protein (RBP) targets from eCLIP sequencing data, processing raw reads through peak calling to identify protein–RNA binding sites, followed by motif discovery to characterize the underlying sequence specificity of each RBP.

04

Publications

Full list on Google Scholar. Author name shown in bold.

Selected peer-reviewed articles

Preprint

05

Education

2021 - 2023

Carnegie Mellon University

M.S. Computational Biology

GPA 4.0 / 4.0. · Pittsburgh, PA

2015 - 2020

BITS Pilani, K.K. Birla Goa Campus

B.E. Electrical & Electronics Engineering
M.Sc. Biological Sciences

GPA 9.29 / 10.00. · Goa, India

06

Skills

Languages

  • Python
  • C++
  • SQL
  • R
  • Bash

AI / ML

  • PyTorch
  • TensorFlow
  • Scikit-Learn
  • Hugging Face Transformers
  • LangChain
  • LangGraph
  • RAG
  • ChromaDB

Agentic AI / LLMs

  • Multi-agent systems
  • Function calling
  • Tool orchestration
  • Prompt engineering

Bioinformatics

  • scRNA-seq
  • snRNA-seq
  • Spatial transcriptomics
  • CRISPR screens
  • Scanpy
  • Squidpy
  • AnnData
  • STAR
  • bedtools

Infrastructure

  • AWS
  • GCP
  • Docker
  • Singularity
  • Cromwell/WDL
  • Git
  • CI/CD
  • HPC
07

Contact

Open to conversations about computational biology, ML for genomics, and agentic AI. Get in touch.

arnav1211@gmail.com