An agentic system exploring spatial transcriptomics data. A supervisor agent decomposes questions and routes them to Pydantic-validated, tool-calling agents for dataset discovery, spatial statistics, and gene-expression analysis. Backed by a ChromaDB RAG layer over dataset metadata, with LangSmith tracing and a Streamlit chat interface.
San Francisco, CA
Arnav Gupta
Building ML & AI Systems for Biology
I build ML systems that turn genomic data into drug targets. My work spans the full stack of ML, bioinformatics pipelines, foundation models, and the agentic AI tools specialized in single-cell data.
About
I'm a Data Scientist specialized in Computational Biology at Gordian Biotechnology, where I build across the full stack of single-cell processing from analysis pipelines to fine-tuning single-cell foundation models.
I work the way good early-stage teams do with a bias toward shipping, honest benchmarking, and a focus on impact. Lately I've gone deeper on agentic AI and LLMs, building multi-agent and RAG systems that let scientists better explore complex biological data.
I hold an M.S. in Computational Biology from Carnegie Mellon and dual degrees - B.E. in Electrical & Electronics Engineering and M.Sc. in Biological Sciences from BITS Pilani, Goa, India.
Experience
Data Scientist I, Computational Biology
Mar 2025 – PresentDetails
Designed a perturbation-assignment algorithm that lifted screen signal-to-noise by 10% across tissues (bioRxiv preprint), and deployed benchmarked ambient-RNA-removal pipelines across screens of >1M scRNA/snRNA-seq cells.
Senior Data Analyst, Human Genomics
Jun 2023 – Mar 2025Details
Fine-tuned Geneformer, a single-cell foundation model, for disease classification and in-silico perturbation prediction to nominate screening targets. Architected end-to-end scRNA-seq pipelines for >1M cells/run on GCP, cutting processing from days to hours and feeding targets directly to wet-lab validation.
Computational Biology Intern
May 2022 – Aug 2022Details
Trained TensorFlow autoencoders for multi-omic representation learning on cancer patient data, beating the production model's accuracy by 10%.
Data Science Intern
Jan 2020 – Jun 2020Details
Worked on geospatial data to reduce asset loss and damage.
MITACS Globalink Summer Research Intern
May 2019 – Aug 2019Projects
Machine Learning & AI
Trained autoencoders for unsupervised multi-omic representation learning (gene expression + CNV), reaching 95% accuracy on lung cancer subtype classification. Systematically benchmarked mono-omic vs. multi-omic embeddings across SVM and Random Forest classifiers, showing multi-omic representations outperform single-source features.
Bioinformatics
An end-to-end bioinformatics pipeline for mapping RNA-binding protein (RBP) targets from eCLIP sequencing data, processing raw reads through peak calling to identify protein–RNA binding sites, followed by motif discovery to characterize the underlying sequence specificity of each RBP.
Publications
Full list on Google Scholar. Author name shown in bold.
Selected peer-reviewed articles
-
Isoflavone diet ameliorates experimental autoimmune encephalomyelitis through modulation of gut bacteria depleted in patients with multiple sclerosis.
Science Advances 7(28), eabd4595 (2021) · 80 citations
-
Prospective correlation between the patient microbiome with response to and development of immune-mediated adverse effects to immunotherapy in lung cancer.
BMC Cancer 21(1), 1–14 (2021) · 75 citations
-
Type II diabetes mellitus and obesity: common links, existing therapeutics and future developments.
Journal of Biosciences 44, 1–13 (2019) · 29 citations
Preprint
-
A multispecies, modality-agnostic scalable in vivo mosaic screening platform for therapeutic target discovery.
bioRxiv 2026.02.26.708253 (2026)
Education
Carnegie Mellon University
M.S. Computational Biology
BITS Pilani, K.K. Birla Goa Campus
B.E. Electrical & Electronics Engineering
M.Sc. Biological Sciences
Skills
Languages
- Python
- C++
- SQL
- R
- Bash
AI / ML
- PyTorch
- TensorFlow
- Scikit-Learn
- Hugging Face Transformers
- LangChain
- LangGraph
- RAG
- ChromaDB
Agentic AI / LLMs
- Multi-agent systems
- Function calling
- Tool orchestration
- Prompt engineering
Bioinformatics
- scRNA-seq
- snRNA-seq
- Spatial transcriptomics
- CRISPR screens
- Scanpy
- Squidpy
- AnnData
- STAR
- bedtools
Infrastructure
- AWS
- GCP
- Docker
- Singularity
- Cromwell/WDL
- Git
- CI/CD
- HPC
Contact
Open to conversations about computational biology, ML for genomics, and agentic AI. Get in touch.
arnav1211@gmail.com