SHASHWAT // SYSTEM ARCHIVE

∞ ── AI.RESEARCH

Research

Papers I've read deeply, reproduced, or applied in projects. Focused on LLM systems, efficient inference, and agentic AI.

DEEP READAPPLIEDIMPLEMENTEDBUILDING
01

Efficient ML

Flash Attention & Efficient Transformers

DEEP READ

Explored IO-aware exact attention algorithms that reduce memory reads/writes by recomputing attention on-chip. Key insight: memory bandwidth is the bottleneck, not FLOPs. Applied this understanding to optimize inference pipelines.

AttentionCUDATransformersMemory Efficiency
02

Inference Optimization

KV Cache Compression & Long Context LLMs

APPLIED

Studied eviction strategies for KV caches to handle 100k+ token contexts. Implemented sliding window attention and experimented with StreamingLLM's attention sink phenomenon in local inference setups.

KV CacheLong ContextLLM InferenceStreaming
03

Model Compression

Quantization: GPTQ, GGUF, AWQ

IMPLEMENTED

Ran local experiments comparing 4-bit quantization schemes on Llama 2/3 class models. Measured perplexity degradation vs throughput gains. Key finding: AWQ outperforms GPTQ on reasoning tasks at same bit-width.

QuantizationGPTQAWQGGUFLlama
04

AI Agents

Agentic Frameworks & Tool-Use Patterns

BUILDING

Actively building with LangGraph to understand stateful multi-agent architectures. Exploring ReAct vs Plan-and-Execute patterns, tool routing strategies, and failure recovery mechanisms for production-grade agents.

LangGraphReActTool UseMulti-agent
05

Fine-Tuning

LoRA & Parameter-Efficient Fine-Tuning

APPLIED

Fine-tuned Mistral 7B and Phi-3 on domain-specific datasets using LoRA adapters. Studied the rank vs capacity tradeoff and experimented with QLoRA for consumer GPU training. Results: strong domain alignment with <1% parameter overhead.

LoRAQLoRAPEFTFine-tuningHuggingFace
06

Retrieval Systems

RAG Architecture Patterns

BUILDING

Designed and deployed RAG pipelines comparing dense retrieval (Pinecone + ada-002) vs hybrid BM25+dense approaches. Exploring HyDE, query expansion, and re-ranking to improve retrieval quality on technical corpora.

RAGVector SearchEmbeddingsBM25Re-ranking

Longer write-ups and notes live in the blog → Read blog