∞ ── AI.RESEARCH
Research
Papers I've read deeply, reproduced, or applied in projects. Focused on LLM systems, efficient inference, and agentic AI.
Efficient ML
Flash Attention & Efficient Transformers
Explored IO-aware exact attention algorithms that reduce memory reads/writes by recomputing attention on-chip. Key insight: memory bandwidth is the bottleneck, not FLOPs. Applied this understanding to optimize inference pipelines.
Inference Optimization
KV Cache Compression & Long Context LLMs
Studied eviction strategies for KV caches to handle 100k+ token contexts. Implemented sliding window attention and experimented with StreamingLLM's attention sink phenomenon in local inference setups.
Model Compression
Quantization: GPTQ, GGUF, AWQ
Ran local experiments comparing 4-bit quantization schemes on Llama 2/3 class models. Measured perplexity degradation vs throughput gains. Key finding: AWQ outperforms GPTQ on reasoning tasks at same bit-width.
AI Agents
Agentic Frameworks & Tool-Use Patterns
Actively building with LangGraph to understand stateful multi-agent architectures. Exploring ReAct vs Plan-and-Execute patterns, tool routing strategies, and failure recovery mechanisms for production-grade agents.
Fine-Tuning
LoRA & Parameter-Efficient Fine-Tuning
Fine-tuned Mistral 7B and Phi-3 on domain-specific datasets using LoRA adapters. Studied the rank vs capacity tradeoff and experimented with QLoRA for consumer GPU training. Results: strong domain alignment with <1% parameter overhead.
Retrieval Systems
RAG Architecture Patterns
Designed and deployed RAG pipelines comparing dense retrieval (Pinecone + ada-002) vs hybrid BM25+dense approaches. Exploring HyDE, query expansion, and re-ranking to improve retrieval quality on technical corpora.
Longer write-ups and notes live in the blog → Read blog