Loading content...
Loading content...
Conducted an in-depth feasibility study on leveraging advanced large language models (LLMs) for automating the generation of DBT (Data Build Tool) documentation, addressing VRT's undocumented DBT codebase
In this project, I explored the feasibility of using Large Language Models (LLMs) to document dbt (data build tool) queries and implemented a solution tailored to the VRT data team's needs. dbt is an open-source command-line tool designed for data analysts and engineers to collaboratively develop, test, and deploy data transformations.
The VRT data team maintained a large dbt project with numerous models, packages, and macros, but lacked documentation. This made it difficult to understand and maintain queries. My goal was to determine if LLMs could generate clear, accurate documentation for both technical and non-technical users.
I ran initial inferences on Colab but faced GPU memory limits, successfully testing only Llama 2-7b-Chat. To scale, I later moved to AWS SageMaker and experimented with Hugging Face's API.
I tested prompts explicitly stating "dbt query" to improve responses. Complex topics like multi-join explanations required multi-step prompting, highlighting the need for efficient strategies.
Research and Planning Documentation
Throughout these implementations, I iteratively refined the solution in collaboration with the VRT data engineers, incorporating their feedback on the generated documentation to ensure clarity, accuracy, and relevance.
This minimal-infrastructure approach sends prompts directly to the LLM.
Prompt-Based Approach workflow
This approach uses embeddings and a vector store for richer context retrieval.
RAG Notebook Workflow
Key components: LangChain, all-MiniLM-L6-v2 embeddings, dynamic chunk retrieval in Postgres.
Response after system prompt
Conceptual approach of fine-tuning an LLM on dbt queries for domain-specific accuracy (not implemented).
Fine-tuning LLM conceptual diagram
During my internship at VRT, I not only deepened my technical expertise in both dbt and Large Language Models, but also learned the power of collaboration and continuous improvement. By iteratively refining prompt designs, embedding strategies, and retrieval workflows, always incorporating feedback from the VRT data engineers, I gained a practical understanding of how to balance model capabilities with real world constraints like token limits and GPU memory. Experimenting on Google Colab and AWS SageMaker taught me to troubleshoot performance bottlenecks in real time, while migrating from Pinecone to a Postgres + pgvector solution reinforced best practices in cost efficiency and data security. Leveraging LangChain and sentence transformer embeddings expanded my skill set in modular pipeline design and semantic search. Above all, this project underscored the importance of clear communication: translating complex SQL transformations into user friendly documentation and iteratively validating those explanations with my teammates, not only improved the end product but also strengthened my adaptability, problem-solving, and teamwork key lessons I’ll carry forward into any future data engineering challenge.