Senior Data Engineer (Data + Applied AI)
Plume
About the Role
The Senior Data and AI Engineer is a high-impact individual contributor on the Data and AI team. This role will be tasked with building, maintaining, and optimizing the data pipelines, transformation models, and BI deliverables. Applied AI (RAG pipelines, MLOps) is a growing area of the role, not the primary focus today.
The right person for the role will be a skilled, self-sufficient engineer who takes well-defined architectural direction and executes it with precision, quality, and ownership. They will also contribute their own technical judgment to day-to-day implementation decisions.
They are deeply hands-on: writing production dbt models, building Airflow DAGs, delivering clinical dashboards, and contributing to RAG pipelines and MLOps workflows within a regulated healthcare environment. You will also work closely with offshore contractors, reviewing their work, providing feedback, and ensuring output meets the team's quality and engineering standards.
The ideal candidate will have a strong command of cloud data warehousing, dimensional modeling, Python, and applied AI tooling, as well as a collaborative mindset that thrives in a team-oriented environment. This role is an excellent opportunity for a senior engineer looking to deepen their full-stack data and AI expertise alongside experienced technical leadership.
Responsibilites:
- Building and maintaining production-grade data pipelines in cloud data warehouses such as Google BigQuery or equivalent, following architectural standards set by the Director of Data and AI.
- Designing and developing dbt models across bronze, silver, and gold layers, including a focus on quality and governance via automated tests, documentation, and incremental load strategies.
- Creating and optimizing Airflow DAGs for data workflow orchestration, including scheduling, dependency management, error handling, and alerting.
- Implement dimensional data models and data mart structures — guided by the team's modeling standards — that support clinical BI and ML feature consumption.
- Crafting easy-to-understand visualizations and dashboards that align with commonly used business analytic standards in Looker or equivalent BI tools in close collaboration with product analytics, finance, operations, growth, and clinical stakeholders.
- Integrating healthcare data from sources such as EHRs, Stripe, 3rd-party APIs, and application database feeds, normalizing incoming data into the unified data platform.
- Applying HIPAA-compliant data handling practices, including PHI/PII masking, tokenization, audit logging, and role-based access controls across all pipeline and AI system work.
- Architecting and implementing RAG pipelines — including document ingestion, chunking, embedding generation, and retrieval — using frameworks such as LangChain or LangGraph
- Supporting MLOps workflows, including model training pipeline maintenance, deployment support, performance monitoring, and retraining triggers
- Code reviewing PRs from teammates, providing constructive technical feedback to peers, and upholding the team's engineering standards.
- Collaborating closely with product managers to understand requirements and deliver reliable data and AI products.
- Monitoring and triaging assigned pipeline and data quality failures, escalating architectural issues as appropriate.
- Documenting pipeline designs, data models, and technical decisions in alignment with the team's governance and lineage tracking standards.
- Evaluating new tools and frameworks, providing hands-on prototyping and technical assessments.
Must-Have Requirements:
- 5+ years of hands-on experience in data engineering, analytics engineering, or a closely related role.
- 2+ years of experience working within the healthcare industry, including working knowledge of healthcare data standards, clinical workflows, regulated data environments, and domain-specific data visualizations.
- Working knowledge of HIPAA — including PHI/PII classification, data masking, audit logging, and access control requirements.
- Proven production experience with at least one major cloud data warehouse: BigQuery, Snowflake, or Redshift — including advanced SQL and query optimization.
- Strong hands-on experience with dbt (Core or Cloud), including incremental models, tests, documentation, and multi-environment workflows.
- Deep experience with Apache Airflow for workflow orchestration, including DAG design, scheduling, monitoring, and failure handling.
- Demonstrated knowledge of dimensional data modeling — star/snowflake schemas, SCD Types 1/2, fact and dimension table design.
- Hands-on experience delivering dashboards and reports in at least one enterprise BI tool: Looker, Power BI, Tableau, Qlik, etc.
- Proficiency in Python for data pipeline development, API integrations, and automation (Pandas, PySpark, or similar).
- Practical exposure to RAG pipeline development and LLM integration using LangChain, LangGraph, or LlamaIndex
- Hands-on exposure to MLOps concepts — model deployment, monitoring, and retraining workflows
- Knowledge of CI/CD tooling for data and AI workloads (GitHub Actions, dbt Cloud CI)
- Strong understanding of data quality and governance principles: lineage, access controls, data contracts, and automated testing and experience with data governance tools such as OpenMetadata
- Excellent written and verbal communication skills with the ability to collaborate effectively across engineering, analytics, and clinical teams
- Ability to work independently on assigned workstreams while keeping the Director and team informed of progress, blockers, and risks
Nice-to-have:
- Experience with real-time or streaming data pipelines using Kafka, Kinesis, or Pub/Sub, particularly for ADT or clinical event feeds.
- Knowledge of vector databases such as Pinecone, Weaviate, FAISS, or Chroma
- Familiarity with responsible AI principles, including bias detection and model explainability in a healthcare context
- Experience with data observability tools such as Monte Carlo, Bigeye, or Soda
- Familiarity with data lakehouse patterns (Delta Lake, Iceberg, Apache Hudi)
- Experience working toward or maintaining SOC2 or HITRUST certification
- Familiarity with semantic layer tools (Looker LookML, dbt Semantic Layer)
- Experience with population health, revenue cycle, or clinical quality reporting datasets
- Exposure to Kubernetes or containerized ML workloads
158000 - 168000 USD a year