We are looking for a Lead Data Engineer for the Machine Learning (ML) engineering development team. The primary focus will be to gather requirements from ML/DS teams and identify the optimal solution. Then design, implement, monitor and maintain these scalable distributed big data pipelines for different big data ML use-cases. You will be working with Data Scientists to train, refresh and serve models using big data ML pipelines.
Collaborate with ML engineers and Data Scientists to gather requirements.
Design and Implement ETL big data pipelines to train ML models.
Streaming processing and Batch pipelines using UDFs, ML libraries and load processed data to multiple distributed data sources.
API programming knowledge to train and server the ML models.
Selecting and integrating a variety of big data tools and frameworks required for processing
Responsible for availability, scalability, reliability, and performance of the big data platform.
Skills and Qualifications
Minimum of 6+ years relevant experience
Proven background in ETL development and large scale data processing.
Proficiency with Big Data ecosystem – Spark (PySpark), Hadoop, HDFS, HIVE, NoSQL, and modern Cloud Data lakes (Cloudera Data Platform or Deltalake)
Strong SQL expertise, optimizing complex joins and database concepts
Strong programming development experience in languages like Python and Java.
Experience with building stream-processing systems, using Spark-Streaming.
Experience with workflow orchestration tools, such as Oozie, Airflow.
Experience with Unix/Shell or Python scripting.
Knowledge of AWS is a plus.
Knowledge of AI/ML and MLOps is a plus.