Key Insights for Data Engineers: Mastering Big Data Practices
Written on
Understanding Big Data Insights
Before my journey in Silicon Valley, I served as a Business Intelligence Engineer and Data Engineer in the healthcare and tech industries. My two years of exposure to big data methodologies provided more insights than all my prior roles combined.
The Cost of Compute vs. Engineer Time
One major revelation I had was the low cost of computing in comparison to the time spent by Data Engineers. While I previously focused on query optimization, I realized that many companies utilize affordable computing solutions. Although optimizing queries to save on compute time appears beneficial, the time invested in creating highly efficient queries can span days or even weeks for more extensive projects. When weighing the cost of engineer time against the potential savings from optimization, it often proves more economical to prioritize speed over meticulous optimization. This principle also applies to data storage; instead of laboring over a complex ‘merge’ statement, capturing data snapshots during each refresh can accumulate data rapidly—potentially 1TB per pipeline annually. The cost? Minimal, and decreasing.
SQL as a Codebase
Even though SQL is classified as a query language, many organizations regard it as code. It undergoes commitments, reviews, adheres to formatting standards, and is well-documented, presenting challenges for developers during debugging. Prominent companies have developed substantial infrastructure to enhance data strategies, embedding data deeply within their operations. As systems evolve, queries must adapt accordingly. By treating SQL alongside application code, communication between application engineers and Data Engineers regarding value adjustments becomes seamless. It also allows for tracking changes in business context, ensuring synchronized evolution of data definitions and pipelines.
Centralizing Logic for Efficiency
Implementing libraries for SQL—code snippets that encapsulate functions in importable files—can streamline processes for developers dealing with similar data sources. In my experience, source systems often yield peculiar outputs, such as unusual date formats or varying casing. Different developers may manipulate data in distinct ways, leading to multiple sources of truth and potential technical debt. Therefore, centralizing code should be a fundamental strategy. Companies that excel in this area often check if a similar task has been previously addressed before crafting their own solutions. If no prior instance exists, I might develop a dedicated library for a specific data source or table, preventing redundant work for future Data Engineers and mitigating technical debt.
Considerations for Smaller Enterprises
While these insights might seem applicable only to large organizations, there are free or open-source tools available that can achieve similar outcomes. Although some companies have spent years refining advanced tools, I advocate for these principles in every workplace. Maximizing spending on computing resources, treating SQL as code, and centralizing logic are achievable goals for every Data Engineer.
Chapter 2: Essential Tools for Data Engineers
In this chapter, we explore the critical tools that every data engineer should be familiar with to thrive in 2024.
The first video, "What Tools Should Data Engineers Know In 2024," delves into the essential software and tools that enhance data engineering efficiency.
Chapter 3: Understanding Data Engineering Roles
Data engineering encompasses various roles and responsibilities that are crucial for organizational success.
The second video, "These 3 Things Can Help You Understand Data Engineering Roles," provides insights into the different facets of data engineering and how they contribute to business objectives.