data_engineering.htm

Overview of Data Engineering

lifecycle

Data Engineering Teams

The data engineer is a hub between data producers, such as software engineers, data architects, and DevOps or site-reliability engineers (SREs), and data consumers, such as data analysts, data scientists, and ML engineers. In addition, data engineers will interact with those in operational roles, such as DevOps engineers.

ML Engineers vs. Data Engineer

The ML engineer overlaps with DE, but it develops more advanced ML techniques, train models, and designs and maintains infrastructure running ML processes. It emphasizes more MLOps and other mature practices such as DevOps.

Main components of Data Engineering

Data Generation

Data is coming from a variety of sources now
It's hard to keep in the same schema. Schema evolution is now considered part of the Agile approach.

Storage

Cloud architectures often have several storage solutions
Hot data is data commonly retrieved many times per day. It should be geared for fast retrieval
Cold data is seldom queried and mostly for archival purposes.

Ingestion

A common bottleneck of the lifecycle; source systems are outside of your direct control and might have data of poor quality
The separation of storage and compute makes streaming more popular
Can my downstream services handle the volume of data I'm sending them?
Batch is still widely used because of legacy systems, streaming can be useful but it creates increasing amounts of complexity

Key Questions:

What are the use cases for the data?
What is the destination?
In what volume will it arrive?
What format is the data in?

Transformation

Change data into correct types, put records into standard formats, remove bad ones
Large-scale aggregation for reporting or featurizing data for ML processes

Serving

Data is served for analytics, ML, and reverse ETL
Before investing into ML, companies need to build a solid data foundation
This means setting up the best systems and architecttures across the data engineering and ML lifecycle.
Analytics then ML
Reverse ETLs feed bvack the data into source systems. It's useful for businesses relying on SaaS.

Data Ops

DataOps automation has a similar framework and workflow to DevOps, consisting of change management (environment, code, and data version control), continuous integration/continuous deployment (CI/CD), and configuration as code.
DatOps consist of automation, observability and monitoring, and incident response

Observability and Monitoring

Data is a silent killer; bad data can linger in reports for a long time
You need to implement observing, monitoring, logging, alerting, and tracing
DODD focused on making data observability a first-class consideration in the data engineering lifecycle

Data Architecture

Types of data architectures

Data warehouse: structured, includes the compute
- A subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of mgmt decisions
- Separates OLAP from OLTP
- Centralized and organizes data
- e.g., Snowflake, Amazon Redshift
Data lake: unstructured, data can be in any format
- Common during big data era
- Data can be queried with MapReduce, Spark, Presto, etc...
Data lakehouse: combines benefits from both, introduced by Databricks
- Includes control, data management and data structures
- Still houses data in an object storage and supports a variety of query and transformation engines

Modern Data Stack

Instead of using monolithic toolsets, use cloud-based, plug and play, off-the-shelf components
Includes data pipelines, storage, monitoring, etc...

Data Generation

Data Logs

Insert only: always retain the records, add a timestamp

Messages and Streams

Message queues and streaming platforms are often used interchangeably
A message is raw data communicated across two or more systems
- A message is normally sent through a message queue from a publisher to a consumer
- Once the message is delivered, it is removed from the queue
A stream is an append only log of event records
- You use streams when you care about what happened over many events

Types of time

Event time; event is generated
Ingestion time; event is ingested
Process time; event is processed

Ways of ingesting data

APIs
- REST
- GraphQL
- WebHooks
  - Event based data transmission pattern
  - Data source can be an application backend, a web page, or a mobile app
- RPC and gRPC: remote procedure call library; bidirectional exchange of data over HTTP/2
  - Imposes more technical standards than REST
Message queues
- Critical for decoupled microservices and event-driven architectures
- Need to keep in mind the frequency of delivery, message ordering and scalability
- Some issues with queues can be the order of the messages
- Ideally, should be idempotent; outcome is the same after processing it multiple times
Event streaming platform
- Produces data in an ordered log of records
- Includes
  - Topics - collection of related events
  - Stream partitions
- Tend to be more fault tolerant because they're distributed

Ingestion undercurrents

Security: how is the data secured and encrypted; do we have a VPN, or an SSO?
Data management: who manages the data, what's the quality, what if the schema changes, is the data HIPAA?
DataOps: how will we know when there's an issue, how are we monitoring, is the schema conformant, what do we do if something bad happens
Architecture: what if the system fails, what do we do if we lose data, how available are the sources, who's in charge of them
Orchestration: how often do we recieve the data, can we integrate with the upstream application team
SWE: Can the code access with the right credentials, how do we authenticate, how do we access (API, REST), can we parallelize the work, how do we deploy code changes?

Storage

The raw ingredients are: disk drives, memory, networking and CUPU, serialization, compression and caching

Raw Ingredients

Magnetic Disk Drives
- They can be slow but are cheap
- They are constrained by disk transfer speed and rotational latency
- Storing info in parallel can make them faster
- The max speed is 200-300 MB/s (20 hours for 30 TB)
- Used widely because of how cheap they are
- If we use multiple, the bottleneck becomes network and CPU
Solid State Drives
- Stores data in flash memory cells
- They're really fast (0.1 ms)
- Used for OLTP Systems; allow Postgres
- Cost 20-30 cents per GB (10x magnetic drives)
- SSD can be used in some OLAP for caching
RAM
- Attached to CPU; stores code
- It's volatile, if it fails it gets reset
- 1000 times faster than SSD
- $10/GB
- Stores data in capacitors, needs to be refreshed over time
Networking and CPU
- Availability zones impact the storage access
- Trade off in spreading data vs. keeping it under on zone
Serialization
- Flattening and packing data into standard format that reader will be able to decode
- e.g., Apache Parquet
Compression
- Increasing scan speed per disk
Caching

caching

Data Storage Systems

BASIC

Basis of eventual consistency
Basically available; consistency is not guaranteed
Soft state - state of transaction is fuzzy
Eventual consistency - at some point, reading data will return consistent values

Why do BASIC? It allows us to use large-scale distributed systems. i.e., scale horizontally.

When do we do strong consistency? When we can tolerate longer query times but want the correct data every time.

Types of file storage

Cloud filesystem services: e.g., Amazon Elastic File System; exposed through the NFS4 protocol
Block storage: type of raw storage provided by SSDs and magnetic disks; standard for VMs
RAID: redundant array of independent disks. Controls multiple disks to improve data durability, enhance performance, and combine capacity
Cloud virtualized block storage: similar to storage area network (block storage over a network). They separate data from the host server. Can persist the data when an EC2 instance shuts down. Highly scalable.
Object storage - contains objects of all shapes and sizes. Amazon S3, Azure Blob Storage. We lose a lot of the writing flexibility with file storage on a local disk. Objects are written once as a stream of bytes. To change data or append you must rewrite the full object.
- Object stores are the gold standard for data lakes.
- Object stores are key value stores.
- We have bucket names and item values
Cache and Memory Storage systems: e.g., memcached, a key-value store desgined for caching db query results, api call repsonses and more. Redis, a key-value store, supports more complex data types. Can tolerate a small amount of data loss.
Streaming storage: stored data is temporal. Kafka now allows long data retention.

Data abstractions

Data warehouse: standard olap architecture.
Data lake: massive store where data was retained raw and unprocessed. Originally built on Hadoop systems. Now we're separating compute and storage.
Data lakehouse: combines aspects of the data warehouse and data lake. Stores data in object storage. Adds robust table and schema support for incremental updates and deletes. Has file management layer with data mgmt and transf tools. Easier to send data when other tools can directly read from the object

Data Ingestion

It's the process of moving data from one place to the other. A data pipeline is the combination of architecture, systems, and processed that move data through the stages of the DE lifecycle.

Considerations:

Bounded vs. unbounded
- Discrete batches or stream of consciousness (stream of data)
Frequency
- Batch, micro batch or real time (or streaming)
Sync or async
- With sync, the source, ingestion, and destination have complex dependencies and are tightly coupled
- With async, dependencies can operate at the level of individual events; like microservices.
Serialization or deserialization
- Encoding the data from the source and preparing data structures for transmission
Throughput and scalability
- Will your ingestion be able to keep up with a sudden influx of backlogged data
Reliability and durability
- Reliability entails high uptime and proper failover for ingestion systems
- Durability entails making sure data isn't lost or corrupted
- Need to build redundancy and self healing based on the impact and cost of losing data
Payload
- Data type
- Data shape (json, tabular, unstructured)
- Schema
Push vs. pull vs. poll patterns
- If there's a change, pull (polling)
- Push from source to destination
- Destination pulls from source

Ways to ingest data

Direct DB connection
- OBDC or JBDC via API
Change Data Capture
- Ingesting changes from a source system
- Batch oriented CDC; every 24 hours update changes in records
- Continuous CDC; capture all table history and support near real time data ingestion
- CDC can consume database resoures, CPU time, network bandwitdth, disk bandwidth
APIs
- Increasingly popular
- Common with SaaS platforms
Message Queues and Event-Streaming
Managed data connectors
- Managed by vendor
SFTP and SCP; run over SSH connection
Webhooks; aka reverse APIs
- The data provider defines an API request spec; iit makes the API calls rather than receiving them
- The consumer needs to provide an endpoint for the provider to call
- Can be brittle, difficult to maintain and inefficient
Web scraping

Undercurrent of data ingestion

DataOps
- Need to monitor uptime, latency, and data volumes
- How will you respond if ingestion job fails
- Monitoring should be there from the beginning
- Is there proper testing and deployment automation? Can you roll it back?
Data quality
- Data can regress whenever and without warning
- Need to monitor statistics, spikes, etc...
- Implement logs to capture history, checks, exception handling
- Can you do some statistical data testing? averages, nulls, outliers?

Data Transformation

The lifetime of a SQL query is:

Query issued
Parsing and conversion to bytecode
Query planning and optimization
Query execution
Return results

A data model represents the way data relates to the real world. How it must be organized to reflect the organization's processes, definitions, workflow and logic.

Types of normalization (Codd normal forms)

Denormalized - no normalization, nested and redundant data allowed
First normal form - each column is unique and has a single value; table has a unitque primary key
Second normal form - 1NF plus partial dependencies are removed
Third normal form - 2NF plus each table contains only relevant fields related to its primary key and has no dependencies

Data modeling

Star schema: data is modeled with two general types of tables: facts and dimentions
- Facts is a table of numbers. e.g., order ID, customer key, date key, gross sale amount
- Dimension tables as qualitative data referencing a fact; smaller than fact tables. e.g., datekey, year, quarter, month, day of week.

Update patterns

Truncate and reload; wipe old data and reload
Insert only; insert new records without changing or deleting old records.
Delete; deletes are more expensive than inserts in columnar system. Insert deletion inserts a new record with a deleted flag. Queries can get more complicated
Upsert/merge; upserting takes a set of source records and looks for matches using a primary key. When it matches, it updates the record, otherwise it inserts the new record.

Serving data

What a data engineer should know about ML:

When to use which models?
How to wrangle data for unstructured and structured data?
How to encode data for various types
Batch vs. online learning?
When to use AutoML vs. handcrafting an ML model?
Classification vs. regression
When to train locally, on a cluster, or at the edge
What are data cascades?

New trends in data engineering

Application stacks will be data stacks. Applications will integrate real time automation and decision making powered by streaming pipelines and ML
We will have a live data stack; application and source systems will feed data to low latency fast query processing DBs, which will feed data to report, analytics, and ML
We will have more streaming pipelines and real time analytical DBs
- Data will move in a contiguous flow moving away from batch ingestion
- Shift away from ELT and more to stream transform and load.
- This live data stack will be powered by OLAP DBs built for streaming
We will have tighter feedback loops between applications and ML.