Understanding Data
Data is a distinct piece of information that is gathered and translated for some purpose. Data can come in the form of text, observations, figures, images, numbers, graphs, or symbols. For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates, or distances.
Types of Data
Structured
Unstructured
Semi-Structured
Structured Data - Data that is organized in a defined manner or schema, typically found in relational databases.
Characteristics: • Easily queryable • Organized in rows and columns • Has a consistent structure
Examples: • Database tables • CSV files with consistent columns • Excel spreadsheets
Unstructured Data - Definition: Data that doesn't have a predefined structure or schema.
Characteristics: • Not easily queryable without preprocessing • May come in various formats
Examples: • Text files without a fixed format • Videos and audio files • Images • Emails and word processing documents
Semi -Structured Data - Data that is not as organized as structured data but has some level of structure in the form of tags, hierarchies, or other patterns.
Characteristics: • Elements might be tagged or categorized in some way • More flexible than structured data but not as chaotic as unstructured data
Examples: • XML and JSON files • Email headers (which have a mix of structured fields like date, subject, etc., and unstructured data in the body) • Log files with varied formats
Properties of Data
Volume
Velocity
Variety
Volume - Refers to the amount or size of data that organizations are dealing with at any given time.
Characteristics: • May range from gigabytes to petabytes or even more • Challenges in storing, processing, and analyzing high volumes of data
Examples: • A popular social media platform processing terabytes of data daily from user posts, images, and videos. • Retailers collecting years' worth of transaction data, amounting to several petabytes.
Velocity - Refers to the speed at which new data is generated, collected, and processed.
Characteristics: • High velocity requires real-time or near-real-time processing capabilities • Rapid ingestion and processing can be critical for certain applications
Examples: • Sensor data from IoT devices streaming readings every millisecond. • High-frequency trading systems where milliseconds can make a difference in decision-making.
Variety - Refers to the different types, structures, and sources of data.
Characteristics: • Data can be structured, semi-structured, or unstructured • Data can come from multiple sources and in various formats
Examples: • A business analyzing data from relational databases (structured), emails (unstructured), and JSON logs (semi-structured). • Healthcare systems collecting data from electronic medical records, wearable health devices, and patient feedback forms.
Managing Data - Data Warehouse, Data Lake & Data Lakehouse
Data Warehouse : A centralized repository optimized for analysis where data from different sources is stored in a structured format.
Characteristics: Designed for complex queries and analysis • Data is cleaned, transformed, and loaded (ETL process) • Typically uses a star or snowflake schema • Optimized for read-heavy operations
Examples: • Amazon Redshift • Google BigQuery • Microsoft Azure SQL Data Warehouse
Data Lake : A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
Characteristics: Can store large volumes of raw data without predefined schema • Data is loaded as-is, no need for preprocessing • Supports batch, real-time, and stream processing • Can be queried for data transformation or exploration purposes
Examples: Amazon Simple Storage Service (S3) when used as a data lake • Azure Data Lake Storage • Hadoop Distributed File System (HDFS)
Data Lakehouse : A hybrid data architecture that combines the best features of data lakes and data warehouses, aiming to provide the performance, reliability, and capabilities of a data warehouse while maintaining the flexibility, scale, and low -cost storage of data lakes.
Characteristics: • Supports both structured and unstructured data. • Allows for schema-on-write and schema-on-read. • Provides capabilities for both detailed analytics and machine learning tasks. • Typically built on top of cloud or distributed architectures. • Benefits from technologies like Delta Lake, which bring ACID transactions to big data.
Examples: • AWS Lake Formation (with S3 and Redshift Spectrum) • Delta Lake: An open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. • Databricks Lakehouse Platform: A unified platform that combines the capabilities of data lakes and data warehouses. • Azure Synapse Analytics: Microsoft's analytics service that brings together big data and data warehousing