Difference between S3 and data lake

In the Snowflake as a Data Lake blog, we saw the importance of the data lake and its technical challenges and how Snowflake can act as a data lake solution. We also touched on a few points on how a data lake can be implemented in Snowflake. In this part of the blog, we will see how Snowflake outplays other competitors in the market, like Amazon S3 and Delta Lake.

 We are considering the following factors for comparison:

  • Continuous data integration.
  • Consumption & exposure of data.
  • SQL interface.
  • Sharing of data across accounts
  • Compression of data
  • Native stack (better integration).
  • Supported data formats.
  • How each data lake solution updates data.
FactorsSnowflakeAmazon S3Delta LakeContinuous Data IntegrationHas inbuilt option such as STREAMSIt is achieved using various technology or tools such as AWS Glue, Athena, and Spark.It can be achieved using ETL tools.Consuming / Exposing Data.Snowflake has JDBC, ODBC, .NET, and Go Snowflake Drivers. Additionally, it has Node.js, Python, Spark, and Kafka  Connectors.  Snowflake also provides Java & Python APIs to simplify working in REST API.REST API, SOAP API(Depreciated), JDBC & ODBC Drivers. Connectors for JS, Python, PHP, .NET, Ruby, Java, C++ and for NodeJS.Delta ACID API for consuming and delta JDBC connector for exposing.  SQL InterfaceInbuilt (Worksheets)Need Athena/Presto (additional cost)Apache Spark SQL, Azure SQL, Data Warehouse/DBSharing of Data Across AccountsActual data is not copied or shared with another account. Read-only access is provided to a consumer account. It is achieved using a simple “share” command, which incurs computational cost and not storage cost.Accessing file across accounts can be achieved using Amazon Quick Sight, which incurs additional cost.Sharing of data is achieved using Azure Data Share, which is based on snapshot-based sharing.Azure Data Share incurs a cost for the operation to move a dataset from source to destination plus the cost for the resources incurred in moving the data.Compression (Data Storage)Automatically compresses the file as it stores data in a columnar format in the ratio is 4:1.It can be achieved manually using EC2 machines.Loads all data in Apache Parquet file format to leverage efficient compression.Native Stack (better integration)The Snowflake partner tools provide a better integration than other toolsAmazon Stack (Amazon S3 – Storage, Amazon Redshift – Datawarehouse, Amazon Athena – Querying, Amazon RDS – Database, AWS Data Pipeline – Orchestration etc.,)Microsoft stack (BLOB – Storage, Azure Databricks – Data Preparation, Azure Synapse Analytics – Data Warehouse, Azure SQL DB – Database, Azure DevOps, Power BI – Reporting etc.,)Supported FormatsStructured & semi-Structured Data (JSON, AVRO, ORC, PARQUET, and XML.)Structured, semi-structured & Unstructured DataStructured, semi-structured & unstructured data.Data with updatesUpdates the specific rows in the table with new values where the condition matches.We cannot add data or remove or modify just a part of an existing S3 object. We should read the object, make changes to the object, and then write the entire object back to S3. We cannot update data in S3. We can only read and rewrite the entire object back to S3We can update specific values in the data where the condition matches.

Snowflake has faster analytics, simple service, stores diverse data across various cloud platforms, and can be scaled up as required; this makes it one of the most cost-effective solutions in the market.

Snowflake has a single integrated service across the three major clouds. You can have data stored in Microsoft Azure, Amazon S3, or Google Cloud but can still integrate all of them inside Snowflake. In the future, if we want to move data across cloud vendors, Snowflake would still be able to work efficiently.

Built entirely on ANSI SQL, it is effortless for one to have a data lake that has a full SQL environment. Complete resource isolation and control enables Snowflake virtual warehouses to independently fetch queries from the same object without one affecting the other. Automatic metadata management and history allow Snowflake to produce faster analytics with built-in control and governance for fast data flow.

Hence with Snowflake, we can extract batch or streaming data and build materialized views, external tables and then deliver the insights and business results much faster.  Most importantly, it does not require manual intervention to rescale the cluster. As compute cost and storage cost are separated, it keeps the cost low, thus making it to be the top contender for data lakes in the market.

Learn more about Visual BI Solutions Snowflake offerings here and read more about similar Self Service BI topics here.

This article presents reference architectures and examples of data lake implementations, from around the web as well as from Upsolver customers. For a more detailed, hands-on example of building a data lake to store, process and analyze petabytes of data, check our data lake webinar with ironSource and Amazon Web Services. 

Upsolver’s newest offering, SQLake, takes advantage of the same cloud-native processing engine used by the Upsolver customers featured here. With SQLake you can build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.

So you’ve decided it’s time to overhaul your data architecture. What’s next? How do you go about building a data lake that delivers the results you’re expecting?

Well, we’re strong believers in the notion that an example is worth a thousand convoluted explanations. That’s why we’ll jump right into real-life examples of companies that have built their data lakes on Amazon S3, after covering some basic principles of data lake architecture. Use this guide for inspiration, reference, or as your gateway to learn more about the different components you’ll need to become familiar with for your own initiative.

What is data lake architecture?

A data lake is an architecture pattern rather than a specific platform, built around a big data repository that uses a schema–on–read approach. In a data lake, we store large amounts of unstructured data in an object store such as Amazon S3, without structuring the data in advance and while maintaining flexibility to perform further ETL and ELT on the data in the future. This makes it ideal for businesses that need to analyze data that is constantly changing, or very large datasets.

Data lake architecture is simply the combination of tools used to build and operationalize this type of approach to data – starting from event processing tools, through ingestion and transformation pipelines, to analytics and query tools. As we shall see in the examples below, there are many different combinations of these tools that can be used to build a data lake, based on the specific skillset and tooling available in the organizatino.

To learn more, check out What is a Data Lake? and Understanding Data Lakes and Data Lake Platforms.

Design Principles and Best Practices for Building a Data Lake

We’ve covered design principles and best practices in more depth elsewhere – you can check out the links to dive deeper. In this article, we’ll quickly run through the 10 most important factors when building a data lake:

  1. Event sourcing: store all incoming events in an immutable log, which log can then be used for ETL jobs and analytics use cases. This approach has many benefits, including the ability to reduce costs, validate hypotheses retroactively, and trace issues with processed data.
  2. Storage in open file formats: A data lake should store data in open formats such as Apache Parquet, retain historical data, and use a central meta–data repository. This will enable ubiquitous access to the data and reduce operational costs.
  3. Optimize for performance: You’ll eventually want to put the data in your lake to use. Store your data in a way that makes it easy to query, using columnar file formats and keeping files to a manageable size. You also need to partition your data efficiently so that queries only retrieve the relevant data.
  4. Implement data governance and access control: Tools such as AWS data lake formation make it easier to control access to data in a data lake and address security concerns.
  5. Schema visibility: You should have the ability to understand the data as it is being ingested in terms of the schema of each data source, sparsely populated fields, and metadata properties. Gaining this visibility on read rather than trying to infer it on write will save you a lot of trouble down the line by enabling you to build ETL pipelines based on the most accurate and available data.

Learn more about how to structure your data lake.

8 Data Lake Examples to Copy and Learn From

1. Sisense Builds a Versatile Data Lake with Minimal Engineering Overhead

As a leading global provider of business intelligence software, Sisense has data-driven decision making embedded in its DNA. One of the richest sources of data the company has to work with is product usage logs, which capture all manner of users interacting with the Sisense server, the browser, and cloud-based applications.

Over time, and with the rapid growth in Sisense’s customer base, this data had accumulated to more than 70bn records. In order to effectively manage and analyze this data, the company quickly realized it would have to use a data lake architecture, and decided to build one using the AWS ecosystem.  We’ve written a more detailed case study about this architecture, which you can read here.

The Data Lake

To quickly generate value for the business and avoid the complexities of a Spark/Hadoop-based project, Sisense’s CTO Guy Boyangu opted for a solution based on Upsolver, S3, and Amazon Athena.

Product logs are streamed via Amazon Kinesis and processed using Upsolver, which then writes columnar CSV and Parquet files to S3. These are used for visualization and business intelligence using Sisense’s own software. Additionally, structured tables are sent to Athena to support ad-hoc analysis and data science use cases.

To learn more about Sisense’s data lake architecture, check out the case study.

2. Depop Goes From Data Swamp to Data Lake

Depop is a peer-to-peer social shopping app based in London, serving thousands of users. These users take various actions in the app – following, messaging, purchasing and selling products, and so on – creating a constant stream of events.

The Depop team documented their journey in two excellent blog posts. After an initial attempt to create replicas of the data on Redshift, they quickly realized that performance tuning and schema maintenance on Redshift would prove highly cumbersome and resource intensive. This led Depop to adopt a data lake approach using Amazon S3.

The Data Lake

Click to enlarge. Image source: Depop Engineering Blog. 

The data lake at Depop consists of three different pipelines:

  • Ingest: Messages are written via RabbitMQ, and dispatched via a fanout lambda function.
  • Fanout: The lambda function sets up the relevant AWS infrastructure based on event type and creates an AWS Kinesis stream.
  • Transform: The final step is creating columnar Parquet files from the raw JSON data, and is handled using the AWS Glue ETL and Crawler. From there data is outputted to Athena for analysis.

For more information about Depop’s data lake, check out their blog on Medium.

3. ironSource Streams its Way to Hypergrowth

ironSource is the leading in-app monetization and video advertising platform. ironSource includes the industry’s largest in-app video network. The company collects, stores, and prepares vast amounts of streaming data from millions of end devices. 

When ironSource’s rapid data growth effectively obsoleted its previous system, it chose Upsolver to help it adopt a streaming data lake architecture, including storing raw event data on object storage.

The Data Lake

  1. Using Upsolver’s no-code self-service UI, ironSource ingests Kafka streams of up to 500K events per second, and stores the data in S3. 
  2. Upsolver automatically prepares data for consumption in Athena, including compaction, compression, partitioning, and creating and managing tables in the AWS Glue Data Catalog. 
  3. ironSource’s BI teams use Upsolver to enrich and filter data and write it to Redshift to build reporting dashboards in Tableau and send tables to Athena for ad-hoc analytic queries.  

Learn more about ironSource’s data lake architecture.

4. Bigabid Builds a High-Performance Real-time Architecture with Minimal Data Engineering

Bigabid brings performance-based advertisements to app developers, so clients only pay when new users come to their application through the ad.  Bigabid takes advantage of machine learning (ML) for predictive decision-making and goes beyond social media and search engine data sourcing to create an in-depth customer user profile.

Bigabid had to replace its daily batch processing with real-time stream processing, so the company could update user profiles based on users’ most recent actions and continue to develop and scale ad campaigns. Using Upsolver’s visual no-code UI, Bigabid built its streaming architecture so quickly, it saved the equivalent of 6-12 months of engineering work from four dedicated engineers.

The Data Lake

Bigabid uses Kinesis Firehose to ingest multiple data streams into its Amazon S3 data lake, then uses Upsolver for data ETL, combining, cleaning, and enriching data from multiple streams to build complete user profiles in real-time.  The company also uses Upsolver and Athena for business intelligence (BI) reporting that is used by its data science team to improve machine learning models. 

Upsolver also automatically prepares data for Athena, optimizing Athena’s storage layer (Parquet format, compaction of small files) and sharply accelerating queries.

Learn more about Bigabid’s real-time data architecture.

5. SimilarWeb Crunches Hundreds of Terabytes of Data

SimilarWeb is a leading market intelligence company that provides insights into the digital world. To provide this service at scale, the company collects massive amounts of data from various sources, which it uses to better understand the way users interact with websites.

In a recent blog post published on the AWS Big Data Blog, Yossi Wasserman from Similar Web details the architecture that the company uses to generate insights from the hundreds of terabytes of anonymous data it collects from mobile sources.

The Data Lake

Image source: AWS blog

The SimilarWeb solution utilizes S3 as its events storage layer, Amazon Athena for SQL querying, and Upsolver for data preparation and ETL. In his article, Wasserman details the way data is sent from Kafka to S3, reduced to include only the relevant fields needed for analysis, and then sent as structured tables to Athena for querying and analysis.

Read more about Similar Web’s data lake solution on the AWS blog.

6. An Event-driven, Serverless Architecture at Natural Intelligence

Natural Intelligence runs comparison sites across many different verticals. As Denise Schlesinger details on her blog, the company was looking to accomplish two different goals with this data:

  1. Query raw, unstructured data for real-time analytics, alerts and machine learning
  2. Store structured and optimized data for BI and analytics

To effectively work with unstructured data, Natural Intelligence decided to adopt a data lake architecture based on AWS Kinesis Firehose, AWS Lambda, and a distributed SQL engine.

The Data Lake

Image source: Denise Schlesinger on Medium

S3 is used as the data lake storage layer into which raw data is streamed via Kinesis. AWS Lambda functions are written in Python to process the data, which is then queried via a distributed engine and finally visualized using Tableau.

For more details about this architecture, check out Denise’s blog on Medium.

7. Peer39 Contextualizes Billions of Pages for Targeting and Analytics

Peer39 is an innovative leader in the ad and digital marketing industry. Each day, Peer39 analyzes more than 450 million unique web pages holistically to contextualize the true meaning of the page text and  topics. It helps advertisers to optimize their spend and place ads in the right place, at the right time.

The Data Lake

Upsolver automates and orchestrates Peer39’s data workflow, incorporating built-in data lake best practices including: 

  • partitioning
  • queuing
  • guaranteed message delivery
  • exactly-once processing
  • optimization of Athena’s storage layer (Parquet format, compaction of small files) so queries run much faster.

Peer39’s updated streaming data architecture unified teams across data science, analytics, data engineering, and traditional database administration, enabling the company to speed go-to-market with existing staff.

Learn more about Peer39’s streaming data architecture.

8. Browsi – Managing ETL Pipelines for 4 Billion Events with a Single Engineer

Browsi provides an AI-powered adtech solution that helps publishers monetize content by offering advertising inventory-as-a-service. Browsi automatically optimizes ad placements and layout to ensure relevant ad content.

Browsi needed to move from batch processing to stream processing, and move off of manually-coded data solutions.

The Data Lake

Browsi implemented Upsolver to replace both the Lambda architecture used for ingest and the Spark/EMR implementation used to process data.

Events generated by scripts on publisher websites are streamed via Amazon Kinesis Streams. Upsolver ingests the data from Kinesis and writes it to S3 while enforcing partitioning, exactly-once processing, and other data lake best practices.

From there, Browsi outputs ETL flows to Amazon Athena, which it uses for data science as well as BI reporting via Domo. End-to-end latency from Kinesis to Athena is now mere minutes.  Meanwhile, a homegrown solution creates internal reports from Upsolver’s daily aggregations of data.

And it’s all managed by a single data engineer.

Learn more about Browsi’s streaming ETL pipelines.

Ready to build your own data lake?

Upsolver is the fastest and easiest way to get your S3 data lake from 0 to 1. Schedule a demo to learn how you can go from streams to analytics in a matter of minutes. For additional data infrastructure best practices, check out some of these data pipeline architecture diagrams.

Try SQLake for free for 30 days. SQLake is Upsolver’s newest offering. It lets you build and run reliable data pipelines on streaming and batch data via an all-SQL experience. Try it for free. No credit card required.

Can S3 be data lake?

Amazon S3 is the best place to build data lakes because of its unmatched durability, availability, scalability, security, compliance, and audit capabilities. With AWS Lake Formation, you can build secure data lakes in days instead of months.

Is S3 a data warehouse?

Data lakes often coexist with data warehouses, where data warehouses are often built on top of data lakes. In terms of AWS, the most common implementation of this is using S3 as the data lake and Redshift as the data warehouse.

What is the purpose of a data lake?

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. It can store data in its native format and process any variety of it, ignoring size limits.

What is a data pool vs data lake?

A data pool is an independent, isolated micro-data lake. A data lake includes at least one, but ideally many data pools that belong to the same organization, and are managed independently (they can even run on different cloud vendors!).