All Episodes

November 20, 2022 108 mins

We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.

The full show notes for this episode are available at https://www.codingblocks.net/episode198.

News

  • Want to help us out? Leave us a review!
  • The 2023 Game Ja-Ja-Ja Jam is coming up!

Twitter has a Data Problem

Moving an Exabyte of Data

  • In 2019, over 100 million people per day would visit Twitter.
  • Every tweet and user action creates an event that is used by machine learning and employees for analytics.
  • Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
  • At the time, various technologies were used for data analysis:
    • Scalding which required programmer knowledge, and
    • Presto and Vertica which had performance issues at scale.
  • Another problem was having data spread across multiple systems without a simple way to access it.

Moving pieces to Google Cloud Platform

  • The Google Cloud big data tools at play:
    • BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
    • DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.

History of Data Warehousing at Twitter

  • 2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
  • 2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
  • 2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
  • 2018 …
    • Scalding for production pipelines,
    • Scalding and Spark for ad hoc data science and machine learning,
    • Vertica and Presto for ad hoc, interactive SQL analysis,
    • Druid for interactive, exploratory access to time-series metrics, and
    • Tableau, Zeppelin, and Pivot for data visualization.
  • So why the change? To simplify analytical tools for Twitter employees.

BigQuery for Everyone

  • Challenges:
    • Needed to develop an infrastructure to reliably ingest large amounts of data,
    • Support company-wide data management,
    • Implement access controls,
    • Ensure customer privacy, and
    • Build systems for:
      • Resource allocation,
      • Monitoring, and
      • Charge-back.
  • In 2018, they rolled out an alpha release.
    • The most frequently used tables were offered with personal data removed.
      • Over 250 users, from engineering, finance, and marketing used the alpha.
      • Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
      • The alpha turned out to be a large success so they moved forward with more using BigQuery.
  • They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.

Ease of Use

  • BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
    • Users did need to become f
Mark as Played

Advertise With Us
Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.