Technical Challenges of Scale at Twitter

All Episodes

November 20, 2022 • 108 mins

We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them both.

The full show notes for this episode are available at https://www.codingblocks.net/episode198.

News

Want to help us out? Leave us a review!
The 2023 Game Ja-Ja-Ja Jam is coming up!

Twitter has a Data Problem

Moving an Exabyte of Data

In 2019, over 100 million people per day would visit Twitter.
Every tweet and user action creates an event that is used by machine learning and employees for analytics.
Their goal was to democratize data analysis within Twitter to allow people with various skillsets to analyze and/or visualize the data.
At the time, various technologies were used for data analysis:
- Scalding which required programmer knowledge, and
- Presto and Vertica which had performance issues at scale.
Another problem was having data spread across multiple systems without a simple way to access it.

Moving pieces to Google Cloud Platform

The Google Cloud big data tools at play:
- BigQuery, a cost-effective, serverless, multicloud enterprise data warehouse to power your data-driven innovation.
- DataStudio, unifying data in one place with ability to explore, visualize and tell stories with the data.

History of Data Warehousing at Twitter

2011 – Data analysis was done with Vertica and Hadoop and data was ingested using Pig for MapReduce.
2012 – Replaced Pig with Scalding using Scala APIs that were geared towards creating complex pipelines that were easy to test. However, it was difficult for people with SQL skills to pick up.
2016 – Started using Presto to access Hadoop data using SQL and also used Spark for ad hoc data science and machine learning.
2018 …
- Scalding for production pipelines,
- Scalding and Spark for ad hoc data science and machine learning,
- Vertica and Presto for ad hoc, interactive SQL analysis,
- Druid for interactive, exploratory access to time-series metrics, and
- Tableau, Zeppelin, and Pivot for data visualization.
So why the change? To simplify analytical tools for Twitter employees.

BigQuery for Everyone

Challenges:
- Needed to develop an infrastructure to reliably ingest large amounts of data,
- Support company-wide data management,
- Implement access controls,
- Ensure customer privacy, and
- Build systems for:
  - Resource allocation,
  - Monitoring, and
  - Charge-back.
In 2018, they rolled out an alpha release.
- The most frequently used tables were offered with personal data removed.
  - Over 250 users, from engineering, finance, and marketing used the alpha.
  - Sometime around June of 2019, they had a month where 8,000 queries were run that processed over 100 petabytes of data, not including scheduled reports.
  - The alpha turned out to be a large success so they moved forward with more using BigQuery.
They have a nice diagram that’s an overview of what their processes looked like at this time, where they essentially pushed data into GCS from on-premise Hadoop data clusters, and then used Airflow to move that into BigQuery, from which Data Studio pulled its data.

Ease of Use

BigQuery was easy to use because it didn’t require the installation of special tools and instead was easy to navigate via a web UI.
- Users did need to become f

Mark as Played

Advertise With Us

Popular Podcasts

True Crime Tonight

If you eat, sleep, and breathe true crime, TRUE CRIME TONIGHT is serving up your nightly fix. Five nights a week, KT STUDIOS & iHEART RADIO invite listeners to pull up a seat for an unfiltered look at the biggest cases making headlines, celebrity scandals, and the trials everyone is watching. With a mix of expert analysis, hot takes, and listener call-ins, TRUE CRIME TONIGHT goes beyond the headlines to uncover the twists, turns, and unanswered questions that keep us all obsessed—because, at TRUE CRIME TONIGHT, there’s a seat for everyone. Whether breaking down crime scene forensics, scrutinizing serial killers, or debating the most binge-worthy true crime docs, True Crime Tonight is the fresh, fast-paced, and slightly addictive home for true crime lovers.

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

24/7 News: The Latest

The latest news in 4 minutes updated every hour, every day.