Distributed data workflows: PySpark vs Dask Офлайн 2021

Доклад отклонён

Тезисы

Apache Spark is a popular distributed computing tool for tabular datasets that is growing to become a dominant name in Big Data analysis today. Dask has several elements that appear to intersect this space and we are often asked, “How does Dask compare with Spark?”

This talk attempts to clear this by providing real-life use cases solved using both and walk-through numerous benchmarks done on Petabyte scale data.

Vaibhav Srivastav

Deloitte Consulting LLP

Vaibhav is a Data Scientist working with Deloitte Consulting LLP, He works with Fortune Technology 10 clients to help them make data-driven (profitable) decisions. In his surplus time he serves as a Subject Matter Expert on Google Cloud Platform to help build scalable, resilient and fault tolerant cloud workflows.

Prior to this he has worked with startups across India to build Social Media Analytics Dashboards, Chatbots, Recommendation Engines and Forecasting Models.

His core interests lie in Natural Language Processing, Machine Learning/ Statistics and Cloud based Product development.

If you have ideas around Data Science, Google Cloud / or are interested in collaborating, do drop him a note on Twitter (@reach_vb) or at Vaibhavs10@gmail.com