High Performance Data Processing in PythonСеть, бэкенд и web-разработка

Доклад принят в программу конференции
Donald Whyte
Engineers Gate

Donald is a senior software engineer at Engineers Gate, a New York-based quantitive hedge fund. There, he builds large-scale data pipelines and has processed over two dozen datasets. An avid Python and Rust developer and data enthusiast, Donald has given many talks about these languages across the world.

Previously, he organised hackathons in several countries and worked at Bloomberg L.P. where he built core, high performance database infrastructure that's still used across the firm globally.

Тезисы

The Internet age generates vast amounts of data. Most of this data is unstructured and needs to post processed in some way. Python has become the standard tool for transforming this data into more useable forms.

numpy and numba are popular Python libraries for processing large quantities of data. When running complex transformations on large datasets, many developers fall into common pitfalls that kill the performance of these libraries.

This talk explains how numpy/numba work under the hood and how they use vectorisation to process large amounts of data extremely quickly. We use these tools to reduce the processing time of a large 600GB dataset from one month to less than an hour.

Python

Другие доклады секции Сеть, бэкенд и web-разработка