High Performance Data Processing in PythonСеть, бэкенд и web-разработка
Donald is a senior software engineer at Engineers Gate, a New York-based quantitive hedge fund. There, he builds large-scale data pipelines and has processed over two dozen datasets. An avid Python and Rust developer and data enthusiast, Donald has given many talks about these languages across the world.
Previously, he organised hackathons in several countries and worked at Bloomberg L.P. where he built core, high performance database infrastructure that's still used across the firm globally.
The Internet age generates vast amounts of data. Most of this data is unstructured and needs to post processed in some way. Python has become the standard tool for transforming this data into more useable forms.
numpy and numba are popular Python libraries for processing large quantities of data. When running complex transformations on large datasets, many developers fall into common pitfalls that kill the performance of these libraries.
This talk explains how numpy/numba work under the hood and how they use vectorisation to process large amounts of data extremely quickly. We use these tools to reduce the processing time of a large 600GB dataset from one month to less than an hour.