Mars: A tensor-based unified framework for large scale data computationAI/ML и визуализация данных
I am a senior software engineer from Alibaba Group Company, Python enthusiast. working on combining big data with Python language.
Currently, as the architect and core developer, I am leading an open-source project Mars which is a tensor-based unified framework for large-scale data processing, Mars extends numpy ability with parallel and distributed computing, and in the long term, Mars aims to create the distributed counterparts of scipy stack which will not be subject to the ability of a single machine. I also worked on a project named PyODPS that users can write pandas-like DataFrame which can be compiled into SQL on the big data platforms.
When I was a student, I developed a distributed crawling framework named cola.
Mars is a tensor-based unified framework for large-scale data computation. Github: https://github.com/mars-project/mars.
Mars tensor provides a compatible interface like Numpy, users can obtain the ability to handle extreme huge tensor/ndarray by simple import replacement. We extend the interface of Numpy to support create tensor/ndarray on GPU by specifying gpu=True on all the implemented array creation, and also, create sparse matrix via noting sparse=True on some array creation like zeros, eye and so on.
Mars can scale in to a laptop, and scale out to a cluster with thousands of machines. Both the local and distributed version share the same piece of code, it's fairly simple to migrate from a single machine to a cluster due to the increase of data. Mars is evolving quickly aimed at reaching production-level.
Mars is completely open sourced, and takes advantage of the great projects from Python community like numpy, cupy, numexpr, pyarrow etc to build the entire project. In the long term, mars is aimed to create a distributed counterpart of scipy stack which is not subject to the ability of a single machine.
This talk will focus on why we start the project of Mars and how we have done to ensure the simplicity of API and performance on huge terabytes-scale tensor/ndarray computation.