Scaling Python Dataframes beyond memory with Modin
These days I spend a good deal of time with Data products, either AI driven and / or heavy ETL based. The language I choose that makes this easier is...
This is a work in progress — mileage will vary
I was working on rebuilding a tech stack where I need to:
A huge part of it was unwinding a data store from NoSQL to a relational data structure. The product I took over was feature rich, but not scalable or functional.
My work was definitely cut out for me, as you can imagine the process was to break it down into phases, prioritization, migrate and iterate. Determine if the business made sense to support each iteration and continuously have it live while being able to stop at any time if budget or other issues came up, and they did.
This is a classic strangler process, anyone who has ever done one will tell you it is a hell of a lot harder to pull off than you can imagine.
Python was part of the new toolkit I brought in house to make data management viable, however towards the end one thing became apparent, mass projects and ever changing tech stacks are not python friendly.
What do I mean? Well moving from Mongo to Mysql + ElasticSearch, Lambda to Docker/Celery, building a SaaS BI solution, a recommendation engine, moving users RBAC out of Authentication storage and building out a basic admin tool to manage it all meant different tech was used at different times.
Each iteration went through the following questions:
Decisions that you are making day in / day out.
Something I noticed though as the project came to an end; Phase 1 dependencies were still present in the environment.
Code clean up does not mean environment clean up.
Not exactly a house on fire issue, but as you refactor, modularize for reuse, change run time environments, change repo layout, and do code clean ups, and you suddenly get hit with a tidal wave of incompatibility.
This is a weakness of Python requirements.txt being generated from the environment and not cascading code. One reason is the decision to have executable code as part of the install / dependency process of python e.g. setup.py .
Dependency Hell
PIP, Conda, VirtualEnv, Poetry etc..solve different pieces of a puzzle, none fully solving it as dependency management is hard.
The problems I kept running into were
I came up with an idea recently to look at solving this problem — for the moment it’s called depend-py (expect it to be renamed)
The objective with this was two-fold
Warning
This is very early on in the project, I just pushed it to github the other day after an xmas break.
Starting to solve the problem
Depend-py can be installed directly from git and run with python without additional dependencies.
Here is the output of running depend-py against another python project
$ python depend.py --path ../reconciliation/
{'active': {'project_pkgs': {'reconciliation': [('reconciliation', '0.3', ['Flask', 'Flask-Jsonpify', 'marshmallow'])]},
'vendor_pkgs': {'flask': [('Flask', '2.0.2', ['Werkzeug', 'Jinja2', 'itsdangerous', 'click', 'asgiref', 'python-dotenv'])],
'flask_jsonpify': [('Flask-Jsonpify', '1.5.0', ['Flask'])],
'marshmallow': [('marshmallow', '3.14.1', ['pytest', 'pytz', 'simplejson', 'mypy', 'flake8', 'flake8-bugbear', 'pre-commit', 'tox', 'sphinx', 'sphinx-issues', 'alabaster', 'sphinx-version-warning', 'autodocsumm', 'mypy', 'flake8', 'flake8-bugbear', 'pre-commit', 'pytest', 'pytz', 'simplejson'])],
'pandas': [('pandas', '1.3.4', ['python-dateutil', 'pytz', 'numpy', 'numpy', 'numpy', 'numpy', 'hypothesis', 'pytest', 'pytest-xdist'])],
'setuptools': [('setuptools', '58.3.0', ['sphinx', 'jaraco.packaging', 'rst.linker', 'jaraco.tidelift', 'pygments-github-lexers', 'sphinx-inline-tabs', 'sphinxcontrib-towncrier', 'furo', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'mock', 'flake8-2020', 'virtualenv', 'pytest-virtualenv', 'wheel', 'paver', 'pip', 'jaraco.envs', 'pytest-xdist', 'sphinx', 'jaraco.path', 'pytest-black', 'pytest-mypy'])]}},
'installed': {'_distutils_hack': [('setuptools', '58.3.0', ['sphinx', 'jaraco.packaging', 'rst.linker', 'jaraco.tidelift', 'pygments-github-lexers', 'sphinx-inline-tabs', 'sphinxcontrib-towncrier', 'furo', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'mock', 'flake8-2020', 'virtualenv', 'pytest-virtualenv', 'wheel', 'paver', 'pip', 'jaraco.envs', 'pytest-xdist', 'sphinx', 'jaraco.path', 'pytest-black', 'pytest-mypy'])],
'autopep8': [('autopep8', '1.6.0', ['pycodestyle', 'toml'])],
'bleach': [('bleach', '4.1.0', ['packaging', 'six', 'webencodings'])],
'certifi': [('certifi', '2021.10.8', [])],
'charset_normalizer': [('charset-normalizer', '2.0.9', ['unicodedata2'])],
'click': [('click', '8.0.3', ['colorama', 'importlib-metadata'])],
'colorama': [('colorama', '0.4.4', [])],
'dateutil': [('python-dateutil', '2.8.2', ['six'])],
'docutils': [('docutils', '0.18.1', [])],
'et_xmlfile': [('et-xmlfile', '1.1.0', [])],
'flask': [('Flask', '2.0.2', ['Werkzeug', 'Jinja2', 'itsdangerous', 'click', 'asgiref', 'python-dotenv'])],
'flask_jsonpify': [('Flask-Jsonpify', '1.5.0', ['Flask'])],
'idna': [('idna', '3.3', [])],
'importlib_metadata': [('importlib-metadata', '4.8.2', ['zipp', 'typing-extensions', 'sphinx', 'jaraco.packaging', 'rst.linker', 'ipython', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'packaging', 'pep517', 'pyfakefs', 'flufl.flake8', 'pytest-perf', 'pytest-black', 'pytest-mypy', 'importlib-resources'])],
'itsdangerous': [('itsdangerous', '2.0.1', [])],
'jinja2': [('Jinja2', '3.0.3', ['MarkupSafe', 'Babel'])],
'keyring': [('keyring', '23.4.0', ['importlib-metadata', 'SecretStorage', 'jeepney', 'pywin32-ctypes', 'sphinx', 'jaraco.packaging', 'rst.linker', 'jaraco.tidelift', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'pytest-black', 'pytest-mypy'])],
'markupsafe': [('MarkupSafe', '2.0.1', [])],
'marshmallow': [('marshmallow', '3.14.1', ['pytest', 'pytz', 'simplejson', 'mypy', 'flake8', 'flake8-bugbear', 'pre-commit', 'tox', 'sphinx', 'sphinx-issues', 'alabaster', 'sphinx-version-warning', 'autodocsumm', 'mypy', 'flake8', 'flake8-bugbear', 'pre-commit', 'pytest', 'pytz', 'simplejson'])],
'numpy': [('numpy', '1.21.4', [])],
'openpyxl': [('openpyxl', '3.0.9', ['et-xmlfile'])],
'packaging': [('packaging', '21.3', ['pyparsing'])],
'pandas': [('pandas', '1.3.4', ['python-dateutil', 'pytz', 'numpy', 'numpy', 'numpy', 'numpy', 'hypothesis', 'pytest', 'pytest-xdist'])],
'pip': [('pip', '21.3.1', [])],
'pkg_resources': [('setuptools', '58.3.0', ['sphinx', 'jaraco.packaging', 'rst.linker', 'jaraco.tidelift', 'pygments-github-lexers', 'sphinx-inline-tabs', 'sphinxcontrib-towncrier', 'furo', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'mock', 'flake8-2020', 'virtualenv', 'pytest-virtualenv', 'wheel', 'paver', 'pip', 'jaraco.envs', 'pytest-xdist', 'sphinx', 'jaraco.path', 'pytest-black', 'pytest-mypy'])],
'pkginfo': [('pkginfo', '1.8.2', ['coverage', 'nose'])],
'pycodestyle': [('pycodestyle', '2.8.0', [])],
'pygments': [('Pygments', '2.10.0', [])],
'pyparsing': [('pyparsing', '3.0.6', ['jinja2', 'railroad-diagrams'])],
'pytz': [('pytz', '2021.3', [])],
'readme_renderer': [('readme-renderer', '30.0', ['bleach', 'docutils', 'Pygments', 'cmarkgfm'])],
'requests': [('requests', '2.26.0', ['urllib3', 'certifi', 'chardet', 'idna', 'charset-normalizer', 'idna', 'PySocks', 'win-inet-pton', 'chardet'])],
'requests_toolbelt': [('requests-toolbelt', '0.9.1', ['requests'])],
'rfc3986': [('rfc3986', '1.5.0', ['idna'])],
'setuptools': [('setuptools', '58.3.0', ['sphinx', 'jaraco.packaging', 'rst.linker', 'jaraco.tidelift', 'pygments-github-lexers', 'sphinx-inline-tabs', 'sphinxcontrib-towncrier', 'furo', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'mock', 'flake8-2020', 'virtualenv', 'pytest-virtualenv', 'wheel', 'paver', 'pip', 'jaraco.envs', 'pytest-xdist', 'sphinx', 'jaraco.path', 'pytest-black', 'pytest-mypy'])],
'six': [('six', '1.16.0', [])],
'toml': [('toml', '0.10.2', [])],
'tqdm': [('tqdm', '4.62.3', ['colorama', 'py-make', 'twine', 'wheel', 'ipywidgets', 'requests'])],
'twine': [('twine', '3.7.1', ['pkginfo', 'readme-renderer', 'requests', 'requests-toolbelt', 'tqdm', 'importlib-metadata', 'keyring', 'rfc3986', 'colorama'])],
'urllib3': [('urllib3', '1.26.7', ['brotlipy', 'pyOpenSSL', 'cryptography', 'idna', 'certifi', 'ipaddress', 'PySocks'])],
'webencodings': [('webencodings', '0.5.1', [])],
'werkzeug': [('Werkzeug', '2.0.2', ['dataclasses', 'watchdog'])],
'wheel': [('wheel', '0.37.0', ['pytest', 'pytest-cov'])],
'zipp': [('zipp', '3.6.0', ['sphinx', 'jaraco.packaging', 'rst.linker', 'pytest', 'pytest-checkdocs', 'pytest-flake8', 'pytest-cov', 'pytest-enabler', 'jaraco.itertools', 'func-timeout', 'pytest-black', 'pytest-mypy'])]},
'missing': [],
'path': '../reconciliation/',
'python_paths': ['../reconciliation/.env/lib/python3.8/site-packages'],
'source_deps': {'flask': ['request'],
'flask_jsonpify': ['jsonpify'],
'json': [],
'marshmallow': ['Schema', 'fields'],
'marshmallow.decorators': ['post_dump', 'post_load'],
'os': ['name'],
'pandas': [],
'pprint': ['pprint'],
'reconciliation.reconcile': ['EntityType',
'InvalidUsage',
'Property',
'ReconcileRequest',
'ReconcileService'],
'setuptools': ['setup'],
'typing': []},
'source_pkgs': {'reconciliation': [('reconciliation', '0.3', ['Flask', 'Flask-Jsonpify', 'marshmallow'])]},
'system': ['json', 'os', 'pprint', 'typing']}
You can see what’s active, what's installed, what’s 3rd party vendor, provided by python, provided by the project itself.
There’s even code in there to help you trace back a dependency to what required it.
$ python deep_resolve.py --path ../reconciliation/ --depends-on Flask
Flask is required by ['Flask-Jsonpify', 'reconciliation']
List of ['Flask-Jsonpify', 'reconciliation'] has a source package
Flask is NOT a source dependency
There’s obviously a lot more that can be done with this but it’s a start to help get things under control.
Top image from XKCD
All other images from the mind of a GPU
These days I spend a good deal of time with Data products, either AI driven and / or heavy ETL based. The language I choose that makes this easier is...
Encryption at rest is not enough, here's Application Level Encryption for Data Science