Source code for serializejson

"""
serializejson
=============

+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Authors**               | `Baptiste de La Gorce <contact@smartaudiotools.com>`_                                                                    |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **PyPI**                  | https://pypi.org/project/serializejson                                                                                   |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Documentation**         | https://smartaudiotools.github.io/serializejson                                                                          |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Sources**               | https://github.com/SmartAudioTools/serializejson                                                                         |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Issues**                | https://github.com/SmartAudioTools/serializejson/issues                                                                  |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Noncommercial license** | `Prosperity Public License 3.0.0 <https://github.com/SmartAudioTools/serializejson/blob/master/LICENSE-PROSPERITY.rst>`_ |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+
| **Commercial license**    | `Patron License 1.0.0 <https://github.com/SmartAudioTools/serializejson/blob/master/LICENSE-PATRON.rst>`_                |
|                           | ⇒ `Sponsor me ! <https://github.com/sponsors/SmartAudioTools>`_ or `contact me ! <contact@smartaudiotools.com>`_         |
+---------------------------+--------------------------------------------------------------------------------------------------------------------------+


**serializejson**  is a python library for fast serialization and deserialization
of python objects in `JSON <http://json.org>`_  designed as a safe, interoperable and human-readable drop-in replacement for the Python `pickle <https://docs.python.org/3/library/pickle.html>`_ package.
Complex python object hierarchies are serializable, deserializable or updatable in once, allowing for example to save or restore a complete application state in few lines of code.
The library is build upon
`python-rapidjson <https://github.com/python-rapidjson/python-rapidjson>`_,
`pybase64 <https://github.com/mayeut/pybase64>`_ and
`blosc <https://github.com/Blosc/python-blosc>`_  for optional `zstandard <https://github.com/facebook/zstd>`_ compression.

Some of the main features:

- supports Python 3.7 (maybe lower) or greater.
- serializes arbitrary python objects into a dictionary by adding `__class__` ,and eventually `__init__`, `__new__`, `__state__`, `__items__` keys.
- calls the same objects methods as pickle. Therefore almost all pickable objects are serializable with serializejson without any modification.
- for not already pickable object, you will allways be able to serialize it by adding methodes to the object or creating plugins for pickle or serializejson.
- generally 2x slower than pickle for dumping and 3x slower than pickle for loading (on your benchmark) except for big arrays (optimisation will soon be done).
- serializes and deserializes bytes and bytearray very quickly in base64 thanks to `pybase64 <https://github.com/mayeut/pybase64>`_ and lossless `blosc <https://github.com/Blosc/python-blosc>`_ compression.
- serialize properties and attributes with getters and setters if wanted (unlike pickle).
- json data will still be directly loadable if you have transform some attributes in slots or properties in your code since your last serialization. (unlike pickle)
- can serialize `__init__(self,..)` arguments by name instead of positions, allowing to skip arguments with defauts values and making json datas robust to a change of `__init__` parameters order.
- serialized objects take generally less space than when serialized with pickle: for binary data, the 30% increase due to base64 encoding is in general largely compensated using the lossless `blosc <https://github.com/Blosc/python-blosc>`_ compression.
- serialized objects are human-readable and easy to read. Unlike pickled data, your data will never become unreadable if your code evolves: you will always be able to modify your datas with a text editor (with find & replace for example if you change an attribut name).
- serialized objects are text and therefore versionable and comparable with versionning and comparaison tools.
- can safely load untrusted / unauthenticated sources if authorized_classes list parameter is set carefully with strictly necessary objects (unlike pickle).
- can update existing objects recursively instead of override them. serializejson can be used to save and restore in place a complete application state (⚠ not yet well tested).
- filters attribute starting with "_" by default (unlike pickle). You can keep them if wanted with `filter_ = False`.
- numpy arrays can be serialized as lists with automatic conversion in both ways or in a conservative way.
- supports circular references and serialize only once duplicated objects, using "$ref" key an path to the first occurance in the json : `{"$ref": "root.xxx.elt"}` (⚠ not yet if the object is a list or dictionary).
- accepts json with comment (// and /\* \*/) if `accept_comments = True`.
- can automatically recognize objects in json from keys names and recreate them, without the need of `__class__` key, if passed in `recognized_classes`.
- serializejson is easly interoperable outside of the Python ecosystem with this recognition of objects from keys names or with `__class__` translation between python and other language classes.
- dump and load support string path.
- can iteratively encode (with append) and decode (with iterator) a list in json file, which helps saving memory space during the process of serialization and deserialization and useful for logs.

.. warning::

    **⚠** Do not load serializejson files from untrusted / unauthenticated sources without carefully setting the load authorized_classes parameter.

    **⚠** Never dump a dictionary with the `__class__` key, otherwise serializejson will attempt to reconstruct an object when loading the json.
    Be careful not to allow a user to manually enter a dictionary key somewhere without checking that it is not `__class__`.
    Due to current limitation of rapidjson we cannot we cannot at the moment efficiently detect dictionaries with the `__class__` key to raise an error.


Installation
============

**Last offical release**

.. code-block::

    pip install serializejson

**Developpement version unreleased**

.. code-block::

    pip install git+https://github.com/SmartAudioTools/serializejson.git

Examples
================

**Serialization with fonctions API**

.. code-block:: python

    import serializejson

    # serialize in string
    object1 = set([1,2])
    dumped1 = serializejson.dumps(object1)
    loaded1 = serializejson.loads(dumped1)
    print(dumped1)
    >{
    >        "__class__": "set",
    >        "__init__": [1,2]
    >}


    # serialize in file
    object2 = set([3,4])
    serializejson.dump(object2,"dumped2.json")
    loaded2 = serializejson.load("dumped2.json")

**Serialization with classes based API.**

.. code-block:: python

    import serializejson
    encoder = serializejson.Encoder()
    decoder = serializejson.Decoder()

    # serialize in string

    object1 = set([1,2])
    dumped1 = encoder.dumps(object1)
    loaded1 = decoder.loads(dumped1)
    print(dumped1)

    # serialize in file
    object2 = set([3,4])
    encoder.dump(object2,"dumped2.json")
    loaded2 = decoder.load("dumped2.json")

**Update existing object**

.. code-block:: python

    import serializejson
    object1 = set([1,2])
    object2 = set([3,4])
    dumped1 = serializejson.dumps(object1)
    print(f"id {id(object2)} :  {object2}")
    serializejson.loads(dumped1,obj = object2, updatables_classes = [set])
    print(f"id {id(object2)} :  {object2}")

**Iterative serialization and deserialization**

.. code-block:: python

    import serializejson
    encoder = serializejson.Encoder("my_list.json",indent = None)
    for elt in range(3):
        encoder.append(elt)
    print(open("my_list.json").read())
    for elt in serializejson.Decoder("my_list.json"):
        print(elt)
    >[0,1,2]
    >0
    >1
    >2

More examples and complete documentation `here <https://smartaudiotools.github.io/serializejson/>`_

License
=======

Copyright 2020 Baptiste de La Gorce

For noncommercial use or thirty-day limited free-trial period commercial use, this project is licensed under the `Prosperity Public License 3.0.0 <https://github.com/SmartAudioTools/serializejson/blob/master/LICENSE-PROSPERITY.rst>`_.

For non limited commercial use, this project is licensed under the `Patron License 1.0.0 <https://github.com/SmartAudioTools/serializejson/blob/master/LICENSE-PATRON.rst>`_.
To acquire a license please `contact me <mailto:contact@smartaudiotools.com>`_, or just `sponsor me on GitHub <https://github.com/sponsors/SmartAudioTools>`_ under the appropriate tier ! This funding model helps me making my work sustainable and compensates me for the work it took to write this crate!

Third-party contributions are licensed under `Apache License, Version 2.0 <http://www.apache.org/licenses/LICENSE-2.0>`_ and belong to their respective authors.
"""

try:
    import importlib.metadata as importlib_metadata  # New in version 3.8
except ModuleNotFoundError:
    import importlib_metadata
try:
    __version__ = importlib_metadata.version("serializejson")
except:
    pass
import os
import warnings
import io
import rapidjson
import gc
import blosc
import errno
from collections import deque
from pybase64 import b64decode, b64encode_as_string
from _collections_abc import list_iterator
try:
    import numpy
    from numpy import ndarray

    use_numpy = True
except ModuleNotFoundError:
    use_numpy = False
from . import serialize_parameters


# def add_authorized_classes(*classes):
#    if len(classes) == 0 and type(classes[0]) in (tuple,list,set):
#        classes = classes[0]
#    for elt in classes:
#        if not type(elt) is str:
#            elt = class_str_from_class(elt)
#        authorized_classes.add(elt)

from .tools import (
    getstate,
    setstate,
    instance,
    tuple_from_instance,
    class_str_from_class,
    class_from_class_str,
    from_name,
    _get_getters,
    _get_setters,
    _get_properties,
    encoder_parameters,
    _onlyOneDimSameTypeNumbers,
    _onlyOneDimNumbers,
    blosc_compressions,
    setters_names_from_class,
    slots_from_class,
    authorized_classes,
    Reference,
    constructors,
)


authorized_classes.update(
    {
        "bytes",
        "bytearray",
        "complex",
        "frozenset",
        "tuple",
        "type",
        "range",
        "set",
        "slice",
        "dict_non_str_keys",
        "collections.Counter",
        "collections.defaultdict",
        "collections.deque",
        "collections.OrderedDict",
    }
)

__all__ = [
    "dumps",
    "dump",
    "loads",
    "load",
    "append",
    "Encoder",
    "Decoder",
    "getstate",
    "class_from_class_str",
]
# flag allowing to keep None as allowed value for Encoder default_value.
no_default_value = []


# --- FONCTIONS BASED API ----------------------


[docs]def dump(obj, file, **argsDict): """ Dump an object into json file. Args: obj: object to dump. file (str or file-like): path or file. **argsDict: parameters passed to the Encoder (see documentation). """ if isinstance(file, str): fp = open(file, "wb") else: fp = file Encoder(**argsDict)(obj, fp)
[docs]def dumps(obj, **argsDict): """ Dump object into json string. If you want to return a bytes for pickle drop-in pickle remplacement, your should ether replace `pickle.dumps` calls by `serializejson.dumpb` calls or make an `from serializejson import dumpb as dumps` at the start of your script Args: obj: object to dump. **argsDict: parameters passed to the Encoder (see documentation). """ return Encoder(**argsDict)(obj, return_bytes=False)
def dumpb(obj, **argsDict): """ Dump object into json bytes. Args: obj: object to dump. **argsDict: parameters passed to the Encoder (see documentation). """ return Encoder(**argsDict)(obj, return_bytes=False).encode("utf_8")
[docs]def append(obj, file=None, *, indent="\t", **argsDict): """ Append an object into json file. Args: obj: object to dump. file (str or file-like): path or file. The file must be empty or containing a json list. indent: indent passed to Encoder. **argsDict: other parameters passed to the Encoder (see documentation). """ file = _open_for_append(file, indent) Encoder(**argsDict)(obj, file) _close_for_append(file, indent)
[docs]def loads(json, *, obj=None, iterator=False, **argsDict): # on ne peut pas en meme temps updater objet """ Load an object from a json string or bytes. Args: json: the json string or bytes. obj (optional): If provided, the object `obj` will be updated and no new object will be created. iterator: if `True` and the json corresponds to a list then the items will be read one by one which reduces RAM consumption. **argsDict: parameters passed to the Decoder (see documentation). Return: created object, updated object if `obj` is provided or elements iterator if `iterator` is `True`. """ decoder = Decoder(**argsDict) if iterator: return decoder else: return decoder(json=json, obj=obj)
[docs]def load(file, *, obj=None, iterator=False, **argsDict): """ Load an object from a json file. Args: file (str or file-like): the json path or file-like object. obj (optional): if provided, the object `obj` will be updated and no new object will be created. iterator: if `True` and the json corresponds to a list then the items will be read one by one which reduces RAM consumption. **argsDict: parameters passed to the Decoder (see documentation). Return: created object, updated object if passed obj or elements iterator if iterator is True. """ if iterator: return Decoder(**argsDict) else: return Decoder(**argsDict).load(file=file, obj=obj)
def jsonpath(obj): """return the json path of loaded object""" return id_to_path.get(id(obj), None) # --- CLASSES BASED API -------------------------------------------------------
[docs]class Encoder(rapidjson.Encoder): """ class for serialization of python objects into json. Args: file (str or file-like): The json path or file-like object. When specified, the encoded result will be written there if you don't pricise file to`dump()` method later. attributes_filter (bool or set/list/tuple): Controls whether remove "private" attributs starting with "_" from the saved state for objects without plugin, __getstate__,__serializejson__ or reimplemented __reduce_ex__ or __reduce__ methodes. - `False` : filter private attributes to none classes (if not filtered in __reduce__ or __gestate__ methodes) - `True` : filter private attributes for all classes - `set/list/tuple` : filter private attributes for this classes Use it temporarily. - In order to stay compatible with pickle,you sould better code one of the __getstate__, __reduce_ex__,__reduce__ or a pickle plugin, filtering attributes starting with "_". - Otherwise, in order to be independent of this parameter, code a _serializejson__ method or serializejson plugin. - In this method or plugin you can call the helping function : state = serialize.__gestate__(self,attributes_filter = True) properties (bool, None, set/list/tuple, dict ): Controls whether add properties to the saved state for objects without plugin, __getstate__,__serializejson__ or reimplemented __reduce_ex__ or __reduce__ methodes. - `False` : add properties to none classes (as pickle) - `True` : add properties for all classes - `None` : (default) add properties defined in serializejson.properties dict (added by plugins or manualy before encoder call) (see documentation section: ref:`"Add plugins to serializejson"<add-plugins-label>`. ) - `set/list/tuple` : add all properties for classes in this set/list/tuple, in addition to properties defined in serializejson.properties dict [class1, class2,..] (not secure if unstruted json, use it only for debuging) - `dict` : add properties defined in dict, in addition to properties defined in serializejson.properties dict {class1 : ["propertie1","propertie1"], class2: True} Use it temporarily. - In order to stay compatible with pickle, you sould better code one of the __getstate__, __reduce_ex__, __reduce__ or a pickle plugin, retrieving values for properties and returning them in the same dictionnary than __slots__, as the second element of a state tuple. - Otherwise, in order to be independent of this parameter, code a _serializejson__ method or serializejson plugin retrieving values for properties and return them in the state dictionnary. - In this method or plugin you can call the helping function : state = serialize.__gestate__(self, properties = True or list of properties names) getters (bool or set/list/tuple): Controls whether add values retrieve with getters to the saved state for objects without plugin, __getstate__,__serializejson__ or reimplemented __reduce_ex__ or __reduce__ methodes. - `False` : save no other getters than thus called in __getstate__ methodes, like pickle. - `True` : save getters for all objects - `None` : (default) save getters defined in serializejson.getter dict (added by plugins or manualy before encoder call) (see documentation section: ref:`"Add plugins to serializejson"<add-plugins-label>`. ) - `set/list/tuple` : save getters for classes in set/list/tuple, in addition to getters defined in serializejson.setters dict [class1, class2,..] (not secure if unstruted json, use it only for debuging) - `dict` : save getters defined in dict, in addition to getters defined in serializejson.getters dict {class1 : {"attribut_name":"getter_name",...}, class2: True} Use it temporarily. - In order to stay compatible with pickle, you sould better code one of the __getstate__, __reduce_ex__, __reduce__ or a pickle plugin, retrieving values for getters and returning them in the state. And code a __setstate__ methode calling setters for this values . - Otherwise, in order to be independent of this parameter, code a _serializejson__ method or serializejson plugin retrieving values for getters and returning them in the state. And code a __setstate__ methode calling setters for this values or leave the Decpder's setters parameter as True. - In this method or plugin you can call the helping function : state = serialize.__gestate__(self,getters = True or {"a":"getA","b":"getB"}). With getters as True, the getters will be automaticaly guessed. Wit getters as a dict allow the finest control and is faster because getters are not guessed from introspection. With tuple as key in this dict, you can retrieve several attributes values from one getter. remove_default_values (bool or set/list/tuple): Controls whether remove values same as their default value from the state in order to save memory space, for objects without plugin, __getstate__, __serializejson__ or reimplemented __reduce_ex__ or __reduce__ methodes. - `False` : remove defaul values to none classes - `True` : remove defaul values for all classes - `set/list/tuple` : remove defaul values for this classes. Use it temporarily. - Since the default values will not be stored and may change between different versions of your code, never use it for long term storage. Be aware that in order to know the default value, serializejson will create an insistence of the object's class without any __init__ argument. - In order to stay compatible with pickle, you sould better code one of the __getstate__, __reduce_ex__, __reduce__ or a pickle plugin, removing values same as their default value. - Otherwise, in order to be independent of this parameter, code a _serializejson__ method or serializejson plugin removing values same as their default value. - In this method or plugin you can call the helping function : state = serialize.__gestate__(self,remove_default_values = True or dict {name : default_value,...}) chunk_size: Write the file in chunks of this size at a time. ensure_ascii: Whether non-ascii str are dumped with escaped unicode or utf-8. indent (None, int or '\\\\t'): Indentation width to produce pretty printed JSON. - `None` : Json in one line (quicker than with indent). - `int` : new lines and `indent` spaces for indent. - '\\\\t' : new lines and tabulations for indent (take less space than int > 1). single_line_init: whether `__init__` args must be serialized in one line. single_line_new: whether `__new__` args must be serialized in one line. single_line_list_numbers: whether list of numbers of same type must be serialize in one line. sort_keys: whether dictionary keys should be sorted alphabetically. Since python 3.7 dictionary order is guaranteed to be insertion order. Some codes may now rely on this particular order, like the key order of the state returned by __gestate__. bytes_compression(None or str): Compression for bytes, bytesarray and numpy arrays: - `None` : no compression, use only base 64. - `str` : compression name ("blosc_zstd", "blosclz", "blosc_lz4", "blosc_lz4hc" or "blosc_zlib") with maximum compression level 9. - `tuple` : (compression name, compression level) with compression level from 0 (no compression) to 9 (maximum compression) By default the "blosc_zstd" compression is used with compression level 1. For the highest compression (but with slower dumping) use "blosc_zstd" with compression level 9 bytes_compression_diff_dtypes (tuple of dtype) tuple of dtype for wich serialize json encode the first element followed by the differences between consecutive elements of an array before the compression. A cumulative sum will be used for the decompression bytes_compression_threads (int,str): Number of threads user for the compression - `int` : number of threads user for the compression - `"cpus"`: use as many thread than cpu - `"determinist"` us one thread with blosc compression for determinist compression eiter as many thread than cpu bytes_size_compression_threshold (int): bytes size threshold beyond compression is tried to reduce size of bytes, bytesarray and numpy array if `bytes_compression` is not None. The default value is 512, generaly beside the compression is not worth it due to the header size and the additional cpu cost. array_readable_max_size (int,None or dict): Defines the maximum array.array size for serialization in readable numbers. By default array_readable_max_size is set to 0, all non empty arrays are encoded in base 64. - `int` : all arrays smaller than or egal to this size are serialized in readable numbers. - `None` : there is no maximum size and all arrays are serialized in readable numbers. - `dict` : for each typecode key, the value define the maximum size of this typecode arrays for serialization in readable numbers. If value is `None` there is no maximum and array of this typecode are all serialized in readable numbers. If you want only signed int arrays to be readable, then you should pass `array_readable_max_size = {"i":None}` .. note:: serialization of int arrays can take much less space in readable, but is much slower than in base 64 for big arrays. If you have lot or large int arrays and performance matters, then you should stay with default value 0. numpy_array_readable_max_size (int,None or dict): Defines the maximum numpy array size (product of the array’s dimensions) for serialization in readable numbers. By default numpy_array_readable_max_size is set to 0, all non empty numpy arrays are encoded in base 64. - `int` : all numpy arrays smaller than or egal to size are serialized in readable numbers. - `None` : there is no maximum size and all numpy arrays are serialized in readable numbers. - `dict` : for each dtype key, the value define the maximum size of this dtype arrays for serialization in readable numbers. If value is `None` there is no maximum and numpy array of this dtype are all serialized in readable numbers. If you want only numpy arrays int32 to be readable, then you should pass `numpy_array_readable_max_size = {"int32":None}` .. note:: serialization in readable can take much less space in int32 if the values ar smaller or equal to 9999, but is much slower than in base 64 for big arrays. If you have lot or large numpy int32 arrays and performance matters, then you should stay with default value 0. numpy_array_to_list: whether numpy array should be serialized as list. .. warning:: This should be used only for interoperability with other json libraries. If you want readable  values in your json, we recommend to use instead `numpy_array_readable_max_size` which is not destructive. With `numpy_array_to_list` set to `True`: - numpy arrays will be indistinctable from list in json. - `Decoder(numpy_array_from_list=True)` will recreate numpy array from lists of bool, int or float, if not an `__init__` args list, with the the risque of unwanted convertion of lists to numpy arrays. - dtype of the numpy array will be loosed if not bool, int32 or float64 and converted to the bool, int32 or float64 when loading - Empty numpy array will be converted to [] without any way to guess the dtype and will stay an empty list when loading event with `numpy_array_from_list = True` numpy_types_to_python_types: whether numpy integers and floats outside of a array must be convert to python types. It save space and generally don't affect strict_pickle (False by default) If True serialize with exactly the same behaviour than pickle: - disabling serializejson plugins for custom serialization.(no numpyB64) - disabling attributes_filter - disabling keys sorting - disabling numpy_array_to_list - disabling numpy_types_to_python_types - keeping __dict__ and __slots__ separated in a tuple if both, instead of merge them in a dictionnary (you should prepare __setstat__ methods to receive both a tuple or a dictionnary) - making same checks than pickle - raising the sames Errors than pickle **plugins_parameters: extra keys arguments are stocked in "serialize_parameters" global module and accessible in plugins module in order to allow the choice between serialization options in plugins. """ """ bytearray_use_bytearrayB64: save bytearray with references to serializejson.bytearrayB64 instead of verbose use of base64.b64decode. It save space but make the json file dependent of the serializejson module. numpy_array_use_numpyB64: save numpy arrays with references to serializejson.numpyB64 instead of verbose use of base64.b64decode. It save space but make the json file dependent of the serializejson module. """ def __new__( cls, file=None, *, strict_pickle=False, return_bytes=False, attributes_filter=True, properties=False, getters=False, remove_default_values=False, chunk_size=65536, ensure_ascii=False, indent="\t", single_line_init=True, single_line_new=True, single_line_list_numbers=True, sort_keys=False, bytes_compression=("blosc_zstd", 1), # bytes_compression_diff_dtypes=tuple(), bytes_size_compression_threshold=512, bytes_compression_threads=1, array_use_arrayB64=True, # le laisser ? array_readable_max_size=0, # 'int32':-1 numpy_array_use_numpyB64=True, # le laisser ? numpy_array_readable_max_size=0, # 'int32':-1 numpy_array_to_list=False, numpy_types_to_python_types=True, protocol=4, # protocol pour pickle **plugins_parameters, ): # if not bytes_to_string: # bytes_mode = rapidjson.BM_NONE # else: # bytes_mode = rapidjson.BM_UTF8 if strict_pickle: attributes_filter = False sort_keys = False numpy_array_to_list = False numpy_types_to_python_types = False self = super().__new__( cls, ensure_ascii=ensure_ascii, indent=indent, sort_keys=sort_keys, bytes_mode=rapidjson.BM_NONE, number_mode=rapidjson.NM_NAN, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS # **argsDict ) self.protocol = protocol self.attributes_filter = bool_or_set(attributes_filter) self.properties = _get_properties(properties) self.getters = _get_getters(getters) self.remove_default_values = bool_or_set(remove_default_values) self.file = file self.return_bytes = return_bytes if indent is None: self.single_line_list_numbers = False self.single_line_init = False self.single_line_new = False else: self.single_line_list_numbers = single_line_list_numbers self.single_line_init = single_line_init self.single_line_new = single_line_new # rapid json enregistre self.indent_char et self.indent_count , mais ne permet pas de savoir si indent = None ... self.indent = indent self._dump_one_line = indent is None self.dumped_classes = set() self.chunk_size = chunk_size bytes_compression_level = 9 if bytes_compression is not None: if isinstance(bytes_compression, (list, tuple)): bytes_compression, bytes_compression_level = bytes_compression if bytes_compression not in blosc_compressions: raise Exception( f"{bytes_compression} compression unknown. Available values for bytes_compression are {', '.join(blosc_compressions)}" ) self.bytes_compression = bytes_compression self.bytes_compression_threads = bytes_compression_threads self.bytes_compression_diff_dtypes = bytes_compression_diff_dtypes self.bytes_compression_level = bytes_compression_level self.bytes_size_compression_threshold = bytes_size_compression_threshold self.array_use_arrayB64 = array_use_arrayB64 self.array_readable_max_size = array_readable_max_size self.numpy_array_to_list = numpy_array_to_list self.numpy_array_use_numpyB64 = numpy_array_use_numpyB64 self.numpy_array_readable_max_size = numpy_array_readable_max_size self.numpy_types_to_python_types = numpy_types_to_python_types self.strict_pickle = strict_pickle unexpected_keywords_arguments = set(plugins_parameters) - set( encoder_parameters ) if unexpected_keywords_arguments: raise TypeError( "serializejson.Encoder got unexpected keywords arguments '" + ", ".join(unexpected_keywords_arguments) + "'" ) self.plugins_parameters = encoder_parameters.copy() self.plugins_parameters.update(plugins_parameters) return self
[docs] def dump(self, obj, file=None, close=True): """ Dump object into json file. Args: obj: object to dump. file (optional str or file-like): the json path or file-like object. When specified, json is written into this file. Otherwise json is written into the file passed to `Encoder()` constructor. close (optional bool): weither dump must close the file after dumping (True by default). """ if file is None: file = self.file if isinstance(file, str): self.fp = open(file, "wb") else: self.fp = file self.__call__(obj, fp=self.fp, chunk_size=self.chunk_size) if close: self.fp.close() del self.fp
[docs] def dumps(self, obj): """ Dump object into json string. """ return self.__call__(obj, return_bytes=False)
[docs] def dumpb(self, obj): """ Dump object into json bytes. """ return self.__call__(obj, return_bytes=True)
def close(self): if hasattr(self, "fp"): self.fp.close() del self.fp # else : # raise Exception("json file already closed") def clear(self, close=False): self._reset() self._update_serialize_parameters() # self.file = open(self.file, "rb+") if isinstance(self.file, str): path = self.file if os.path.exists(path): self.fp = open(path, "rb+") self.fp.truncate(0) else: self.fp = open(path, "wb+") else: self.fp.truncate(0) if close: self.fp.close() del self.fp # @profile
[docs] def append(self, obj, file=None, close=False): """ Append object into json file. Args: obj: object to dump. file (optional str or file-like): path or file. If provided, the object will be dumped into this file instead of being dumped into the file passed at the Encoder constructor. The file must be empty or contain a json list. close : - `True` the file will be closed afterappend and reopen at the next append - `False` (by default) the file will be kepped open for the next append. You will have to manually close se file with encoder.close() """ self._update_serialize_parameters() if file is None: file = self.file if hasattr(self, "fp"): fp = _open_for_append(self.fp, self.indent) else: self.fp = fp = _open_for_append(file, self.indent) rapidjson.Encoder.__call__(self, obj, stream=fp, chunk_size=self.chunk_size) _close_for_append(fp, self.indent) if close: fp.close() del self.fp
[docs] def get_dumped_classes(self): """ Return the all dumped classes. In order to reuse them as `authorize_classes` argument when loading with a ``serializejson.Decoder``. """ return self.dumped_classes
# @profile def default(self, inst): # Equivalent au calback "default" qu'on peut passer à dump ou dumps id_ = id(inst) if id_ in self._already_serialized: path = self._get_path(inst, already_explored=set([id(locals())])) if path is not None: return rapidjson.RawJSON(f'{"$ref": "{path}"}') else: self._already_serialized.add(id_) type_inst = type(inst) if self.numpy_types_to_python_types and type_inst in _numpy_types: return _numpy_dtypes_to_python_types[type_inst](inst) if use_numpy and type_inst is ndarray and self.numpy_array_to_list: if self._dump_one_line or not self.single_line_list_numbers: return ( inst.tolist() ) # A REVOIR : pas génial... va tester si nombres tous du meme type et ne pas pas utiliser rapidjson.NM_NATIVE? if inst.dtype in _numpy_float_dtypes: number_mode = self.number_mode else: number_mode = rapidjson.NM_NATIVE # permet décceler pas mal if inst.ndim == 1: return rapidjson.RawJSON( rapidjson.dumps( inst.tolist(), ensure_ascii=False, number_mode=number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS, ) ) return [ rapidjson.RawJSON( rapidjson.dumps( elt.tolist(), ensure_ascii=False, number_mode=number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS, ) ) for elt in inst ] # inst.tolist() if type_inst is tuple: # isinstance(inst,tuple) attrape les struct_time # je l'ai mis là plutot que dans tuple_from_instance car très spécifique à json et les tuples n'ont pas de réduce contrairement à set , qui lui est pour l'instant traité dans dict_from_instance -> tuple_from_instance self.dumped_classes.add(tuple) dic = {"__class__": "tuple", "__new__": list(inst)} elif type_inst is Reference: return rapidjson.RawJSON( '{"$ref": "%s%s"}' % ( self._get_path(inst.obj, already_explored=set([id(inst.__dict__)])), inst.sup_str, ) ) else: dic = self._dict_from_instance( inst ) # 8.6 % (correspond au temps pour conversion en b64 avec pybase64.b64encode) du temps sur obj = bytes(numpy.arange(2**20,dtype=numpy.float64).data) if not self._dump_one_line: if self.single_line_init: args = dic.get("__init__", None) if isinstance(args, list): # 91.2 % du temps avec obj = bytes(numpy.arange(2**20,dtype=numpy.float64).data) dic["__init__"] = rapidjson.RawJSON( rapidjson.dumps( args, ensure_ascii=self.ensure_ascii, default=self._default_one_line, sort_keys=self.sort_keys, bytes_mode=self.bytes_mode, number_mode=self.number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS # **self.kargs ) ) if self.single_line_new: args = dic.get("__new__", None) if type(args) is list: dic[ "__new__" # 91.2 % du temps avec obj = bytes(numpy.arange(2**20,dtype=numpy.float64).data) ] = rapidjson.RawJSON( rapidjson.dumps( args, ensure_ascii=self.ensure_ascii, default=self._default_one_line, sort_keys=self.sort_keys, bytes_mode=self.bytes_mode, number_mode=self.number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS # **self.kargs ) ) if self.single_line_list_numbers: for key, value in dic.items(): if ( key != "__class__" and (key != "__init__" or not self.single_line_init) and (key != "__new__" or not self.single_line_new) and type(value) is list and _onlyOneDimSameTypeNumbers(value) ): dic[key] = rapidjson.RawJSON( rapidjson.dumps( value, ensure_ascii=self.ensure_ascii, default=self._default_one_line, bytes_mode=self.bytes_mode, number_mode=self.number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS # **self.kargs ) ) # self._already_serialized_id_dic_to_obj_dic[id(dic)] = ( # inst, # dic, # ) # important de metre dic avec sinon il va être detruit et son identifiant va être réutilisé. # if self.add_id: # dic["_id"] = id_ return dic # raise TypeError('%r is not JSON serializable' % inst) # @profile def _default_one_line(self, inst): type_inst = type(inst) if self.numpy_types_to_python_types and type_inst in _numpy_types: return _numpy_dtypes_to_python_types[type_inst](inst) if type_inst is tuple: # isinstance(inst,tuple) attrape les struct_time # je l'ai mis là plutot que dans tuple_from_instance car très spécifique à json et les tuples n'ont pas de réduce contrairement à set , qui lui est pour l'instant traité dans dict_from_instance -> tuple_from_instance self.dumped_classes.add(tuple) return {"__class__": "tuple", "__new__": list(inst)} if type_inst is Reference: return { "$ref": self._get_path( inst.obj, already_explored=set([id(inst.__dict__)]) ) + inst.sup_str } if type_inst is ndarray and self.numpy_array_to_list: return inst.tolist() return self._dict_from_instance(inst) def _dict_from_instance(self, inst): if type(inst) is dict: # dictionnary with non string key d = {"__class__": "dict_non_str_keys"} init_dict = d for key, value in inst.items(): # if type(key) is tuple : # key = list(key) new_key = None type_key = type(key) if type_key is int: # pas mis les float pour garder -inf et inf (nan ca déconne dans les dictionnaires) new_key = str(key) elif type_key is str: try: rapidjson.loads(key) except: if key.endswith("'") and ( key.startswith("'") or key.startswith("b'") or key.startswith("b64'") ): new_key = f"'{key}'" else: new_key = key else: new_key = f"'{key}'" elif type_key is bytes: try: new_key = f"b'{key.decode('ascii_printables')}'" except: new_key = f"b64'{b64encode_as_string(key)}'" elif type_key is tuple: key = list(key) if new_key is None: new_key = rapidjson.dumps( key, default=self._default_one_line, ensure_ascii=self.ensure_ascii, sort_keys=self.sort_keys, bytes_mode=self.bytes_mode, number_mode=rapidjson.NM_NATIVE, iterable_mode=rapidjson.IM_ONLY_LISTS, # mapping_mode=rapidjson.MM_ONLY_DICTS # **self.kargs ) init_dict[new_key] = value return d # if type(inst) is OrderedDict : # if not self.sort_keys : # on a besoin d'avoir accès à self.sort_keys et specifique à serializejson # return { # "__class__" : "collections.OrderedDict", # "__items__" : dict(inst) # } # else : # return { # "__class__" : "collections.OrderedDict", # "__items__" : list(inst.items()) # } if type(inst) is dotdict: return dict(inst) class_, initArgs, state, listitems, dictitems, newArgs = tuple_from_instance( inst, self.protocol ) if type(class_) is not str: class_ = class_str_from_class(class_) self.dumped_classes.add(class_) dictionnaire = {"__class__": class_} for args, method in ((newArgs, "__new__"), (initArgs, "__init__")): if args is not None: if type(args) is dict: dictionnaire[method] = args else: if class_ in remove_add_braces: if args: dictionnaire[method] = args[0] else: dictionnaire[method] = [] elif len(args) == 1: type_first = type(args[0]) if ( type_first not in (tuple, list) and not ( self.numpy_array_to_list and type_first is numpy.ndarray ) and ((type_first is not dict) or "__class__" in args[0]) ): dictionnaire[method] = args[0] else: dictionnaire[method] = list(args) # args is a tuple else: dictionnaire[method] = list(args) # args is a tuple if listitems: dictionnaire["__items__"] = listitems elif dictitems: dictionnaire["__items__"] = dictitems if state: if (type(state) is not dict) or ( hasattr(inst, "__setstate__") and not all_keys_are_str(state) ): dictionnaire["__state__"] = state else: dictionnaire.update(state) return dictionnaire def __call__(self, obj, fp=None, return_bytes=None): if return_bytes is None: return_bytes = self.return_bytes if ( type(obj) is list and self.single_line_list_numbers and _onlyOneDimSameTypeNumbers(obj) ): return rapidjson.dumps( obj, ensure_ascii=False, default=self._default_one_line, bytes_mode=self.bytes_mode, number_mode=self.number_mode, iterable_mode=rapidjson.IM_ONLY_LISTS, mapping_mode=rapidjson.MM_ONLY_DICTS, # return_bytes = return_bytes # **self.kargs ) self._update_serialize_parameters() self._reset() self._root = obj encoded = rapidjson.Encoder.__call__( self, obj, stream=fp, chunk_size=self.chunk_size ) self._clean() return encoded def _update_serialize_parameters(self): blosc.set_nthreads(self.bytes_compression_threads) serialize_parameters.__dict__.update(self.__dict__) serialize_parameters.__dict__.update(self.plugins_parameters) def _reset(self): self.dumped_classes = set() self._already_serialized = set() # self._already_serialized_id_dic_to_obj_dic = dict() def _clean(self): del self._already_serialized # del self.dumped_classes # del self._already_serialized_id_dic_to_obj_dic # @profile # ,list_deep = 10): def _searchSerializedParent(self, obj, already_explored=set(), attribut=None): root = self._root if obj is root: return [(["root"], False)] id_obj = id(obj) if id_obj in already_explored: return [] already_explored = already_explored.copy() already_explored.add(id_obj) already_explored.add(id(locals())) pathElements = list() refs = gc.get_referrers(obj) already_explored.add(id(refs)) # potential_parents = [parent_test for parent_test in gc.get_referrers(obj)if ((id(parent_test) not in already_explored) and isinstance(parent_test,(dict,list))) ] # print(len(potential_parents)) for parent_test in refs: id_parent_test = id(parent_test) if id_parent_test not in already_explored: type_parent_test = type(parent_test) if type_parent_test is dict: if self.sort_keys: parent_test_keys = sorted(parent_test) else: parent_test_keys = parent_test.keys() for key in parent_test_keys: # sorted(parent_test): value = parent_test[key] if value is obj: for elt, is_attribut in self._searchSerializedParent( parent_test, already_explored, attribut=obj ): if is_attribut: pathElements.append((elt, False)) else: pathElements.append((elt + [f"['{key}']"], False)) break elif ( type_parent_test is list and not type(parent_test[-1]) is list_iterator ): for key, value in enumerate(parent_test): if value is obj: for elt, _ in self._searchSerializedParent( parent_test, already_explored ): pathElements.append((elt + ["[%d]" % key], False)) break elif id_parent_test in self._already_serialized: if hasattr(parent_test, "__dict__"): dic = self._dict_from_instance(parent_test) for i, (key, value) in enumerate(dic.items()): if value is attribut: for elt, _ in self._searchSerializedParent( parent_test, already_explored ): pathElements.append((elt + [".", (i), key], True)) break if hasattr(parent_test, "__slots__"): dic = self._dict_from_instance(parent_test) for i, (key, value) in enumerate(dic.items()): if value is obj: for elt, _ in self._searchSerializedParent( parent_test, already_explored ): pathElements.append((elt + [".", (i), key], True)) break return pathElements def _get_path(self, obj, already_explored=set()): already_explored.add(id(locals())) pathElements = self._searchSerializedParent( obj, already_explored=already_explored ) if not pathElements: return None # return f'impossible to find a path from root object for {obj}' #raise Exception("impossible to find a path from root object for %s" % obj) # print("!",pathElements) # return pathElements[0][0] return "".join([e for e in sorted(pathElements)[0][0] if isinstance(e, str)])
[docs]class Decoder(rapidjson.Decoder): """ Decoder for loading objects serialized in json files or strings. Args: file (string or file-like): the json path or file-like object. When specified, the decoder will read json from this file if you don't pricise file to`load()` method later. authorized_classes (set/list/tuple): Define the classes that serializejson is authorized to recreate from the `__class__` keywords in json, in addition to default authorized classes and classes autorized by plugins. default authorize classes are : array.array,bytearray,bytes,range,set,slice,time.struct_time,tuple, type,frozenset,collections.Counter,collections.OrderedDict, collections.defaultdict,collections.deque,complex,datetime.date, datetime.datetime,datetime.time,datetime.timedelta,decimal.Decimal, numpy.array,numpy.bool_,numpy.dtype,numpy.float16,numpy.float32, numpy.float64,numpy.frombuffer,numpy.int16,numpy.int32,numpy.int64, numpy.int8,numpy.ndarray,numpy.uint16,numpy.uint32,numpy.uint64, numpy.uint8,numpyB64. authorized_classes must be a set/list/tuple of classes or strings corresponding to the qualified names of classes (`module.class_name`). If the loading json contain an unauthorized `__class__`, serializejson will raise a TypeError exception. .. warning:: Do not load serializejson files from untrusted / unauthenticated sources without carefully set the `authorized_classes` parameter. Never authorize "eval", "exec", "apply" or other functions or classes which could allow execution of malicious code with for example : ``{"__class__":"eval","__init__":"do_bad_things()"}`` unauthorized_classes_as_dict (False by default) Controls whether unauthorized classes should be decoded as dict without raising a TypeError (or as dotdict if dotdict parameter is True, see the "dotdict" parameter for further explanation). recognized_classes (set/list/tuple): Classes (string with qualified names or classes) that serializejson will try to recognize from keys names. A classe will be recognized if keys names of a json dictionnary is a superset of the classe's default attributs names. Classe's default attributs name are slots and attributs names in __dict__ not starting with "_" after initialisation (serializejson will create an instance of each class passed in recognized_classes in order to determine this attributs) The instance will be instancied with new (with no argement), and __init__ will not be called . If you want execute some initialization code, you must add a __setstate__() methode to your object or setter/properties with setters/properties Encoder's parameters activated. updatables_classes (set/list/tuple): Classes (string with qualified names or classes) that serializejson will try to update if already in the provided object `obj` when calling `load` or `loads`. Objects will be recreated for other classes. properties (bool, None, set/list/tuple, dict ): Controls whether `load` will call properties's setters instead of put them in self.__dict__ when the object as no `__setstate__` method and properties are merged with attributes in the state dictionnary when dumping (merged if strict_pickle is False) . - False: call properties setters for none classes (as pickle) - True : (default) call properties setters for all classes - None : call only properties setters defined in serializejson.properties dict (added by plugins or manualy before decoder call) (see documentation section: ref:`"Add plugins to serializejson"<add-plugins-label>`. ) - set/list/tuple : call all properties setters for classes in this set/list/tuple, in addition to properties defined in serializejson.properties dict [class1, class2,..] (not secure if unstruted json, use it only for debuging) - dict : call properties setters defined in dict, in addition to properties defined in serializejson.properties dict {class1 : ["propertie1","propertie1"], class2: True} .. warning:: **The properties's setters are called in the json order !** - in alphabetic order if `sort_keys = True` or if the object has not __getstate__ method. - in the order returned by the __getstate__ method if `sort_keys = False` - Be carefull if you rename an attribute because properties setters calls order can change. - If `properties = True` (or is a list) then serializejson load will differ from pickle that don't call attribute's setters. **It is best to add the __setate__() method to your object:** - If you want to stay compatible with pickle with the same behavior. - If you want to call properties setters in a different order than alphabetic order and don't want to code a __getstate__ method given the order. - If you want to call properties setters in a order robust to an attribute name change. - If you want to be robust to this `properties` parameter change. - If you want to avoid transitional states during setting of attribute one by one. In this method you can call the helping function : serialize.__setstate__(self,properties = True) setters (bool,None,set/list/tuple,dict): Controls whether `load` will try to call `setxxx`,`set_xxx` or `setXxx` methods or `xxx` property setter for each attributes of the serialized objects when the object as no `__setstate__` method. - False: call no other setters than thus called in __setstate__ methodes, like pickle. - True : (default) explore and call all setters for all objects (not secure if unstruted json, use it only for debuging) - None : call only setters defined in serializejson.setters dict (added by plugins or manualy before decoder call) (see documentation section: ref:`"Add plugins to serializejson"<add-plugins-label>`. ) - set/list/tuple : explore and call setters classes in set/list/tuple, in addition to setters defined in serializejson.setters dict [class1, class2,..] (not secure if unstruted json, use it only for debuging) - dict : call setters defined in dict, in addition to setters defined in serializejson.setters dict {class1 : {"attribut_name":"setter_name",...}, class2: True} .. warning:: **The attribute's setters are called in the json order !** - in alphabetic order if `sort_keys = True` or if the object has not __getstate__ method. - in the order returned by the __getstate__ method if `sort_keys = False` - Be carefull if you rename an attribute because setters calls order can change. - If `set_attribute = True` (or is a list) then serializejson load will differ from pickle that don't call attribute's setters. **It is best to add the __setate__() method to your object:** - If you want to stay compatible with pickle with the same behavior. - If you want to call setters in a different order than alphabetic order and don't want to code a __getstate__ method given the order. - If you want to call setters in a order robust to an attribute name change. - If you want to be robust to this `setters` parameter change. - If you want to avoid transitional states during setting of attribute one by one. In this method you can call the helping function : serialize.__setstate__(self,setters = True or dict {name : setter_name,...}) strict_pickle (False by default) If True serialize with exactly the same behaviour than pickle: - disabling properties setters - disabling setters - disabling numpy_array_from_list accept_comments (bool): Controls whether serializejson accepts to parse json with comments. numpy_array_from_list (bool): Controls whether list of bool, int or floats with same types elements should be loaded into numpy arrays. numpy_array_from_heterogenous_list (bool): Controls whether list of bool, int or floats with same or heterogenous types elements should be loaded into numpy arrays. default_value: The value returned if the path passed to `load` doesn't exist. It allows to have a default object at the first run of the script or when the json has been deleted, without raising of FileNotFoundError. chunk_size (int): Chunk size used when reading the json file. dotdict (bool): load dicts as serializejson.dotdict, a dict subclasse with acces to key names with a dot as object attributes enabled. A dotdict will be serialized as dict again when dumping. dotdict allows you to more easily access the elements of a deserialized dictionary, with the same '.' acces syntax as for an object, allowing you if you wish, to later transform the dictionaries in your jsons into real objects with the addition of the "__class__" field, without having to modify your code. add_jsonpath If True, the source json path will be added to the loaded object as `_jsonpath` attribut. If False (by default), nothing will be added to the loaded object, but you can still retrieve the source json path with the "serializejson.jsonpath" function which will find the path from the object identifier """ """ Inherited from rapidjson.Decoder: number_mode (int): Enable particular behaviors in handling numbers datetime_mode (int): How should datetime, time and date instances be handled uuid_mode (int): How should UUID instances be handled parse_mode (int): Whether the parser should allow non-standard JSON extensions (nan, -inf, inf ) """ def __new__( cls, file=None, *, authorized_classes=None, unauthorized_classes_as_dict=False, recognized_classes=None, updatables_classes=None, setters=True, properties=True, default_value=no_default_value, accept_comments=False, numpy_array_from_list=False, numpy_array_from_heterogenous_list=False, chunk_size=65536, strict_pickle=False, dotdict=False, add_jsonpath=False, ): if accept_comments: parse_mode = rapidjson.PM_COMMENTS else: parse_mode = rapidjson.PM_NONE self = super().__new__(cls, parse_mode=parse_mode) # , **argsDict) self.strict_pickle = strict_pickle if strict_pickle: setters = False properties = False numpy_array_from_list = False numpy_array_from_heterogenous_list = False add_jsonpath = False self.file = file self.setters = _get_setters(setters) self.properties = _get_properties(properties) self._authorized_classes_strs = _get_authorized_classes_strings( authorized_classes ) self.unauthorized_classes_as_dict = unauthorized_classes_as_dict self._class_from_attributes_names = _get_recognized_classes_dict( recognized_classes ) self.set_updatables_classes(updatables_classes) # self.accept_comments = accept_comments # self.numpy_array_from_list=numpy_array_from_list self.default_value = default_value self.chunk_size = chunk_size self.dotdict = dotdict self.add_jsonpath = add_jsonpath self.file_iter = None self._updating = False self.numpy_array_from_list = numpy_array_from_list self.numpy_array_from_heterogenous_list = numpy_array_from_heterogenous_list if numpy_array_from_heterogenous_list: self.numpy_array_from_list = True self.end_array = self._end_array_if_numpy_array_from_heterogenous_list elif numpy_array_from_list: self.end_array = self._end_array_if_numpy_array_from_list return self
[docs] def load(self, file=None, obj=None): """ Load object from json file. Args: file (optional str or file-like): the json path or file-like object. When specified, json is read from this file. Otherwise json is read from the file passed to `Decoder()` constructor. obj (optional): If provided, the object `obj` will be updated and no new object will be created. Return: created object or updated object if passed obj. """ if file is None: file = self.file path = None if isinstance(file, str): path = file # print("load",file) if not os.path.exists(file): if self.default_value is no_default_value: raise FileNotFoundError( errno.ENOENT, os.strerror(errno.ENOENT), file ) return self.default_value file = _open_with_good_encoding(file) elif file is None: # a priori pointeur vers fichier raise ValueError('Encoder.load need a "file" path/file argument') loaded = self.__call__(json=file, obj=obj) if path: if self.add_jsonpath: loaded._jsonpath = path id_to_path[id(loaded)] = path return loaded
[docs] def loads(self, json, obj=None): """ Load object from json string or bytes. Args: s: the json string. obj (optional): If provided, the object `obj` will be updated and no new object will be created. Return: created object or updated object if passed obj. """ return self.__call__(json=json, obj=obj)
[docs] def set_default_value(self, value=no_default_value): """ Set the value returned if the path passed to load doesn't exist. It allows to have a default object at the first run of the script or when the json has been deleted, without raising of FileNotFoundError. encoder.set_default_value() without any argument will remove the default_value and reactivate the raise of FileNotFoundError. """ self.default_value = value
[docs] def set_authorized_classes(self, classes): """ Define the classes that serializejson is authorized to recreate from the `__class__` keywords in json, in addition to the usuals classes. Usual classes are : complex ,bytes, bytearray, Decimal, type, set, frozenset, range, slice, deque, datetime, timedelta, date, time numpy.array, numpy.dtype. authorized_classes must be a liste of classes or strings corresponding to the qualified names of classes (`module.class_name`). If the loading json contain an unauthorized `__class__`, serializejson will raise a TypeError exception. .. warning:: Do not load serializejson files from untrusted / unauthenticated sources without carefully set the `authorized_classes` parameter. Never authorize "eval", "exec", "apply" or other functions or classes which could allow execution of malicious code with for example : ``{"__class__":"eval","__init__":"do_bad_things()"}`` """ self._authorized_classes_strs = _get_authorized_classes_strings(classes)
[docs] def set_recognized_classes(self, classes): """ Set the classes (string with qualified name or classes) that serializejson will try to recognize from key names. """ self._class_from_attributes_names = _get_recognized_classes_dict(classes)
[docs] def set_updatables_classes(self, updatables): """ Set the classes (string with qualified name or classes) that serializejson will try to update if already in the provided object `obj` when loading with `load` or `loads`. Otherwise the objects are recreated. """ updatableClassStrs = set() if updatables is not None: for updatable in updatables: if isinstance(updatable, str): updatableClassStrs.add(updatable) else: updatableClassStrs.add(class_str_from_class(updatable)) self.updatableClassStrs = updatableClassStrs
def start_object(self): dict_ = dict() if ( self.root is None and self.json_startswith_curly ): # en vrai c'est pas forcement le root ,si par exeple le root est une liste ... self.root = dict_ if self._updating: id_ = id(dict_) self.ancestors.append(id_) return dict_ def end_object(self, inst): # self._counter += 1 # self._deserializeds.add() if self._updating: self.ancestors.pop() # se retire lui meme class_str = inst.get("__class__", None) if class_str: if self._updating: if class_str in self.updatableClassStrs: ancestor = self.ancestors[-1] self.node_has_descendants_to_recreate.add(ancestor) else: return self._exploreDictToReCreateObjects( inst ) # idealement faudrait pouvoir eviter d'explorer, et aller directement rédydrater les descendant , le problème c'est que l'hydrattation n'est pas in place et les objet qui les contiennent de vont pas avoir leur champs mis à jour ... ex dans une liste else: return self._inst_from_dict(inst) # pour reconnaissant d'objet juste à partir des attributes elif "$ref" in inst and len(inst) == 1: if self.root: # try: inst_potential = from_name( inst["$ref"], accept_dict_as_object=True, root=self.root ) # essaye de remplacer tout de suite si possible if inst is inst_potential: raise Exception('{"$ref": "%s"} pointing to himself' % inst["$ref"]) if not type(inst_potential) is dict: # verifie que ce n'est pas un objet qui n'a pas encore été recré return inst_potential if "__class__" not in inst_potential: return inst_potential inst_potential_epured = { key: inst_potential[key] for key in ["__class__", "__init__", "__new__"] if key in inst_potential } inst = self._inst_from_dict(inst_potential_epured) inst_potential["__class__"] = inst return inst self.duplicates_to_replace.append(inst) elif self._class_from_attributes_names: class_from_attributes_names = self._class_from_attributes_names attributes_tuple = tuple(sorted(inst)) if attributes_tuple in class_from_attributes_names: inst["__class__"] = class_from_attributes_names[attributes_tuple] recognized = True else: attributes_set = set(attributes_tuple) for attribute_names in class_from_attributes_names.keys(): if attributes_set.issuperset(attribute_names): inst["__class__"] = class_from_attributes_names[attribute_names] recognized = True break else: recognized = False if recognized: if self._updating: if inst["__class__"] in self.updatableClassStrs: ancestor = self.ancestors[-1] self.node_has_descendants_to_recreate.add(ancestor) else: # idealement faudrait pouvoir eviter d'explorer, et aller directement rédydrater les descendant , le problème c'est que l'hydrattation n'est pas in place et les objet qui les contiennent de vont pas avoir leur champs mis à jour ... ex dans une liste return self._exploreDictToReCreateObjects(inst) else: # pas de verification les objets recognized sont considérés comme authorized #self._inst_from_dict(inst) return instance(**inst) if self.dotdict: return dotdict(inst) return inst def __call__(self, json, obj=None): """ Args: json : file-like, str or bytes (UTF-8) containing the JSON to be decoded obj : object to update (optional) Returns: a python value examples: >>> decoder = Decoder() >>> decoder('"€ 0.50"') '€ 0.50' >>> decoder(b'"\xe2\x82\xac 0.50"') '€ 0.50' >>> decoder(io.StringIO('"€ 0.50"')) '€ 0.50' >>> decoder(io.BytesIO(b'"\xe2\x82\xac 0.50"')) '€ 0.50' """ blosc.set_nthreads(blosc.ncores) serialize_parameters.strict_pickle = self.strict_pickle serialize_parameters.setters = self.setters serialize_parameters.properties = self.properties self.converted_numpy_array_from_lists = set() # self._counter = 0 self._updating = False # for duplicates ----------- self.root = None if isinstance(json, str): self.json_startswith_curly = json.startswith("{") elif isinstance(json, bytes): self.json_startswith_curly = json.startswith(b"{") else: self.json_startswith_curly = json.read(1) in ("{", b"{") json.seek(0) self.duplicates_to_replace = [] # for updating ------------------ if obj is None: self._updating = False loaded = rapidjson.Decoder.__call__(self, json, chunk_size=self.chunk_size) else: # update self._updating = True self.ancestors = deque() self.ancestors.append(None) self.node_has_descendants_to_recreate = set() loaded_dict = rapidjson.Decoder.__call__( self, json, chunk_size=self.chunk_size ) loaded = self._exploreToUpdate(obj, loaded_dict) # on restaure doublons qu'on a pu restaurer pendant deserialisation (dans une liste ou doublon referencant un parent) duplicates_to_replace = self.duplicates_to_replace if duplicates_to_replace: gc.collect() # pas sur qu'indispensable mais dans le doute https://docs.python.org/3/library/gc.html#gc.get_referrers while duplicates_to_replace: duplicate_to_replace = duplicates_to_replace.pop() referenced = from_name( duplicate_to_replace["$ref"], accept_dict_as_object=True, root=loaded, ) if referenced is duplicate_to_replace: raise Exception('{"$ref": "%s"} pointing to himself' % duplicate_to_replace["$ref"]) refs = gc.get_referrers(duplicate_to_replace) skip = (locals(), refs) for parent in refs: if parent not in skip: if type(parent) is dict: for key, value in parent.items(): if value is duplicate_to_replace: parent[key] = referenced break elif type(parent) is list: for key, value in enumerate(parent): if value is duplicate_to_replace: parent[key] = referenced break elif hasattr(parent, "__slots__"): for slot in parent.__slots__: if ( hasattr(parent, slot) and getattr(parent, slot) is duplicate_to_replace ): setattr(parent, slot, referenced) break # clean --------------- del self.duplicates_to_replace if self._updating: del self.ancestors del self.node_has_descendants_to_recreate self._updating = False if obj is not None: return obj return loaded def __iter__(self): self._updating = False file = self.file if isinstance(file, str): if not os.path.exists(file): return [self.default_value] self.file_iter = _json_object_file_iterator(file, mode="rb") else: raise Exception("not yet able to load_iter on %s" % str(type(file))) return self def _inst_from_dict(self, inst): class_str = inst["__class__"] if class_str in self._authorized_classes_strs or not isinstance(class_str, str): for key in ("__init__", "__new__", "__items__"): if key in inst: if ( self.numpy_array_from_list and isinstance(inst[key], numpy.ndarray) and id(inst[key]) in self.converted_numpy_array_from_lists ): inst[key] = inst[key].tolist() if key != "__items__" and class_str in remove_add_braces: inst[key] = (inst[key],) if ( inst["__class__"] == "dict_non_str_keys" ): # je l'ai mis ici car trop specifique à json pour etre dans tools (qui est partagé avec serializePython et serializeRepr) return dict_non_str_keys(inst) return instance(**inst) if self.unauthorized_classes_as_dict: if self.dotdict: warnings.warn( f"{inst['__class__']} not in authorized_classes leaved as docdict", Warning, ) return dotdict(inst) warnings.warn( f"{inst['__class__']} not in authorized_classes leaved as dict", Warning ) return inst raise TypeError(f"{inst['__class__']} is not in authorized_classes") # @profile def _exploreToUpdate(self, obj, loaded_node): # gère le cas où loaded_node est un dictionnaire ---------------------- if isinstance(loaded_node, dict): obj_keys = None # plutot que set vide un objet peut ne pas avoir d'attributes ni de slots initialisés obj_class = obj.__class__ if obj_class is dict and ("dict" in self.updatableClassStrs): is_dict = True obj_keys = set(obj) obj else: # s'assure que c'est une instance is_dict = False class_str = loaded_node.get("__class__") if ( (class_str is not None) and (class_str in self.updatableClassStrs) and (class_str == class_str_from_class(obj_class)) ): if class_str == "set": # on peut udpate le set MAIS PAS LES OJECTS QUI SONT DEDANS !!!! car on ne sait pas quel existant correspond à quel element json obj.clear() obj.update(self._exploreDictToReCreateObjects(loaded_node)) return obj if hasattr(obj, "__setstate__"): # j'ai du remplacer hasMethod(inst,"__setstate__") par hasattr(inst,"__setstate__") pour pouvoir deserialiser des sklearn.tree._tree.Tree en json "__setstate__" n'est pas reconnu comme étant une methdoe !? alors que bien là . if "__state__" in loaded_node: obj.__setstate__(loaded_node["__state__"]) else: loaded_node.__delitem__("__class__") if "__init__" in loaded_node: loaded_node.__delitem__("__init__") obj.__setstate__(loaded_node) return obj if hasattr(obj, "__dict__"): # A REVOIR : ne marche pas avec les slots obj_keys = set(obj.__dict__) if hasattr(obj, "__slots__"): if obj_keys is None: obj_keys = set() for slot in slots_from_class(obj_class): if hasattr(obj, slot): obj_keys.add(slot) if obj_keys is not None: if not is_dict: setters = serialize_parameters.setters if type(setters) is dict: setters = setters.get(obj_class, False) if setters is True: setters = setters_names_from_class(obj_class) # update dans le cas où l'objet pré-existant est un objet (avec __dict__ pas encore __slot__) ou un dictionnaire -- loaded_node_has_descendants_to_recreate = ( id(loaded_node) in self.node_has_descendants_to_recreate ) # suprime les attributes de l'objet qui ne sont pas dans loaded.. only_in_obj = obj_keys - set(loaded_node) for key in only_in_obj: if is_dict: del obj[key] elif not key.startswith("_"): obj.__delattr__(key) # update ou recrer les autres attributes for key, value in loaded_node.items(): if key not in ("__class__", "__init__"): if key in obj_keys: if is_dict: old_value = obj[key] else: old_value = obj.__getattribute__(key) value = self._exploreToUpdate(old_value, value) elif loaded_node_has_descendants_to_recreate: if isinstance(value, dict): value = self._exploreDictToReCreateObjects(value) elif isinstance(value, list): value = self._exploreListToReCreateObjects(value) if is_dict: obj[key] = value elif setters and key in setters: obj.__getattribute__(setters[key])(value) else: obj.__setattr__(key, value) return obj return self._exploreDictToReCreateObjects(loaded_node) # gère le cas où loaded_node est une liste --------------------------- if isinstance(loaded_node, list): if isinstance(obj, list) and ("list" in self.updatableClassStrs): # update dans le cas où l'objet pré-existant est une liste len_obj = len(obj) del obj[len(loaded_node):] for i, value in enumerate(loaded_node): if i < len_obj and isinstance(value, (list, dict)): obj[i] = self._exploreToUpdate(obj[i], value) else: if isinstance(value, dict): value = self._exploreDictToReCreateObjects(value) elif isinstance(value, list): value = self._exploreListToReCreateObjects(value) obj.append(value) return obj else: # sinon replace return self._exploreListToReCreateObjects(loaded_node) # gère les autres cas return loaded_node # replace def _exploreDictToReCreateObjects(self, loaded_node): if id(loaded_node) in self.node_has_descendants_to_recreate: for key, value in loaded_node.items(): if isinstance(value, dict): # and "__class__" in value loaded_node[key] = self._exploreDictToReCreateObjects(value) elif isinstance(value, list): loaded_node[key] = self._exploreListToReCreateObjects(value) if "__class__" in loaded_node: return self._inst_from_dict(loaded_node) else: return loaded_node def _exploreListToReCreateObjects(self, loaded_node): for i, value in enumerate(loaded_node): if isinstance(value, dict): loaded_node[i] = self._exploreDictToReCreateObjects(value) elif isinstance(value, list): loaded_node[i] = self._exploreListToReCreateObjects(value) return loaded_node # --------------------------------- def _end_array_if_numpy_array_from_list(self, sequence): if _onlyOneDimSameTypeNumbers(sequence): array = numpy.array(sequence, dtype=type(sequence[0])) self.converted_numpy_array_from_lists.add(id(array)) return array if len(sequence) and isinstance(sequence[0], ndarray): first_elt = sequence[0] first_elt_shape = first_elt.shape first_elt_dtype = first_elt.dtype if all( ( isinstance(elt, ndarray) and elt.dtype is first_elt_dtype and elt.shape == first_elt_shape ) for elt in sequence ): array = numpy.array(sequence, dtype=first_elt_dtype) self.converted_numpy_array_from_lists.add(id(array)) return array return sequence def _end_array_if_numpy_array_from_heterogenous_list(self, sequence): if _onlyOneDimNumbers(sequence): array = numpy.array(sequence) self.converted_numpy_array_from_lists.add(id(array)) return array if len(sequence) and isinstance(sequence[0], ndarray): first_elt = sequence[0] first_elt_shape = first_elt.shape if all( (isinstance(elt, ndarray) and elt.shape == first_elt_shape) for elt in sequence ): array = numpy.array(sequence) self.converted_numpy_array_from_lists.add(id(array)) return array return sequence def __next__(self): try: return rapidjson.Decoder.__call__( self, self.file_iter, chunk_size=self.chunk_size ) except rapidjson.JSONDecodeError as error: self.file_iter.close() if error.args[0] == "Parse error at offset 0: The document is empty.": raise StopIteration else: raise
# ---------------------------------------------------------------------------------------------------------------------------- # --- INTERNES ----------------------------------------------------------------------------------------------------- # ---------------------------------------------------------------------------------------------------------------------------- class dotdict(dict): """dot notation access to dictionary attributes""" def __getattr__(self, attr): try: return self[attr] except KeyError: raise AttributeError() __setattr__ = dict.__setitem__ __delattr__ = dict.__delitem__ def bool_or_set(value): if value is None: return set() if isinstance(value, (bool, set)): return value if isinstance(value, (list, tuple)): return set(value) else: raise TypeError def bool_or_dict(value): if value is None: return dict() if isinstance(value, (bool, dict)): return value if isinstance(value, (set, list, tuple)): return {key: True for key in value} else: raise TypeError def dict_non_str_keys(dict_): d = dict() del dict_["__class__"] for key, value in dict_.items(): try: key = loads(key) except: if key.endswith("'"): if key.startswith("'"): key = key[1:-1] elif key.startswith("b'"): key = key[2:-1].encode("ascii_printables") elif key.startswith("b64'"): key = b64decode(key[4:]) else: if type(key) is list: key = tuple(key) d[key] = value return d def all_keys_are_str(dict_): for key in dict_: if type(key) != str: return False return True if use_numpy: _numpy_float_dtypes = set( (numpy.dtype("float16"), numpy.dtype("float32"), numpy.dtype("float64")) ) _numpy_types = set( ( numpy.bool_, numpy.int8, numpy.int16, numpy.int32, numpy.int64, numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.float16, numpy.float32, numpy.float64, ) ) _numpy_float_types = set( ( numpy.float16, numpy.float32, numpy.float64, ) ) _numpy_int_types = set( ( numpy.int8, numpy.int16, numpy.int32, numpy.int64, numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, ) ) _numpy_dtypes_to_python_types = {numpy.bool_: bool} for numpy_type in _numpy_int_types: _numpy_dtypes_to_python_types[numpy_type] = int for numpy_type in _numpy_float_types: _numpy_dtypes_to_python_types[numpy_type] = float else: _numpy_types = set() NoneType = type(None) remove_add_braces = { "set", "frozenset", "tuple", "collections.OrderedDict", "collections.Counter", } def _close_for_append(fp, indent): if indent is None: try: fp.write(b"]") except TypeError: fp.write("]") else: try: fp.write(b"\n]") except TypeError: fp.write("\n]") def _open_for_append(fp, indent): length = 0 remove_last_square_close = True if isinstance(fp, str): path = fp if os.path.exists(path): fp = open(path, "rb+") # detect encoding bytes_ = fp.read(3) len_bytes = len(bytes_) if len_bytes: if bytes_[0] == 0: if bytes_[1] == 0: fp = open(path, "r+", encoding="utf_32_be") else: fp = open(path, "r+", encoding="utf_16_be") elif len_bytes > 1 and bytes_[1] == 0: if len_bytes > 2 and bytes_[2] == 0: fp = open(path, "r+", encoding="utf_32_le") else: fp = open(path, "r+", encoding="utf_16_le") # remove last ] remove_last_square_close = True else: fp = open(path, "wb") remove_last_square_close = False elif fp is None: raise Exception("Incorrect file (file, str ou unicode)") if remove_last_square_close: fp.seek(0, 2) length = fp.tell() if length == 1: fp.close() raise Exception("serializejson can append only to serialized lists") if length >= 2: fp.seek(-1, 2) # va sur le dernier caractère lastcChar = fp.read(1) if lastcChar in (b"]", "]"): fp.seek(-2, 2) beforlastcChar = fp.read(1) if beforlastcChar in (b"\n", "\n"): fp.seek(-2, 2) else: fp.seek(-1, 2) # va sur le dernier caractère fp.truncate() else: fp.close() raise Exception("serializejson can append only to serialized lists") if length == 0: if indent is None: fp.write(b"[") else: fp.write(b"[\n") elif length > 2: if indent is None: try: fp.write(b",") except TypeError: fp.write(",") else: try: fp.write(b",\n") except TypeError: fp.write(",\n") return fp def _open_with_good_encoding(path): # https://stackoverflow.com/questions/4990095/json-specification-and-usage-of-bom-charset-encoding/38036753 fp = open(path, "rb") bytes_ = fp.read(3) fp.seek(0) len_bytes = len(bytes_) if len_bytes: if ( bytes_ == b"\xef\xbb\xbf" ): # normalement ne devrait pas arriver les json ne devraient jamais commencer par un BOM , mais parfoit si le fichier à été créer à la main dans un editeur de text, il peut y'en avoir un (exemple : personnel.json ). fp = open(path, "r", encoding="utf_8_sig") elif bytes_[0] == 0: if bytes_[1] == 0: fp = open(path, "r", encoding="utf_32_be") else: fp = open(path, "r", encoding="utf_16_be") elif len_bytes > 1 and bytes_[1] == 0: if len_bytes > 2 and bytes_[2] == 0: fp = open(path, "r", encoding="utf_32_le") else: fp = open(path, "r", encoding="utf_16_le") return fp def _get_authorized_classes_strings(classes): if not type(classes) in (set, list, tuple): if classes is None: classes = set() else: classes = [classes] _authorized_classes_strs = authorized_classes.copy() for elt in classes: if not type(elt) is str: elt = class_str_from_class(elt) _authorized_classes_strs.add(elt) return _authorized_classes_strs def _get_recognized_classes_dict(classes): if classes is None: return dict() if not isinstance(classes, (list, tuple)): classes = [classes] else: classes = classes _class_from_attributes_names = dict() for class_ in classes: if isinstance(class_, str): classToRecStr = class_ classToRecClass = class_from_class_str(class_) else: classToRecStr = class_str_from_class(class_) classToRecClass = class_ serializedattributes = [] instanceVide = classToRecClass() for attribute in list(instanceVide.__dict__.keys()) + slots_from_class(class_): if not attribute.startswith("_"): serializedattributes.append(attribute) serializedattributes = tuple(sorted(serializedattributes)) _class_from_attributes_names[serializedattributes] = classToRecStr return _class_from_attributes_names class _json_object_file_iterator(io.FileIO): def __init__(self, fp, mode, **kwargs): io.FileIO.__init__(self, fp, mode=mode, **kwargs) self.in_quotes = False self.in_curlys = 0 self.in_squares = 0 self.in_simple = False self.in_object = False self.backslash_escape = False self.shedule_break = False self.in_chunk_start = 0 self.s = None # s = io.FileIO.read(self, 1) # if s not in (b"[", "["): # raise Exception('the json data must start with "["') if "b" in mode: self.interesting = set(b'\\"{}[]') self.separators = set(b", \t\n\r") self.chars = list(b'\\"{}[]') else: self.interesting = set('\\"{}[]') self.separators = set(", \t\n\r") self.chars = list('\\"{}[]') def read(self, size=-1): if self.shedule_break: self.shedule_break = False # print("read(1): empty") return "" ( backslash, doublecote, curly_open, curly_close, square_open, square_close, ) = self.chars interesting = self.interesting separators = self.separators in_quotes = self.in_quotes in_curlys = self.in_curlys in_squares = self.in_squares in_simple = self.in_simple in_object = self.in_object backslash_escape = self.backslash_escape # true if we just saw a backslash in_chunk_start = self.in_chunk_start if in_chunk_start == 0: s = self.s = io.FileIO.read(self, size) else: s = self.s for i in range(in_chunk_start, len(s)): ch = s[i] if in_simple: if ch in separators or ch in ("]", 93): if in_chunk_start < i: # on prevoit d'arreter au read suivant sinon , va de tout facon arreter et on ne pourra pas remeter self.shedule_break à False self.shedule_break = True # self.seek(chunk_start + i + 1) self.in_chunk_start = (i + 1) % len(s) self.in_quotes = False self.in_curlys = 0 self.in_squares = in_squares self.in_simple = False self.in_object = False # print("read(2): ",s[in_chunk_start:i]) return s[in_chunk_start:i] elif ch in interesting: check = False if in_quotes: if backslash_escape: # we must have just seen a backslash; reset that flag and continue backslash_escape = False elif ch == backslash: backslash_escape = True # we are in a quote and we see a backslash; escape next char elif ch == doublecote: in_quotes = False # signale qu'on sort d'un truc et qu'il faudra checker check = True elif ch == doublecote: # " in_quotes = True in_object = True elif ch == curly_open: # { in_curlys += 1 in_object = True elif ch == curly_close: # } in_curlys -= 1 check = True elif ch == square_open: # [ in_squares += 1 if in_squares > 1: in_object = True else: in_chunk_start = (i + 1) % len(s) elif ch == square_close: # ] in_squares -= 1 check = True if not in_squares: # on a ateint la fin de la liste json return "" if check and not in_quotes and not in_curlys and in_squares < 2: if in_chunk_start < (i + 1): # on prevoit d'arreter au read suivant sinon , va de tout facon arreter et on ne pourra pas remeter self.shedule_break à False self.shedule_break = True # self.seek(chunk_start + i + 1) self.in_chunk_start = (i + 1) % len(s) self.in_quotes = False self.in_curlys = False self.in_squares = in_squares self.in_simple = False self.in_object = False # print("read(3): ",s[in_chunk_start: i + 1]) return s[in_chunk_start: i + 1] elif not in_object: if ch in separators: in_chunk_start = i + 1 else: in_simple = True self.in_quotes = in_quotes self.in_curlys = in_curlys self.in_squares = in_squares self.in_simple = in_simple self.in_object = in_object self.backslash_escape = backslash_escape self.in_chunk_start = 0 if in_chunk_start: # print("read(4): ",s[in_chunk_start:]) return s[in_chunk_start:] return s id_to_path = dict()