Friday, August 28, 2015

Different approaches to dump and load objects using Python. Perfomance analysis.


The previous post gave you an example demonstrating the usage of json format to store an object's value. 

We have several ways to store/load (or serialize and deserialize) data that we collect during the launch of our application. 
At the moment we have several ways how to do that. First of all we should think about what we will do with that data? Will we load it back inside our application? Will we send it to another application (that could be written using another programming language, not especially Python)? Should we store data in the format that could be understood by a human, so that the stored data had a readable format?
What approaches might we have if we used Python?
  1. JSON (http://json.org/)
  2. XML (http://www.w3.org/XML/
  3. YAML (http://yaml.org/ and RFC is here)
  4. Pickle (Python 2 Docs, Python 3 Docs)
JSON, XML and YAML can be used to store information in human-readable format, and this data format can be uploaded using all modern programming languages. Pickle is a Python-specific module. An advantage of this module is that we can dump more complicated objects than standard data structures and types without any issues and special methods inside our classes that will convert instances of our classes into the format that is applicable to be dumped into XML, JSON or YAML formats.
So, it is the time to look at each bullet in a bit more detail.

First of all we should keep in mind that for serialization and deserialization we should have simple approach that allows us to store and load our data (simple variables, collections, objects) without any problems. We will have a look at implementations of modules with both dump and load functions.

JSON 

This format is supported by Python standard module (module description for Python 3.X). As you may see from documentation, the functionality provided in this module is pretty simple. It allows to serialize and deserialize an object into a string or a stream. The list of the possible types that can be encoded/decoded with this value includes: Several collection types: dict, list and tuple, String types like string and unicode string, Numeric types like int, long and float, Boolean values such as True and False, and of course None.
This library provides both dump function and load function. So, JSON is chosen for our experiment.

XML

XML support in Python is implemented in several modules. Full information can be found here. As I haven't found any applicable modules with the required functionality, I suggest that for our experiment we try to write our own module for serialization or deserialization, based on Expat module. But it is not the purpose of this article, so we will not use XML for this experiment.

YAML

Module PyYAML can be used for working with this markup language. It provides simple functions to dump and load information. Correlation with JSON is available here and with XML - here. This module has both dump and load functions, so we will use it for our experiment.

Pickle

Pickle is a standard module that allows to use serialization and deserialization processes for an object. It converts objects hierarchy in byte stream aka 'Pickling' operation and restore the objects hierarchy from byte stream aka 'Unpickling' process. This format is compared with JSON in the following article for Python 3.X
Since Pickle is a standard module, it also contains both dump function and load function, and we will use this module as part of our experiment.

After a brief description lets switch to implementation details and the results of our experiment.

I've prepared simple class and set of functions to be used for serialization. Our scope will consist of:


Module that is responsible for data preparation is available here
Both YAML and Pickle provide full support for the types listed above. The exeriment's output for both cases will be the same:

Serialization... 


Deserialization... 


Done... 

With JSON we have some limitations.

Serialization... 

Object 'CData_Preparation' is not JSON serializable
<data_preparation.CData_Preparation object at 0x0000000001F9FF28> is not JSON serializable
Object 'set' is not JSON serializable: 
set(['wrf2mr54', '6wrs43vw']) is not JSON serializable

Deserialization... 

Object 'CData_Preparation' could not be decoded after serialization: 
No JSON object could be decoded
Incorrect serialization/deserialization of Data object. 
Expected: String variable: w09ca5gayh
Number variable: 2
Dictionary: 
Key: 9up => Value: uo27txvz
Key: ca => Value: zbszhfo0c
Key: qv7o => Value: {'8': 'm70dc7m8'}
Array: ['od1cv', '6ge', 'oykyzj1y', 'zjsf6']
Set: set(['8rfif', 'yh6r', 'q7', '8a4bxzk3'])
Tuple: ('j5lw3b15', '4pp', 'pi1o')

Actual: None
Object 'set' could not be decoded after serialization: 
No JSON object could be decoded
Incorrect serialization/deserialization of Set object. 
Expected: set(['wrf2mr54', '6wrs43vw'])
Actual: None
Incorrect serialization/deserialization of Tuple object. 
Expected: ('35vyd', 'c', '8v1', '5vddg')
Actual: [u'35vyd', u'c', u'8v1', u'5vddg']

Done... 

Lets look at the example with JSON  more precisely:
The first issue is related to limitation that serialization/deserialization in Python with JSON usage doesn't support custom objects and sets as described here.
The second issue is related to conversion tables from Python objects into JSON objects and from JSON objects into Python objects. We may notice that both tuple and list in Python will be converted into array in JSON, but array in JSON will be converted only in list in Python. We should keep in mind this limitation if we would like to use JSON for serialization and deserialization in our program.

Several links on source code:

Another important issue uniting JSON, YAML and Pickle, is the performance of serialization and deserialization.

Since all the three approaches support the dictionary and the list as most complicated collections, let's try to generate several samples to measure performance. The list of samples will include the following cases:
  • Big dictionary with single deepness 
  • Big array/list
  • Big dictionary with big deepness
On each iteration I randomly generated each object and performed both serialize operation and deserialize operation on it, using JSON, YAML and Pickle. 
For time measurements I used standard timeit module. Since serialize and deserialize operation took a significant time for generated dictionaries and arrays, I set the number of times to execute both serialization and deserialization (parameter number in timeit module) to one. All three types of data have been generated using of the data preparation module that was mentioned before. Additional module has been developed for performance measurements  - it can be found here. During initial set of experiments I faced a situation when a significant amount of time was required to serialize or deserialize an array or a dictionary with one thousand elements. I configured the size for both dictionaries and lists, as well as the deepness for dictionaries to have optimal time for experiment execution. Actually it was enough to get the following;
  • Dictionaries. Amount of objects from one to five. Each object could be a simple string, an array or another dictionary. Arrays and dictionaries would have the same configuration settings (amount of "leaves") like parent object. Deepness of dictionary depends on value of parameter of special procedure
  • Arrays. Arrays in my experiment are simple arrays with the amount of string elements from ten to fifty,
Lets look at the results of such experiment:
I've launched it two times. Each launch took about one day!!!
As a reminder, each iteration consists of the following:
  1. Generate two dictionaries and one array.
  2. Execute serialization with JSON, YAML and Pickle for objects generated in step one, Measure the execution time for each module. Time's unit - in seconds.
  3. Execute deserialization with JSON, YAML and Pickle for objects generated in step one.  Measure execution time for each module. Time's unit - in seconds.
It won't be accurate to compare absolute time since the new dictionaries and arrays have been generated on each iteration. It is better to look at the relative time for serialization and deserialization. 
I collected statistic for ten iterations and according to results we have the following. First of all, JSON is the faster way for serialization and deserialization of internal datatypes. Now in details:
  • Big dictionary with single deepness. YAML is about 100 times slower than JSON on serialization and about 115 times slower than JSON on deserialization. Pickle is about 7 times slower on serialization and about 5 times slower on deserialization than JSON.
  • Big arrays. YAML is about 190 times slower on serialization and about 210 times slower on deserialization than JSON. Pickle is about 16 times slower on serailzation and about 9 times slower on deserialization than JSON.
  • Big dictionary with big deepness. YAML is about 210 times slower on serialization and about 170 times slower on deserialization than JSON. Pickle is about 16 times slower on serialization and 7 times slower on deserialization than JSON.
It is a very interesting situation with performance gap with Pickle and especially with YAML in compared to JSON. 
Looks like we should use JSON serialization and deserialization approach if we would like to work with Python's internal data types like strings, numbers, lists and dictionaries. 
Full execution log for first five iterations (with dump of objects) is located here, for the second five iterations - here, for the third five iterations - here  and for the fourth five iterations - here

Another interesting comparison is serialization time vs. deserialization time. At this slice, only Pickle has practically the same time and ratio that closer to one. For both JSON and YAML the ration is about 0.6. Another interesting situation that we have such behaviour in performance for both serialization and deserialization with the usage JSON and YAML.

I have to admit that I am not conducting the performance analysis for YAML and Pickle with custom user objects in this experiment. 

2 comments:

  1. Thanks for doing this comparison! I would be interested in knowing how pickle compares when using cPickle because I always use that module rather than the pure Python one. According to the docs cPickle can be 1000 times faster.

    ReplyDelete
    Replies
    1. Hi Mark,

      Thanks a lot for your comment!
      I added required support for cPickle. I measured the same amount of iterations like in my original article. The cPickle module has approximately the same performance characteristics like json module. I uploaded results of my experiments on my GitHub account. Summary for measurements (JSON / YAML / Pickle / cPickle) is here: https://github.com/MikeLaptev/sandbox_python/blob/master/mera/serialization_deserialization/Performance%20Summary.%20Additional%20launch%20with%20cPickle.xlsx

      I uploaded source code and detailed outputs also on GitHub (https://github.com/MikeLaptev/sandbox_python/tree/master/mera/serialization_deserialization)

      I didn't notice improvements in 1000 times, but cPickle is faster then Pickle, and again, performance for cPickle on the same level like for JSON.

      Thanks,
      Mike

      Delete