Getting pimcore ready for big datasets
Hi folks,
pimcore 1.4.5 with lots of improvements and new features has been around for few weeks now. We hope you like it?
One major objective during the development of this version was performance testing and optimization. In this blog post we want to share with you some of the tests we did and the results we got, and by doing so give you some answers about the performance and scalability of pimcore in combination with big datasets – a question we frequently hear from you, our clients and others.
What we did
Our goal was to test and optimize the pimcore backend and api (cli scripts for importing, updating …) for performance with big datasets. We concentrated our tests on the objects part of pimcore, but all optimizations were also added for documents and assets if possible.
Our dataset
To get a big dataset, we took a real world example of a PIM installation and enlarged the dataset extensively. By doing so, we ended up with the following dataset:
Object count | 3.2 mio |
Product count | 2.5 mio |
Relation count | 20 mio |
Version count | 5 mio |
The relation count might seem a little low. But this is in fact a quite realistic number because of two reasons:
- There are lots of ‘small’ accessory products with don’t have that many relations for them self
- The clever use of inheritance not only reduces the maintenance effort but also decreases the number of relations to save. If relations are assigned to parent objects and inherited by concrete product, the relation itself has to be saved only once (although many products use this relation).
The products themselves were quite complex product with extensive use of different datatypes for attributes, object bricks, field collections and inheritance.
Attributes per product | 60 |
Average object bricks per product | 4 |
Average attributes per object brick | 30 |
Our results
First of all, it has to be said, that our test dataset was not designed to get good performance results, but to be as realistic as possible and therefore get a real world impression of how pimcore performs. The performance also depends strongly on the data model and use cases. With less complex objects you will get completely different results. So these results can only give a feeling of what is easily possible.
The performance of pimcore while working with the backend is quite hard to quantify since it is dependent from many different factors.
But from an editors perspective, a normal and quick working was possible and the pimcore backend stayed perfectly responsive.
Most of the common interactions (like opening sub trees, opening objects, moving objects, saving objects, searching for objects) were executed in less than a second in most of the time. If the cache was not filled, huge amount of sub objects had to be loaded or a complex fulltext searches were made, the requests needed about two to five, maximum ten, seconds.
But of course there are interactions (like renaming or moving folders with huge amount of objects within them etc.) which took much longer simply because of the data amount that has to be processed.
In terms of batch script performance we made three simple tests – copying products and saving the new ones (to simulate import performance), updating products and saving the modified products and deleting products:
1000 | 10000 | |
Copy Products | 50 sec | 490 sec |
Update Products | 80 sec | 480 sec |
Delete Products | 40 sec | 420 sec |
Again, one needs to take in to consideration, that in this case the products where quite complex and that the results of such tests are strongly dependent on the use case, data model and so on.
How we did it
First of all, we got our self a server – not the biggest or newest one, but the important thing is a server with plenty of RAM and fast HDD-access:
Processor | 2 x Quad-Core AMD Opteron(tm) Processor 2350 |
RAM | 48 GB |
HDD pimcore | SATA 1,5 TB |
HDD database | SSD 120 GB |
Installed Software
- Debian 6.0.4
- php 5.3.9-ZS5.6.0
- Apache/2.2.16 (Debian)
Configuration of database
Since MySQL made lots of performance improvements in the last versions, it is important to use the latest version (in our case MySQL 5.5.20).
Configuration wise we supplied the MySQL with plenty of RAM and configured cache sizes etc. optimized for our use case. You can find information about that either in the MySQL documentation, at http://www.mysqlperformanceblog.com/ or at Google ;).
We also took advantage of the partitioning feature of MySQL and split up the big tables into partitions.
Again the optimized configuration of MySQL is highly dependent on your use case and needs to be planned carefully during the setup of the system. Most performance improvements can be reached by setting up the MySQL properly.
Configuration of cache
Since memcache doesn’t support tags, we simulate this feature with a database table. But with these big scale datasets this solution is far too slow.
In our tests we used the NoSQL database MongoDB (http://www.mongodb.org/) as cache instead – with great results. Tagged caching is supported this way out-of-the-box and it is really fast. The MongoDB cache backend (Pimcore_Cache_Backend_Mongodb) is shipped with pimcore 1.4.5 and can be configured in the cache.xml.
Configuration for cli scripts
Since php is only single threaded and not exactly built for long running import processes, we looked for a way to deal with these drawbacks and came up with the job server gearman (http://gearman.org). The idea behind is to
- split up huge workloads (of importing, updating, …) to small work packages and
- run them in parallel.
With gearman and a small framework we built around it, we now have a generic way for this task, can take advantage of multi core server processors and get quite good performance of importing or updating pimcore data.
What we learned
The most important thing we learned is that pimcore is perfectly scalable. Since pimcore is built on a solid technical base, it can take full advantage of all performance improvement and scalability functionalities of this technical base.
So right now we are far from the limit what is possible. In our case we just used one single server with quite old components. Use modern components, Intel processors (for which php is more optimized for), a dedicated database server, more than just one application server to name just a few possible next steps.
But of course, during our tests and with your input in the forum we found a few bottlenecks within pimcore all of which we fixed in pimcore 1.4.5. So if you plan to set up a big system, use the latest version.
So that is that. We hope this gives you a good impression of what is possible with pimcore and how you can make it happen.
As always, comments, suggestions and feedback are welcome.
Whatever you are up to with pimcore in the upcoming weeks, may it be good fun and profitable.
Have a nice day and a good spring,
Christian and the pimcore team