Getting pimcore ready for big datasets

Getting pimcore ready for big datasets - Impression #1

Hi folks,

pimcore 1.4.5 with lots of improvements and new features has been around for few weeks now. We hope you like it?

One major objective during the development of this version was performance testing and optimization. In this blog post we want to share with you some of the tests we did and the results we got, and by doing so give you some answers about the performance and scalability of pimcore in combination with big datasets – a question we frequently hear from you, our clients and others. 

What we did

Our goal was to test and optimize the pimcore backend and api (cli scripts for importing, updating …) for performance with big datasets. We concentrated our tests on the objects part of pimcore, but all optimizations were also added for documents and assets if possible. 

Our dataset

To get a big dataset, we took a real world example of a PIM installation and enlarged the dataset extensively. By doing so, we ended up with the following dataset:

Object count 3.2 mio
Product count 2.5 mio
Relation count 20 mio
Version count 5 mio

The relation count might seem a little low. But this is in fact a quite realistic number because of two reasons:

  • There are lots of ‘small’ accessory products with don’t have that many relations for them self
  • The clever use of inheritance not only reduces the maintenance effort but also decreases the number of relations to save. If relations are assigned to parent objects and inherited by concrete product, the relation itself has to be saved only once (although many products use this relation).

The products themselves were quite complex product with extensive use of different datatypes for attributes, object bricks, field collections and inheritance.

Attributes per product 60
Average object bricks per product 4
Average attributes per object brick 30

 

Our results

First of all, it has to be said, that our test dataset was not designed to get good performance results, but to be as realistic as possible and therefore get a real world impression of how pimcore performs. The performance also depends strongly on the data model and use cases. With less complex objects you will get completely different results. So these results can only give a feeling of what is easily possible.

The performance of pimcore while working with the backend is quite hard to quantify since it is dependent from many different factors.

But from an editors perspective, a normal and quick working was possible and the pimcore backend stayed perfectly responsive.

Most of the common interactions (like opening sub trees, opening objects, moving objects, saving objects, searching for objects) were executed in less than a second in most of the time. If the cache was not filled, huge amount of sub objects had to be loaded or a complex fulltext searches were made, the requests needed about two to five, maximum ten, seconds.

But of course there are interactions (like renaming or moving folders with huge amount of objects within them etc.) which took much longer simply because of the data amount that has to be processed. 

In terms of batch script performance we made three simple tests – copying products and saving the new ones (to simulate import performance), updating products and saving the modified products and deleting products:

  1000 10000
Copy Products 50 sec 490 sec
Update Products 80 sec 480 sec
Delete Products 40 sec 420 sec

Again, one needs to take in to consideration, that in this case the products where quite complex and that the results of such tests are strongly dependent on the use case, data model and so on. 

How we did it

First of all, we got our self a server – not the biggest or newest one, but the important thing is a server with plenty of RAM and fast HDD-access:

Processor 2 x Quad-Core AMD Opteron(tm) Processor 2350
RAM 48 GB
HDD pimcore SATA 1,5 TB
HDD database SSD 120 GB

 

Installed Software

  • Debian 6.0.4
  • php 5.3.9-ZS5.6.0
  • Apache/2.2.16 (Debian)

Configuration of database

Since MySQL made lots of performance improvements in the last versions, it is important to use the latest version (in our case MySQL 5.5.20).

Configuration wise we supplied the MySQL with plenty of RAM and configured cache sizes etc. optimized for our use case. You can find information about that either in the MySQL documentation, at http://www.mysqlperformanceblog.com/ or at Google ;).

We also took advantage of the partitioning feature of MySQL and split up the big tables into partitions.

Again the optimized configuration of MySQL is highly dependent on your use case and needs to be planned carefully during the setup of the system. Most performance improvements can be reached by setting up the MySQL properly. 

Configuration of cache

Since memcache doesn’t support tags, we simulate this feature with a database table. But with these big scale datasets this solution is far too slow.

In our tests we used the NoSQL database MongoDB (http://www.mongodb.org/) as cache instead – with great results. Tagged caching is supported this way out-of-the-box and it is really fast. The MongoDB cache backend (Pimcore_Cache_Backend_Mongodb) is shipped with pimcore 1.4.5 and can be configured in the cache.xml. 

Configuration for cli scripts

Since php is only single threaded and not exactly built for long running import processes, we looked for a way to deal with these drawbacks and came up with the job server gearman (http://gearman.org). The idea behind is to

  • split up huge workloads (of importing, updating, …) to small work packages and
  • run them in parallel.

With gearman and a small framework we built around it, we now have a generic way for this task, can take advantage of multi core server processors and get quite good performance of importing or updating pimcore data.

What we learned

The most important thing we learned is that pimcore is perfectly scalable. Since pimcore is built on a solid technical base, it can take full advantage of all performance improvement and scalability functionalities of this technical base.

So right now we are far from the limit what is possible. In our case we just used one single server with quite old components. Use modern components, Intel processors (for which php is more optimized for), a dedicated database server, more than just one application server to name just a few possible next steps.

But of course, during our tests and with your input in the forum we found a few bottlenecks within pimcore all of which we fixed in pimcore 1.4.5. So if you plan to set up a big system, use the latest version.  

So that is that. We hope this gives you a good impression of what is possible with pimcore and how you can make it happen.

As always, comments, suggestions and feedback are welcome. 

Whatever you are up to with pimcore in the upcoming weeks, may it be good fun and profitable. 
Have a nice day and a good spring,

Christian and the pimcore team

If you want to follow along or help out, check out our blog section on a regular basis. You can also follow the development in real-time on GitHub.

Author:Christian Fasching
Christian Fasching
  • CPO
29 articles by this author

Read more Pimcore news

110 000+ businesses trust Pimcore. Discover our 550+ favorite success stories.