Comments (12)
We just ran into the same issue as bitbucket-import. The reason this is "wrong" is two-fold:
- Calling insert_batch() invalidates the last_insert_id, which is unexpected by anyone who didn't read the insert_batch() code.
- Calling insert_batch() does not maintain isolation integrity of systems that implement an isolation solution for multiple insert (such that all inserts are guaranteed to be all-or-nothing and sequential without a race condition from an external writer).
In the case of MySQL at least, last_insert_id is guaranteed to point to the first item inserted in an autoincrement table. That guarantee is lost if you're using Codeigniter's insert_batch. What's worse, is if you need the value of the first insert there is now no safe way to obtain that information, it is lost forever.
I didn't notice any reason in the comments for choosing 100 for the limit. Perhaps it is a limit of one of the particular database systems, but if that's the case, it seems best to impose the limit in the driver for that specific database. My bug-fix recommendation would be to remove the limit altogether and send the entire insert query, or at least to document in the function why there is an imposed limit and how said limit was chosen. The MySQL limit seems to be constrained only by the maximum packet size which defaults to 16M.
How to fix outside of Codeigniter:
In case this issue isn't fixed and you're reading this on the Internet, the fix I would recommend would be to implement your own form of batch insert that does not use an arbitrary limit based on the average number of fingers possessed by humans. ;-) Or, you could re-write the insert_batch function to be wrapped in a transaction (so you know no other inserts asynchronously get added) that saves the last_insert_id from the first insert and returns it upon successful completion of the transaction. That solution is less clean, however, as you might have other code expecting a true/false answer or relying on last_insert_id, which would still be corrupted.
from codeigniter.
Hi Andrey & Mat, thanks for the quick response.
Details and examples below, but the main issue is this: insert_batch silently changes the API to a database. One of the quintessential aspects of a database is atomicity. From wikipedia:
In database systems, atomicity (or atomicness; from Greek a-tomos, undividable) is one of the ACID transaction properties. An atomic transaction is a series of database operations such that either all occur, or nothing occurs. The series of operations cannot be divided apart and executed partially from each other, which makes the series of operations "indivisible", hence the name.
In Codeigniter, the implementation of insert_batch defies the very definition of "atomicity" and "indivisible" for all of the databases behind it. The fact that MySQL has a limit means that at some point it will return an error, which the user code can handle. However, insert_batch makes the operation non-atomic which may result in errors the user cannot handle (i.e. non-deterministic change to last_insert_id). Sorry, I was probably unclear when I said "invalidates", but that's what I meant. It silently (no errors, etc.) changes the behavior of the underlying DB and I can no longer accurately obtain the IDs returned by the insert.
Because atomicity is broken, not only is last_insert_id broken, but what happens if one of the chunks in the middle fails? Now I have 1 set of IDs inserted and another set (or sets) that failed with no way to correlate what worked and what didn't. It's why the ACID principle is so important to databases.
As it was reported earlier, I suspect that other users are running into it. We didn't notice until data started getting corrupted in the asynchronous case (see below). It would be very easy to use this and never know that inserts > 100 could be corrupting your data (i.e. because you expect last_insert_id to be deterministic).
Ok, the examples and answer to @narfbg 's other question:
The first one? Am I missing something here? Isn't it supposed to be the last one?
No, as per the MySQL documentation, last_insert_id is always the FIRST entry in multiple inserts.
Important
If you insert multiple rows using a single INSERT statement, LAST_INSERT_ID() returns the value generated for the first inserted row only. The reason for this is to make it possible to reproduce easily the same INSERT statement against some other server.
For an example, let's say I'm inserting 500 entries. The first entry is inserted as id 2000. I do a MySQL multiple insert (not using insert_batch, but just a raw query). Calling last_insert_id will return 2000 since it was the first entry. I inserted 500 entries, so I now know my IDs are 2000 to 2499 (or of course if we hit a limit somewhere it will fail, and I will know because I received an error).
When I use insert_batch, two things go wrong here. In the non-threaded (simpler) case, I insert the same 500 entries in a table whose next ID will be 2000. After this call, however, last_insert_id won't return 2000 as we would expect, it will return 2400. Same as before, my IDs are 2000 to 2499, but now last_insert_id points somewhere in the middle. Not what the user is expecting.
In the threaded case, I can't even determine what my IDs are. For example, say insert_batch does its first batch, IDs 2000 through 2099. Then _insert_batch returns and our loop is about to call it again with the next batch. Right at that moment another thread writes 5 entries into the database. Now my second set is going to be 2104 through 2203 and information is lost (the list of IDs is now non-deterministic). My ids are now 2000-2099, 2104-2203, and so forth. By breaking my atomic query into a non-atomic query it has broken the ACID transaction property of such a database.
I hope that makes more sense. Please let me know if my examples are confusing or if I've brought up other questions.
Thanks!
from codeigniter.
How does that make it wrong?
from codeigniter.
Calling insert_batch() invalidates the last_insert_id
What do you mean by "invalidate" exactly?
In the case of MySQL at least, last_insert_id is guaranteed to point to the first item inserted in an autoincrement table
The first one? Am I missing something here? Isn't it supposed to be the last one?
from codeigniter.
http://dev.mysql.com/doc/refman/5.7/en/information-functions.html#function_last-insert-id
Apparently, MySQL's last_insert_id()
function returns the first ID inserted by a batch insert. CodeIgniter doesn't use that function directly, but I'm not sure whether MySQLi's insert_id
uses the function (or mimics its behavior). Either way, if you do a batch insert of more than 100 rows, CI's insert_id()
function is probably not going to give you a useful value (though I wouldn't expect the behavior of MySQL's last_insert_id()
function if I wasn't reading the MySQL manual).
One of the reasons for splitting up a batch insert is that MySQLi (and probably other PHP database extensions) has a limit on the size of the query ( http://dev.mysql.com/doc/refman/5.7/en/packet-too-large.html ).
While 100 is arbitrary, it apparently works for most common uses of insert_batch()
, and anything non-arbitrary would have to be a platform-specific implementation, as you would have to determine the actual limit for the platform, then divide the insert into the appropriate size (which, at least for MySQLi, would break the batch down into a number of inserts based on the amount of data inserted into each row, which would be unnecessary 99.9% of the time).
from codeigniter.
At the very least, adding a "note" to the insert_batch public documentation would be worthwhile. Given that insert_batch and insert_id are both part of the public API, inserting >100 rows and retrieving the insert_id should either be supported, or marked as a known limitation. If keeping current behavior and updating documentation is the desired path, I'm happy to submit a pull request for the docs.
from codeigniter.
I've got another idea about an eventual solution, but I still don't quite see the problem and am waiting for an answer to my first question for @bgeisel1.
from codeigniter.
To be honest, most of what you said is kind of irrelevant (and further confusing), but I believe I've got what you mean. :)
Does that last commit resolve the problem?
from codeigniter.
That commit is definitely an improvement, since it would be possible to pass count($set)
in as $batch_size
and bypass the batching mechanism. If you were to call insert_batch
that way, I think you would re-gain access to insert_id
afterward.
I don't like that the current batch implementation can silently fail to insert some of the batches, such as experienced by this user:
http://stackoverflow.com/questions/18015852/codeigniters-insert-batch-with-thousands-of-inserts-has-missing-records
I would suggest wrapping the batch inserts in a transaction, but not all databases support that. Any ideas on how to prevent some of the inserts from silently failing?
from codeigniter.
That's not possible if you don't utilize the new parameter, not in a way that would work on all databases.
from codeigniter.
That's not possible if you don't utilize the new parameter, not in a way that would work on all databases.
That's really the reason it's such a bad idea to break a database query into chunks -- it will not work on all databases. It will seem to work until using it in a large production environment. Then it will silently corrupt data.
Breaking the query into chunks is bad. In some very valid use cases, it corrupts data.
Not breaking the query into chunks is good. It models the way databases work and doesn't corrupt data.
My recommendation would be to default insert_batch to not break the query into chunks. It was a mistake to chunk it in the first place. There is no good reason to continue chunking and many good reasons to discontinue chunking. Just remove the for loop that is doing the array_slice.
It used to corrupt data and no one noticed is not a good reason to go on doing so.
from codeigniter.
While it would probably be a good idea to not break the query up, it would also be a good idea to:
- run the batch in a transaction (which you can do without any modification to the query builder code)
- compare the result of
$this->db->affected_rows()
to the number of rows you wanted to insert - don't rely on the result of
$this->db->insert_id()
matching the behavior you described
The last is just because it won't match your description for all database platforms. In some cases, the method is simply not supported; in at least one case, the method will return the last ID inserted, rather than the first. (I expected it to return the last ID inserted, which is why I linked the documentation earlier to support your description of the behavior under MySQL.)
from codeigniter.
Related Issues (20)
- migration
- PHP 8.2 - Optional parameter declared before required parameter HOT 1
- Is Codeigniter version 3.1.10 is support for php 8.0 or php 8.1 ? HOT 13
- Deprecated Deprecations of Expected Exceptions HOT 1
- CodeIgniter + Session usage with Redis is slow HOT 4
- mysqli_driver version_compare issue CI 3.1.13
- Controller classes must exist as files (why??)
- Drop support for PHP < 8.1 HOT 10
- Crazy error, php-api/system/database/DB_driver.php --------> function escape($str) error HOT 4
- Email library does not send email through SMTP on PHP 8.2
- Upcoming deprecations in PHP 8.3 HOT 2
- Codeigniter 3.1.4 and PHP 8.1 HOT 4
- mysqli num_rows always return 0 HOT 1
- Session Lost for CI 3.1.13 on Page Refresh HOT 4
- Misleading docblock of the file_get_info function
- xss_clean % error HOT 1
- Bug: inconsistent behavior of Query Builder limit 0 HOT 8
- bug: codeigniter3 db->or_where_not_in(...) HOT 1
- Are there plans for CodeIgniter 3.1.14? HOT 2
- XSS when data is passed on to the web application via an API HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codeigniter.