For the last year I've been working on writing import code for the OSNews website into Drupal 7. It actually wasn't such long task as it might sound. I'm working on this project gratis, and spending my limited free time in order to make it happen. If I someone had hired me to do this, I might have gotten this done inside of a month.
The project has gone in fits and starts for the entire year since I agreed to work on it. Some days I would just have no time, or no energy, or the requirements of work or home life were simply too demanding to set my mind to coding.
The last month I've been dealing with the purchase of a house, and the move to the house. Recently things have been calm enough that I could stop and focus on drupal code. My top priority was the Flag module. I'm co-maintainer, and I felt a responsibility to keep it up. For several days I would open the issue queue, and try to make what contributions I could. Soon, I realized, however, that Flag development was in a lull.
So back to OSNews I came.
The wretched thing about the Batch API is that it's documentation is horrid. There's no straightforward way to tell a batch to stop processing. Returning error messages is in theory supported, but the terse docs leave me unconvinced how to accomplish it.
This led to a rather depressing realization.
An Untestable Mess
The code I had written was perfectly valid Drupal code. It used the Batch API as consistently with best practices as I could make it. Like much Drupal 7 code, however, it was an untestable mess.
The problem was that I made it so dependent on the Batch API that when I needed to put the entire process in a sandbox and attempt to migrate a single news article, it wasn't entirely possible. The code was function-oriented, like most Drupal code, and relied heavily on the persistent variable system in order to maintain state. Putting a single import into a repeatable box required nothing short a complete refactor of the entire code base. I looked up Dependency Injection Containers to handle the variable storage, but that particular night it was already late and I had to just go to bed.
The next evening I poked at the code once more while Netflix streamed quietly in the background. The persistent high level of stress left me with my neck in agony, so I was laying on the large bean bag we keep in the living room with my head supported by a small fluffy mountain of pillows. I wasn't entirely focused due to the pain. Sometimes that's what you need for realization to strike.
A Query Too Far
It turned out the problem wasn't an error. In fact, the code was working exactly in the manner in which I had instructed it. The problem was, my instructions were wrong. Programmers can easily forget in the thick of things that computers do what you tell them to do, not what you want them to do. Your instructions must be precise, otherwise you'll find yourself as I did, looking for the problem on the screen instead of in my head.
When I had written my query on the News table, I had specified "WHERE news_id >= $myNewsID". The use of the greater-than-or-equal operator was intended to handle the edge case of news_ids not being contiguous. Gaps were possible. That part of the query went back to the earliest part of the project. Back then, I was merely generating an array of contiguous numbers when building the content import batch. Later, I replaced this with a query to the database that pulled actual news_ids from the database, instead of plucking them out of the air. As a result, the greater-than-or-equal operator was no longer necessary.
But, "if it ain't broke, don't fix it."
At that point, the content import worked great, nothing was skipped mysteriously. As I made the content import more complete, I started to expand my query. Instead simply querying the News table, I joined it against the Topics and Editors tables, hoping to eliminate a separate database query elsewhere in my code.
You see, I committed another blunder when I wrote my SQL query. I used an INNER JOIN, one that only returns results when all criteria are satisfied. Imagine a row in the News table with news_id x. That row specifies a topic_id of y. If I select that row from the News table, and do an INNER JOIN on the Topics table, the Topics table must have a row with a topic_id of y. What happens if there isn't a topic_id of y? Nothing gets returned. Null. Nada. Zip.
Unless you make your matching criteria a little more liberal.
Remember how I used the greater-than-or-equal operator? That came back to bite me. When I queried the database, it faithfully went out and grabbed the first row in the News table with the matching news_id. For some news items, however, the following fetch on the Topics table turned up null. The database server, instructed to perform an inner join, finds itself with unsatisfied matching criteria. Normally it would return NULL at this point and stop, but instead it looks at that greater-than-or-equal news_id and simply moves on to the next possible result.
That's how the "skipping" occurred. aS
Multiple Queries vs. LEFT JOINs
When I figured out what I did, I pulled out the inner join on the Topics table and queried it separately. This turned out to be the smarter choice, as there are only a limited number of Topics, and I was caching those in memory. Unfortunately, I committed the same blunder withe the Editors table. Not only was I performing an INNER JOIN against News and Topics, but Editors too. The same problem reared it's ugly head.
This left me with two different possible solutions. I could pull out the query on the Editors table and perform it separately as I had for the Topics table, or I could perform a LEFT JOIN. LEFT JOINs differ from INNER Joins in that the "left" table (News in this case) need not have a matching result on the "right" table (Editors). If a match isn't found on the JOIN, the corresponding columns from the "right" table would be NULL. In other words, an INNER JOIN is the intersection between two tables, a LEFT JOIN will return an intersection and all the rows in the "left" table.
I toyed with this, and it did seem to work. Editors, however, had the same nature as Topics -- there were only a handful of rows in Editors compared to the number of rows in the News table. It was much better to cache it in memory and avoid the disk IO entirely.
This made me think, is that why Drupal performs so many database queries in a single page request? While you could combine them all into a single, big query, the edge cases and missed matches make the code much more complicated. It's easier and more robust to write several independent queries instead of one big one.
With the import code seemingly working and no longer randomly skipping articles, I started running a massive import effort. The OSNews database copy I had received had some 25000 news items. The best way to test the import code was to throw all of the news articles at it and to see what happened. It took hours to process that many news items given the massive number of comments, but in the end, it worked.
While I had a few failed imports, I was able to resolve them with a bit of investigation. What's left now is to investigate what was imported and see if there are additional problems. I found one already -- backslash escaping and BBcode markup in comments. That isn't a huge problem, but it does reflect the need to cross check what was imported to make sure everything is as it should be. When I'm sure I've gotten all of the problems, I can delete the existing database and start the import process from fresh. This is necessary to assure I have all of the news articles that were "skipped" earlier.