parse_ini_file, et cetera!
I guess the question you're all asking is why would you want to parse an SRT file? Well, there's no specific reason why, however, this article does cover some good programming concepts. There are naturally many ways to parse the same file, but this is the way in which I would parse an SRT file.
First off, I had to download an SRT file from a website that indexes subtitle files. See the attachments below for our SRT file. For this article I have chosen the English subtitle file from Babel - which is incidentally a truly brilliant film, as SOCK knows !
As always we're going to begin with a simple function declaration. No revelation, I know, but it's worth it for those who are staring blankly:
Our function now accepts an argument for the file name that is to be parsed and then returned. Our first port of call is to open the file, which is fairly simple to do and requires no explanation:
$szSRT = file_get_contents($szSRTFile);
That is in sharp contrast to the following line which perhaps requires a paragraph or 2 to explain. As Windows likes to add carriage returns as well as Linux's line breaks, whereas Linux only likes to add line breaks, we need to detect which it has so that we can break it up correctly. We could call the
nl2brfunction and then split via the break tag, or whatever we decided to use, but we're going for a different approach:
Note: We could also just replace all the carriage returns with nothing, leaving us with nothing but line breaks. However, I have done it this way so that if you ever do need to detect carriage returns, you know how.
Using the ternary operator which Matt did a fantastic job explaining, we are detecting if the carriage return is present in the document. We are finding the presence of a carriage return character via its ASCII value and the chr() function, but you could also feasibly detect it by the simple \r notation. If we find the carriage return then we can set the line break as \r\n else it's just a simple line break: \n - which incidentally has an ASCII value of 10.
Once we know what we can split our lines by, and in theory know which operating system the subtitle file was compiled on, we can move ahead to explode the lines up. The first few lines of our SRT file look so:
Originally Posted by BabelDvDrip[Eng]-aXXo.srt
00:01:46,172 --> 00:01:47,696
It's aImost new.
00:01:50,443 --> 00:01:52,206
Three hundred cartridges.
str_repeatline comes in use - str_repeat simply concatenates, but provides a much more short-hand version:
We now have our subtitle chunks broken up nicely. If the above quote was the start and the end of our SRT file, then we'd have 2 items in our array.
Next comes the core of the function which will break up all of the data and assign it to a huge array for returning back to the front-end:
Aside from setting the
$iIndexvariable to 0, we're also declaring our array to place all the parsed data into. We then pass by the reference the current array item to a shorter variable:
$szItem =& $aData[$iIndex];
Now if I were to modify $szItem then $aData[$iIndex] (where
$iIndexdenotes the current array's index) would also change. This is called passing by reference which we have a good article on already.
We then need to check if any of our blocks have empty space. Sometimes a subtitle file will have some extra line breaks and/or carriage returns at the end of the document - the
emptyconstruct will take care of that for us:
We now need to break the lines up for each block. Sometimes, however, the actual text for the subtitles drop into more than one line and therefore we can never guarantee how many lines we will end up with. By providing the optional third argument into the explode function, we can tell it to split only in 3 parts - for which the last part will simply include the rest of the text which leaves us a guaranteed 3, or in some unwanted circumstances, less than 3 items in our array.
Also, as the time-stamps for when a subtitle should appear and then disappear are separated by an arrow, we have rather an easy task in splitting the start time-stamp from the end time-stamp.
Last but not least we now want to place our items into the master array, ready to be returned to the front-end. Seeing as how we have all the parts for our current block, we can simply assign the values like so:
Notice that we are using type-juggling for the assignment of the subtitle index - that is because it's an integer, and we don't want PHP to class it as a string otherwise when we came to later perform some fancy things with our data, we may run into issues. Again, we have an informative article on type-juggling/type-casting.
Now we've got all the data we require! All beautifully parsed into one very enormous array. The last thing we need to do with our function is to issue the
returnconstruct to return the array:
All that's left to do is to call the function we've just crafted, and to test if it works. This can be done like this:
$aSRT = parse_srt('BabelDvDrip[Eng]-aXXo.srt');
This should echo out the value: Get out right now. If it doesn't then you've done something wrong along the way. Fear not, however, please see the attached documents for both the script itself, and the SRT file that I used for the article.
This article has hopefully given you a good insight in how to go parsing files. When you start parsing websites things become a whole lot different. That is where the DOM is intended to help, by classing each node as an object which has 0, 1 or many attributes. The key is to look for a common paradigm that can be used to usher code off into its own blocks so that you can begin parsing it even further. In our case, the pattern was the double \r\n\r\n or \n\n, depending on the value of
$szBreak, and then each inner block was separated via a single set of line break/carriage return. Whereas the times were separated by an ASCII arrow, giving us many different examples to parse documents.
I'm sure you can now appreciate why we need such standards as XML and JSON!
Files: Download SRT and PHP files.