TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Advertisement
Walkthrough: Parsing a SRT Subtitle File
   This article will be introducing a couple of nifty tricks that I've seen scattered around the Internet. One of which is how to detect for the presence of those pesky carriage returns. The article will show you how to take a correctly formatted SRT file and parse it using PHP. Another parse function for the PHP library to go alongside parse_url, parse_ini_file, et cetera!

I guess the question you're all asking is why would you want to parse an SRT file? Well, there's no specific reason why, however, this article does cover some good programming concepts. There are naturally many ways to parse the same file, but this is the way in which I would parse an SRT file.

First off, I had to download an SRT file from a website that indexes subtitle files. See the attachments below for our SRT file. For this article I have chosen the English subtitle file from Babel - which is incidentally a truly brilliant film, as SOCK knows !

As always we're going to begin with a simple function declaration. No revelation, I know, but it's worth it for those who are staring blankly:

php Code:
function parse_srt($szSRTFile)
{

}

Our function now accepts an argument for the file name that is to be parsed and then returned. Our first port of call is to open the file, which is fairly simple to do and requires no explanation:

php Code:
$szSRT = file_get_contents($szSRTFile);

That is in sharp contrast to the following line which perhaps requires a paragraph or 2 to explain. As Windows likes to add carriage returns as well as Linux's line breaks, whereas Linux only likes to add line breaks, we need to detect which it has so that we can break it up correctly. We could call the nl2br function and then split via the break tag, or whatever we decided to use, but we're going for a different approach:

php Code:
$szBreak = strstr($szSRT, chr(13)) !== false ? "\r\n" : "\n";

Note: We could also just replace all the carriage returns with nothing, leaving us with nothing but line breaks. However, I have done it this way so that if you ever do need to detect carriage returns, you know how.

Using the ternary operator which Matt did a fantastic job explaining, we are detecting if the carriage return is present in the document. We are finding the presence of a carriage return character via its ASCII value and the chr() function, but you could also feasibly detect it by the simple \r notation. If we find the carriage return then we can set the line break as \r\n else it's just a simple line break: \n - which incidentally has an ASCII value of 10.

Once we know what we can split our lines by, and in theory know which operating system the subtitle file was compiled on, we can move ahead to explode the lines up. The first few lines of our SRT file look so:

Quote:
Originally Posted by Babel[2006]DvDrip[Eng]-aXXo.srt
1
00:01:46,172 --> 00:01:47,696
It's aImost new.

2
00:01:50,443 --> 00:01:52,206
Three hundred cartridges.
To get the individual blocks we simply split by \r\n\r\n - or alternatively, just \n\n. For that we need to take our $szBreak variable value which we deduced in the previous line, and repeat it another once - this is where the str_repeat line comes in use - str_repeat simply concatenates, but provides a much more short-hand version:

php Code:
$aData = explode(str_repeat($szBreak, 2), $szSRT);

We now have our subtitle chunks broken up nicely. If the above quote was the start and the end of our SRT file, then we'd have 2 items in our array.

Next comes the core of the function which will break up all of the data and assign it to a huge array for returning back to the front-end:

php Code:
for($iIndex = 0, $iDataLen = count($aData), $aSubtitles = array(); $iIndex <= $iDataLen; $iIndex++)
{

}

Aside from setting the $iIndex variable to 0, we're also declaring our array to place all the parsed data into. We then pass by the reference the current array item to a shorter variable:

php Code:
$szItem =& $aData[$iIndex];

Now if I were to modify $szItem then $aData[$iIndex] (where $iIndex denotes the current array's index) would also change. This is called passing by reference which we have a good article on already.

We then need to check if any of our blocks have empty space. Sometimes a subtitle file will have some extra line breaks and/or carriage returns at the end of the document - the empty construct will take care of that for us:

php Code:
if(empty($szItem))
{
    break;
}

We now need to break the lines up for each block. Sometimes, however, the actual text for the subtitles drop into more than one line and therefore we can never guarantee how many lines we will end up with. By providing the optional third argument into the explode function, we can tell it to split only in 3 parts - for which the last part will simply include the rest of the text which leaves us a guaranteed 3, or in some unwanted circumstances, less than 3 items in our array.

Also, as the time-stamps for when a subtitle should appear and then disappear are separated by an arrow, we have rather an easy task in splitting the start time-stamp from the end time-stamp.

php Code:
$aLine = explode($szBreak, $szItem, 3);
$aTime = explode('-->', $aLine[1]);

Last but not least we now want to place our items into the master array, ready to be returned to the front-end. Seeing as how we have all the parts for our current block, we can simply assign the values like so:

php Code:
$aSubtitles[] = array   ( 
                            'index' => (int) $aLine[0],
                            'time_start' => trim($aTime[0]),
                            'time_end' => trim($aTime[1]),
                            'text' => $aLine[2]
                        );

Notice that we are using type-juggling for the assignment of the subtitle index - that is because it's an integer, and we don't want PHP to class it as a string otherwise when we came to later perform some fancy things with our data, we may run into issues. Again, we have an informative article on type-juggling/type-casting.

Now we've got all the data we require! All beautifully parsed into one very enormous array. The last thing we need to do with our function is to issue the return construct to return the array:

php Code:
return $aSubtitles;

All that's left to do is to call the function we've just crafted, and to test if it works. This can be done like this:

php Code:
$aSRT = parse_srt('Babel[2006]DvDrip[Eng]-aXXo.srt');
echo $aSRT[200]['text'];

This should echo out the value: Get out right now. If it doesn't then you've done something wrong along the way. Fear not, however, please see the attached documents for both the script itself, and the SRT file that I used for the article.

This article has hopefully given you a good insight in how to go parsing files. When you start parsing websites things become a whole lot different. That is where the DOM is intended to help, by classing each node as an object which has 0, 1 or many attributes. The key is to look for a common paradigm that can be used to usher code off into its own blocks so that you can begin parsing it even further. In our case, the pattern was the double \r\n\r\n or \n\n, depending on the value of $szBreak, and then each inner block was separated via a single set of line break/carriage return. Whereas the times were separated by an ASCII arrow, giving us many different examples to parse documents.

I'm sure you can now appreciate why we need such standards as XML and JSON!

Files: Download SRT and PHP files.
Report this Article
Last 5 Article Reviews Read All Reviews
There are no reviews for this Article

All times are GMT. The time now is 04:49 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design