TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 09-08-2007, 12:08 AM   #1 (permalink)
Super Moderator
Advanced Programmer 
 
bluesaga's Avatar
 
Join Date: Sep 2007
Posts: 165
Thanks: 0
bluesaga is on a distinguished road
Default When and Why to normalise?

As a rule of thumb, if a peice of data will ever be dynamically categorised then you need to normalise it, normalisation is normally done by creating an additional table and using it as a referance.

For example, say you have a table for:
movies:
id
name
genre

genre, could be many different varietys, thus needs to be normalised so something like:
movies:
id
name
genre (foreign_key genre.id)

Genre:
id
genre_name

Would be the normalised way of doing things, linking via id's. This is much easier to manipulate, and because you are using id's instead of repeating the genre it will save a hell of a lot of space in the long run!
bluesaga is offline  
Reply With Quote
Old 11-16-2007, 04:23 AM   #2 (permalink)
The Wanderer
 
Join Date: Nov 2007
Location: Mumbai, India
Posts: 24
Thanks: 0
sunilbhatia79 is on a distinguished road
Default

Great info.

Would like to add that normalization is mainly an architectural decision... many times while developing MIS systems we tend to de-normalize tables.

For reporting having one table with repeated or ambiguous values is better than having to join multiple tables every time a user generates a report... So for this we create nightly batch jobs that will read new records and update the reporting table...
__________________
Sunil Bhatia www.twitter.com/sunilbhatia79 - Follow me on Twitter
PHP5 Tutorials
Career Articles
sunilbhatia79 is offline  
Reply With Quote
Old 12-01-2007, 07:51 AM   #3 (permalink)
The Contributor
 
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
Dr John is on a distinguished road
Default

Not really a good example.

genre_name will be unique, so then adding a unique id to it in the seperate table forces a join when you wish to return the value of genre_name in a query, as well as storing unnecessary information. Besides when entering data into the movie table, what makes more sense - genre = 7 or genre = comedy ? (Although the user may never see the id of course, depending on your interface)

And you missed out the link table - movie_genre (mid, genre) where mid is movie's id and genre is the different genres that a single movie may belong to EG a single movie could be comedy, and action, and Japanese. As this would be a multi-valued column, this is one important reason why it requires normalisation.

table schema: movie--<movie_genre>--genre

Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data (although a little de-normalisation certainly helps at times).
__________________
www.kidneydialysis.org.uk
Dr John is offline  
Reply With Quote
Old 12-01-2007, 03:28 PM   #4 (permalink)
The Reckoner
Advanced Programmer Top Contributor 
 
Karl's Avatar
 
Join Date: Sep 2007
Posts: 437
Thanks: 22
Karl is on a distinguished road
Default

Dr John, I think the join table was omitted in an effort to make the information easier to read and understand. I also don't see how you can say the example is wrong, maybe a little basic, but I wouldn't say it's wrong.

Storing the genre_id in the movies table and then moving the genres to their own table is normalisation, storing the genre name in the same table as the movie is not. If you were to do that, you'd be potentially repeating data time and time again.

As your quire rightly mentioned storing the id isn't much use to the user, i.e. "genre - 7" doesn't make much sense to me or you, but as you already stated, the user wouldn't see that and in all honesty, thats got nothing to do with normalisation and is therefore irrelevant to this topic.

"Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data"

So how is the genre name not repetitive data? If I were to add Matrix 1-3, I would have

Matrix - Action
Matrix 2 - Action
Matrix 3 - Action

Giving me 3 pieces of repeated data.
__________________
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
Karl is offline  
Reply With Quote
Old 12-01-2007, 04:18 PM   #5 (permalink)
The Contributor
 
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
Dr John is on a distinguished road
Default

You're misunderstanding what I meant, due in part to the example being a very simple one. Repetative data is easier to see in this example (clearly a VERY bad design that needs normalising)

id, name, address, age, employer_name, empl_address, emp_phone
2, fred, 1 low st, 34, jones and co, 10 main road, 123456
22, john, 10 high st, 49, jones and co, 10 main road, 123456
717,Jane, 12 high st, 21, smith ltd, 15 big avenue, 654321
123, Carole, 43 skimmingdish lane, 30, smithltd, 15 big avenue, 654321
98, james, 12 high st, 55, windrushers ltd, 1 skimmingdish lane, 321321
221, john, 19 high st, 29, jones and co, 10 main road, 123456

(assuming that one big company owns several little ones and has a centralised database, perhaps)

Here in the unnormalised table, we are repeatedly entering the same address and same phone number for some of the employers. Time after time.

We should normalise the table to have worker and employer as separate entities.
employer table (empID, employer_name, empl_address, emp_phone) PK= empId
worker table (workID, name, address, age, empID) PK = workID, FK = empID

and employer--<worker

Now, the attributes that are functionally dependant on the workID and those dependant on the empID (which wasn't included in the original table as there it wasn't necessary) are now grouped together.

Now we need only enter the empID next to the worker's data, avoiding the repetative entry of the employer's other data, with, as you will notice, the risk of a typo ocurring (it was a real error as well). With hundreds of workers, the scope for typos is huge, and a search on "smith ltd" would miss out some people.

The simple example used didn't have scope to show what was meant by repetative redundant data. I do accept that the link table was probably missed out to shorten the post. But I've always found that examples which are very simple can give a beginner the wrong impression.

The original example is actually an example of a look-up table, which is usually used on important searchable fields to avoid misspellings causing problems eg comedy and comedies and comeddy. (Look-up tables are not fully required by normalisation, they are a ocnvenience.) But the link table is necessary as a film can be in several different categories, which wasn't immediately obvious (matrix, action) and (matrix, sci-fi) for example. And thus normalisation was required when it was added.

(Addresses are a field that are often left unnormalised for convenience.)

Hope this shows things a bit more clearly. Functional dependancies are always the best way to normalise things, they are part of the early research into relationships.
__________________
www.kidneydialysis.org.uk
Dr John is offline  
Reply With Quote
The Following User Says Thank You to Dr John For This Useful Post:
Karl (12-02-2007)
Old 12-01-2007, 04:40 PM   #6 (permalink)
The Reckoner
Advanced Programmer Top Contributor 
 
Karl's Avatar
 
Join Date: Sep 2007
Posts: 437
Thanks: 22
Karl is on a distinguished road
Default

Ah yes, I understand what you mean, but I still always assumed that the example given was normalisation, albeit only first normal form. Whereas the example you gave is classed as third normal form. Both examples are normalisation, they are just different levels of it.

* Edit *

I've been doing some reading and I think I finally understand what you mean. The example given isn't a true example of normalisation because the new table (the lookup table) doesn't provide anything other than a name related to an id. To be true normalisation (and not just a look up table) the table must also extend the data which has been extracted (in this case, the genre) by adding additional fields.

Have I got that right?
__________________
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

Last edited by Karl : 12-01-2007 at 06:04 PM.
Karl is offline  
Reply With Quote
Old 12-01-2007, 11:38 PM   #7 (permalink)
The Contributor
 
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
Dr John is on a distinguished road
Default

Yes, that's the sort of thing I was trying to get at. A look-up table is for convenience and usually only has the one attribute (it doesn't need an artificial ID, the genre were unique to begin with). But as no other attribute had a functional dependancy on genre, there was no real reason to remove it to another table, UNTIL the multi-valued part was considered (matrix, action) and (matrix, sci-fi), then it required the link table to list each movie and genre pair. Which was why I included that bit.

I personally always think in terms of entities, as I seem to be much faster at normalising things that way. A movie can belong to many genres. A genre can apply to many movies. A many to many relationship - normalise, and add link table.
movie--<movie_genre>--genre

Suppose a movie could only belong to one genre - a one to one relationship.
Use a look-up table for convenience, if you wish to. (But in this case, you'd end up with some awkward genres eg comedy-action-japanese-sc-fi, then you'd realise a movie did NOT belong to just one genre and spot the many to many, as searching for comedy should return this movie, and you've gone multi-valued, not allowed in relatiional DBs.


I've noticed it is very common for people to add an ID to a table when one of the attributes is already a unique identifier. When you add ID to genre you end up with three superkeys - id, genre and (id,genre), of which two are candidate keys - id, and genre, and you choose either of the candidate keys to be the primary key. But id was added as if there was no natural simple candidate key, when genre did that job by itself. So the id in the genre table was instantly unnecessary.

Hope this helps a bit more. As I said, over simplifaction of examples masks what is actually being done and why.
__________________
www.kidneydialysis.org.uk
Dr John is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 01:47 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design