TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   MySQL & Databases (http://www.talkphp.com/mysql-databases/)
-   -   When and Why to normalise? (http://www.talkphp.com/mysql-databases/1051-when-why-normalise.html)

bluesaga 09-08-2007 12:08 AM

When and Why to normalise?
 
As a rule of thumb, if a peice of data will ever be dynamically categorised then you need to normalise it, normalisation is normally done by creating an additional table and using it as a referance.

For example, say you have a table for:
movies:
id
name
genre

genre, could be many different varietys, thus needs to be normalised so something like:
movies:
id
name
genre (foreign_key genre.id)

Genre:
id
genre_name

Would be the normalised way of doing things, linking via id's. This is much easier to manipulate, and because you are using id's instead of repeating the genre it will save a hell of a lot of space in the long run!

sunilbhatia79 11-16-2007 04:23 AM

Great info.

Would like to add that normalization is mainly an architectural decision... many times while developing MIS systems we tend to de-normalize tables.

For reporting having one table with repeated or ambiguous values is better than having to join multiple tables every time a user generates a report... So for this we create nightly batch jobs that will read new records and update the reporting table...

Dr John 12-01-2007 07:51 AM

Not really a good example.

genre_name will be unique, so then adding a unique id to it in the seperate table forces a join when you wish to return the value of genre_name in a query, as well as storing unnecessary information. Besides when entering data into the movie table, what makes more sense - genre = 7 or genre = comedy ? (Although the user may never see the id of course, depending on your interface)

And you missed out the link table - movie_genre (mid, genre) where mid is movie's id and genre is the different genres that a single movie may belong to EG a single movie could be comedy, and action, and Japanese. As this would be a multi-valued column, this is one important reason why it requires normalisation.

table schema: movie--<movie_genre>--genre

Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data (although a little de-normalisation certainly helps at times).

Karl 12-01-2007 03:28 PM

Dr John, I think the join table was omitted in an effort to make the information easier to read and understand. I also don't see how you can say the example is wrong, maybe a little basic, but I wouldn't say it's wrong.

Storing the genre_id in the movies table and then moving the genres to their own table is normalisation, storing the genre name in the same table as the movie is not. If you were to do that, you'd be potentially repeating data time and time again.

As your quire rightly mentioned storing the id isn't much use to the user, i.e. "genre - 7" doesn't make much sense to me or you, but as you already stated, the user wouldn't see that and in all honesty, thats got nothing to do with normalisation and is therefore irrelevant to this topic.

"Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data"

So how is the genre name not repetitive data? If I were to add Matrix 1-3, I would have

Matrix - Action
Matrix 2 - Action
Matrix 3 - Action

Giving me 3 pieces of repeated data.

Dr John 12-01-2007 04:18 PM

You're misunderstanding what I meant, due in part to the example being a very simple one. Repetative data is easier to see in this example (clearly a VERY bad design that needs normalising)

id, name, address, age, employer_name, empl_address, emp_phone
2, fred, 1 low st, 34, jones and co, 10 main road, 123456
22, john, 10 high st, 49, jones and co, 10 main road, 123456
717,Jane, 12 high st, 21, smith ltd, 15 big avenue, 654321
123, Carole, 43 skimmingdish lane, 30, smithltd, 15 big avenue, 654321
98, james, 12 high st, 55, windrushers ltd, 1 skimmingdish lane, 321321
221, john, 19 high st, 29, jones and co, 10 main road, 123456

(assuming that one big company owns several little ones and has a centralised database, perhaps)

Here in the unnormalised table, we are repeatedly entering the same address and same phone number for some of the employers. Time after time.

We should normalise the table to have worker and employer as separate entities.
employer table (empID, employer_name, empl_address, emp_phone) PK= empId
worker table (workID, name, address, age, empID) PK = workID, FK = empID

and employer--<worker

Now, the attributes that are functionally dependant on the workID and those dependant on the empID (which wasn't included in the original table as there it wasn't necessary) are now grouped together.

Now we need only enter the empID next to the worker's data, avoiding the repetative entry of the employer's other data, with, as you will notice, the risk of a typo ocurring (it was a real error as well). With hundreds of workers, the scope for typos is huge, and a search on "smith ltd" would miss out some people.

The simple example used didn't have scope to show what was meant by repetative redundant data. I do accept that the link table was probably missed out to shorten the post. But I've always found that examples which are very simple can give a beginner the wrong impression.

The original example is actually an example of a look-up table, which is usually used on important searchable fields to avoid misspellings causing problems eg comedy and comedies and comeddy. (Look-up tables are not fully required by normalisation, they are a ocnvenience.) But the link table is necessary as a film can be in several different categories, which wasn't immediately obvious (matrix, action) and (matrix, sci-fi) for example. And thus normalisation was required when it was added.

(Addresses are a field that are often left unnormalised for convenience.)

Hope this shows things a bit more clearly. Functional dependancies are always the best way to normalise things, they are part of the early research into relationships.

Karl 12-01-2007 04:40 PM

Ah yes, I understand what you mean, but I still always assumed that the example given was normalisation, albeit only first normal form. Whereas the example you gave is classed as third normal form. Both examples are normalisation, they are just different levels of it.

* Edit *

I've been doing some reading and I think I finally understand what you mean. The example given isn't a true example of normalisation because the new table (the lookup table) doesn't provide anything other than a name related to an id. To be true normalisation (and not just a look up table) the table must also extend the data which has been extracted (in this case, the genre) by adding additional fields.

Have I got that right?

Dr John 12-01-2007 11:38 PM

Yes, that's the sort of thing I was trying to get at. A look-up table is for convenience and usually only has the one attribute (it doesn't need an artificial ID, the genre were unique to begin with). But as no other attribute had a functional dependancy on genre, there was no real reason to remove it to another table, UNTIL the multi-valued part was considered (matrix, action) and (matrix, sci-fi), then it required the link table to list each movie and genre pair. Which was why I included that bit.

I personally always think in terms of entities, as I seem to be much faster at normalising things that way. A movie can belong to many genres. A genre can apply to many movies. A many to many relationship - normalise, and add link table.
movie--<movie_genre>--genre

Suppose a movie could only belong to one genre - a one to one relationship.
Use a look-up table for convenience, if you wish to. (But in this case, you'd end up with some awkward genres eg comedy-action-japanese-sc-fi, then you'd realise a movie did NOT belong to just one genre and spot the many to many, as searching for comedy should return this movie, and you've gone multi-valued, not allowed in relatiional DBs.


I've noticed it is very common for people to add an ID to a table when one of the attributes is already a unique identifier. When you add ID to genre you end up with three superkeys - id, genre and (id,genre), of which two are candidate keys - id, and genre, and you choose either of the candidate keys to be the primary key. But id was added as if there was no natural simple candidate key, when genre did that job by itself. So the id in the genre table was instantly unnecessary.

Hope this helps a bit more. As I said, over simplifaction of examples masks what is actually being done and why.


All times are GMT. The time now is 04:06 AM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0