 |
Account Login
|
 |
 |
Latest Articles
|
 |
 |
IRC Channel
|
 |
 |
Associates
|
 |
 |
Associates
|
 |
|
 |
 |
|
 |
09-08-2007, 12:08 AM
|
#1 (permalink)
|
|
Super Moderator
Join Date: Sep 2007
Posts: 165
Thanks: 0
|
When and Why to normalise?
As a rule of thumb, if a peice of data will ever be dynamically categorised then you need to normalise it, normalisation is normally done by creating an additional table and using it as a referance.
For example, say you have a table for:
movies:
id
name
genre
genre, could be many different varietys, thus needs to be normalised so something like:
movies:
id
name
genre (foreign_key genre.id)
Genre:
id
genre_name
Would be the normalised way of doing things, linking via id's. This is much easier to manipulate, and because you are using id's instead of repeating the genre it will save a hell of a lot of space in the long run!
|
|
|
|
11-16-2007, 04:23 AM
|
#2 (permalink)
|
|
The Wanderer
Join Date: Nov 2007
Location: Mumbai, India
Posts: 24
Thanks: 0
|
Great info.
Would like to add that normalization is mainly an architectural decision... many times while developing MIS systems we tend to de-normalize tables.
For reporting having one table with repeated or ambiguous values is better than having to join multiple tables every time a user generates a report... So for this we create nightly batch jobs that will read new records and update the reporting table...
|
|
|
|
12-01-2007, 07:51 AM
|
#3 (permalink)
|
|
The Contributor
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
|
Not really a good example.
genre_name will be unique, so then adding a unique id to it in the seperate table forces a join when you wish to return the value of genre_name in a query, as well as storing unnecessary information. Besides when entering data into the movie table, what makes more sense - genre = 7 or genre = comedy ? (Although the user may never see the id of course, depending on your interface)
And you missed out the link table - movie_genre (mid, genre) where mid is movie's id and genre is the different genres that a single movie may belong to EG a single movie could be comedy, and action, and Japanese. As this would be a multi-valued column, this is one important reason why it requires normalisation.
table schema: movie--<movie_genre>--genre
Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data (although a little de-normalisation certainly helps at times).
|
|
|
|
12-01-2007, 03:28 PM
|
#4 (permalink)
|
|
The Reckoner
Join Date: Sep 2007
Posts: 437
Thanks: 22
|
Dr John, I think the join table was omitted in an effort to make the information easier to read and understand. I also don't see how you can say the example is wrong, maybe a little basic, but I wouldn't say it's wrong.
Storing the genre_id in the movies table and then moving the genres to their own table is normalisation, storing the genre name in the same table as the movie is not. If you were to do that, you'd be potentially repeating data time and time again.
As your quire rightly mentioned storing the id isn't much use to the user, i.e. "genre - 7" doesn't make much sense to me or you, but as you already stated, the user wouldn't see that and in all honesty, thats got nothing to do with normalisation and is therefore irrelevant to this topic.
"Normalisation should be done based on functional dependancies to remove redundant repetative data and multi-valued data"
So how is the genre name not repetitive data? If I were to add Matrix 1-3, I would have
Matrix - Action
Matrix 2 - Action
Matrix 3 - Action
Giving me 3 pieces of repeated data.
__________________
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
|
|
|
|
12-01-2007, 04:18 PM
|
#5 (permalink)
|
|
The Contributor
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
|
You're misunderstanding what I meant, due in part to the example being a very simple one. Repetative data is easier to see in this example (clearly a VERY bad design that needs normalising)
id, name, address, age, employer_name, empl_address, emp_phone
2, fred, 1 low st, 34, jones and co, 10 main road, 123456
22, john, 10 high st, 49, jones and co, 10 main road, 123456
717,Jane, 12 high st, 21, smith ltd, 15 big avenue, 654321
123, Carole, 43 skimmingdish lane, 30, smithltd, 15 big avenue, 654321
98, james, 12 high st, 55, windrushers ltd, 1 skimmingdish lane, 321321
221, john, 19 high st, 29, jones and co, 10 main road, 123456
(assuming that one big company owns several little ones and has a centralised database, perhaps)
Here in the unnormalised table, we are repeatedly entering the same address and same phone number for some of the employers. Time after time.
We should normalise the table to have worker and employer as separate entities.
employer table (empID, employer_name, empl_address, emp_phone) PK= empId
worker table (workID, name, address, age, empID) PK = workID, FK = empID
and employer--<worker
Now, the attributes that are functionally dependant on the workID and those dependant on the empID (which wasn't included in the original table as there it wasn't necessary) are now grouped together.
Now we need only enter the empID next to the worker's data, avoiding the repetative entry of the employer's other data, with, as you will notice, the risk of a typo ocurring (it was a real error as well). With hundreds of workers, the scope for typos is huge, and a search on "smith ltd" would miss out some people.
The simple example used didn't have scope to show what was meant by repetative redundant data. I do accept that the link table was probably missed out to shorten the post. But I've always found that examples which are very simple can give a beginner the wrong impression.
The original example is actually an example of a look-up table, which is usually used on important searchable fields to avoid misspellings causing problems eg comedy and comedies and comeddy. (Look-up tables are not fully required by normalisation, they are a ocnvenience.) But the link table is necessary as a film can be in several different categories, which wasn't immediately obvious (matrix, action) and (matrix, sci-fi) for example. And thus normalisation was required when it was added.
(Addresses are a field that are often left unnormalised for convenience.)
Hope this shows things a bit more clearly. Functional dependancies are always the best way to normalise things, they are part of the early research into relationships.
|
|
|
|
|
The Following User Says Thank You to Dr John For This Useful Post:
|
|
12-01-2007, 04:40 PM
|
#6 (permalink)
|
|
The Reckoner
Join Date: Sep 2007
Posts: 437
Thanks: 22
|
Ah yes, I understand what you mean, but I still always assumed that the example given was normalisation, albeit only first normal form. Whereas the example you gave is classed as third normal form. Both examples are normalisation, they are just different levels of it.
* Edit *
I've been doing some reading and I think I finally understand what you mean. The example given isn't a true example of normalisation because the new table (the lookup table) doesn't provide anything other than a name related to an id. To be true normalisation (and not just a look up table) the table must also extend the data which has been extracted (in this case, the genre) by adding additional fields.
Have I got that right?
__________________
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
Last edited by Karl : 12-01-2007 at 06:04 PM.
|
|
|
|
12-01-2007, 11:38 PM
|
#7 (permalink)
|
|
The Contributor
Join Date: Apr 2005
Location: Kent, UK
Posts: 54
Thanks: 0
|
Yes, that's the sort of thing I was trying to get at. A look-up table is for convenience and usually only has the one attribute (it doesn't need an artificial ID, the genre were unique to begin with). But as no other attribute had a functional dependancy on genre, there was no real reason to remove it to another table, UNTIL the multi-valued part was considered (matrix, action) and (matrix, sci-fi), then it required the link table to list each movie and genre pair. Which was why I included that bit.
I personally always think in terms of entities, as I seem to be much faster at normalising things that way. A movie can belong to many genres. A genre can apply to many movies. A many to many relationship - normalise, and add link table.
movie--<movie_genre>--genre
Suppose a movie could only belong to one genre - a one to one relationship.
Use a look-up table for convenience, if you wish to. (But in this case, you'd end up with some awkward genres eg comedy-action-japanese-sc-fi, then you'd realise a movie did NOT belong to just one genre and spot the many to many, as searching for comedy should return this movie, and you've gone multi-valued, not allowed in relatiional DBs.
I've noticed it is very common for people to add an ID to a table when one of the attributes is already a unique identifier. When you add ID to genre you end up with three superkeys - id, genre and (id,genre), of which two are candidate keys - id, and genre, and you choose either of the candidate keys to be the primary key. But id was added as if there was no natural simple candidate key, when genre did that job by itself. So the id in the genre table was instantly unnecessary.
Hope this helps a bit more. As I said, over simplifaction of examples masks what is actually being done and why.
|
|
|
|
|
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
|
|
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|