![]() |
Database Basics, Tips and Best Practices
As my third entry in the GFY Educational Series, I'd like to try and shed some light on another important topic which people often have difficulty with - database management. For most websites, the database can be considered the "core" or single-most important piece to the website puzzle; it is also one of the most-neglected. Most programmers spend their time coding; they can build a simple, functional database that interacts with their code and that's all they need. Unless they are specifically asked by their clients to do so, they don't spend time on database optimization or analysis - they delivered a working product and wish to be paid and move on to the next job. It may also be that they simply lack real-world experience. For example, you cannot realistically expect someone who has never worked with large-scale websites to understand some of the bottlenecks that only occur on million-record databases for a site getting millions of visitors.
As this article is aimed at a broad audience, who may or may not have deep technical knowledge, I will attempt to cover the more common and simple tips and best practices. If the demand if there, I would be happy to cover more advanced topics such as Clusters, Detailed Storage Engine Comparisons, Hardware and OS Benchmarking, RAID Configurations, Constraints, Triggers, Procedures, Events, Fulltext Indexing, Transactions and the like in a future article. Additionally, while much of the below subject matter relates to most database softwares, for my examples I will be referring to MySQL - as I believe PHP/MySQL to be the most popular development environment for the adult community. 1) INDEXES Without a doubt, this is the most common reason for slow applications. I have personally seen popular softwares amongst the GFY community, built by fellow GFYers, containing MySQL tables that lack indexes completely. An index, in database jargon, can be thought of like a "hint" system. When a field is indexed, an analysis of the data contained within is put into a quick-access "list". When a query is performed, if an index is available for the search query parts, it may increase performance by avoiding the need to scan the entire table of data. There are many types of indexes available, but the main three are: Primary Indexes, Unique Indexes and Non-Unique (default) Indexes. You can also utilize multiple-column indexes, which can boost performance further in specific cases. Primary Index The Primary Index type, or "Primary Key" as it is more often referred to, uniquely identifies each record in a table. For example, if you have a table containing user information, this field might be named "user_id" and be set to auto_increment. One common misunderstanding is that a Primary Key MUST be auto_incrementing; this is false - the key simply must be unique. If you wished, you could use an actual username, as long as the intent is for no two records to have the same username value, it may be defined as a Primary Key. Your Primary Key can also be composed of more than one field; this can be useful in eliminating the need for an extra, otherwise useless ID column. For example, let's say you have a customer table and a product table. In your application, every customer can have a custom limit for the amount of each product they can order. Thus, you have a three-column "link" table: CustID,ProdID,CustLimit. Since you know that for every customer, his limit can vary per product BUT he will only ever have one limit per product, you can use a multiple-column (or composite) primary key here on the first two fields (CustID,ProdID). That Primary Key can then be called upon to reference any individual row, without need for a space-wasting auto_increment ID column. Note: when using Primary Keys, you do not need to add an additional "regular" index on the field. In the case of composite primary keys, you may require additional indexes if calling the keys out of turn. For example, "SELECT CustLimit FROM table WHERE CustID=1 AND ProdID=1" would be fine, but "SELECT CustLimit FROM table WHERE ProdID=1" would require an individual index on ProdID. Unique Indexes Unique indexes are extremely simple types of keys. For columns where there must be no duplicate values, you should make it unique. For example, say in your customer table, you have a field for their email address and you do not wish that any customer record can have a duplicate address - that is a prime target for a Unique Index. Non-Unique Indexes These are your regular, run-of-the-mill indexes. They are best put to use on fields which are used as search criteria or for ordering of results. For example, let's say you have a field called age or birthdate. When a search is performed, perhaps you need to find only those records between the age of 25 and 40. Your age or birthdate field needs an index. Another example, is you select our records from your table and wish to order them by last name. The last name column should be indexed. There is a special kind of index for text-type fields, but that is a little more advanced and will be covered in a future article. As with the above, you can also create composite/multi-column non-unique indexes. Let's say that you have a table where two of the fields are named username and password. You perhaps perform this query: "SELECT something FROM table WHERE username='user' and password='whatever'". You can easily make sure that both username and password have an individual index, but for higher performance, you can also create an index on both fields (username,password). Just remember that adding too many columns to a composite index may defeat the purpose and always remember indexes are read from left-to-right. CONTINUED IN NEXT POST... |
2) TABLE STRUCTURES
While opinions on this topic vary greatly and everyone has their own way of doing things, I'm going to present what works best for me. One mistake I see all the time, is programmers create their table structures on-the-fly, without proper thought put into them. Basically, they come up with tables and fields as their code dictates. This is the easiest way to end up with bad, redundant structures and bloated field types. Naming Conventions I believe in simple, descriptive and unique names for tables and columns. Here is basically, how I go about it:
The advantages to the above naming scheme is in the fact everything is named uniquely; it helps when doing combined multi-server backups, extensive JOIN queries and avoids any conflicts when importing 3rd-party db tables. I can't tell you how many times I've been frustrated when debugging large JOIN queries in other people's code, where they have columns with the same name in different tables which are not related/referential. Choosing the Right Field Type This is another one, done all the time, that really gets to me. It's nothing more than laziness, pure and simple. When constructing a database table, your goal should be to make it as "tight" and compact as possible. To that end, you should always pick the right field type for the job. For example, let's say you have a field which stores a value that will always be 2 letters. The right type in this case would be CHAR(2) and not VARCHAR(255). When you create a varchar field, the default in most mysql implementations will be 255 characters. That's a lot of wasted space if you only have to store 2 letters, isn't it? Integer fields are no different. If you plan to store a value between 100 and 200, you shouldn't have a regular integer field; you should use TINYINT instead. I highly recommend reading up on the available data types and their limitations for your particular database and version. Many arguments revolve around the use of ENUM as a field type. This special field type allows you to specify specific values and only these values are accepted. People's main argument is, what if you need to add a new value? ALTERing the table structure is always something to avoid, when possible; as it could lead to downtime, or worse, table corruption. I find, however, they can greatly boost performance when used properly and they enforce data integrity to boot. A good example, is a field such as gender. Perhaps your site only allows the following genders: male, female, tv/ts/tg. You don't expect that to change in the future. Here, I would use an ENUM and specify those three values; versus creating a separate table containing one record per gender, along with an ID, which is then reference in your main table. In this case, you save yourself both an extra table and perhaps a JOIN clause in your queries. Let's say, however, that your column in question was "Interests". You currently have 20 different values possible, but that may change frequently in the future. You should avoid using an ENUM here. Not Null Whenever possible, you should specify your columns as NOT NULL. Unless you have good reason to use NULL (there are a few cases), NOT NULL saves space and will make your queries less complicated. Primary Keys Aside from when used in "link" or "association" tables, you should always try and have an integer type of Primary Key field. This will save space and not require you to perform updates should data change. For example, if you were using username and the person was allowed to change their username - you need to update ti across all tables it is used in. Unsigned or Signed? Unless you need negative values, you should always make sure your numeric type fields are UNSIGNED. This saves space, which if you haven't figured out yet, is a good thing. Left-to-Right You should always try to build out your tables from left-to-right, as that is the order in which they are read. Your primary key/keys should always come first, followed by any reference "link" fields, followed by fields in increasing order of byte size. Related data, different table? Something I have found useful, is to split up tables based on what happens to the fields contained within. My general rule is, if a field is not indexed, ship it off into another table. Keep your main tables composed solely of columns you plan to use in WHERE clauses, JOINs and ORDER BYs. For example, take a profile table on a dating site. Perhaps there are a few text fields such as "about me", "what I'm seeking", "headline". This particular site does not allow searching within these fields. There is no reason to then keep them in your main profile table - move them out to a table whose primary key is your user_id, and select them when needed. CONTINUED IN NEXT POST... |
3) QUERIES
Once you have a solid table structure and proper indexing, your battle is nearly won. Nearly, but not quite. Badly-formed queries are the second most common case for slow performance after the lack of indexing. Often, there are numerous methods for achieving the same result set. This does not mean they are all the right way to query your database. In fact, a single malformed query can even crash the most powerful of databases if executed; such is the power they wield. Using Explain In MySQL, the EXPLAIN command is one of the most important tools at your disposal. Unfortunately, it won't do much good until you have a realistic data environment; to that end, if your site is in preproduction, you should try and write a quick script to populate the tables with as good a sampling of actual expected data as you can. Once you have a realistic demo environment, EXPLAIN every single one of your SELECT queries. ALL of them. Not only will it help you find potential typos or syntax errors, but it will analyze how the MySQL Optimizer sees your query and which Indexes it will use to perform the query. Generally, you want to make sure that MySQL is using the indexes you want it to use and that results are simple and short. You also want to avoid using "Filesort" or "Temporary Tables" (under the 'extra' column of the EXPLAIN command) whenever possible; there are some cases where it's unavoidable though. Sometimes, EXPLAIN will be using the wrong index - when this happens, consider if adding a new multi-column index can fix things. If not, you may possibly need FORCE INDEX in your query to help "guide" the optimizer (it isn't perfect). Lastly, if you just cannot get the result you want, it may be time to examine other alternatives such as a different table structure or breaking your super-query into multiple, simpler queries. Trial and error is the absolute key to all of this. Joins and Sub-queries When you need to query multiple tables at once, you usually end up performing a JOIN. There are different types that suit different situations, but for the most part, you will use the default JOIN (or INNER JOIN); where you virtually join two tables via a matching key. Rows that do not match are excluded. Try and make sure the column you perform a join on is indexed and of the same data type in both tables (integer if possible). If it's a primary key, even better. JOINs can get very complicated very quickly; a good rule of thumb is, if you can no longer even understand the query you have wrote by looking at it - you have made things too complicated and performance may suffer. As stated above, your best friend for diagnosing wayward JOINs is EXPLAIN. Sub-queries, on the other hand, should be used sparingly and in specific situations only. Otherwise, when selecting large result sets, you may end up performing thousands of queries without even realizing it. I typically only use them to compare against a singular column in a large table, or to select out a certain field or fields from one of my non-indexed information tables (see the profile example above). Select * From This is done all the time; when in fact, it should rarely (or never) be used. The obvious problem is, many times, you do not need to select all data a table contains; you only require certain fields. Selecting out additional data for no reason, is a waste of system resources and can contribute to network or delivery delays. The second reason, is with regards to your code. I know that when I am writing code, it is much easier to work when I have the available field names right in front of me, in the query. It will be for other people as well; especially those who may not have database access. Readability is a key component of efficient coding and using Select * goes against that completely. Order By and Rand() When trying to select a random group of results, many people use the ORDER BY rand() bit of SQL to achieve the desired result. Depending on your data, this can utilize a lot of processing power and thus become slow. If you know the number of rows in your table, you can use a random value less than that as an offset - speeding up your query greatly. If you need multiple results, however, that method won't work unless you select the random results out one by one using the previous method. In the right circumstances, performing 10 separate queries can be much faster than retrieving 10 random records from a very large table with a single query. Trial and error, EXPLAIN and you'll be fine. Limit 1 Sometimes, you know you are seeking for just one specific record; or you only want to fetch a single record. In these cases, it's best to add a LIMIT 1 clause at the end of your query. This will boost performance by stopping immediately after the first match it found, as opposed to scanning all possible results. It's the little things like this that can add up to big performance savings. 4) MYISAM & INNODB INNODB is the default MySQL database engine. It does have multiple drawbacks and while engine comparisons can become quite lengthy and argumentative, I will simply tell you instead when I find myself using them. I recommend using INNODB tables on very small, compact tables that are rarely changed (INSERTED, UPDATED, DELETED). For such cases, I generally only end up using it for reference tables (for example, the interests table described up above). For everything else, I recommend InnoDB. The key to using InnoDB successfully is to configure its settings properly. If you leave things to the default, you may find InnoDB causes a lot more harm than good. Finely-tuned (and that goes for all your my.cnf settings in general), you'll be able to store and query much larger tables at much larger speeds. You'll be able to better utilize indexes, joins and everything you have grown to love (or hate) about SQL. Please do remember, that if your database becomes corrupted (generally due to a crash), upon restarting the InnoDB tables will attempt to repair themselves automatically. The same is not true for INNODB; make sure you don't forget about them or as soon as a query hits one of those tables, you may find yourself in trouble again. As for the actual my.cnf settings file, it truly is a case-by-case basis. Hardware and software both come into play heavily, so my best advice is, if you aren't sure of what you are doing with it - hire someone to help. 5) CONCLUSION There is so much to this topic, that I could go on for a year. The above should give you a broad overview, with a few specific examples and recommendations that may help you out. As mentioned, I'd be happy to cover the more advanced topics in a future article if the demand is there. I didn't want this particular article to alienate 95% of you reading it, though :) For the record, I have dealt with multiple projects involving MySQL and PostgreSQL (I have actually worked for a time with one of the developers of PostgreSQL, attempting to implement a few optimizations into their engine), involving millions upon millions of records and some very complex queries. Clusters/replication, advanced procedures/triggers, constraints?that said, I still learn new tricks practically every week. You will never stop learning a better way to do something. If anyone has any questions, feel free to ask away below and I'll answer them to the best of my ability. |
Awesome read :D! This might make me want to go back and play with DBs...
|
InnoDB can suck my left nut..
|
Quote:
|
Quote:
tables: models models_films films models_films links models to the films they are in. You want to delete a model from the models table.... In INNODB, you have to delete the model entry in models table plus all the entries in models_films that contains the model_id. With all the re-indexing that mysql does behind the scenes on data delete In InnoDB, you simply delete the model, and if the relationships are correctly set, all the entries in models_films will auto-delete too. When you've got like 10 tables that are relationally-linked, this simple shit is priceless.... Not to mention rolling back on a transaction. So, no, myisam can suck both my nuts ;) Well written series Varius :thumbsup |
Thank you Varius for that great article.:thumbsup
|
Quote:
For example, full-text indexes are not possible in InnoDB yet, but they are in INNODB. INNODB is for high-read volume with little to no update volume (update,insert,delete). InnoDB is for high-update volume, as table vs. row-level locking make a huge difference. |
Fuck me dude, that was pure aces, good job! :thumbsup
|
Another great addition to our Educational Series lineup. Keith, I can't thank you enough for being so kind to contribute again. :)
|
thank you very much...
|
No one has any questions? Maybe I should have mentioned that databases killed show attendance or something to get some replies :winkwink:
|
nice intro to databases, thanks for sharing
|
Quote:
|
Quote:
Anyone have other questions before I venture off to enjoy my bday weekend ? |
Great write up again G...
|
Quote:
|
Quote:
|
Great article, databases have to be the most overlooked component of websites ever.
|
Quote:
Quote:
Of course, if you are just starting off with it, be prepared to go through a lot of late nights when slaves fail or corrupt for one reason or another until you get used to re-syncing them with the Master without any downtime at all. Overall though, yes, I have used it and would use it again and if it fits your situation, definitely use it. |
One of the best GFY posts i've saw this year (which keeps getting lower year by year sadly)
|
Greate writeup! I'm not sure if you mentioned this but one thing to note about InnoDB that you would want to set is doing InnoDB per table. This way all of your tables that are set to InnoDB are not jumbled up in 1 innodb file. This is also a problem when you want to delete data from a table it will not shrink the InnoDB file. So doing InnoDB per table is a good idea.
|
That's very useful read, I will read it all carefully for sure.
|
Where can I read something about database design?
I know basic programming and I can write a quick app, but for some reason I cannot figure out how to properly design a database. When to use different tables,how many fields to use on a single table etc... |
Quote:
As with INNODB vs. InnoDB though, "one file" vs. "file per table" each have their own set of Pros and Cons and should be used depending on your situation. "file per table" PROS:
"file per table" CONS:
So, if you have a database that doesn't grow and shrink very often (say you almost never delete records), but which only grows and grows, you would be better off with using a single large file. If you have a ton of small tables, you'll also likely be better off with a single file. If your database does do a lot of inserts/deletes and doesn't have hundreds of tables, then yes, use "file per table". |
Quote:
Honestly, I wouldn't know where to point you for the best articles for starting off, I would suggest to search for tutorials like "database normalization", "introduction to database design", "database best practices", etc...sites like DevShed, dev.mysql.com and so forth should have what you seek. Or, post any specific questions you have here or join other database/mysql forums/mailing lists and post them there. |
Happy B-Day!
|
great article!!!
|
I detest InnoDB due to its complexity and management headaches. I'll take INNODB for being cut and dry.
|
Awesome thank you!
|
Thanks for the insite
|
excellent article. very well written and very informative. thank you very much and keep posting such nice articles.
|
bump bump
|
I appreciate the 101!
|
Yoooo man .... fabulous work done by you !!! very hot post :)
|
thanks, good stuff.
|
gfelife is da best
Quote:
|
Great read, thanks for share
|
I have been trying to go through this thread for last few days and I'm able to do some work on DB after reading all the contents including replies. Thanks to everyone and I want to add something worth following to this content but I think this isn't a right time for me.
|
Nice and useful article. Thanks
|
Quote:
dont want more |
Your points cannot be stressed enough especially for sites that expect to scale to the hundreds of thousands and millions of users.
Great tidbits. |
thanks for useful information
|
So much vital information, great article
|
Tnx for the great tips! I'm looking forward for more articles.
|
This is HUGE man, thanks!
|
Excellent post, thanks a lot !!
|
thank you for sharing all this with us!
|
Killer article.
|
All times are GMT -7. The time now is 05:38 PM. |
Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2025, vBulletin Solutions, Inc.
©2000-, AI Media Network Inc