Remove MySQL duplicate and spam content from MySQL table:
1) Remove duplicate content:
Create a tmp table and create a unique index to the column. then copy the unique content form your table to tmp table.
create table tmp like `table`;
ALTER TABLE `tmp` ADD UNIQUE INDEX(text1, text2, text3, text4, text5, text6); insert IGNORE into `tmp` select * from `table`;
delete duplicate content from your main table:
where id not in (select id from tmp);
We can also do this with the main table it self, but if you data is too large it will take some time so i prefer above
ALTER IGNORE TABLE
ADD UNIQUE INDEX (mycolumn1, mycolumn2);
Once the process is complete remove unique index from the column.
2) Remove spam content:
i) Identify the spam first, for me spam contents are those who have certain keywords in their text. These can be links, foul words,
e.t.c so we can remove those lines by.
where `mycolumn` REGEXP 'foul word|http://|buy|purchase'
ii) Remove non english charectors like chinese, russians e.t.c
where `mycolumn` != CONVERT(myclumn USING ASCII)
iii)Remove very records which doesnt meet the text length standards, for examples comments column with just 3 char is not a valid comments.
Delete from `table` where LENGTH(mycloumn)<5;