You probably heard the expression that nothing is lost on the Internet. It sounds almost like “manuscripts do not burn.” However, the meaning is most direct. Any search engine works by the fact that it finds, processes and stores all the data that appeared on the Internet. On the one hand, it’s good, because we have access to these data. On the other hand, this is a problem, because the Internet resource we created is also nothing more than data. So, they are easy to scan and download from there all the information. Use it then as you like. This is done with the help of parsing programs. The risk that your site will be subject to this process is always there. How to treat this and what to do about it?
How does parsing work
For the user, an online store (like any site) is a collection of pictures, texts, video – that is, everything that is intended for human organs of perception. For a computer, a site is a collection of data intended for processing and converting to a specific format. To translate extracted data into programming languages (PHP, Perl, Ruby, Python, etc.), different programs (scripts) work. They give each page of the site its structure. Depending on this structure, the page can get a format. Such as .html, .xml, .sql, .txt, and others.
First scripts perform their work on the local computer – directly while the content is being filled with the content. When a site appears on the Internet, it’s done by search engine bots. They conduct analysis, analysis, and transformation of the content of the site so that the Internet user sees in the search issue a link to the query that interests him. This process is called parsing.
As long as the content of the site is only available to search engines for the subsequent processing, everything is fine. The parser program allows you to extract from the page exactly the information that the user needs. Actually, thanks to this, visitors come to the site from the search. But the contents of the site can be sparred in other interests. In particular, to get and use the content.
Parsing content for use on another site is a reality that every webshop owner has to deal with. After all, the specifics of such a site are hundreds and thousands of the same type descriptions of goods, technical characteristics and other content. Unlike other types of sites, the content, for example, of an online store is formalized and unified. So, it is easy to reproduce. The user does not care which website is the source – he makes a request for the goods and passes through any link offered by the issuance or advertising.
You can arbitrarily condemn the use of someone else’s content, but we live in a world where the notion of intellectual property has changed a lot. In addition, technical descriptions and product characteristics are content that is foolish to rewrite in other words. So, it is not anyone’s property. But creating a complete copy of someone else’s shop is already stealing. As well as the partial use of the content for which you have rights. By the way, there is another term, grabbing (from the English “grab”, ie use, intercept) – the collection of information on certain parameters.
We are all engaged in grabbing when downloading anything from torrents. But do not consider ourselves thieves until we are caught in the commercial use of someone else’s intellectual property.
How does parsing prevent you from working
There are several types of problems that create parsing (grabbing) content: technical, commercial, psychological.
The technical problem is that bots and scanners are useless traffic that increases the load on the server. Sometimes the statistics show on the site a burst of attendance and “cosmic” depth of view, but this is hardly an excuse for joy. Most likely, it works script, its scanners, and bots. If you have hosting with restrictions, then exceeding the load is the problem, and the real reason to start an investigation is: who is parsing you.
The commercial problem is obvious: if the parser launches your competitor, collects the product database from your online store and starts to sell the same goods at a lower price, then you lose customers.
Psychologically, the problem can be described by a single exclamation: “Well, how is it!”. Actually, any kind of theft is exactly what this emotion brings to us. Especially when we do not know if it will be possible to punish the culprit.
Let’s start by looking at things realistically. So far there is no way to block and punish parsing and grabbing with one hundred percent efficiency. Therefore, to create an ordinary store with ordinary content – this is always a risk. If your store is popular, then parasitic traffic should serve as an excuse for you to develop business in a direction that is easier to protect. The human factor plays an important role here: it allows us to create an intellectual product that can be copied, but its authorship is easy to prove, and thieves should at least be discouraged.
And yet, what if you decided to fight to parse? There are several types of measures: technical, legal, psychological.
At specialized forums, there are branches of discussions on how to deal with technical parsers ways. The good news is that there is a solution, bad – they can do more harm in the long term than good.
So, the simplest and working method iso calculate from which IP you parsley and close for his access. To do this, you need a table of logs, where user data and time of access to the page are recorded. One way to identify is the time between requests. If it is accessed too often (that is, the deviation from the average value of the delta in 80% of the requests is less than 10 seconds), then the parser. Another way of identifying is checking for content downloads, for example, images or CSS styles. Useful bots are unlikely to download them.
The next step is to determine if a useful or useless bot is visiting your resource. This is difficult, since many bots disguise themselves as normal search bots or browsers. Such bots can be identified only by a combination of factors, and this requires the development of a specific self-written software.
Here the problem is that without taking into account various factors, you can block the search engine bot or some other harmless bot – not all of them are correctly represented by the user-agent. And the frequency of treatment pests now learned to reduce for disguise.
IP blocking is generally reasonable only in the most obvious malicious cases because IP can be dynamically allocated. However, adding a restriction on the frequency of calls and a given number of occurrences will be a superfluous measure. And all this, we recall, applies only to one method.
Another method is to use various services that protect against DDOS attacks. These services try to determine the degree of loading of your site. With a high connection frequency per second, the parser’s operation is treated as similar to a DDOS attack. There is a delay and a warning message on the monitor. When using this approach, we assume that the parser creates a load in several threads and does not pause between page injection. In a number of cases, this can help, but only against the simplest parsing bots.
The fourth method – unloved by all captcha. The method could be considered effective if it were not for two negative moments:
- captcha annoys your user and prevents him from reducing his loyalty to your resource;
- there are services for
So, the results of its application are questionable in the light of a possible loss of interest in your site.
It is more useful to use ReCAPTCHA, as it tries to determine whether a person is traveling on the site or not.
More sophisticated approaches require more effort: first, you need to somehow determine that the site is visited by a parser, then identify it and continue either to “authorize” the work or “forbid” it. In general, the use of technical measures is similar to the struggle of a person with a literary or mythological character: the Lernean Hydra, for example, or windmills. You can try, but the likelihood of harming yourself is higher.
Finally, a method that at least partially helps without harm and regular effort. If we can not protect ourselves from bots, then at least we can make it difficult to use our own content. Its main value can be a photo of goods. Use on watermark images that are difficult to remove. It is difficult to do this automatically, and the restoration of the original image greatly complicates its use on someone else’s resource.
An important organizational measure is to provide fast indexing of new pages of the site until they managed to spar. Make a search query “authorship in Google and Yandex” and use all methods to notify search bots about new pages. Naturally, the method works only if your content is original.
Monitoring the Internet for borrowing your materials (search queries, anti-plagiarism systems) can open your eyes to the fact that your content was copied. If you have established the fact of borrowing, you can try to negotiate with the owner of another resource. Depending on the success of the negotiations, it is also possible to reach legal claims.
It is important to understand the properties of information on the Internet:
- Information is spreading quickly – therefore it can be technically difficult to prove that your resource is the primary source;
- information on the Internet – this is not author materials, but various compilations of them. In this respect, copyright can be powerless;
- Legal issues regarding the Internet are not so well developed, and additional judicial red tape may only complicate matters, and not at all solve it, especially in your favor;
- There are many legal loopholes that are used by such giants of the IT industry as search engines. It is not excluded that they will be used by those who collect your content.
- Claims on the illegal use of photographs and other content, the copyright on which is easy to prove, you can show. And start right from the complaint to the search engines. At a minimum, this will return the site the advantage of the source. But in the normal way the complaint is punished: for example, Google can punish a single picture.
How far you can go in organizing counteractions to parsing depends on the specific situation and what is at stake. One of the justified prerequisites for a compulsory struggle against parsers is if the parsers try to collect personal data from your resource. Leaking such data discredits your resource. The decrease in trust, as a rule, immediately affects both attendance and profit. In some cases, it can turn into a confrontation with the executive authorities.
However, in most cases, litigation regarding the borrowing of content does not lead to anything good. They require time and attention, and the result may not pay off at all.
In this case, they can be considered based on your attitude and expediency. Let’s draw an analogy: “a bad investigator is a good investigator”. If you are strict, then you can not do without technical and legal measures, the purpose of which is to show that it is better to find another site than to fight with you.
If you think that it is easier to agree with parser operators, . The bottom line is that if the information from your resource is in demand, there will always be those who wish to spar it. The parser will collect all the necessary information and form the unloading. This can be an Excel, XML or YML format. Just in case, the YML-file is a document that processes Yandex for its Market. As the saying goes, if you can not fight the phenomenon, you can organize it.
Offer affiliate programs, cooperation programs with you, organize the data export interface and make a profit. Your yesterday’s customers of parsers will become your customers, and you can agree on them on mutually beneficial terms. Of course, this will work only in a certain market segment and with certain types of business. However, the partner programs organized by you can increase your number of visitors or sales.
- Any obstacle for parsing can be circumvented, it is only a matter of the need for your information.
- The cost of getting your data by the other party depends on the difficulty of getting it, that is, the price at which you will be protected from parsing. So, evaluate the feasibility and necessary complexity of the entanglements based on the value of the content that you have.
- The cost of your protection may include not only the price of setting up the system to counteract parsing, but also the risks of its incorrect work. Most SEO investments may not pay off at all if the security system blocks search bots. This is the worst scenario for the development of the situation. Users will also not be delighted with slowing down the work, accidental locks and the need to enter the captcha.
- Problems with search engines may be more expensive than trying to protect their data. Maybe it makes sense to think how to earn more on your resource, and to lose losses from parsing as a reality of our time, for example, network piracy?
In the near future, there may come the era of the semantic Web, which so long been actively discussed. In this new future, the parsers will become completely different, and this will create completely different problems. In the meantime, we are still on the threshold of the semantic web, it makes no sense in most cases to create additional difficulties for ourselves.