Re: [BULK] Re: [WebDNA] set HTTP-Status Code from webdna

This WebDNA talk-list message is from

2016


It keeps the original formatting.
numero = 112677
interpreted = N
texte = 260 --001a11c3c6d2451a77052ecf47d1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi all, Thought I would add my approach to 'pretty' urls using mod_rewrite rather than routing through an error document. Basically everything except images and folders/files that I specify are routed to 'parser.tmpl'. That template then parses the URL and you can the search databases, include files etc. Here's a sample htaccess file with all the mod_rewrite stuff and some other things that people might find useful. - Tom PS. This a great resource on what can be done using the htaccess file https://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess # Better website experience for IE Header set X-UA-Compatible "IE=3Dedge" Header unset X-UA-Compatible DirectoryIndex index.html index.tmpl # Proper MIME types for all files AddType application/javascript js AddType application/json json AddType video/mp4 mp4 m4v f4v f4p AddType video/x-flv flv AddType application/font-woff woff AddType application/vnd.ms-fontobject eot AddType application/x-font-ttf ttc ttf AddType font/opentype otf AddType image/svg+xml svg svgz AddEncoding gzip svgz AddType application/x-shockwave-flash swf AddType application/xml atom rdf rss xml AddType image/x-icon ico AddType text/vtt vtt AddType text/x-component htc AddType text/x-vcard vcf AddType text/csv csv # UTF-8 encoding AddDefaultCharset utf-8 AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml # Security - Block access to directories without a default document Options -Indexes # Block access to backup and source files Order allow,deny Deny from all Satisfy All # Rewrite engine RewriteEngine On # Redirect to Main 'www' Domain RewriteCond %{HTTP_HOST} ^yourdomain\.com [NC] RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L] # Exclude these directories and files from rewrite RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L] # Exclude images from rewrite RewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC] # Route everything else through parser.tmpl RewriteRule . /parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo= rt=3D%{SERVER_PORT} [L] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Digital Revolutionaries 1st Floor, Castleriver House 14-15 Parliament Street Temple Bar,Dublin 2 Ireland ---------------------------------------------- [t]: + 353 1 4403907 [e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On 24 March 2016 at 17:50, wrote: > What about using [referrer] to allow your customers navigate your website > but disallow bookmarking and outside links? you could also use [session] = to > limit the navigation to X minutes or Y pages, even for bots, then "kick" > the visitor out. > > > - chris > > > > > > On Mar 24, 2016, at 20:30, Brian Burton wrote: > > > > Backstory: the site is question is a replacements part business and has > hundreds of thousands of pages of cross reference material, all stored in > databases and generated as needed. Competitors and dealers that carry > competitors brand parts seem to think that copying our cross reference is > easier then creating their own (it would be) so code was written to block > this. > > > > YES, I KNOW that if they are determined, they will find a way around my > blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot= her VPNs=E2=80=A6) > > > > Solution: looking at the stats for the average use of the website, we > found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6 > > I have a visitors.db. The system logs all page requests tracked by IP > address, and after a set amount (more then 14 pages, but still a pretty l= ow > number) starts showing visitors a nice Page Limit Exceeded page instead o= f > what they were crawling thru. After an unreasonable number of pages I jus= t > 404 them out to save server time and bandwidth. The count resets at > midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir= st or last > page request (per IP.) In some cases, when I=E2=80=99m feeling particular= ly > mischievous, once a bot is detected i start feeding them fake info :D > > > > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but = it is > what it is) > > VID IPadd ipperm ipname visitdate pagecount starttime > endtime domain firstpage lastpage browtype > lastsku partner linkin page9 page8 page7 page6 page5 page4 > page3 page2 page1 > > > > > > All the code that does the tracking and counting and map/reduction to > store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i= f > anything) I can share a bit later, and try to write it up as a blog post = or > something. > > > > -Brian B. Burton > > > >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote: > >> > >> curious how to determine...non google/bing/yahoo bots and other > attempting to crawl/copy the entire site? > >> > >> > >> > >> On 3/24/2016 9:28 AM, Brian Burton wrote: > >>> Noah, > >>> > >>> Similar to you, and wanting to use pretty URLs I built something > similar, but did it a different way. > >>> _All_ page requests are caught by a url-rewrite rule and get sent to > dispatch.tpl > >>> Dispatch.tpl has hundreds of rules that decide what page to show, and > uses includes to do it. > >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to = go > mucking about in webdna here, and apache there, and linux somewhere else, > and etc=E2=80=A6) > >>> > >>> Three special circumstances came up that needed special code to send > out proper HTTP status codes: > >>> > >>> temporarily moved code on a redirect) =E2=80=94> > >>> [function name=3D301public] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: > http://www.example.com[link][eol][eol][/returnraw] > >>> [/function] > >>> > >>> crawl/copy the entire site=E2=80=94> > >>> [function name=3D404hard] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not > Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 > Not Found

[eol]The page that you have requested ([thisurl]) could not > be found.[eol][eol][/returnraw] > >>> [/function] > >>> > >>> > >>> [function name=3D404soft] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not > Found[eol]Content-type: text/html[eol][eol][include > file=3D/404pretty.tpl][/returnraw] > >>> [/function] > >>> > >>> Hope this helps > >>> -Brian B. Burton > > > > --------------------------------------------------------- > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us > Bug Reporting: support@webdna.us > --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us --001a11c3c6d2451a77052ecf47d1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all,

Thought I would add my approach= to 'pretty' urls using mod_rewrite rather than routing through an = error document. =C2=A0

Basically everything except= images and folders/files that I specify are routed to 'parser.tmpl'= ;.=C2=A0 That template then parses the URL and you can the search databases= , include files etc.

Here's a sample htaccess = file with all the mod_rewrite stuff and some other things that people might= find useful. =C2=A0

- Tom



PS. This a great resource on what can be do= ne using the htaccess file


<= /div>


# Better website experience for IE
Header set X-UA-Compatible "IE=3Dedge"<= /div>
<FilesMatch "\.(appca= che|crx|css|eot|gif|htc|ico|jpe?g|js|m4a|m4v|manifest|mp4|oex|oga|ogg|ogv|o= tf|pdf|png|safariextz|svgz?|ttf|vcf|webapp|webm|webp|woff|xml|xpi)$"&g= t;
Header unset X-UA-Compatible
</FilesMatch>

DirectoryIndex index.html index.tmpl
<= font face=3D"monospace, monospace">
# Proper MIME types for all files
AddType application/javascript =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0js
AddType application/json =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0json

AddType video/mp4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 mp4 m4v f4v f4p
Ad= dType video/x-flv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 flv

AddType application/font-woff =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 woff
AddType application/vnd.ms-fontobject =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 eot
AddType= application/x-font-ttf =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0ttc ttf
A= ddType font/opentype =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 otf
AddType image/svg+xml =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 svg svgz=
AddEncoding gzip =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0svgz

AddType= application/x-shockwave-flash =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 swf
AddType application/xml =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 atom rdf rss xml
AddType image/x-icon =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ico
AddType text/vtt =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0vtt
AddType text= /x-component =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0htc
AddType text/x-vcard =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vcf

AddType text/csv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0csv

# UTF-8 encoding
AddDefaultCharset utf-8
AddCharset utf-8 .atom .css .js .json .rss .vtt= .webapp .xml

# Security - Block access= to directories without a default document
Options -Indexes

#= Block access to backup and source files
<FilesMatch "(^#.*#|\.(bak|config|dist|fla|inc|i= ni|log|psd|sh|sql|sw[op])|~)$">
Order = allow,deny
Deny from all
= Satisfy All
&l= t;/FilesMatch>

=
# Rewrite engine
RewriteEngine On

# Redirect to Main 'www' Domain
RewriteCond %{HTTP_HOST} ^your= domain\.com [NC]
Rewri= teRule ^(.*)$ http://www.yourdomai= n.com/$1 [R=3D301,NC,L]

# Exclude t= hese directories and files from rewrite
RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\= ..txt)($|/) - [L]

<= /font>
# Exclude images from = rewrite
RewriteCond %{= REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]

# Route everything else through parser.tmpl
RewriteRule . /parser.tmpl?requestedurl=3D%{REQU= EST_URI}&query=3D%{QUERY_STRING}&serverport=3D%{SERVER_PORT} [L]



=


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D
Digital Revolutionaries
1st Floor, Castleriver Hou= se
14-15 Parliament Street
Temple Bar,Dublin 2
Ireland
--------= --------------------------------------
[t]: + 353 1 4403907
[e]: <= mailto:tom@revo= lutionaries.ie>
[w]: <http://www.revolutionaries.ie/>
=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On 24 March 2016 at 17:50, <= christophe.billiottet@webdna.us> wrote:
What about using [referrer] to allow your customers navigate = your website but disallow bookmarking and outside links? you could also use= [session] to limit the navigation to X minutes or Y pages, even for bots, = then "kick" the visitor out.


- chris




> On Mar 24, 2016, at 20:30, Brian Burton <brian@burtons.com> wrote:
>
> Backstory: the site is question is a replacements part business and ha= s hundreds of thousands of pages of cross reference material, all stored in= databases and generated as needed. Competitors and dealers that carry comp= etitors brand parts seem to think that copying our cross reference is easie= r then creating their own (it would be) so code was written to block this.<= br> >
> YES, I KNOW that if they are determined, they will find a way around m= y blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot= her VPNs=E2=80=A6)
>
> Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6 > I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty low= number) starts showing visitors a nice Page Limit Exceeded page instead of= what they were crawling thru. After an unreasonable number of pages I just= 404 them out to save server time and bandwidth. The count resets at midnig= ht, because I=E2=80=99m far to lazy to track 24 hours since the first or la= st page request (per IP.) In some cases, when I=E2=80=99m feeling particula= rly mischievous, once a bot is detected i start feeding them fake info :D >
> Here=E2=80=99s the Visitors.db header:=C2=A0 (not sure if it will help= , but it is what it is)
> VID=C2=A0 =C2=A0IPadd=C2=A0 =C2=A0ipperm=C2=A0 ipname=C2=A0 visitdate= =C2=A0 =C2=A0 =C2=A0 =C2=A0pagecount=C2=A0 =C2=A0 =C2=A0 =C2=A0starttime=C2= =A0 =C2=A0 =C2=A0 =C2=A0endtime domain=C2=A0 firstpage=C2=A0 =C2=A0 =C2=A0 = =C2=A0lastpage=C2=A0 =C2=A0 =C2=A0 =C2=A0 browtype=C2=A0 =C2=A0 =C2=A0 =C2= =A0 lastsku partner linkin=C2=A0 page9=C2=A0 =C2=A0page8=C2=A0 =C2=A0page7= =C2=A0 =C2=A0page6=C2=A0 =C2=A0page5=C2=A0 =C2=A0page4=C2=A0 =C2=A0page3=C2= =A0 =C2=A0page2=C2=A0 =C2=A0page1
>
>
> All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (if = anything) I can share a bit later, and try to write it up as a blog post or= something.
>
> -Brian B. Burton
>
>> On Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:
>>
>> curious how to determine...non google/bing/yahoo bots and other at= tempting to crawl/copy the entire site?
>>
>>
>>
>> On 3/24/2016 9:28 AM, Brian Burton wrote:
>>> Noah,
>>>
>>> Similar to you, and wanting to use pretty URLs I built somethi= ng similar, but did it a different way.
>>> _All_ page requests are caught by a url-rewrite rule and get s= ent to dispatch.tpl
>>> Dispatch.tpl has hundreds of rules that decide what page to sh= ow, and uses includes to do it.
>>> (this keeps everything in-house to webdna so i don=E2=80=99t h= ave to go mucking about in webdna here, and apache there, and linux somewhe= re else, and etc=E2=80=A6)
>>>
>>> Three special circumstances came up that needed special code t= o send out proper HTTP status codes:
>>>
>>> <!=E2=80=94 for page URLS that have permanently moved (webd= na sends out a 302 temporarily moved code on a redirect) =E2=80=94>
>>> [function name=3D301public]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://ww= w.example.com[link][eol][eol][/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 I send this to non google/bing/yahoo bots and o= ther attempting to crawl/copy the entire site=E2=80=94>
>>> [function name=3D404hard]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo= l]Content-type: text/html[eol][eol]<html>[eol]<body>[eol]<h1= >404 Not Found</h1>[eol]The page that you have requested ([thisurl= ]) could not be found.[eol]</body>[eol]</html>[/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 and finally a pretty 404 page for humans =E2=80= =94>
>>> [function name=3D404soft]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo= l]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnr= aw]
>>> [/function]
>>>
>>> Hope this helps
>>> -Brian B. Burton
>

----------------------------------------------= -----------
This message is sent to you because you are subscribed to
the mailing list <talk@webdna.u= s>.
To unsubscribe, E-mail to: <talk= -leave@webdna.us>

--001a11c3c6d2451a77052ecf47d1-- . Associated Messages, from the most recent to the oldest:

    
260 --001a11c3c6d2451a77052ecf47d1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi all, Thought I would add my approach to 'pretty' urls using mod_rewrite rather than routing through an error document. Basically everything except images and folders/files that I specify are routed to 'parser.tmpl'. That template then parses the URL and you can the search databases, include files etc. Here's a sample htaccess file with all the mod_rewrite stuff and some other things that people might find useful. - Tom PS. This a great resource on what can be done using the htaccess file https://github.com/h5bp/html5-boilerplate/blob/master/dist/.htaccess # Better website experience for IE Header set X-UA-Compatible "IE=3Dedge" Header unset X-UA-Compatible DirectoryIndex index.html index.tmpl # Proper MIME types for all files AddType application/javascript js AddType application/json json AddType video/mp4 mp4 m4v f4v f4p AddType video/x-flv flv AddType application/font-woff woff AddType application/vnd.ms-fontobject eot AddType application/x-font-ttf ttc ttf AddType font/opentype otf AddType image/svg+xml svg svgz AddEncoding gzip svgz AddType application/x-shockwave-flash swf AddType application/xml atom rdf rss xml AddType image/x-icon ico AddType text/vtt vtt AddType text/x-component htc AddType text/x-vcard vcf AddType text/csv csv # UTF-8 encoding AddDefaultCharset utf-8 AddCharset utf-8 .atom .css .js .json .rss .vtt .webapp .xml # Security - Block access to directories without a default document Options -Indexes # Block access to backup and source files Order allow,deny Deny from all Satisfy All # Rewrite engine RewriteEngine On # Redirect to Main 'www' Domain RewriteCond %{HTTP_HOST} ^yourdomain\.com [NC] RewriteRule ^(.*)$ http://www.yourdomain.com/$1 [R=3D301,NC,L] # Exclude these directories and files from rewrite RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\.txt)($|/) - [L] # Exclude images from rewrite RewriteCond %{REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC] # Route everything else through parser.tmpl RewriteRule . /parser.tmpl?requestedurl=3D%{REQUEST_URI}&query=3D%{QUERY_STRING}&serverpo= rt=3D%{SERVER_PORT} [L] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Digital Revolutionaries 1st Floor, Castleriver House 14-15 Parliament Street Temple Bar,Dublin 2 Ireland ---------------------------------------------- [t]: + 353 1 4403907 [e]: [w]: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On 24 March 2016 at 17:50, wrote: > What about using [referrer] to allow your customers navigate your website > but disallow bookmarking and outside links? you could also use [session] = to > limit the navigation to X minutes or Y pages, even for bots, then "kick" > the visitor out. > > > - chris > > > > > > On Mar 24, 2016, at 20:30, Brian Burton wrote: > > > > Backstory: the site is question is a replacements part business and has > hundreds of thousands of pages of cross reference material, all stored in > databases and generated as needed. Competitors and dealers that carry > competitors brand parts seem to think that copying our cross reference is > easier then creating their own (it would be) so code was written to block > this. > > > > YES, I KNOW that if they are determined, they will find a way around my > blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot= her VPNs=E2=80=A6) > > > > Solution: looking at the stats for the average use of the website, we > found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6 > > I have a visitors.db. The system logs all page requests tracked by IP > address, and after a set amount (more then 14 pages, but still a pretty l= ow > number) starts showing visitors a nice Page Limit Exceeded page instead o= f > what they were crawling thru. After an unreasonable number of pages I jus= t > 404 them out to save server time and bandwidth. The count resets at > midnight, because I=E2=80=99m far to lazy to track 24 hours since the fir= st or last > page request (per IP.) In some cases, when I=E2=80=99m feeling particular= ly > mischievous, once a bot is detected i start feeding them fake info :D > > > > Here=E2=80=99s the Visitors.db header: (not sure if it will help, but = it is > what it is) > > VID IPadd ipperm ipname visitdate pagecount starttime > endtime domain firstpage lastpage browtype > lastsku partner linkin page9 page8 page7 page6 page5 page4 > page3 page2 page1 > > > > > > All the code that does the tracking and counting and map/reduction to > store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (i= f > anything) I can share a bit later, and try to write it up as a blog post = or > something. > > > > -Brian B. Burton > > > >> On Mar 24, 2016, at 11:41 AM, Jym Duane wrote: > >> > >> curious how to determine...non google/bing/yahoo bots and other > attempting to crawl/copy the entire site? > >> > >> > >> > >> On 3/24/2016 9:28 AM, Brian Burton wrote: > >>> Noah, > >>> > >>> Similar to you, and wanting to use pretty URLs I built something > similar, but did it a different way. > >>> _All_ page requests are caught by a url-rewrite rule and get sent to > dispatch.tpl > >>> Dispatch.tpl has hundreds of rules that decide what page to show, and > uses includes to do it. > >>> (this keeps everything in-house to webdna so i don=E2=80=99t have to = go > mucking about in webdna here, and apache there, and linux somewhere else, > and etc=E2=80=A6) > >>> > >>> Three special circumstances came up that needed special code to send > out proper HTTP status codes: > >>> > >>> temporarily moved code on a redirect) =E2=80=94> > >>> [function name=3D301public] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: > http://www.example.com[link][eol][eol][/returnraw] > >>> [/function] > >>> > >>> crawl/copy the entire site=E2=80=94> > >>> [function name=3D404hard] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not > Found[eol]Content-type: text/html[eol][eol][eol][eol]

404 > Not Found

[eol]The page that you have requested ([thisurl]) could not > be found.[eol][eol][/returnraw] > >>> [/function] > >>> > >>> > >>> [function name=3D404soft] > >>> [text]eol=3D[unurl]%0D%0A[/unurl][/text] > >>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not > Found[eol]Content-type: text/html[eol][eol][include > file=3D/404pretty.tpl][/returnraw] > >>> [/function] > >>> > >>> Hope this helps > >>> -Brian B. Burton > > > > --------------------------------------------------------- > This message is sent to you because you are subscribed to > the mailing list . > To unsubscribe, E-mail to: > archives: http://mail.webdna.us/list/talk@webdna.us > Bug Reporting: support@webdna.us > --------------------------------------------------------- This message is sent to you because you are subscribed to the mailing list . To unsubscribe, E-mail to: archives: http://mail.webdna.us/list/talk@webdna.us Bug Reporting: support@webdna.us --001a11c3c6d2451a77052ecf47d1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all,

Thought I would add my approach= to 'pretty' urls using mod_rewrite rather than routing through an = error document. =C2=A0

Basically everything except= images and folders/files that I specify are routed to 'parser.tmpl'= ;.=C2=A0 That template then parses the URL and you can the search databases= , include files etc.

Here's a sample htaccess = file with all the mod_rewrite stuff and some other things that people might= find useful. =C2=A0

- Tom



PS. This a great resource on what can be do= ne using the htaccess file


<= /div>


# Better website experience for IE
Header set X-UA-Compatible "IE=3Dedge"<= /div>
<FilesMatch "\.(appca= che|crx|css|eot|gif|htc|ico|jpe?g|js|m4a|m4v|manifest|mp4|oex|oga|ogg|ogv|o= tf|pdf|png|safariextz|svgz?|ttf|vcf|webapp|webm|webp|woff|xml|xpi)$"&g= t;
Header unset X-UA-Compatible
</FilesMatch>

DirectoryIndex index.html index.tmpl
<= font face=3D"monospace, monospace">
# Proper MIME types for all files
AddType application/javascript =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0js
AddType application/json =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0json

AddType video/mp4 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 mp4 m4v f4v f4p
Ad= dType video/x-flv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 flv

AddType application/font-woff =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 woff
AddType application/vnd.ms-fontobject =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 eot
AddType= application/x-font-ttf =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0ttc ttf
A= ddType font/opentype =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 otf
AddType image/svg+xml =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 svg svgz=
AddEncoding gzip =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0svgz

AddType= application/x-shockwave-flash =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 swf
AddType application/xml =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 atom rdf rss xml
AddType image/x-icon =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ico
AddType text/vtt =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0vtt
AddType text= /x-component =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0htc
AddType text/x-vcard =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0vcf

AddType text/csv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0csv

# UTF-8 encoding
AddDefaultCharset utf-8
AddCharset utf-8 .atom .css .js .json .rss .vtt= .webapp .xml

# Security - Block access= to directories without a default document
Options -Indexes

#= Block access to backup and source files
<FilesMatch "(^#.*#|\.(bak|config|dist|fla|inc|i= ni|log|psd|sh|sql|sw[op])|~)$">
Order = allow,deny
Deny from all
= Satisfy All
&l= t;/FilesMatch>

=
# Rewrite engine
RewriteEngine On

# Redirect to Main 'www' Domain
RewriteCond %{HTTP_HOST} ^your= domain\.com [NC]
Rewri= teRule ^(.*)$ http://www.yourdomai= n.com/$1 [R=3D301,NC,L]

# Exclude t= hese directories and files from rewrite
RewriteRule ^(admin|otherdirectories|parser\.tmpl|robots\= ..txt)($|/) - [L]

<= /font>
# Exclude images from = rewrite
RewriteCond %{= REQUEST_URI} !\.(gif|jp?g|png|css|ico) [NC]

# Route everything else through parser.tmpl
RewriteRule . /parser.tmpl?requestedurl=3D%{REQU= EST_URI}&query=3D%{QUERY_STRING}&serverport=3D%{SERVER_PORT} [L]



=


=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D
Digital Revolutionaries
1st Floor, Castleriver Hou= se
14-15 Parliament Street
Temple Bar,Dublin 2
Ireland
--------= --------------------------------------
[t]: + 353 1 4403907
[e]: <= mailto:tom@revo= lutionaries.ie>
[w]: <http://www.revolutionaries.ie/>
=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

On 24 March 2016 at 17:50, <= christophe.billiottet@webdna.us> wrote:
What about using [referrer] to allow your customers navigate = your website but disallow bookmarking and outside links? you could also use= [session] to limit the navigation to X minutes or Y pages, even for bots, = then "kick" the visitor out.


- chris




> On Mar 24, 2016, at 20:30, Brian Burton <brian@burtons.com> wrote:
>
> Backstory: the site is question is a replacements part business and ha= s hundreds of thousands of pages of cross reference material, all stored in= databases and generated as needed. Competitors and dealers that carry comp= etitors brand parts seem to think that copying our cross reference is easie= r then creating their own (it would be) so code was written to block this.<= br> >
> YES, I KNOW that if they are determined, they will find a way around m= y blockades (I=E2=80=99ve seen quite a few variations on this: tor, AWS, ot= her VPNs=E2=80=A6)
>
> Solution: looking at the stats for the average use of the website, we = found that 95% of the site traffic visited 14 pages or less. So=E2=80=A6 > I have a visitors.db. The system logs all page requests tracked by IP = address, and after a set amount (more then 14 pages, but still a pretty low= number) starts showing visitors a nice Page Limit Exceeded page instead of= what they were crawling thru. After an unreasonable number of pages I just= 404 them out to save server time and bandwidth. The count resets at midnig= ht, because I=E2=80=99m far to lazy to track 24 hours since the first or la= st page request (per IP.) In some cases, when I=E2=80=99m feeling particula= rly mischievous, once a bot is detected i start feeding them fake info :D >
> Here=E2=80=99s the Visitors.db header:=C2=A0 (not sure if it will help= , but it is what it is)
> VID=C2=A0 =C2=A0IPadd=C2=A0 =C2=A0ipperm=C2=A0 ipname=C2=A0 visitdate= =C2=A0 =C2=A0 =C2=A0 =C2=A0pagecount=C2=A0 =C2=A0 =C2=A0 =C2=A0starttime=C2= =A0 =C2=A0 =C2=A0 =C2=A0endtime domain=C2=A0 firstpage=C2=A0 =C2=A0 =C2=A0 = =C2=A0lastpage=C2=A0 =C2=A0 =C2=A0 =C2=A0 browtype=C2=A0 =C2=A0 =C2=A0 =C2= =A0 lastsku partner linkin=C2=A0 page9=C2=A0 =C2=A0page8=C2=A0 =C2=A0page7= =C2=A0 =C2=A0page6=C2=A0 =C2=A0page5=C2=A0 =C2=A0page4=C2=A0 =C2=A0page3=C2= =A0 =C2=A0page2=C2=A0 =C2=A0page1
>
>
> All the code that does the tracking and counting and map/reduction to = store stats and stuff is proprietary (sorry) but I=E2=80=99ll see what (if = anything) I can share a bit later, and try to write it up as a blog post or= something.
>
> -Brian B. Burton
>
>> On Mar 24, 2016, at 11:41 AM, Jym Duane <jym@purposemedia.com> wrote:
>>
>> curious how to determine...non google/bing/yahoo bots and other at= tempting to crawl/copy the entire site?
>>
>>
>>
>> On 3/24/2016 9:28 AM, Brian Burton wrote:
>>> Noah,
>>>
>>> Similar to you, and wanting to use pretty URLs I built somethi= ng similar, but did it a different way.
>>> _All_ page requests are caught by a url-rewrite rule and get s= ent to dispatch.tpl
>>> Dispatch.tpl has hundreds of rules that decide what page to sh= ow, and uses includes to do it.
>>> (this keeps everything in-house to webdna so i don=E2=80=99t h= ave to go mucking about in webdna here, and apache there, and linux somewhe= re else, and etc=E2=80=A6)
>>>
>>> Three special circumstances came up that needed special code t= o send out proper HTTP status codes:
>>>
>>> <!=E2=80=94 for page URLS that have permanently moved (webd= na sends out a 302 temporarily moved code on a redirect) =E2=80=94>
>>> [function name=3D301public]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.1 301 Moved Permanently[eol]Location: http://ww= w.example.com[link][eol][eol][/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 I send this to non google/bing/yahoo bots and o= ther attempting to crawl/copy the entire site=E2=80=94>
>>> [function name=3D404hard]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo= l]Content-type: text/html[eol][eol]<html>[eol]<body>[eol]<h1= >404 Not Found</h1>[eol]The page that you have requested ([thisurl= ]) could not be found.[eol]</body>[eol]</html>[/returnraw]
>>> [/function]
>>>
>>> <!=E2=80=94 and finally a pretty 404 page for humans =E2=80= =94>
>>> [function name=3D404soft]
>>> [text]eol=3D[unurl]%0D%0A[/unurl][/text]
>>> [returnraw]HTTP/1.0 404 Not Found[eol]Status: 404 Not Found[eo= l]Content-type: text/html[eol][eol][include file=3D/404pretty.tpl][/returnr= aw]
>>> [/function]
>>>
>>> Hope this helps
>>> -Brian B. Burton
>

----------------------------------------------= -----------
This message is sent to you because you are subscribed to
the mailing list <talk@webdna.u= s>.
To unsubscribe, E-mail to: <talk= -leave@webdna.us>

--001a11c3c6d2451a77052ecf47d1-- . Tom Duke

DOWNLOAD WEBDNA NOW!

Top Articles:

Talk List

The WebDNA community talk-list is the best place to get some help: several hundred extremely proficient programmers with an excellent knowledge of WebDNA and an excellent spirit will deliver all the tips and tricks you can imagine...

Related Readings:

WebCat2b13MacPlugIn - [include] doesn't allow creator (1997) How did *you* learn Web Catalog? (2000) DB Emergency (2002) Password protect (2000) Group Updates (1998) Protect (1997) Attention all list readers (1997) Search 1 Field Twice? (2004) rounding onlu UP (2002) [GROUPS] followup (1997) WebCatalog/Mac 2.1b2 New Features (1997) WebDNA-Talk Digest mode broken (1997) ColdFusion (2006) Webmerchant confirmation hooks? (1997) [WebDNA] WebDNA news and projects (2014) [WebDNA] configuring 5.1 with apache 2? (2008) [replace][founditems]? (2004) Add more fields to an existent data base (1997) back button problem (1999) japanese characters (1997)