saving content from another site
I need to save some content from another site (with permission, of
course). It's just the text portion, no graphics, and I can't simply
use frames to target the intended content. Rather, I need to save the
content to a file on my server for later importing into a database and
searching for differences. Can someone tell me how I can do this
without loading the target page into my desktop browser? Essentially,
I want this to run on my server. I can navigate to my page, click on a
button and have it run this process, if need be, but automatically
every hour or so would be much better.
Thanks.
Re: saving content from another site
"javelin" <google.1.jvmail [at] spamgourmet.com> wrote in message
news:1172005485.770120.4900 [at] q2g2000cwa.googlegroups.com...
|I need to save some content from another site (with permission, of
| course). It's just the text portion, no graphics, and I can't simply
| use frames to target the intended content. Rather, I need to save the
| content to a file on my server for later importing into a database and
| searching for differences. Can someone tell me how I can do this
| without loading the target page into my desktop browser? Essentially,
| I want this to run on my server. I can navigate to my page, click on a
| button and have it run this process, if need be, but automatically
| every hour or so would be much better.
fopen - rtfm
| Thanks.
np.
Re: saving content from another site
fopen fails to produce anything but a blank page, as I expected it
would when I "rtfm". The target site is not mine, and I can't hope to
have it opened for access anytime soon. It is, essentially, a blog,
and I need to get the posts off of a certain category every hour or
so. I just needed some PHP or CGI code that can automatically import
the html from a supplied URL, then display it. From there, I can work
out a means to manipulate the resulting page.
Any recommendations from the manual or elsewhere I can look into?
On Feb 20, 3:09 pm, "Steve" <no.... [at] example.com> wrote:
> "javelin" <google.1.jvm... [at] spamgourmet.com> wrote in message
>
> news:1172005485.770120.4900 [at] q2g2000cwa.googlegroups.com...
> |I need to save some content from another site (with permission, of
> | course). It's just the text portion, no graphics, and I can't simply
> | use frames to target the intended content. Rather, I need to save the
> | content to a file on my server for later importing into a database and
> | searching for differences. Can someone tell me how I can do this
> | without loading the target page into my desktop browser? Essentially,
> | I want this to run on my server. I can navigate to my page, click on a
> | button and have it run this process, if need be, but automatically
> | every hour or so would be much better.
>
> fopen - rtfm
>
> | Thanks.
>
> np.
Re: saving content from another site
javelin wrote:
> fopen fails to produce anything but a blank page, as I expected it
> would when I "rtfm". The target site is not mine, and I can't hope to
> have it opened for access anytime soon.
You need a wrapper, read appendix M for a bit more information.
> It is, essentially, a blog,
> and I need to get the posts off of a certain category every hour or
> so. I just needed some PHP or CGI code that can automatically import
> the html from a supplied URL, then display it. From there, I can work
> out a means to manipulate the resulting page.
wget allows you to download remote web pages and supports link following
and a lot more, you can use it from php with the help of exec().
--
//Aho
Re: saving content from another site
"javelin" <google.1.jvmail [at] spamgourmet.com> wrote in message
news:1172079828.170362.243730 [at] l53g2000cwa.googlegroups.com.. .
| fopen fails to produce anything but a blank page, as I expected it
| would when I "rtfm". The target site is not mine, and I can't hope to
| have it opened for access anytime soon.
you don't need to have it 'opened for access'. you aren't getting source
code from them, just html...which they offer publically.
| It is, essentially, a blog,
doesn't matter
| and I need to get the posts off of a certain category every hour or
| so.
write your script (the one where you will use fopen because you rtfm) and
put it on a cron job or windows scheduled task.
| I just needed some PHP or CGI code that can automatically import
| the html from a supplied URL, then display it. From there, I can work
| out a means to manipulate the resulting page.
painfully, but probably so. just kidding. ;^)
| Any recommendations from the manual or elsewhere I can look into?
fopen
fread
stream_get_meta_data
http_response_header
(hell, even file_get_contents works)
Re: saving content from another site
On Feb 21, 12:29 pm, "Steve" <no.... [at] example.com> wrote:
> "javelin" <google.1.jvm... [at] spamgourmet.com> wrote in message
>
> news:1172079828.170362.243730 [at] l53g2000cwa.googlegroups.com.. .
> | fopen fails to produce anything but a blank page, as I expected it
> | would when I "rtfm". The target site is not mine, and I can't hope to
> | have it opened for access anytime soon.
>
> you don't need to have it 'opened for access'. you aren't getting source
> code from them, just html...which they offer publically.
>
> | It is, essentially, a blog,
>
> doesn't matter
>
> | and I need to get the posts off of a certain category every hour or
> | so.
>
> write your script (the one where you will use fopen because you rtfm) and
> put it on a cron job or windows scheduled task.
>
> | I just needed some PHP or CGI code that can automatically import
> | the html from a supplied URL, then display it. From there, I can work
> | out a means to manipulate the resulting page.
>
> painfully, but probably so. just kidding. ;^)
>
> | Any recommendations from the manual or elsewhere I can look into?
>
> fopen
> fread
> stream_get_meta_data
> http_response_header
> (hell, even file_get_contents works)
Thanks for all the refs. I'll get to reading again, and let you know
if my eyes get crossed up in a knot (won't be the first time).
Re: saving content from another site
"javelin" <google.1.jvmail [at] spamgourmet.com> wrote in message
news:1172088195.195327.10520 [at] l53g2000cwa.googlegroups.com...
| On Feb 21, 12:29 pm, "Steve" <no.... [at] example.com> wrote:
| > "javelin" <google.1.jvm... [at] spamgourmet.com> wrote in message
| >
| > news:1172079828.170362.243730 [at] l53g2000cwa.googlegroups.com.. .
| > | fopen fails to produce anything but a blank page, as I expected it
| > | would when I "rtfm". The target site is not mine, and I can't hope to
| > | have it opened for access anytime soon.
| >
| > you don't need to have it 'opened for access'. you aren't getting source
| > code from them, just html...which they offer publically.
| >
| > | It is, essentially, a blog,
| >
| > doesn't matter
| >
| > | and I need to get the posts off of a certain category every hour or
| > | so.
| >
| > write your script (the one where you will use fopen because you rtfm)
and
| > put it on a cron job or windows scheduled task.
| >
| > | I just needed some PHP or CGI code that can automatically import
| > | the html from a supplied URL, then display it. From there, I can work
| > | out a means to manipulate the resulting page.
| >
| > painfully, but probably so. just kidding. ;^)
| >
| > | Any recommendations from the manual or elsewhere I can look into?
| >
| > fopen
| > fread
| > stream_get_meta_data
| > http_response_header
| > (hell, even file_get_contents works)
|
| Thanks for all the refs. I'll get to reading again, and let you know
| if my eyes get crossed up in a knot (won't be the first time).
look, if you don't have to log in to get to the forum/post you wanna see,
then just do:
echo file_get_contents($url);
that way you can see the html that was at $url. other than that, you can
parse the result of the function, store it, whatever. i've borrowed *many* a
'secure' site's functionality using only that function alone. it's great to
'borrow' the ability to get barcoded images from someone else's site...or
'borrow' pie, bar, or other chart images doing the same...or...whatever.
it's not hard. even hacking 'secure' sites is suprisingly easily done.
however, you don't even have to go to great lengths here since they are
giving you the html knowingly and freely. ;^)
Re: saving content from another site
javelin wrote:
> Thanks for all the refs. I'll get to reading again, and let you know
> if my eyes get crossed up in a knot (won't be the first time).
when all that fails, try Curl. :-)
Re: saving content from another site
> echo file_get_contents($url);
>
Thanks for this idea. It works well for the top level, but it I now
need each sub link to be opened via the same echo
file_get_contents($url) function. Any chance I can easily parse this
and dynamically create a new page for each link? (I don't ask for
much, do I).
Thanks/
Re: saving content from another site
"javelin" <google.1.jvmail [at] spamgourmet.com> wrote in message
news:1172549204.330607.13680 [at] p10g2000cwp.googlegroups.com...
|
|
| > echo file_get_contents($url);
| >
|
| Thanks for this idea. It works well for the top level, but it I now
| need each sub link to be opened via the same echo
| file_get_contents($url) function. Any chance I can easily parse this
| and dynamically create a new page for each link? (I don't ask for
| much, do I).
no you don't ask for much. the question is 'any chance *i* can easily...' to
which the response is yes. so no, you aren't asking for much. as for writing
it for you, that's another story.
you need to learn the preg functions and write a regular expression that
catures html tags such as src= and href=, etc. once you get your matches,
get the contents using file_get_contents. it looks like you're either
screen-scraping or hacking...either way, there are tons of examples of the
regex you'll need. just google 'php screen scraping'.
hth
Re: saving content from another site
> you need to learn the preg functions and write a regular expression that
> catures html tags such as src= and href=, etc. once you get your matches,
> get the contents using file_get_contents. it looks like you're either
> screen-scraping or hacking...either way, there are tons of examples of the
> regex you'll need. just google 'php screen scraping'.
I'm familiar with the regex and experssions, so I should be able to
parse out the data. I was simply wondering if there were any
functions or opensource code available for this task. I did google
'php screen scraping' and got lot's of pages. Looks like it's going to
be a late night for me :-)
Thanks again.
Re: saving content from another site
"javelin" <google.1.jvmail [at] spamgourmet.com> wrote in message
news:1172684528.567355.69130 [at] j27g2000cwj.googlegroups.com...
|
| > you need to learn the preg functions and write a regular expression that
| > catures html tags such as src= and href=, etc. once you get your
matches,
| > get the contents using file_get_contents. it looks like you're either
| > screen-scraping or hacking...either way, there are tons of examples of
the
| > regex you'll need. just google 'php screen scraping'.
|
| I'm familiar with the regex and experssions, so I should be able to
| parse out the data. I was simply wondering if there were any
| functions or opensource code available for this task. I did google
| 'php screen scraping' and got lot's of pages. Looks like it's going to
| be a late night for me :-)
no problem...good luck.
PHP » alt.php » saving content from another site