Spss Combine Variables

1. Introduction

Spss Combine Variables

When you have two data files, you can combine them by merging them side by side, matching up observations based on an identifier. For example, below we have a file containing dads and we have a file containing faminc. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid.

After match merging the dads and faminc, the data would look like this.

Tags combine variables spss; T. Tessa13 New Member. I want to create 6 groups in SPSS, originating from 2 groups but I don't know how.

2. One-to-one merge

Let’s start by creating the files that we will be merging. Below we create the files dads.sav and faminc.sav.

The output of these statements is shown below, confirming that we have read the data properly.

There are three steps to match merge dads.sav with faminc.sav. (Note that this is a one to one merge because there is a one to one correspondence between the dads and faminc records.) These three steps are illustrated below.

  1. Use SORT CASES to sort dads on famid and save that file (we will call it dads2.sav)
  2. Use SORT CASES to sort faminc on famid and save that file (we will call it faminc2.sav)
  3. Use MATCH FILES to merge the dads2.sav and faminc2.sav files based on famid

Below we show the commands for performing the merge.

  1. The data set is formatted like this (apologies for the excel format) In this example, the AGGREGATE function was used to combine the cases by the same variable. In other words, CITY, Tampa in the example, is the break variable. Unfortunately, each entry for Tampa gives 10 unique temperatures for each day.
  2. Open your dataset in SPSS's data editor. Select 'File' and choose 'Open' from the drop-down menu.
  3. Merge the active dataset with another open dataset or IBM SPSS Statistics data file containing the same cases but different variables. From the menus choose: Data Merge Files. Select Add Cases or Add Variables. For more information on merging files by adding cases (rows), see Add Cases.

The output below shows that the match merge worked properly.

3. One-to-many merge

The next example considers a one to many merge where one observation in one file may have multiple matching records in another file. Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. You see why this is called a one to many merge since you are matching one dad observation to one or more (many) kids observations. Remember that the dads file is the file with one observation, and the kids file is the one with many observations. Below, we create the data file for the dads and for the kids.

As you see below, the steps for doing a one to many merge is similar to the one to one merge that we saw above.

  1. Use SORT CASES BY to sort dads on famid and save that file (we will call it dads2)
  2. Use SORT CASES BY to sort kids on famid and save that file (we will call it kids2)
  3. Use MATCH FILES to merge the dads2 and kids2 files. However, since the dads file is the file with one observation, use /TABLE='dads2.sav', not /FILE='dads2.sav' to specify the dads file.

The output below shows that this merge worked as we hoped.

The key difference between a one to one merge and a one to many merge is that you need to use /TABLE='dads2.sav' instead of /FILE='dads2.sav'. For your data, when you do a one to many merge, ask yourself which file plays the role of one (in one to many). For that file, use /TABLE= instead of /FILE=.

Let’s intentionally make an error and use /FILE='dads2.sav'and see what SPSS does.

The first thing we notice is that SPSS gives us the warning shown below. This is telling us that there are multiple kids for a given dad.

As SPSS advises, we will inspect the results carefully. Indeed, we see the results are not what we desired. When there were multiple kids per dad, it only merged the dad with the first kid, and then the following kids with the same dads were assigned missing values for the dads information (name and inc). When we used the /TABLE= subcommand in the previous example, SPSS carried the dads information across all of the kids.

4. Ordering the variables in the new file

You can use the /MAP subcommand with the ADD FILES command to see the order of the variables in the new file, as illustrated below. If you would like to rearrange the order of the variables in the new file, you can also add the /KEEP subcommand to the ADD FILES command. The variables will be ordered in the new file in the order that you list them on the /KEEP subcommand. If you do not list all of the variables on the /KEEP subcommand, the variables not listed will not be present in the new file. Also note that you can list the first few variables if they are the only ones that need to be reordered, and then use the keyword ALL to have the rest of the variables included in the new file. The variables not specified on the /KEEP subcommand will remain the order in which they are in the original files.

As you can see, the variables in the new file are now in the order name, famid inc.

5. Problems to look out for

5.1 Mismatching records in one-to-one merge

The two data files have may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) who does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.

As you see above, we use /IN=fromdad to create a 0/1 variable that indicates whether the resulting file contains a record with data from the dads file. Likewise, we use /IN=fromfam to indicate if the resulting file has a record from the faminc file. The LIST and CROSSTABS then show us about the mismatching records.

The output from the LIST command shows us that when there were mismatching records. For famid 4, the value of fromdad is 1 and fromfam is 0, as we would expect since there was data from dads for famid 4, but no data from faminc. Also, as we expect, this record has valid data for the variables from the dads file (name and inc) and missing data for the variables from faminc (faminc96 faminc97 and faminc98). We see the reverse pattern for famid 5 and 6.

If we look at the fromdad and fromfam variables, we can see that there are three records that have matching data, one that has data from the dads only, and two records that have data from the faminc file only. The crosstab below shows us the same results, and is an easier way of tallying the matching than manually tallying the matching.

When matching files, we suggest that you use this strategy to check the matching of the two files. If there are unexpected mismatched records, then you should investigate to understand the cause of the mismatched records.

You can use SELECT IF to eliminate some of the non-matching records. For example, if you wanted to keep just the records where the dads matched with the family information, you could type

The results are shown below, including just the three matching records.

5.2 Mismatching records in one-to-many merge

SPSS handles the inclusion of mismatched records in a one to-many merge differently than a one-to-one merge. Remember that in a one-to-many merge, there is a file that has one observation that matches to many observations in the other file; let us refer to these as the one file and the many file. If there are observations in the one file that do not match to the many file, then these observations will not appear in the merged file at all. If there are observations in the many file that do not match the one file, those records will appear in the merged file. If this is what you desire, then you can merge the files as illustrated in Section 3, and use the /IN= as illustrated in the prior section to track the matching. However, if you would like mismatched records from the one and many file to both appear in the merged file, then you can use the matching strategy outlined below.

Below we use our example to merge dads with kids, and in this example we have mismatched records in both files. Below we match the files to include all mismatched records in the merged file. The parts that are different are indicated in red.

The section in red adds an extra step to the matching. The purpose of this step is to add any values of famid that are only in the dads file to the kids file. It does by doing a one-to-one merge between dadid and the kids and saves that file as temp. Since dadid just the famid of all of the dads, this merge basically adds observations for any famid that is in the dads file but not in the kids file, and saves this as temp. Then, we can then merge temp with dads2 and temp will have a famid for every observation in the dads2 file. This assures that the resulting file will include all observations from the dads file, even if they do not have a matching record in the kids file. The result is shown below. Indeed, the file contains the observation for the dad Karl who does not have any matching kids. If we omitted the extra code in this step, that record would not have been included in this file.

5.3 Variables with the same name, but different information

Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98. Let’s go ahead and merge these files and see what SPSS does.

The results are shown below. As you see, the variable inc98 has the data from the dads file, the file that appeared first in the MATCH FILES command. When you match files that have the same variable, SPSS will use the values from the file that appears earliest in the MATCH FILES command.

There are a couple of ways you can solve this problem.

Solution #1. The most obvious solution is to choose variable names in the original files that will not conflict with each other. However, you may receive files where the names have already been chosen.

Solution #2. You can rename the variables in the MATCH FILES command (which renames the variables before doing the matching). This allows you to select variable names that do not conflict with each other, as illustrated below.

As you can see below, the variables were renamed as we specified.

5.4 The same variables with different dictionary information

This problem is similar to the one outlined above. In this example, we have two variables with the same name and the same information, but with different dictionary information associated with them. This dictionary information could include value labels and/or variable labels. As with the example above, SPSS will take the information from the file listed first in the MATCH FILES command. No error or warning message will be issued to let you know that the information from the variable in the later file has been lost. The solution to this problem is to list first in the MATCH FILES command the file with the dictionary information that you want in the resulting file.

5.5 You have run the ADD FILES command, and nothing happened

If you run just the ADD FILES command, as shown below, SPSS will not do anything. However, you will see a note in the lower right corner of the data editor saying 'transformation pending'.

Solution: The solution is to add either the execute command or a procedure command that will force the execution of the transformation, such as the list command or the crosstab command.

6. For more information

  • For more information about Match Merging data files, see the MATCH FILES command in SPSS Syntax Reference Guide.
  • For information on concatenating data files, see the SPSS Learning Module on Concatenating Data Files in SPSS

Merging files can mean two different things. You may ADD two or more files that (usually) contain different cases, but have at least partly the same variables. ADDing files means that all cases previously in separate files will end up in one file; that is, the resulting file will have more cases, but very often not more variables. This case usually is rather simple; for instance, you may have the same sort of data for several school classes and wish to have them in a single file.

You may MATCH two files that contain the same cases (at least in part), but have different variables. MATCHing these files often means that you will deal with the same cases as before, but you will have more information (more variables) about them. This case may be very simple as well – see the first example on MATCHing below –, but sometimes necessitates (or allows) to deal with quite sophisticated problems.

Simple example for ADDing files:

ADD FILE
/ file = 'c:subdirmydata1.sav'
/ file = 'c:subdirmydata2.sav'
/ file = 'c:subdirmydata3.sav'.

Simple example for MATCHing files:

MATCH FILE
/ file = 'c:subdirmydata1.sav'
/ file = 'c:subdirmydata2.sav'
/ by id.

Complex example for MATCHing files:

MATCH FILE
/ file = 'c:subdirmydata1.sav'
/ rename (var17 var23a var300 = v17 v23a var301a )
/ drop var12 var313 d1
/ table = 'c:subdirmydata2.sav'
/ by id.

Go immediately to MATCH FILES.

Spss Merge Two Variables

ADD FILES

As explained in the introduction, adding files is normally a very simple operation. More specifically, this is the case if the structure of the files you wish to add indeed are identical (more on this below). Then the simple example from the first section will do its job. Up to ten files can be ADDed in one step by contatenating them with backslashes; each file will be stacked below the others. If there are more than ten files, you just have to write more than one ADD FILE command.

What ADDing files means can be shown by a simple example. Let's suppose you have grades in maths for the boys and the girls in your class, but these are in different files. Thus you have two files with different cases, but identical variables. For instance, part of the boys' data will look like this (grades are in German notation, where 1 means A, and 5 means F):

Name Maths
John 2
Mike 3

Likewise, part of the girls' file may look like this:

Name Maths
Lisa 2
Anne 1

Let's suppose that the boys' data are in file 'mboys.sav' which is located in subdirectory 'c:class98', and the girls' data are in file 'mgirls.sav' in the same directory. Then ADDing these two files with the command

ADD FILE
/ file = 'c:class98mboys.sav'
/ file = 'c:class98mgirls.sav'.
EXE.

will yield the new file

Name Maths
John 2
Mike 3
Lisa 2
Anne 1

Note that you have to SAVE this file if you wish to keep it as one file that you can work with later on.

Often, one of the files you wish to ADD to other files will be your working file. You can refer to your working file by the '*' sign. You may place it anywhere in your list of files; that is, the working file does not have to be the first file in your list, as in the following case, where the first file addressed is 'data1.sav', followed by the current working file and then 'data2.sav'�:

ADD FILE
/ file = 'c:subdirdata1.sav'
/ file = *
/ file = 'c:subdirdata2.sav'.
EXE.

CAUTION: If you use this possibility or referring to your working file, the resulting file will have the same name as the initial working file. Saving this file without a change of name will overwrite the old file. This may be very annoying if you have used the 'drop' or the 'keep' subcommand (see below).

Some additional complications may arise, but there are several possibilities to deal with these. What these complications may consist in becomes clear when we think about what it means for two files to have an identical structure. I refer to two files as having an identical structure if they contain the 'same' variables. This means that (1) all variables that are in one file are also present in the other file(s) and that no file contains variables that are not present in other files, and that (2) all the variables in all files that have the same name are of the same type. If the first condition is violated – i.e. if a file contains variables that are not present in other files –, nothing serious will happen; only the resulting data set may be bigger than you wish, with the result that speed of operations will slow down. The second case will be detrimental, that is, the operation of ADDing files will not be performed. I will deal with these two cases in turn.

Violation of the 'same variables' condition

Let's assume you have two data sets you wish to ADD. Data set 'data1.sav' contains variables var1 and var2, but data set 'data2.sav' in addition has variable var3. What happens if you ADD the two files? The resulting data set will have variables var1, var2 and var3, and the cases from 'data1.sav' will have system missing values in var3. Of course this may be precisely what you want, but if you do not need var3, it may be good practice to drop it from the resulting data set, because each 'useless' variable makes itself felt negatively as far as computational speed is concerned. This can be achieved easily like this:

ADD FILE
/ file = 'c:subdirdata1.sav'
/ file = 'c:subdirdata2.sav'
/ drop var3.

Sometimes you may have (or wish) to drop so many variables that it is simpler to tell SPSS which variables you wish to KEEP:

ADD FILE
/ file = 'c:subdirdata1.sav'
/ file = 'c:subdirdata2.sav'
/ keep var1 to var3.

Note that of course you may always use the 'drop' or 'keep' commands to get rid of variables you do not need.

A problem that may occur (if you have not prepared your data well, that is) is that you have the same variables in both data sets, but these are named differently. Let's assume that 'data1.sav' has the same information that in 'data2.sav' is stored in variables var1, var2 and var3. But unfortunately, in 'data1.sav' these variables were named var5, var6 and var4. You can amend this by RENAMing variables (note the order of variables!):

ADD FILE
/ file = 'c:subdirdata1.sav'
/ rename (var5 var6 var4 = var1 var2 var3)
/ keep var1 to var3.

Of course, you may combine RENAMEing and DROPping (or KEEPing) variables. For instance, 'data1.sav' may contain the data that should be in var2, but they are called var5. Instead, what is called var2 contains data that you do not need anymore. Thus, you will write:

ADD FILE
/ file = 'c:subdirdata1.sav'
/ rename (var5 var2 = var2 var4 )
/ file = 'c:subdirdata2.sav'
/ drop var4.

Violation of the 'same type' condition

Spss Combining Variables

If two variables have the same name, but are of different type, SPSS will not perform the ADD FILES operation. You cannot amend this during the ADD FILES run. You have to change the type of the variable prior to ADDing files. The interesting thing is which variables will be treated as 'of the same type' and which will not.

I deal here only with the most usual cases, that is numeric and string variables. What is obvious is that 'numeric' is not the same type as 'string'. What is less obvious is that string variables of different width will also be treated as different type! If SPSS aborts an ADD FILE procedure with an error message, you may consider this possibility.

What happens with numeric variables that have different formats? Since the different formats apply only to the way variables are displayed, SPSS finds nothing wrong in ADDing these variables. It will use the format that is used in the first file on the ADD FILES list for display in the resulting working file.

MATCH FILES

Matching files may be very easy. For instance, you may have in one file the maths grades of your class, and in another file the sports grades. Now you may wish to join these two (or perhaps more) files. I give an example:

Here's the maths grades:

NameMaths
John2
Mike1

Here's the sports grades (you see that you can be good both in maths and in sports - certainly in imaginary data):

NameSports
John2
Mike1

If (IF!) you have prepared your data well, MATCHing these files will be achieved quite easily.

(EXAMPLE TO BE USED WITH EXTREME CAUTION):

MATCH FILE
/ file = 'c:subdirmaths.sav'
/ file = 'c:subdirsports.sav'.

Here's the resulting file:

NameMathsSports
John22
Mike11

However, things will work as smoothly only in a simple case like this, when the cases in both data sets are the same and they are in the same order. In addition, different variables indeed must be named differently. (If they have the same names, SPSS will use the information from the first data set; this situation can be amended, however.)

SPSS can deal with cases when one or two of these conditions are not met, however. The opportunities offered to deal with these situations allow for accomplishing fairly complex data management tasks with a few operations. Therefore, you may have a look at the following even if right now you don't have any such problems. Especially the use of the keyword BY should become a rule.

Keyword BY

Matching BY a given variable ensures that the information of the same cases in the different files will indeed be matched correctly. Usually, each case in a data set has a unique identification variable (henceforth ID), and this identification variable will be used in any data set where this case occurs. If the cases in each data set are sorted in ascending order by the ID variable(s) (indeed, there may be more than one ID variable!!), the data sets may be matched as follows (with 'name' used as ID variable):

MATCH FILE
/ file = 'c:subdirmaths.sav'
/ file = 'c:subdirsports.sav'
/ BY name.

This will result in a 'meaningful' data set even if some of the cases in the two (or more) different data sets are not identical. Let's assume we have the two following files:

NameMaths
John2
Mike1
Andy3
NameSports
John2
Mike1
Rick5

Matching these two files BY name will yield the following file:

NameMathsSports
John22
Mike11
Andy3.
Rick.5

If you are annoyed by having imcomplete cases in your file, this situation can be partly amended with the TABLE keyword which now will be explained. Note that this keyword has an additional use.

Note: As described above (ADD command), you may refer to a file you have worked with prior to MATCHing (your working file) with the '*' sign. If you use this possibility of referring to your working file, the resulting file will have the same name as the initial working file. Saving this file without a change of name will overwrite the old file. This may be very annoying if you have used the 'drop' or the 'keep' subcommand, which are also applicable (and often very usefully so) with the MATCH command.

Keyword TABLE

The first use of the TABLE keyword is to deal with the situation explained above. If one (or more) files that are matched are not evoked with keyword FILE but rather with keyword TABLE, only those cases of this or these file(s) will be matched to the file(s) that are evoked as FILEs that actually are contained in the FILEs. Let's assume we match the two files we used above as follows:

MATCH FILE
/ file = 'c:subdirmaths.sav'
/ table = 'c:subdirsports.sav'
/ BY name.

The result will be the following file (with Rick being dropped from the file because he has no equivalent in the maths file, but with Andy still present and with a Missing Value in variable 'Sports'):

NameMathsSports
John22
Mike11
Andy3.

There can be more than one TABLE files, and more than one file that is addressed as FILE (in this case, all cases that are in any of the files addressed as FILES will end up in the resulting data set). However, at least one FILE keyword is necessary; that is, the cases of this file will be the minimum cases that will be contained in the resulting file.

The second use of keyword TABLE is as follows: If the file(s) that is (are) matched with keyword FILE contain more than one case with the same ID variable, and the file(s) with keyword TABLE has or have only one case for each ID, the variables of the TABLE file(s) will be matched to all cases in the FILE file. This is often useful if there is some aggregated information that is to be matched to other data. For instance, I have analyzed unemployment data on individuals, and I was wondering what effect the unemployment rate in the respective Bundesland ('State') has on the individual length of unemployement. Of course, the data set on individuals has many persons from each Bundesland. Thus, we may have the following file on individuals' duration of unemployment (extract):

IDLandDuration
115
2110
312
426
5211

The data on regional unemployment rates are in the following file

LandUnem_rat
110.1
25.8

Since you wish to have the information of the unemployment rate on each individual in each Bundesland, you have to match the files by the variable 'Land' (be sure that both files are sorted by this variable!). The command:

MATCH FILE
/ file = 'c:subdirindivid.sav'
/ table = 'c:subdirunemprat.sav'
/ BY land.

will yield the following file:

IDLandDurationUnem_rat
11510.1
211010.1
31210.1
4265.8
52115.8

Same variables in data sets to be matched

If the data sets to be matched contain the same variables - that is, variables with the same names - (other than any variables BY which the data sets are to be matched), SPSS will use the first occurrence of each variable. For instance, if var2 is in 'data1.sav' and also in 'data2.sav', and 'data1.sav' is mentioned first on the MATCH FILE statement, var2 from 'data1.sav' will be in the new data set and the information from var2 in 'data2.sav' will be lost. Often, this will be precisely what you want. In other cases (perhaps if the two variables, albeit having the same name, have different content) you have to RENAME the variables. The 'more complex example' at the beginning of the page explains how to do this. (A RENAME keyword can follow any FILE or TABLE line and will refer only to that file.)

Variables in data sets to be matched that are not needed in the resulting file

Often, one (or several) data set(s) that are to be matched contain variables that are not needed for further analysis. These variables may be dropped from the resulting data set with keyword DROP (indeed). The 'more complex example' at the beginning of this page has a line with an example for that case. (A DROP keyword can occur after any FILE or TABLE line and will refer to the respective file only.) The situation where there are many variables to be DROPped is one of the few instances where I use the SPSS DIALOG BOX, because you can easily click variables from one window ('new working data file') to the other ('exluded variables').

© W. Ludwig-Mayerhofer, IGSW Last update: 15 Apr 2002